Solved

Search through flat file databases on Windows and Unix servers?

Posted on 2002-07-05
56
317 Views
Last Modified: 2008-03-06
I have a couple of flat file databases I need to search through, and also a directory, if it's not too hard. I was looking at Webmonkey at "Roll your own search engine", but it only dealt with Unix. I downloaded a few free scripts, but one just gave me the title linked to the page and the path. I got a second one and it did the title, url, and relavance. I liked it, but was perfectly satisfied with it. I used KSearch on Unix and loved it, but I couldn't get it to work at all on Windows. I then went back to the second script and it couldn't do the flat files. I need it to search a couple of flat files and if the words are found, return a script that calls the file. The file is seperated with |. Is that going to matter? Something that opens a file, dbfile.txt. If it finds a match, it returns script.cgi?open=dbfile. Then, there is a directory full of html files that would be nice if it could search. My question is, how would I go about doing this, indexing the html file directory and using the flat files as and index? What would I use to search to work on Unix and Windows, I know grep won't work.
0
Comment
Question by:sjaguar13
  • 25
  • 16
  • 12
  • +2
56 Comments
 
LVL 12

Expert Comment

by:lexxwern
ID: 7134598
how is data stored in the flat file db, are they comma seperated values or something similar?
0
 
LVL 3

Expert Comment

by:DVB
ID: 7134618
Use a regular expression and the m/// function.
m/[pattern]/[input]/

A sample
#!/usr/bin/perl -w
use strict;

open (FP,"</path/to/file");
while(<FP>)
{
      if (m/pattern/$_/)
          {
            `/usr/bin/perl /path/to/file`;
             return 1;
          }
}

If you can give more details about what you are trying to do, and on what input data, then it would be easier to help out.
0
 

Author Comment

by:sjaguar13
ID: 7134638
I have products listed in categories in each file. It's separated with | I then have a script that opens up the file, sorts it, and prints a table. The problem I found is, if someone searched for something, it searches the html file. I don't have one. I need it to search the text file, so if I had one on cats, dogs, and other, and they searched for jaguar, it would open each file and search for it. It would find it in cats and preferable display that line as well as a link to the script, like browse.cgi?open=cats (If it's too hard to show the line, that's okay). I have one directory with html files that I would like it to be able to search through, too. Those should just display the line, if possible, and a link to that file. Does that help?
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 250 total points
ID: 7134700
beside reinventing the wheel, how about htdig or glimpse
0
 

Author Comment

by:sjaguar13
ID: 7139576
How do those work and will they work on both platforms?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7140098
both work on UNIX/Linux and NT.
http://www.htdig.org/  http://webglimpse.org/
0
 

Author Comment

by:sjaguar13
ID: 7141806
They both look confusing. I think I like htdig better, what one would you use? What is this installing stuff. What if I have a free host, am I still going to be able to install this stuff? It will search the text files and the html files and show the right script to call the text file?
0
 

Author Comment

by:sjaguar13
ID: 7151044
So I was searching Google looking for ways to do this and I came across grep. Can I use that, or would that be a bad idea?
0
 
LVL 12

Assisted Solution

by:lexxwern
lexxwern earned 250 total points
ID: 7151052
see this code

$variable="abc|hello|world|xyz";
@ary = split(/|/,$variable);

here $ary[0] is abc, $ary[1] is hello and so on...




so suppose file data.txt has the following data,
abc|hello|world|xyz
uio|hello|sdfsdf|uj
546|ty|dfg|asf
sdf|dr|547|xdf


then we need to write a loop which scans every line of the file. it should be somewhat like this,

open(DAT,"data.txt");
@alldata = <DAT>;
close(DAT);

foreach $line(@alldata)
{
 @ary = split(/|/,$line);
 print "@ary[0] , @ary[1] , @ary[2] , @ary[4] <br>";
}


test if this works,
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7151053
oops,   replace

@ary = split(/|/,$line);

with


@ary = split(/\|/,$line);
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7151109
well, rgep is the simplest way ;-)
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7151136
atleast tryout what ive given above, just for the sake of my efforts ;)
0
 

Author Comment

by:sjaguar13
ID: 7151713
lexxwern, so it take the file and chucks in into a big array. It then take each line and splits the variables. Then it prints each variable for each line. How would I search through it?

Is grep or rgrep fast enough to do a lot of pages?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7151972
hmm. all you search has to be done within the foreach loop.


foreach $line(@alldata)
{
 @ary = split(/|/,$line);
 if($ary[3]=='hello')
 {
   //do something etc.
 }

}
0
 

Author Comment

by:sjaguar13
ID: 7151993
How long would it take to do about 10 files like that, plus about 50 html files, and how would I do the html files?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7152001
>> How long would it take to do about 10 files like that,
depends on the size of the files.

>> and how would I do the html files?
my script only searches flat file database thats asked in the Q. for parsing html files, i should know what exactly should be done.
0
 

Author Comment

by:sjaguar13
ID: 7152007
>>I have a couple of flat file databases I need to search through, and also a directory, if it's not too hard.
>>I need it to search a couple of flat files and if the words are found, return a script that calls the file.
>>If it finds a match, it returns script.cgi?open=dbfile. >>Then, there is a directory full of html files that would be nice if it could search.
>>My question is, how would I go about doing this, indexing the html file directory and using the flat files as and index?
 
>>What would I use to search to work on Unix and Windows, I know grep won't work. <<- I'm not so sure about this now, would grep, or fgrep work for this?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7152013
hmm.

what you have to do is index your html files inot the flat file database.
suppose you have yahoo.html, then you need a cgi which will index this page into the database you use, and then the search will work with the database.

so on what basis html gets represented in the database, it is best known to you...
0
 

Author Comment

by:sjaguar13
ID: 7152019
How do I index it? Open up the file and split at the spaces to get each word?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7152027
hmm. you should give the structure of the Search DB a lot of thought.



First lets list the bare minimums required in the Database.
1. title
2. description
3. url

4. keyword1
5. keyword2
6. keyword3
7. keyword4
8. keyword5



so i recommend that your DB file have 3+5 attributes.
each site indexed should have 5 keywords provided.



so your indexing script should wisely pick 5 keyword and put it in the DB. BUT i recommend that the indexing is done manually and not thru a script.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7152088
as mentioned before: sounds like you try to reinvent the wheel:
   glimpse and webglimpse do that all, on both: UNIX and Windoze

If you write your own wheel, perl or Tcl are the only possibilities to use. Means same script without changes and/or platform-dependent ifdefs. IMHO.
In this case you have to make your performance checks for both: indexing and searching, yourself (glimpse is fast, much faster than grep).
Even there exist numerous grep.exe for M$, if not used within perl or Tcl, you need to write a different shell around. Well, have seen bash for M$ too ...

You mentioned that "I like htdig better", then keep in mind, that htdig just searches/indexes your DocumentRoot, while glimpse can do any directory. It's up to you which one is sufficient.
0
 

Author Comment

by:sjaguar13
ID: 7152848
Ok, glimpse sounds better. Does grep work on windows? That webglimpse thing seems like a lot of work. If I'm using a free host, how do I install it?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7152903
> Does grep work on windows?
yes.

> .. how do I install it?
sorry cannot help here ('cause I use my own CGI for that).
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7153239
>>  sorry cannot help here ('cause I use my own CGI for that).

that tells you to use your own cgi too.

i have used both free scripts and my own scripts,
and trust me the control you have over your page when you write it fuly yourself is worth all the pains!
0
 

Author Comment

by:sjaguar13
ID: 7153269
>>and trust me the control you have over your page when you write it fuly yourself is worth all the pains!

The biggest pain I have right now is figuring out what I'm supposed to do.

How about this, I grep through the database files line by line, returning the whole line if a match is found. It does each one. Then it opens the html files and stores it in an array and greps through those, too, one by one (sounds like a lot of time) and returns a link. Good idea, or bad idea?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7153289
wait! what do you want? do you want a search engine for you site?

if yes read below. if not answer the Q.


lets now see what the components of a basic search engine should be.

1. a database with all required information of the pages.
2. a script which parses thru the database and matches its data with the users submitted data. this should also display matching entities to the browser.

with this the developer should answer a few Qs. like,

1. how does my DB get the searchable data?
A. it can be either indexed by a script on can be manually fed to the DB,

2. how do i store data in my database file?
A. ......done......data|seperated|by|

3. what will be the attributes for each entry in DB?
A. you decide, my earlier comment was made to help you. now this is the most important factor which decides the fate of your search engine.

now give it a deep thought, only if you're confident of writing a goood code then proceed, otherwise use the free scripts.

why cant you use databases tho? its much easier to write code, if you use mysql or some other sql based db.
0
 

Author Comment

by:sjaguar13
ID: 7153385
I want a serach engine. I think I can make one that searches the flat file databases, but I have no idea how to do the html files. How would I index it?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7153391
you are not getting the idea.

the flatfile database has information about html files.

>> how do I index it?
your wish, you can either take all text in the <body> or perhaps you can use <meta> tags, or you can come up with something totally new.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:sjaguar13
ID: 7153422
>>you are not getting the idea.
Obviously.

So the flat file databases are cool. I can egrep those. The html files are harder, so I make them into a flat file database because that's easy. I would like all the text in the body tag. Should I make a script that grabs the text and chucks it into a txt file, or should I just grap the text as an array, search it, and then forget about it?

Am I getting more of the idea?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7153444
:-)

well, this example will answer your Q.

lets say there is a page named yahoo.html, your database file name is db.txt.

so when the script indexes yahoo.html it adds the following to db.txt

Yahoo's Title|http://location.of/yahoo.html|db_yahoo.txt

now db_yahoo.txt is a file created while indexing yahoo.html, now all textin yahoo is put into this page.


now, when i search for "certain words in yahoo" what the search program should do is open the file name indexed by the database, ie. db_yahoo.txt, open is search it, and if search successful, return this to the user.


now this system works! good. but what i dont like about this is the speed, i mean just to search onee word, it opens and closes each and every file indexed in the DB. then searches it.


this whole process is very slow. but if you want to persist with it, you can.
0
 

Author Comment

by:sjaguar13
ID: 7153503
Is there any way to make this go faster? Your example is using egrep to search the files, right?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7153771
> I want a serach engine.
glimpseindex + glimpsserver +glimpse
(which are all part of the glimpse package)

glimpse also handles html files, see the -X option to glimpseindex.

Here is a simplecall to the glimpse database producing HTML:

   glimpse  pattern | perl -F: -ane 'print "<a href=$F[0]>$F[0]</A>:\n\t<PRE>$F[1]\n</PRE>";'

(on Windoze you just need to exchange " and ')
Do this in your CGI, which just gets the pattern as parameter, ready.
Could it be simpler?
0
 

Author Comment

by:sjaguar13
ID: 7156255
<<Could it be simpler?
Yes, it could.

Would an associative array/hash thing work? I was readin this thing about making search engines and it mentions this. Is it a better way?
0
 

Author Comment

by:sjaguar13
ID: 7156313
I was thinking and I think I got the answer. For the 11 txt files, I egrep through them by line. The biggest file is 94 lines at 8kb, then there is a 7kb, then it drops to 3kb and 2kbs. That might take a little bit, but I use the hash thing for the html files. That's quick right? It just returns the pages that has that word. It will already indexed, so it won't have to look through each file. Is there any downfall to that idea?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7156317
>> Would an associative array/hash thing work? I was readin this thing about making search engines and it mentions this.

when you set out to make a search engine, it is assumed that you have mastery over the perl syntax and you can use the best tool at the best place.


>> Is it a better way?
could be could be not... depends on the way you code.
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7156329
im really dumb and i haven't understood your new method :)

could you pls explain it again so that i can comment abt it.?
0
 

Author Comment

by:sjaguar13
ID: 7157281
>>could you pls explain it again so that i can comment abt it.?

You know what I mean about the hash thing? Each word that appears on a page is a key, and the value is a number representing that page, so when I search for hello, it checks $hash{hello} and gets a list of numbers like 2,4,6. Then it takes the pages that are 2, 4, and 6 and prints out links. For the txt files, it shoves it all in as an array. Then, it egreps each line to see if the string matches, like foreach (@txt) { egrep hello $_ }

Is that a better way of doing it?
0
 
LVL 3

Expert Comment

by:DVB
ID: 7157296
BTW, Perl has a builtin grep operator as well. No need for external commands.
(but unless you are doing this for the sake of learning Perl, I would recommend tool reuse.)
0
 

Author Comment

by:sjaguar13
ID: 7157881
Should I egrep? I was looking at other search scripts and if they don't use the hash thing, they search like $string=~/$searchquery/, what one's better?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7158677
jaguar, you are getting carried away by the smaller things, do you ave the bigger picture clear in your mind?
0
 

Author Comment

by:sjaguar13
ID: 7158809
I've been looking at free scripts for a week before I asked this question. It's been almost two weeks since I asked this question. The picture I have is what I already mentioned, and seeing how no one talked me out of it, that's still the plan I'm going with. I've already got it to search the txt files, there is still a few bugs, but for the most part it's good. The major thing right now is the HTML files, other than that, I only have smalling things to worry about.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7158923
so, why don't you simply post your code, and show which part is not working as expected.
Would be much simpler than talking about apples and pears.
0
 

Author Comment

by:sjaguar13
ID: 7159725
I'd post my code, but the biggset problem I have right now is the fact that I don't have anything written to search the HTML files. Still trying to figure out if I should grep through those, too, or use a hash thing.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7160415
> I'd post my code, ..
Where?= Not in this thread (or am I blind), I just see fragments, or more correct: single statements.
0
 

Author Comment

by:sjaguar13
ID: 7160605
> I'd post my code, ..
Where?

I would post my code, but I don't know how to do the html files. What good would the first half be? I can fix the first half, it's just a bunch of design things. The part I need help with is the part I don't have, how to search html files.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7160669
> .. how to search html files.
Do mean to fin d the files, or patterns in the files.
0
 

Author Comment

by:sjaguar13
ID: 7161009
I'm pretty sure I can get each one, how do I open it and see if the keyword is in there?
0
 

Author Comment

by:sjaguar13
ID: 7161273
Ok, no one's talking. How about this:
a directory is read and stored in an array
foreach file in the array, that file is opened
if file =~/search string/i then it's added to an array
foreach file in that array, a link is printed along with the title.

Is that a good plan, or should I put each word in an hash or what should I do?
0
 

Author Comment

by:sjaguar13
ID: 7164141
#!C:\perl\bin\perl.exe


open(DB, "C:\\apache\\htdocs\\cgi-bin\\DBman\\Mine\\EGuitar\\EGuitar.txt");
@db = <DB>;
close(DB);

print "Content-type: text/html\n\n";
print "EGuitar<HR>";
print "<Table>";
@results = grep /red/i, @db;

foreach (@results) {
($i1, $i2, $i3, $i4, $i5, $i6, $i7, $i8) = split(/\|/,$_);
print "<tr><td>$i1</td><td>$i2</td><td>$i3</td><td>$i4</td><td>$i5</td><td>$i6</td></tr>";
}

print "</table>";


open(DB, "C:\\apache\\htdocs\\cgi-bin\\DBman\\Mine\\Bass\\Bass.txt");
@db = <DB>;
close(DB);

print "Bass<HR>";
print "<Table>";
@results = grep /red/i, @db;

foreach (@results) {
($i1, $i2, $i3, $i4, $i5, $i6, $i7, $i8) = split(/\|/,$_);
print "<tr><td>$i1</td><td>$i2</td><td>$i3</td><td>$i4</td><td>$i5</td><td>$i6</td></tr>";
}

print "</table>";


open(DB, "C:\\apache\\htdocs\\cgi-bin\\DBman\\Mine\\Acoustic\\Acoustic.txt");
@db = <DB>;
close(DB);

print "Acoustic<HR>";
print "<Table>";
@results = grep /red/i, @db;



foreach (@results) {
($i1, $i2, $i3, $i4, $i5, $i6, $i7, $i8) = split(/\|/,$_);
print "<tr><td>$i1</td><td>$i2</td><td>$i3</td><td>$i4</td><td>$i5</td><td>$i6</td></tr>";
}

print "</table>";



$basedir = "C:\\my documents\\";
opendir(DIR,$basedir);
@files = grep /.html/i, readdir(DIR);
closedir(DIR);

foreach $file (@files) {
open(FILE,"C:\\my documents\\$file");
@LINES = <FILE>;
close(FILE);


$Line_string = join(" ", @LINES);


if ($Line_string =~ /you/i) {
push(@html,$file);
}

}

foreach $html (@html) {
print "<A HREF=\"$basedir$html\">$html</A><BR>";
}






How's that look? It's a bit crude, but before I went any further, I wanted to check with you guys. Three questions did come up, how do I search for the entire word, not just part, like if I type in you, it shouldn't return sites with your. Two, if it finds nothing, how do I have it print "No matches found"? Three, if they type in more than one word, it should still work, right?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7164239
> .. I search for the entire word, not just part, ..
grep /\bred\b/i,@db;

> if it finds nothing, how do I have it print "No matches found"?
if ($#results<0){print "No matches found\n";}

> . if they type in more than one word, it should still work, right
grep /\bred bear\b/i,@db;
0
 

Author Comment

by:sjaguar13
ID: 7165196
The $#results if statement works great, but the other two don't. May I got this in the wrong order or something, but when I type \ \b around the word, it finds nothing. What does the \ \b mean, anyway? This is just a guess, but $#results gets the number of items in the array?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7165209
it's  \bWORD\b  not  \WORD\b
\b means word boundary (whitespace, punctation, etc.)
$#results is the number of items in the @results, it is set to -1 if there are no elements
0
 

Author Comment

by:sjaguar13
ID: 7165345
Ohhhh, \b \b works for the whole word thing, but if I search for two words, it doesn't always work. It seems to only find matches if the two words are found side-by-side, like if you searched for big bear and the page had big red bear, it doesn't find it. Could I put a wildcard inbetween them so if would match if the words were anywhere on the page? The last thing I want to do is, if it's not too terribly hard, display the line from the page the word was found in. Can it be done relativly easy?
0
 
LVL 12

Expert Comment

by:lexxwern
ID: 7166271
why dont you use sql based databases, they are much easier, more powerful and at most times it is worth learning that extra piece of syntax.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7166550
to find big red bear and big bear, when your pattern is big bear, you need to do it like:

  grep /\bbig\b.*\bbear\b/,@db;

BTW, I'm pretty shure your're rethinking your statement when you look at your examples and patterns and future requirements ;-)
<quote>
 <<Could it be simpler?
  Yes, it could.
</quote>
0
 
LVL 8

Expert Comment

by:davorg
ID: 9483984
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Split points between ahoffman and lexxwern

Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

davorg
EE Cleanup Volunteer
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Suggested Solutions

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now