• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 468
  • Last Modified:

Search through flat file databases on Windows and Unix servers?

I have a couple of flat file databases I need to search through, and also a directory, if it's not too hard. I was looking at Webmonkey at "Roll your own search engine", but it only dealt with Unix. I downloaded a few free scripts, but one just gave me the title linked to the page and the path. I got a second one and it did the title, url, and relavance. I liked it, but was perfectly satisfied with it. I used KSearch on Unix and loved it, but I couldn't get it to work at all on Windows. I then went back to the second script and it couldn't do the flat files. I need it to search a couple of flat files and if the words are found, return a script that calls the file. The file is seperated with |. Is that going to matter? Something that opens a file, dbfile.txt. If it finds a match, it returns script.cgi?open=dbfile. Then, there is a directory full of html files that would be nice if it could search. My question is, how would I go about doing this, indexing the html file directory and using the flat files as and index? What would I use to search to work on Unix and Windows, I know grep won't work.
0
sjaguar13
Asked:
sjaguar13
  • 25
  • 16
  • 12
  • +2
2 Solutions
 
lexxwernCommented:
how is data stored in the flat file db, are they comma seperated values or something similar?
0
 
DVBCommented:
Use a regular expression and the m/// function.
m/[pattern]/[input]/

A sample
#!/usr/bin/perl -w
use strict;

open (FP,"</path/to/file");
while(<FP>)
{
      if (m/pattern/$_/)
          {
            `/usr/bin/perl /path/to/file`;
             return 1;
          }
}

If you can give more details about what you are trying to do, and on what input data, then it would be easier to help out.
0
 
sjaguar13Author Commented:
I have products listed in categories in each file. It's separated with | I then have a script that opens up the file, sorts it, and prints a table. The problem I found is, if someone searched for something, it searches the html file. I don't have one. I need it to search the text file, so if I had one on cats, dogs, and other, and they searched for jaguar, it would open each file and search for it. It would find it in cats and preferable display that line as well as a link to the script, like browse.cgi?open=cats (If it's too hard to show the line, that's okay). I have one directory with html files that I would like it to be able to search through, too. Those should just display the line, if possible, and a link to that file. Does that help?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
ahoffmannCommented:
beside reinventing the wheel, how about htdig or glimpse
0
 
sjaguar13Author Commented:
How do those work and will they work on both platforms?
0
 
ahoffmannCommented:
both work on UNIX/Linux and NT.
http://www.htdig.org/  http://webglimpse.org/
0
 
sjaguar13Author Commented:
They both look confusing. I think I like htdig better, what one would you use? What is this installing stuff. What if I have a free host, am I still going to be able to install this stuff? It will search the text files and the html files and show the right script to call the text file?
0
 
sjaguar13Author Commented:
So I was searching Google looking for ways to do this and I came across grep. Can I use that, or would that be a bad idea?
0
 
lexxwernCommented:
see this code

$variable="abc|hello|world|xyz";
@ary = split(/|/,$variable);

here $ary[0] is abc, $ary[1] is hello and so on...




so suppose file data.txt has the following data,
abc|hello|world|xyz
uio|hello|sdfsdf|uj
546|ty|dfg|asf
sdf|dr|547|xdf


then we need to write a loop which scans every line of the file. it should be somewhat like this,

open(DAT,"data.txt");
@alldata = <DAT>;
close(DAT);

foreach $line(@alldata)
{
 @ary = split(/|/,$line);
 print "@ary[0] , @ary[1] , @ary[2] , @ary[4] <br>";
}


test if this works,
0
 
lexxwernCommented:
oops,   replace

@ary = split(/|/,$line);

with


@ary = split(/\|/,$line);
0
 
ahoffmannCommented:
well, rgep is the simplest way ;-)
0
 
lexxwernCommented:
atleast tryout what ive given above, just for the sake of my efforts ;)
0
 
sjaguar13Author Commented:
lexxwern, so it take the file and chucks in into a big array. It then take each line and splits the variables. Then it prints each variable for each line. How would I search through it?

Is grep or rgrep fast enough to do a lot of pages?
0
 
lexxwernCommented:
hmm. all you search has to be done within the foreach loop.


foreach $line(@alldata)
{
 @ary = split(/|/,$line);
 if($ary[3]=='hello')
 {
   //do something etc.
 }

}
0
 
sjaguar13Author Commented:
How long would it take to do about 10 files like that, plus about 50 html files, and how would I do the html files?
0
 
lexxwernCommented:
>> How long would it take to do about 10 files like that,
depends on the size of the files.

>> and how would I do the html files?
my script only searches flat file database thats asked in the Q. for parsing html files, i should know what exactly should be done.
0
 
sjaguar13Author Commented:
>>I have a couple of flat file databases I need to search through, and also a directory, if it's not too hard.
>>I need it to search a couple of flat files and if the words are found, return a script that calls the file.
>>If it finds a match, it returns script.cgi?open=dbfile. >>Then, there is a directory full of html files that would be nice if it could search.
>>My question is, how would I go about doing this, indexing the html file directory and using the flat files as and index?
 
>>What would I use to search to work on Unix and Windows, I know grep won't work. <<- I'm not so sure about this now, would grep, or fgrep work for this?
0
 
lexxwernCommented:
hmm.

what you have to do is index your html files inot the flat file database.
suppose you have yahoo.html, then you need a cgi which will index this page into the database you use, and then the search will work with the database.

so on what basis html gets represented in the database, it is best known to you...
0
 
sjaguar13Author Commented:
How do I index it? Open up the file and split at the spaces to get each word?
0
 
lexxwernCommented:
hmm. you should give the structure of the Search DB a lot of thought.



First lets list the bare minimums required in the Database.
1. title
2. description
3. url

4. keyword1
5. keyword2
6. keyword3
7. keyword4
8. keyword5



so i recommend that your DB file have 3+5 attributes.
each site indexed should have 5 keywords provided.



so your indexing script should wisely pick 5 keyword and put it in the DB. BUT i recommend that the indexing is done manually and not thru a script.
0
 
ahoffmannCommented:
as mentioned before: sounds like you try to reinvent the wheel:
   glimpse and webglimpse do that all, on both: UNIX and Windoze

If you write your own wheel, perl or Tcl are the only possibilities to use. Means same script without changes and/or platform-dependent ifdefs. IMHO.
In this case you have to make your performance checks for both: indexing and searching, yourself (glimpse is fast, much faster than grep).
Even there exist numerous grep.exe for M$, if not used within perl or Tcl, you need to write a different shell around. Well, have seen bash for M$ too ...

You mentioned that "I like htdig better", then keep in mind, that htdig just searches/indexes your DocumentRoot, while glimpse can do any directory. It's up to you which one is sufficient.
0
 
sjaguar13Author Commented:
Ok, glimpse sounds better. Does grep work on windows? That webglimpse thing seems like a lot of work. If I'm using a free host, how do I install it?
0
 
ahoffmannCommented:
> Does grep work on windows?
yes.

> .. how do I install it?
sorry cannot help here ('cause I use my own CGI for that).
0
 
lexxwernCommented:
>>  sorry cannot help here ('cause I use my own CGI for that).

that tells you to use your own cgi too.

i have used both free scripts and my own scripts,
and trust me the control you have over your page when you write it fuly yourself is worth all the pains!
0
 
sjaguar13Author Commented:
>>and trust me the control you have over your page when you write it fuly yourself is worth all the pains!

The biggest pain I have right now is figuring out what I'm supposed to do.

How about this, I grep through the database files line by line, returning the whole line if a match is found. It does each one. Then it opens the html files and stores it in an array and greps through those, too, one by one (sounds like a lot of time) and returns a link. Good idea, or bad idea?
0
 
lexxwernCommented:
wait! what do you want? do you want a search engine for you site?

if yes read below. if not answer the Q.


lets now see what the components of a basic search engine should be.

1. a database with all required information of the pages.
2. a script which parses thru the database and matches its data with the users submitted data. this should also display matching entities to the browser.

with this the developer should answer a few Qs. like,

1. how does my DB get the searchable data?
A. it can be either indexed by a script on can be manually fed to the DB,

2. how do i store data in my database file?
A. ......done......data|seperated|by|

3. what will be the attributes for each entry in DB?
A. you decide, my earlier comment was made to help you. now this is the most important factor which decides the fate of your search engine.

now give it a deep thought, only if you're confident of writing a goood code then proceed, otherwise use the free scripts.

why cant you use databases tho? its much easier to write code, if you use mysql or some other sql based db.
0
 
sjaguar13Author Commented:
I want a serach engine. I think I can make one that searches the flat file databases, but I have no idea how to do the html files. How would I index it?
0
 
lexxwernCommented:
you are not getting the idea.

the flatfile database has information about html files.

>> how do I index it?
your wish, you can either take all text in the <body> or perhaps you can use <meta> tags, or you can come up with something totally new.
0
 
sjaguar13Author Commented:
>>you are not getting the idea.
Obviously.

So the flat file databases are cool. I can egrep those. The html files are harder, so I make them into a flat file database because that's easy. I would like all the text in the body tag. Should I make a script that grabs the text and chucks it into a txt file, or should I just grap the text as an array, search it, and then forget about it?

Am I getting more of the idea?
0
 
lexxwernCommented:
:-)

well, this example will answer your Q.

lets say there is a page named yahoo.html, your database file name is db.txt.

so when the script indexes yahoo.html it adds the following to db.txt

Yahoo's Title|http://location.of/yahoo.html|db_yahoo.txt

now db_yahoo.txt is a file created while indexing yahoo.html, now all textin yahoo is put into this page.


now, when i search for "certain words in yahoo" what the search program should do is open the file name indexed by the database, ie. db_yahoo.txt, open is search it, and if search successful, return this to the user.


now this system works! good. but what i dont like about this is the speed, i mean just to search onee word, it opens and closes each and every file indexed in the DB. then searches it.


this whole process is very slow. but if you want to persist with it, you can.
0
 
sjaguar13Author Commented:
Is there any way to make this go faster? Your example is using egrep to search the files, right?
0
 
ahoffmannCommented:
> I want a serach engine.
glimpseindex + glimpsserver +glimpse
(which are all part of the glimpse package)

glimpse also handles html files, see the -X option to glimpseindex.

Here is a simplecall to the glimpse database producing HTML:

   glimpse  pattern | perl -F: -ane 'print "<a href=$F[0]>$F[0]</A>:\n\t<PRE>$F[1]\n</PRE>";'

(on Windoze you just need to exchange " and ')
Do this in your CGI, which just gets the pattern as parameter, ready.
Could it be simpler?
0
 
sjaguar13Author Commented:
<<Could it be simpler?
Yes, it could.

Would an associative array/hash thing work? I was readin this thing about making search engines and it mentions this. Is it a better way?
0
 
sjaguar13Author Commented:
I was thinking and I think I got the answer. For the 11 txt files, I egrep through them by line. The biggest file is 94 lines at 8kb, then there is a 7kb, then it drops to 3kb and 2kbs. That might take a little bit, but I use the hash thing for the html files. That's quick right? It just returns the pages that has that word. It will already indexed, so it won't have to look through each file. Is there any downfall to that idea?
0
 
lexxwernCommented:
>> Would an associative array/hash thing work? I was readin this thing about making search engines and it mentions this.

when you set out to make a search engine, it is assumed that you have mastery over the perl syntax and you can use the best tool at the best place.


>> Is it a better way?
could be could be not... depends on the way you code.
0
 
lexxwernCommented:
im really dumb and i haven't understood your new method :)

could you pls explain it again so that i can comment abt it.?
0
 
sjaguar13Author Commented:
>>could you pls explain it again so that i can comment abt it.?

You know what I mean about the hash thing? Each word that appears on a page is a key, and the value is a number representing that page, so when I search for hello, it checks $hash{hello} and gets a list of numbers like 2,4,6. Then it takes the pages that are 2, 4, and 6 and prints out links. For the txt files, it shoves it all in as an array. Then, it egreps each line to see if the string matches, like foreach (@txt) { egrep hello $_ }

Is that a better way of doing it?
0
 
DVBCommented:
BTW, Perl has a builtin grep operator as well. No need for external commands.
(but unless you are doing this for the sake of learning Perl, I would recommend tool reuse.)
0
 
sjaguar13Author Commented:
Should I egrep? I was looking at other search scripts and if they don't use the hash thing, they search like $string=~/$searchquery/, what one's better?
0
 
lexxwernCommented:
jaguar, you are getting carried away by the smaller things, do you ave the bigger picture clear in your mind?
0
 
sjaguar13Author Commented:
I've been looking at free scripts for a week before I asked this question. It's been almost two weeks since I asked this question. The picture I have is what I already mentioned, and seeing how no one talked me out of it, that's still the plan I'm going with. I've already got it to search the txt files, there is still a few bugs, but for the most part it's good. The major thing right now is the HTML files, other than that, I only have smalling things to worry about.
0
 
ahoffmannCommented:
so, why don't you simply post your code, and show which part is not working as expected.
Would be much simpler than talking about apples and pears.
0
 
sjaguar13Author Commented:
I'd post my code, but the biggset problem I have right now is the fact that I don't have anything written to search the HTML files. Still trying to figure out if I should grep through those, too, or use a hash thing.
0
 
ahoffmannCommented:
> I'd post my code, ..
Where?= Not in this thread (or am I blind), I just see fragments, or more correct: single statements.
0
 
sjaguar13Author Commented:
> I'd post my code, ..
Where?

I would post my code, but I don't know how to do the html files. What good would the first half be? I can fix the first half, it's just a bunch of design things. The part I need help with is the part I don't have, how to search html files.
0
 
ahoffmannCommented:
> .. how to search html files.
Do mean to fin d the files, or patterns in the files.
0
 
sjaguar13Author Commented:
I'm pretty sure I can get each one, how do I open it and see if the keyword is in there?
0
 
sjaguar13Author Commented:
Ok, no one's talking. How about this:
a directory is read and stored in an array
foreach file in the array, that file is opened
if file =~/search string/i then it's added to an array
foreach file in that array, a link is printed along with the title.

Is that a good plan, or should I put each word in an hash or what should I do?
0
 
sjaguar13Author Commented:
#!C:\perl\bin\perl.exe


open(DB, "C:\\apache\\htdocs\\cgi-bin\\DBman\\Mine\\EGuitar\\EGuitar.txt");
@db = <DB>;
close(DB);

print "Content-type: text/html\n\n";
print "EGuitar<HR>";
print "<Table>";
@results = grep /red/i, @db;

foreach (@results) {
($i1, $i2, $i3, $i4, $i5, $i6, $i7, $i8) = split(/\|/,$_);
print "<tr><td>$i1</td><td>$i2</td><td>$i3</td><td>$i4</td><td>$i5</td><td>$i6</td></tr>";
}

print "</table>";


open(DB, "C:\\apache\\htdocs\\cgi-bin\\DBman\\Mine\\Bass\\Bass.txt");
@db = <DB>;
close(DB);

print "Bass<HR>";
print "<Table>";
@results = grep /red/i, @db;

foreach (@results) {
($i1, $i2, $i3, $i4, $i5, $i6, $i7, $i8) = split(/\|/,$_);
print "<tr><td>$i1</td><td>$i2</td><td>$i3</td><td>$i4</td><td>$i5</td><td>$i6</td></tr>";
}

print "</table>";


open(DB, "C:\\apache\\htdocs\\cgi-bin\\DBman\\Mine\\Acoustic\\Acoustic.txt");
@db = <DB>;
close(DB);

print "Acoustic<HR>";
print "<Table>";
@results = grep /red/i, @db;



foreach (@results) {
($i1, $i2, $i3, $i4, $i5, $i6, $i7, $i8) = split(/\|/,$_);
print "<tr><td>$i1</td><td>$i2</td><td>$i3</td><td>$i4</td><td>$i5</td><td>$i6</td></tr>";
}

print "</table>";



$basedir = "C:\\my documents\\";
opendir(DIR,$basedir);
@files = grep /.html/i, readdir(DIR);
closedir(DIR);

foreach $file (@files) {
open(FILE,"C:\\my documents\\$file");
@LINES = <FILE>;
close(FILE);


$Line_string = join(" ", @LINES);


if ($Line_string =~ /you/i) {
push(@html,$file);
}

}

foreach $html (@html) {
print "<A HREF=\"$basedir$html\">$html</A><BR>";
}






How's that look? It's a bit crude, but before I went any further, I wanted to check with you guys. Three questions did come up, how do I search for the entire word, not just part, like if I type in you, it shouldn't return sites with your. Two, if it finds nothing, how do I have it print "No matches found"? Three, if they type in more than one word, it should still work, right?
0
 
ahoffmannCommented:
> .. I search for the entire word, not just part, ..
grep /\bred\b/i,@db;

> if it finds nothing, how do I have it print "No matches found"?
if ($#results<0){print "No matches found\n";}

> . if they type in more than one word, it should still work, right
grep /\bred bear\b/i,@db;
0
 
sjaguar13Author Commented:
The $#results if statement works great, but the other two don't. May I got this in the wrong order or something, but when I type \ \b around the word, it finds nothing. What does the \ \b mean, anyway? This is just a guess, but $#results gets the number of items in the array?
0
 
ahoffmannCommented:
it's  \bWORD\b  not  \WORD\b
\b means word boundary (whitespace, punctation, etc.)
$#results is the number of items in the @results, it is set to -1 if there are no elements
0
 
sjaguar13Author Commented:
Ohhhh, \b \b works for the whole word thing, but if I search for two words, it doesn't always work. It seems to only find matches if the two words are found side-by-side, like if you searched for big bear and the page had big red bear, it doesn't find it. Could I put a wildcard inbetween them so if would match if the words were anywhere on the page? The last thing I want to do is, if it's not too terribly hard, display the line from the page the word was found in. Can it be done relativly easy?
0
 
lexxwernCommented:
why dont you use sql based databases, they are much easier, more powerful and at most times it is worth learning that extra piece of syntax.
0
 
ahoffmannCommented:
to find big red bear and big bear, when your pattern is big bear, you need to do it like:

  grep /\bbig\b.*\bbear\b/,@db;

BTW, I'm pretty shure your're rethinking your statement when you look at your examples and patterns and future requirements ;-)
<quote>
 <<Could it be simpler?
  Yes, it could.
</quote>
0
 
davorgCommented:
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Split points between ahoffman and lexxwern

Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

davorg
EE Cleanup Volunteer
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 25
  • 16
  • 12
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now