Solved

CGI Search Method

Posted on 1997-02-17
20
460 Views
Last Modified: 2013-12-25
I have written a perl script which searches our website.  It currently searches the titles of 150 pages for the search terms which takes about 30 to 40 seconds, which I think is quite slow.

The method I use at the moment is the serial file (item entry) method, whereby each document is opened the text searched and then the next document and so on....

I would like to use the inverted file (term entry) method whereby a list is generated of all the searchable terms and each term has a corresponding list of documents.

My questions are as follows, bearing in mind that the CGI program will be written in Perl:-

1). How should the inverted file list be created (format, example appreciated)

2). How to search the file list using a perl script and then provide the matches.

3). How to provide some sort of description to go with each document.

4). Any ideas on Best-match retrieval using some sort of relavance ranking.

I know that this is quite alot and probably very difficult, hence I am offering alot of points.
0
Comment
Question by:Trevor013097
20 Comments
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827754
Adjusted points to 500
0
 

Expert Comment

by:jmalone
ID: 1827755
I am not quite sure I know what you are talking about.  Are you trying to write something that you run once a week that searches all of your files and creates a database and then another scripts will search that data file for certain criteria and return the result?
0
 
LVL 5

Expert Comment

by:julio011597
ID: 1827756
Hi Trevor, i'm afraid you won't get any answer here, since what you're talking about actualy is a search engine... quite involving as you know - people make a living on such kind of things.
Why don't you get some available tool from the net, then just build a cgi to access it (even in Perl ;-)?
I can suggest having a look at *the best one* (AFAIK): MG (Managing Gigabytes), http://deimos.kbs.citri.edu.au/~tes/mg/.
It comes from a university reasearch in Australia and is covered by the terms of GNU Public licence.
MG does all you need, comes with source code (in C), so it's fully customizable, and i could also point you to another sopporting tool, called RemoteMG, which makes MG ready to access from the net.
I've worked deeply with that MG stuff, so could surely help you to make it work and, if you'd like to mess up with the code, could point you to where to put your hands to do this and that.
HTH
0
 
LVL 1

Expert Comment

by:evilgreg
ID: 1827757
It's not that hard, except maybe part four, if you want some sort of intelligent guessing algorithm. I must confess I am lost as to what it is exactly you want. You have a site with multiple pages, and you wish to go through each file and then create a list, sorted by search terms, which has the coresponding file. This file is then searched for matches by a cgi script, no?

Stop me if I'm wrong here...

So the file looks like this (in theory, not actual):

Search word: apples
Files: fruity.html Grove/Fruittree.html ciderpress.html
Search word: mangos
Files: Grove/magotree.html fruity.html

Etc.. etc..

It is quite a feat to build this, but a clear decription of what _exactly_ it is you need should clear it up. :)

Any questions, email me at fruits@turnstep.com

-Greg

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827758
Okay,

I have my documents and each document can be searched for a particular term.  But to speed things up I want to search just one document not 150.  So what I want to do is have a single document which has a list of all the search terms and then each search term has an array of documents.

eg. if I were to search for apple it would not search all the documents and then display the ones with apple it would simply search my inverted file list (which would be a lot quicker) and display the documents from there.


My inverted file list should look something like:-

t1: {d1, d3, d5}
t2: {d2, d3, d7}
t3: {d1, d2, d4}

each line gives a set of documents indexed by a given term.

t= term,  d= document

Now,

If I were to make the file list in the manner above I could split on the : (colon)  to get my $name-$value pairs and then somehow split on the , (comma) between each document name to get my document list.


0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827759
The search script will have to search the single document containg the search terms and corresponding documents.  However I will have manually created the document to be searched in probably a .txt format as in the format outlined in my previous comment.


0
 
LVL 3

Accepted Solution

by:
pc012197 earned 500 total points
ID: 1827760
Questions 1-3 are pretty straightforward. I've written two perl scripts, one for creating an index, another one for querying it:

--- begin mkindex.pl ---
#!/usr/local/bin/perl -w

# This program creates an index consisting of key-value pairs. Keys are
# words contained in files given as command line arguments. Values are
# strings of the form
#
# "<name1>,<count1>[:<name2>,<count2>[:...]]"
#
# where <namen> is the name of a file containing the corresponding word
# and <countn> says how often the word was found in the file.
 
use AnyDBM_File;
 
$indexfile="index";
 
# We keep our index in an associative array / dbm-file
 
dbmopen(%docs,$indexfile,0644) || die("Can't open $ARGV[0]\n");
 
foreach $document (@ARGV) {
    print "Scanning $document...";
 
    %counters = ();
    @words = ();
    open( DOC, "<$document" );
    while( <DOC> )
    {
        chop;
        # discard HTML special characters. You might want to add
        # more specific parsing here.
        s/[^a-zA-Z0-9]+/ /g;
        # increment counters for every word in the current line
        foreach $word (split(/[ \t]+/)) {
            if( ! defined($counters{$word}) ) {
                $counters{$word} = 1;
                push @words, $word;
            } else {
                $counters{$word}++;
            }
        }
    }
    close( DOC );
 
    # add words and counters to our database
    foreach $word (@words) {
        if( ! defined($docs{$word}) ) {
            $docs{$word} = "$document,".$counters{$word};
        } else {
            $docs{$word} .= ":$document,".$counters{$word};
        }
    }
 
    print "\n";
}
 
dbmclose(%docs);
 
--- end mkindex.pl ---

--- begin query_index.pl ---
#!/usr/local/bin/perl -w
 
# This program queries an index created by mkindex for a single word
# and prints all hits to STDOUT.
 
use AnyDBM_File;
 
$indexfile="index";
 
# We keep our index in an associative array / dbm-file
 
dbmopen(%docs,$indexfile,0644) || die("Can't open $indexfile\n");
 
if( ! defined( $docs{$ARGV[0]} ) ) {
    print "$ARGV[0] is not in the database!\n";
} else {
    foreach $document (split(/:/,$docs{$ARGV[0]})) {
        ($url,$hits) = split(/,/,$document);
        print "$hits found in $url\n";
    }
}
 
dbmclose(%docs);
 
--- end query_index.pl ---

You can implement best-match retrieval by evaluation the hit-counts properly. Comments for each document should be kept in a separate database (for performance reasons).

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827761
Okay, looks pretty good but I get errors when I compile.


mkindex script - syntax errors

1) use AnyDbm_file;  (Is this a Perl 5 command, I am using WinPerl)

2) push@words

query_index script - syntax errors

1) use AnyDBM_File;



0
 
LVL 3

Expert Comment

by:pc012197
ID: 1827762
Hm, I did this under UNIX with perl 5.

Look in your perl library directory for any files with DB support (like DB_file.pm or something) and substitute that one for AnyDBM_file. If that doesn't work, try substituting the 'use..' with
BEGIN { require AnyDBM_file; import AnyDBM_file; }

push @words, $word; is equivalent to
$words[++$#words] = $word;


0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827763
I will be executing the scripts on a Unix machine, I just script them using WinPerl, so if they worked for you on a Unix and Perl 5 they should work for me, will try them tomorrow.

Thanks

0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 5

Author Comment

by:Trevor013097
ID: 1827764
I am unable to look in my Perl library as it is not accessible to me.  I have contacted my ISP (Demon Internet Ltd) and was told that  I simply needed to reference it, but that doesn't help me if I do not know what the file is.

Is there a standard file that is always in the library that I could reference?  Otherwise I cannot use you answer.


0
 
LVL 3

Expert Comment

by:pc012197
ID: 1827765
Instead of AnyDBM_File you could try

DB_File
NDBM_File
GDBM_File
SDBM_File

(these are in our /usr/lib/parl5 directory). O'Reilly's "Programming Perl" says that SDBM_File is always available, because it's part of the standard perl distribution.

If that doesn't work for you it is relatively easy to substitute the DB-stuff with a simple text database in my code, like you suggested in your question. I think it will have lower performance, though.

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827766
Still cannot get your solution to work.  I have tried the DB files you mentioned but to no avail the prog still fails.  Perhaps the DB file is not the problem.  Maybe it is something I am doing.

1) How do you call the mkindex.pl file to create the index?
2) Where is this index created?
3) How does it know where to look for the docs (I can see no reference to a base URL)
      My directory structure is such that in the root there are the following directories:-

a) CGI-BIN
b) DOCS
c) LOGS
d) INCOMING

All I get at the moment is a server error when running the prog.


0
 
LVL 3

Expert Comment

by:pc012197
ID: 1827767
Hm.

1) mkindex expects filenames as command-line-arguments. All the files given are searched and indexed.

2) The index is created in the file given by the $indexfile variable (relative to the current directory, unless an absolute path is specified).

3) It doesn't know where to look for the docs, because you have to specify all the documents to be indexed on the command line. Note that the database is created non-destructively, i. e. successive runs of mkindex will extend an existing database instead of overwriting it. If you want to index a large server with lots of documents you should write a small script to call mkindex for individual files.

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827768
Sorry PC, Still having problems.

I am convinced that it is something I am doing wrong.

To begin with I am just trying to get the mkindex.pl to work.  I have now got the following code in my mkindex.pl file

<--------Start of mkindex.pl---------->

#!/bin/perl

# This program creates an index consisting of key-value pairs. Keys are
# words contained in files given as command line arguments. Values are
# strings of the form
#
# "<name1>,<count1>[:<name2>,<count2>[:...]]"
#
# where <namen> is the name of a file containing the corresponding word
# and <countn> says how often the word was found in the file.

use SDBM_File.pm

$indexfile="index";

# We keep our index in an associative array / dbm-file

dbmopen(%docs,$indexfile,0644) || die("Can't open $ARGV[0]\n");

foreach $document (@ARGV) {
print "Scanning $document...";

%counters = ();
@words = ();
open( DOC, "<$document" );
while( <DOC> )
{
chop;
# discard HTML special characters. You might want to add
# more specific parsing here.
s/[^a-zA-Z0-9]+/ /g;
# increment counters for every word in the current line
foreach $word (split(/[ \t]+/)) {
if( ! defined($counters{$word}) ) {
$counters{$word} = 1;
push @words, $word;
} else {
$counters{$word}++;
}
}
}
close( DOC );

# add words and counters to our database
foreach $word (@words) {
if( ! defined($docs{$word}) ) {
$docs{$word} = "$document,".$counters{$word};
} else {
$docs{$word} .= ":$document,".$counters{$word};
}
}

print "\n";
}

dbmclose(%docs);

<------end mkindex.pl-------->

Okay,

I am calling the mkindex.pl from my browser using the following:

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/mkindex.pl?../docs/*.htm

this is the correct way to call a CGI script on my server and I am trying to pass it the files I want indexed which are in my main docs directory.  Is this right?

I have tried calling from a telnet session but to no avail.

You can try the script for yourself and if I am calling correct you will get an error message.
0
 
LVL 3

Expert Comment

by:pc012197
ID: 1827769
Oops. I didn't realize you were calling mkindex as a CGI script. In that case you have to change the line

foreach $document (@ARGV) {

to

foreach $document (<$ENV{"QUERY_STRING"}>) {

Similarly, in query_index.pl you have to replace $ARGV[0] with $ENV{"QUERY_STRING"} in three places.


0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827770
Sorry about that, I didn't make it clear that it was a CGI in my question.

But unfortunately it still won't work.

How do I call it, I have tried

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/mkindex.pl?../docs/*.htm

and

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/mkindex.pl?http://www.pcmaritime.co.uk/*.htm

but still no joy.

What am I doing wrong?

Am I calling it wrong?  Have you manged to get it to work as a CGI and if so what did you call it using?


0
 
LVL 3

Expert Comment

by:pc012197
ID: 1827771
Ok, I've tried it. Add the line

print "Content-type: text/html\n\n";

after the "use AnyDBM"... line (in both scripts).
Use glob($ENV{"QUERY_STRING"}) instead of <$ENV{"QUERY_STRING"}>, that's the clean way to do it.
Use an absolute path for the index database.
Make sure the CGI-script has write access to the directory where the index is created.
When you invoke mkindex, use the complete path to the documents as argument, like this:

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/mkindex.pl?/your/server/root/*.htm

That's the way it works for me.

Alternatively, add the command

chdir "/your/server/root";

to the beginning of mkindex and use relative paths.

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827772
I am still having problems PC.  I have changed the parts you mentioned and have ensured I have the correct permissions everywhere but when I call the mkindex.pl script all I get is a Server Error (You can try it, the file is up there at the moment).

I am not sure why this is doing what it is, I have tried it on its own and with the argument /docs/www.pcmaritime.co.uk/main.htm but to no avail.  

What server where you running it on, do you know how that might differ from Demon's (Do you have any experience with Demon Internet).


0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1827773
If you want I can repost the entire code I am using (but this question is getting a bit long already) I can e-mail it to you if you let me have your address.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Making a simple AJAX shopping cart Couple years ago I made my first shopping cart, I used iframe and JavaScript, it was very good at that time, there were no sessions or AJAX, I used cookies on clients machine. Today we have more advanced techno…
It is becoming increasingly popular to have a front-page slider on a web site. Nearly every TV website,  magazine or online news has one on their site, and even some e-commerce sites have one. Today you can use sliders with Joomla, WordPress or …
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now