Solved

pc here are your points

Posted on 1997-05-23
14
324 Views
Last Modified: 2013-12-25
Here are those points that I owe you plus the extra points for the continuation of the previous question.

Question:

to try and put together the last part of my original question, the best match retrieval.  
More specifically a ranking system based on relevancy (number of counts) and a way of
searching with case sensitivity on or off and how to use some sort of Boolean logic.

Please collect your points pc and then we can continue the enhancements to the scripts.

No other experts except pc need answer this question please.
0
Comment
Question by:Trevor013097
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 6
14 Comments
 
LVL 3

Accepted Solution

by:
pc012197 earned 450 total points
ID: 1828068
To make mkindex.pl case-insensitive, after the line

foreach $word (split(/[ \t]+/)) {

add

$word =~ tr/A-Z/a-z/;

In queryindex.pl, after the

if( !defined($ENV{"QUERY_STRING"})...

add

$ENV{"QUERY_STRING"} =~ tr/A-Z/a-z/;

(or whatever mechanism you use to extract your form data).
For the simple best-match algorithm, replace

foreach $document (split(/:/,$docs{$ENV{"QUERY_STRING"}})) {

with

foreach $document (sort sortByHitCount split(/:/,$docs{$ENV{"QUERY_STRING"}})) {

and add the following subroutine:

sub sortByHitCount {
local ($ua,$ca)=split(/,/,$a);
local ($ub,$cb)=split(/,/,$b);
#suppress a warning...
    $ua = $ub;
    return $cb <=> $ca;
}

Implementing the boolean logic will be tricky. I suggest a simple form like that used by AltaVista, i. e.:
 - use OR for individual words in the input not starting with + or -
 - use AND to evaluate words starting with '+'
 - use AND NOT to exclude words starting with '-'
 - evaluate input lines from left to right (i.e. no special priorities).

E. g. "hi hello +world -bye" would match any document containing either 'hi' or 'hello' AND 'world', but not 'bye'.


0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1828069
Hi pc,

Sorry it has taken so long to get back to you but have had a few probolems the last two weeks.  My internet gateway machine decided to crash completely and then the backup tapes would not read and it has taken me two weeks to get back to where I was two weeks ago.  Oh well such is life.

Have tried your suggested soultions and they work a treat, I now have case insensitivity and the results ranked by Hit count.

Next thing is the boolean logic.

Okay the Alta Vista style boolean is what I am after and no more complicated than that.

Now what we need to do is to create an array of the terms sent to the query_index.pl and then split this array into the individual terms and also the boolean operators.  Then we can evaluate from left to right.

so we would need something like:-

@terms = split(/\s+/, $FORM{'terms'});

if ($boolean eq 'AND') {
    for each $term(@terms) {
  ### search the database

similarly for the other operators.

I am unsure how to code this in though, any ideas?

below is what I have so far for the query_index.pl:-

#!/bin/perl5

# This program queries an index created by mkindex for a single word
# and prints all hits to STDOUT.

# Get the input
   read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
# Split the name-value pairs
   @pairs = split(/&/, $buffer);
   foreach $pair (@pairs) {
      ($name, $value) = split(/=/, $pair);
      $value =~ tr/+/ /;
      $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
      $FORM{$name} = $value;
   }

use SDBM_File;
print "Content-type: text/html\n\n";

$indexfile="/docs/www.pcmaritime.co.uk/search/data/index";
$descfile="/docs/www.pcmaritime.co.uk/search/data/descr";
$urlfile="/docs/www.pcmaritime.co.uk/search/data/urls";

# We keep our index in an associative array / dbm-file

dbmopen(%docs,$indexfile,0644) || die("Can't open $indexfile\n");
dbmopen(%desc,$descfile,0644) || die("Can't open $descfile\n");
dbmopen(%urls,$urlfile,0644) || die("Can't open $urlfile\n");

if( ! (defined( $docs{$FORM{'terms'}} ) ) ){
      $FORM{'terms'}=~tr/A-Z/a-z/;
      print "$terms is not in the database!\n";
      }
      else {
            foreach $document (sort sortByHitCount split(/:/,$docs{$FORM{'terms'}})) {
            ($urlid,$hits) = split(/,/,$document);
            ($title,$description) = split( /::/, $desc{$urlid}, 2);
            print "Url-ID: $urlid<br>\n";
            print "Document: $document<br>\n";
            print "Desc: $desc{$urlid}<br>\n";
            print "Title: $title<br>\n";
            print "$hits found in <a href=\"".$urls{$urlid}."\">$title</a>: $description<p>\n";
            
            }
      }

sub sortByHitCount {
      local($ua,$ca)=split(/,/,$a);
      local($ub,$cb)=split(/,/,$b);
      #suppress a warning...
            $ua = $ub;
            return $cb <=> $ca;
      }

dbmclose(%urls);
dbmclose(%docs);
dbmclose(%desc);

You can see the current version working at:-

http://www.pcmaritime.co.uk/search/search2.htm

this provides you with the simple graphical front end.

Thanks in advance

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1828070
Hi pc,

Sorry it has taken so long to get back to you but have had a few probolems the last two weeks.  My internet gateway machine decided to crash completely and then the backup tapes would not read and it has taken me two weeks to get back to where I was two weeks ago.  Oh well such is life.

Have tried your suggested soultions and they work a treat, I now have case insensitivity and the results ranked by Hit count.

Next thing is the boolean logic.

Okay the Alta Vista style boolean is what I am after and no more complicated than that.

Now what we need to do is to create an array of the terms sent to the query_index.pl and then split this array into the individual terms and also the boolean operators.  Then we can evaluate from left to right.

so we would need something like:-

@terms = split(/\s+/, $FORM{'terms'});

if ($boolean eq 'AND') {
    for each $term(@terms) {
  ### search the database

similarly for the other operators.

I am unsure how to code this in though, any ideas?

below is what I have so far for the query_index.pl:-

#!/bin/perl5

# This program queries an index created by mkindex for a single word
# and prints all hits to STDOUT.

# Get the input
   read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
# Split the name-value pairs
   @pairs = split(/&/, $buffer);
   foreach $pair (@pairs) {
      ($name, $value) = split(/=/, $pair);
      $value =~ tr/+/ /;
      $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
      $FORM{$name} = $value;
   }

use SDBM_File;
print "Content-type: text/html\n\n";

$indexfile="/docs/www.pcmaritime.co.uk/search/data/index";
$descfile="/docs/www.pcmaritime.co.uk/search/data/descr";
$urlfile="/docs/www.pcmaritime.co.uk/search/data/urls";

# We keep our index in an associative array / dbm-file

dbmopen(%docs,$indexfile,0644) || die("Can't open $indexfile\n");
dbmopen(%desc,$descfile,0644) || die("Can't open $descfile\n");
dbmopen(%urls,$urlfile,0644) || die("Can't open $urlfile\n");

if( ! (defined( $docs{$FORM{'terms'}} ) ) ){
      $FORM{'terms'}=~tr/A-Z/a-z/;
      print "$terms is not in the database!\n";
      }
      else {
            foreach $document (sort sortByHitCount split(/:/,$docs{$FORM{'terms'}})) {
            ($urlid,$hits) = split(/,/,$document);
            ($title,$description) = split( /::/, $desc{$urlid}, 2);
            print "Url-ID: $urlid<br>\n";
            print "Document: $document<br>\n";
            print "Desc: $desc{$urlid}<br>\n";
            print "Title: $title<br>\n";
            print "$hits found in <a href=\"".$urls{$urlid}."\">$title</a>: $description<p>\n";
            
            }
      }

sub sortByHitCount {
      local($ua,$ca)=split(/,/,$a);
      local($ub,$cb)=split(/,/,$b);
      #suppress a warning...
            $ua = $ub;
            return $cb <=> $ca;
      }

dbmclose(%urls);
dbmclose(%docs);
dbmclose(%desc);

You can see the current version working at:-

http://www.pcmaritime.co.uk/search/search2.htm

this provides you with the simple graphical front end.

Thanks in advance

0
Resolve Critical IT Incidents Fast

If your data, services or processes become compromised, your organization can suffer damage in just minutes and how fast you communicate during a major IT incident is everything. Learn how to immediately identify incidents & best practices to resolve them quickly and effectively.

 
LVL 5

Author Comment

by:Trevor013097
ID: 1828071
Why my comment got posted twice I do not know.  I only clicked submit once, must be a problem at EE somewhere, perhaps it got stuck in the queue.

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1828072
I have reorganised our site and the current working version has now moved to:http://www.pcmaritime.co.uk/leisure/search/search2.htm
0
 
LVL 3

Expert Comment

by:pc012197
ID: 1828073
Hi Trevor,

it has taken me about two weeks to get started with the rest, and only about two hours to actually do it... looks like my motivation has dropped a little since I received that 'Wizard level' T-shirt... :-)))
Well, here is an alpha-version of the boolean queryindex.pl:

#!/usr/local/bin/perl -w
#
# Version: 1.1
#
# This program queries an index created by mkindex for a single word
# and prints all hits to STDOUT.
#
# Changes since last version (first working version):
#
# Requires three DB files instead of one (see mkindex.pl).
#
# Output is more verbose.

use SDBM_File;

print "Content-type: text/html\n\n";

$indexfile="/home/conrad/tmp/test/index";
$descfile ="/home/conrad/tmp/test/descr";
$urlfile  ="/home/conrad/tmp/test/urls";

# We keep our index in an associative array / dbm-file

dbmopen(%docs,$indexfile,0644) || die("Can't open $indexfile\n");
dbmopen(%desc,$descfile,0644) || die("Can't open $descfile\n");
dbmopen(%urls,$urlfile,0644) || die("Can't open $urlfile\n");

$ENV{"QUERY_STRING"} =~ tr/A-Z/a-z/;

@args = split(/\s+/, $ENV{"QUERY_STRING"});
%A = ();
$first = 1;

foreach $word (@args) {
    $key = substr($word,0,1);
    if( $key eq "+" ) { $func = "land"; $word = substr($word,1); }
    elsif( $key eq "-" ) { $func = "remove"; $word = substr($word,1); }
    else { $func = "merge"; }

    if( $first != 0 ) {
        $func = "merge";
        $first = 0;
    }

    %B = finddocs($word);
    @B = keys %B;
    print "$word: ".($#B+1)."<br>\n";
#    print "A = ".k2s(%A)."\n";
#    print "B = ".k2s(%B)."\n";
    eval "\%A = $func";
#    print "$func(A,B) = ".k2s(%A)."\n";
}

foreach $urlid (sort { $A{$b} <=> $A{$a} } keys %A)
{
    ($title,$description) = split( /::/, $desc{$urlid}, 2 );
    print "$A{$urlid} found in <a href=\"".$urls{$urlid}."\">$title</a>: $descri
ption<p>\n";
}

dbmclose(%urls);
dbmclose(%docs);
dbmclose(%desc);

exit 0;

sub finddocs
{
local ($w)=@_;
local (%B);

    if( ! defined( $docs{$w} ) ) {
        return ();
    }
    %B = ();
    foreach $document (split(/:/,$docs{$w})) {
        ($urlid,$hits) = split(/,/,$document);
        $B{$urlid} = $hits;
    }

    return %B;
}

sub merge
# Merge two hashes A and B, i. e. for any element (a,c1) in A:
#   if there is (a,c2) in B, substitute (a,c1+c2) for (a,c2) in B
#   else add (a,c1) to B.
{
local (%C, $k, $v);

    %C = %B;
    while( ($k,$v) = each %A )
    {
        if( ! exists $C{$k} ) {
            $C{$k} = $v;
        } else {
            $C{$k} += $v;
        }
    }

    return %C;
}

sub remove
# Remove all elements which appear in B from A
{
local (%C, $k);

    %C = %A;
    foreach $k (keys %B) {
        delete $C{$k};
    }

    return %C;
}

sub land
# Create a list containing all elements a with (a,c1) in A and (a,c2) in B
# and none else.
{
local (%C, $k, $v);

    %C = ();
    while( ($k,$v) = each %A ) {
        if( exists $B{$k} ) {
            $C{$k} = $B{$k} + $v;
        }
    }

    return %C;
}

#sub k2s
#{
#local (%A)=(@_);
#local ($res);
#
#    $res = "";
#    foreach $key (keys %A) {
#       $res .= ", $key";
#    }
#    return $res;
#}

I have made a few tests with the boolean stuff which seem to indicate the functions work properly. If they don't it would be helpful if you removed the comments from the 'k2s' subroutine, and from the print statements where it is invoked.

Bye,
      Peter
0
 
LVL 3

Expert Comment

by:pc012197
ID: 1828074
Oops, there's a bad word wrap in the foreach $urlid part...

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1828075
Works fine when performing a simple search, ie. just one search term.  However I cannot seem to get it to search on multiple terms.  What are the boolean operators, I tried + - but with no effect.  Two keywords which I know appear on the same page are navmaster and arcs but when searched for together it retrieves a match of 0.

It is on the site if you want to try it.

At the moment it is without a front end so it requires a command line argument.  I have tried the following:

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/new.pl?navmaster

and

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/new.pl?navmaster+arcs

no wrap should be there.

Any ideas?


0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1828076
I have removed the comments from the k2 subroutine and also from the additional print statements and have uploaded a new file to our server.  There are now two versions up there:

The script with comments in:-

http://www.pcmaritime.co.uk/cgi-bin/new.pl?search-term

and the script without the comments in:

http://www.pcmaritime.co.uk/cgi-bin/new2.pl?search-term


0
 
LVL 3

Expert Comment

by:pc012197
ID: 1828077
I forgot special characters are URLencoded. I. e. all spaces are replaced with '+' and other special characters with '%xy', where 'xy' is a hex code.
Before the '@args=split...' line insert the following statements:

$ENV{"QUERY_STRING"} =~ s/\+/ /g;
$ENV{"QUERY_STRING"} =~ s/%([A-Fa-f0-9]{2})/pack("c",hex($1))/ge;

For testing you'd best use a form, so the browser will handle the encoding. If you want to enter the arguments in the URL, replace all spaces in the request with '+' and all pluses with '%2b':

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/new.pl?search-term1+%2bsearch-term2

Oh, and of course, the name of the INPUT-field will also be included, so add the line

$ENV{"QUERY_STRING"} =~ s/^[^=]*=//;

before the other two. And use this in your URL:

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/new.pl?terms=search-term1+%2bsearch-term2


0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1828078
I am sure I have inserted the line correctly but no joy.  I have tried both a form front end version and a command line version.

form:-

http://www.pcmaritime.co.uk/leisure/search/search2.htm

command argument:-

http://www.pcmaritime.co.uk/cgi-bin/pcmweb/new_pc.pl?search-term

neither version appear to work.  Any ideas?

maybe I should e-mail you my current scripts?


0
 
LVL 3

Expert Comment

by:pc012197
ID: 1828079
Yes, please send the code per email. I don't see what's going wrong there.

0
 
LVL 5

Author Comment

by:Trevor013097
ID: 1828080
Excellent Peter,

Works an absolute treat.  It was the form parsing that was causing the problems, not the method as it was correct but the extra unneccessary lines which once removed worked fine.

The boolean works great and now my next job is designing the front end and return HTML.  I will e-mail you the URL when it is up and running and may request your help later on as I want to make some refinements with the returned HTML based on a query, such as matched word highlighting.  That can be looked at later though.

For now thank you very much for all your help.
0
 
LVL 3

Expert Comment

by:pc012197
ID: 1828081
Great. Thanks for the points! It's been a pleasure to work with you. :-)

0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
change the windows script file to BAT 10 59
Perl script to process a .csv file 18 84
storing csv file in table variable in Python 2 93
IDE for Python 5 104
This tutorial will discuss fancy secure registration forms, with AJAX technology support. In this article I assume you already know HTML and some JS. I will write the code using WhizBase Server Pages, so you need to know some basics in WBSP (you mig…
Batch, VBS, and scripts in general are incredibly useful for repetitive tasks.  Some tasks can take a while to complete and it can be annoying to check back only to discover that your script finished 5 minutes ago.  Some scripts may complete nearly …
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question