whittle string down to search terms only

I need to whittle a string down to specific elements. Here's a sample string

$myString=FIND%20(ABSTRACT%20%2F%20ADIS_EVALUATION%20%2F%20CROSS_REF%20%2F%20CROSS_REF2%20%2F%20DERWENT_SUMMARY%20ct%20apples%20%2F%20ability)%20OR%20(IC_KEYWORDS%20%2F%20DERWENT_SUMMARY%20%3Ddoxazosin%A0GITS%20!%20%3DE5)%20OR%20(AUTHORS%20%2F%20CROSS_REF%20%3DA%A0Bertolini)%20AND%20(YEAR%20ct%202005)%20NOT%20(LANGUAGE%20%3D%22Ger%22)

I'm trying to figure out how I can push only selected items to an array. In the example above, the only items I want to push are:

apples
ability
doxazosin GITS
A Bertolini
2005

But of course the string above could just as easily be something like:

$myString=FIND%20(ABSTRACT%20%2F%20ADIS_EVALUATION%20%2F%20CROSS_REF%20%2F%20CROSS_REF2%20%2F%20DERWENT_SUMMARY%20ct%20orange%20%2F%20lack%20of%20imagination)%20OR%20(IC_KEYWORDS%20%2F%20DERWENT_SUMMARY%20%3Ddonepezil%20!%20%3DE5)%20OR%20(AUTHORS%20%2F%20CROSS_REF%20%3DSanta%A0Clause)%20AND%20(YEAR%20ct%201999%3A2002)%20NOT%20(LANGUAGE%20%3D%22Ger%22)

In the second example, I'd want to push the following to an array:

orange
lack of imagination
donepezil
Santa Clause
1999
2000
2001
2002

In the second example, I want years 1999 through 2002 because the string contains 1999%3A2002, which indicates a range of years.

And, as you can see, I don't want any of the following:

FIND%20
ABSTRACT
ADIS_EVALUATION
CROSS_REF
CROSS_REF2
DERWENT_SUMMARY
IC_KEYWORDS
AUTHORS
YEAR
LANGUAGE
%20%2F%20
(
)

nor do I want items that follow an ! or a NOT, such as (in the examples above)
E5
Ger

For me, this seems a nightmare of logic and I'm pretty lost. Would anybody out there care to give it shot? All efforts much appreciated. The whole point of this is that I'm trying to narrow everything down to specific search terms so that I can then highlight them in my search results.

Like I said...a nightmare.

Any brave capable souls interested in diving into the breach?

GessWurkerAsked:
Who is Participating?
 
stefan73Commented:
Hi GessWurker,
First, get rid of the HTML escapes (%...):

use URI::Escape;
print uri_unescape("FIND%20(ABSTRACT%20%2F%20ADIS_EVALUATION%20%2F%20CROSS_REF%20%2F%20CROSS_REF2%20%2F%20DERWENT_SUMMARY%20ct%20orange%20%2F%20lack%20of%20imagination)%20OR%20(IC_KEYWORDS%20%2F%20DERWENT_SUMMARY%20%3Ddonepezil%20!%20%3DE5)%20OR%20(AUTHORS%20%2F%20CROSS_REF%20%3DSanta%A0Clause)%20AND%20(YEAR%20ct%201999%3A2002)%20NOT%20(LANGUAGE%20%3D%22Ger%22)");

You'll get:
FIND (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange / lack of imagination) OR (IC_KEYWORDS / DERWENT_SUMMARY =donepezil ! =E5) OR (AUTHORS / CROSS_REF =Santa Clause) AND (YEAR ct 1999:2002) NOT (LANGUAGE ="Ger")

You can split those using the brackets (can you nest them? I hope not)

Like:

while($myString =~ m/(?:^|\s+)(\W+)\s*\(([^\)]*)\)/g){
    push @expression,[$1,$2];
}

Now you have in @expression:
[0]=["FIND","ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange / lack of imagination"]
[1]=["OR","IC_KEYWORDS / DERWENT_SUMMARY =donepezil ! =E5"]
etc.

Then you can filter out NOTs.
(If your operators are not case-sensitive, map them to either lower or upper case: push @expression,[uc($1),$2];)

As the next step you can split by "ct" or "=":
for $ex (@expression){
    $right_side = (split(/\s*(cd|\=)\s*))[-1];
    ...
}
 
As you can have composite ones (like "orange / lack of imagination"), you need to split again.

Hope this gets you on the way. Be aware that this won't support a full syntax yet. For that, you need a full-scale parser, such as Lex/YACC.


Cheers!

Stefan
0
 
GessWurkerAuthor Commented:
Stefan: Thanks for your response. And sadly, yes, parens can be nested and nested again! But I'll be giving your suggestions a shot.

I'm not familiar with Lex/YACC.
0
 
GessWurkerAuthor Commented:
Perhaps it would be better to use split on the = (%3D) and on the "ct"? Maybe that would get me to the actual terms more efficiently? ...especially since there can be lot's of nested parentheses.
0
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

 
stefan73Commented:
Yes, although getting rid of "NOT" expressions will be tricky then.

You're probably easier off parsing the expression into a syntax tree. The "AND" or "NOT" operators are the nodes, and the (possibly nested) expressions are the branches.

       AND
     /       \      
NOT        OR
  |          /      \
apples  pears peaches

Then you can easily clip off NOT subtrees.

 
0
 
GessWurkerAuthor Commented:
Well...my head REALLY hurts now! Can you elaborate a little on how I'd parse the expression into a syntax tree? Thanks again for all your guidance!
0
 
stefan73Commented:
That's a bit tricky. Your root node normally is an AND expression (could also be OR) and the sub-nodes are the prenthesed expressions. Example:

FIND (field1 / field2 = a ) AND (field3 = c ) AND ( ( field4 = d ) OR ( field5 ct e) )

would translate into
 
                     ++++++ AND +++++++                  
                    /              \                 \
                   =              =                 OR
                 /      \         /   \           /       \
            field1/2   A  field3  c         =        ct
                                               /     \     /   \
                                             field4 d field5  e


I'm not familiar with the search syntax, but I assume "ct" means "contains"?

Also:
FIND (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange / lack of imagination)

Is the same as
FIND (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange) OR (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct lack of imagination)

like an implicit "OR"?
0
 
GessWurkerAuthor Commented:
Yes, basically, ct means contains. And yes, it's an implicit "OR".
0
 
stefan73Commented:
So the OR something you can easily change into an OR subtree.
0
 
GessWurkerAuthor Commented:
OK. right now, all I'm trying to do is create the array and display each member. But I'm missing something. So all I'm ending up with is a list like this:
ARRAY(0x1845f1c)
ARRAY(0x1845f34)
ARRAY(0x190379c)
ARRAY(0x19037f0)
ARRAY(0x19038b0)
ARRAY(0x1903910)
ARRAY(0x1903a84)

What am I missing?

Here's the code:
$ContentType   = "Content-type:  text/html\n\n";
print $ContentType;

use strict;

my ($query,%data,$InmDB,$InmQ,$InmBU,$InmRF,@exp,$exp);

use CGI;

$query = new CGI;

use URI::Escape qw(uri_unescape);
$InmQ = uri_unescape($data{'InmQ'});

while($InmQ =~ m/\=?:^|\s+\=\W+/g){
    push @exp,[$1,$2];
}
foreach $exp(@exp) {
print <<HTML;
$exp
<br/>
HTML
}

print HTML;


print "<html><head><title>Parse query</title></head><BODY>
       <b>unescaped query (original): </b>".$InmQ."
       <BR/>";
print "</BODY>";
0
 
stefan73Commented:
Try Data::Dumper:

use Data::Dumper;
...
print Data::Dumper->Dump([\@exp],["*exp"]);

...and you'll see that @exp is a nested array, because you did
push @exp,[$1,$2];

BTW: You'll see your subarrays are empty: You need to capture the data using (...) expressions, in your case
while($InmQ =~ m/\=?:^|\s+\=\W+/g){

replaced with (check this, I'm not sure what you're planning to do)
while($InmQ =~ m/(?<=^|\s)(\=|ct\s)\s*(\.*?)(?=\))/g){

...and you'll get the compare operator ('=' or 'ct ') in $1, the word you're looking for in $2. This regex is still insufficient to capture a bracketed expression. For this, you can do:

while($InmQ =~ m/(?<=\()([^\)]+?)(\=|ct\s)\s*(\.*?)(?=\))/g){

This still won't work for nested parantheses, though. There's no way in a regex to mark nesting levels. That's what YACC or similar other tools are for.
0
 
GessWurkerAuthor Commented:
made change to the regex as you suggested and am getting an error:
Variable length lookbehind not implemented in regex; marked by <-- HERE in m/(?<=^|\s)(=|ct\s)\s*(\.*?)(?=\)) <-- HERE / at d:\cgi\workshop\ParseQuery.pl line 27.

can you correct my code?

Here's my non-working code:
$ContentType   = "Content-type:  text/html\n\n";
print $ContentType;

use strict;

my ($query,%data,$InmDB,$InmQ,$InmBU,$InmRF,@exp,$exp,%exp);

use CGI;
use Data::Dumper;
$query = new CGI;

use URI::Escape qw(uri_unescape);

use URI::Escape qw(uri_escape);

$InmDB = uc($data{'InmDB'});
$InmQ = uri_unescape($data{'InmQ'});
$InmBU = $data{'InmBU'};
$InmRF = $data{'InmRF'};

while($InmQ =~ m/(?<=^|\s)(\=|ct\s)\s*(\.*?)(?=\))/g){
    push @exp,[$1,$2];
}

print Data::Dumper->Dump([\@exp],["*exp"]);

#foreach $exp(@exp) {
#print $exp."<br/>";}

print "<html><head><title>Parse query</title></head><BODY>
       <b>unescaped query (original): </b>".$InmQ.
       "       <BR/>";
print "</BODY>";
0
 
GessWurkerAuthor Commented:
stefan73: I've decided to abandon the query-parsing, pushing-to-array, term-highlighting approach, so I'm going to allocate points and close this question. But thanks for all your help on this; I've gained a lot from our discussion.

GessWurker
0
 
stefan73Commented:
Hi,

Here's a simple example of a syntax tree built from an arithmetric expression:


#!/usr/bin/env perl

use strict;
use Data::Dumper;

# Sample expression
my $input="(3+4)*(5+3)";

# Regex components
my $operator='[+*-/]';

my $root=parse($input);

print Data::Dumper->Dump([$root],['root']);

sub parse($){
      my $txt = shift;
      my ($left,$right,$op);
      
      # Left-hand match
      if($txt =~ m/^(\d+)(?=$operator)/g){
            #print "Left digit: $1\n";
            $left = $1;
      } elsif ($txt =~ m/^\((.*)(?=\)$operator)/g){
            #print "Left bracket: $1\n";
            $left = parse($1);
      }

      # Operator match      
      my $xx = substr($txt,pos($txt));
      if($xx =~ /^\)*($operator)/){
            $op = $1;
      }
      
      # Right-hand match
      if($txt =~ m/(\d+)$/){
            #print "Right digit: $1\n";
            $right = $1;
      } elsif ($txt =~ m/$operator\((.*)\)$/g){
            #print "Right bracket: $1\n";
            $right = parse($1);
      }
      
      {left=>$left,right=>$right,op=>$op};
}

This is quite messy and far from robust, but you get the idea.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.