Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

whittle string down to search terms only

Posted on 2005-05-13
13
Medium Priority
?
249 Views
Last Modified: 2010-03-05
I need to whittle a string down to specific elements. Here's a sample string

$myString=FIND%20(ABSTRACT%20%2F%20ADIS_EVALUATION%20%2F%20CROSS_REF%20%2F%20CROSS_REF2%20%2F%20DERWENT_SUMMARY%20ct%20apples%20%2F%20ability)%20OR%20(IC_KEYWORDS%20%2F%20DERWENT_SUMMARY%20%3Ddoxazosin%A0GITS%20!%20%3DE5)%20OR%20(AUTHORS%20%2F%20CROSS_REF%20%3DA%A0Bertolini)%20AND%20(YEAR%20ct%202005)%20NOT%20(LANGUAGE%20%3D%22Ger%22)

I'm trying to figure out how I can push only selected items to an array. In the example above, the only items I want to push are:

apples
ability
doxazosin GITS
A Bertolini
2005

But of course the string above could just as easily be something like:

$myString=FIND%20(ABSTRACT%20%2F%20ADIS_EVALUATION%20%2F%20CROSS_REF%20%2F%20CROSS_REF2%20%2F%20DERWENT_SUMMARY%20ct%20orange%20%2F%20lack%20of%20imagination)%20OR%20(IC_KEYWORDS%20%2F%20DERWENT_SUMMARY%20%3Ddonepezil%20!%20%3DE5)%20OR%20(AUTHORS%20%2F%20CROSS_REF%20%3DSanta%A0Clause)%20AND%20(YEAR%20ct%201999%3A2002)%20NOT%20(LANGUAGE%20%3D%22Ger%22)

In the second example, I'd want to push the following to an array:

orange
lack of imagination
donepezil
Santa Clause
1999
2000
2001
2002

In the second example, I want years 1999 through 2002 because the string contains 1999%3A2002, which indicates a range of years.

And, as you can see, I don't want any of the following:

FIND%20
ABSTRACT
ADIS_EVALUATION
CROSS_REF
CROSS_REF2
DERWENT_SUMMARY
IC_KEYWORDS
AUTHORS
YEAR
LANGUAGE
%20%2F%20
(
)

nor do I want items that follow an ! or a NOT, such as (in the examples above)
E5
Ger

For me, this seems a nightmare of logic and I'm pretty lost. Would anybody out there care to give it shot? All efforts much appreciated. The whole point of this is that I'm trying to narrow everything down to specific search terms so that I can then highlight them in my search results.

Like I said...a nightmare.

Any brave capable souls interested in diving into the breach?

0
Comment
Question by:GessWurker
  • 7
  • 6
13 Comments
 
LVL 12

Accepted Solution

by:
stefan73 earned 2000 total points
ID: 13999891
Hi GessWurker,
First, get rid of the HTML escapes (%...):

use URI::Escape;
print uri_unescape("FIND%20(ABSTRACT%20%2F%20ADIS_EVALUATION%20%2F%20CROSS_REF%20%2F%20CROSS_REF2%20%2F%20DERWENT_SUMMARY%20ct%20orange%20%2F%20lack%20of%20imagination)%20OR%20(IC_KEYWORDS%20%2F%20DERWENT_SUMMARY%20%3Ddonepezil%20!%20%3DE5)%20OR%20(AUTHORS%20%2F%20CROSS_REF%20%3DSanta%A0Clause)%20AND%20(YEAR%20ct%201999%3A2002)%20NOT%20(LANGUAGE%20%3D%22Ger%22)");

You'll get:
FIND (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange / lack of imagination) OR (IC_KEYWORDS / DERWENT_SUMMARY =donepezil ! =E5) OR (AUTHORS / CROSS_REF =Santa Clause) AND (YEAR ct 1999:2002) NOT (LANGUAGE ="Ger")

You can split those using the brackets (can you nest them? I hope not)

Like:

while($myString =~ m/(?:^|\s+)(\W+)\s*\(([^\)]*)\)/g){
    push @expression,[$1,$2];
}

Now you have in @expression:
[0]=["FIND","ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange / lack of imagination"]
[1]=["OR","IC_KEYWORDS / DERWENT_SUMMARY =donepezil ! =E5"]
etc.

Then you can filter out NOTs.
(If your operators are not case-sensitive, map them to either lower or upper case: push @expression,[uc($1),$2];)

As the next step you can split by "ct" or "=":
for $ex (@expression){
    $right_side = (split(/\s*(cd|\=)\s*))[-1];
    ...
}
 
As you can have composite ones (like "orange / lack of imagination"), you need to split again.

Hope this gets you on the way. Be aware that this won't support a full syntax yet. For that, you need a full-scale parser, such as Lex/YACC.


Cheers!

Stefan
0
 

Author Comment

by:GessWurker
ID: 14000212
Stefan: Thanks for your response. And sadly, yes, parens can be nested and nested again! But I'll be giving your suggestions a shot.

I'm not familiar with Lex/YACC.
0
 

Author Comment

by:GessWurker
ID: 14000260
Perhaps it would be better to use split on the = (%3D) and on the "ct"? Maybe that would get me to the actual terms more efficiently? ...especially since there can be lot's of nested parentheses.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 12

Expert Comment

by:stefan73
ID: 14001487
Yes, although getting rid of "NOT" expressions will be tricky then.

You're probably easier off parsing the expression into a syntax tree. The "AND" or "NOT" operators are the nodes, and the (possibly nested) expressions are the branches.

       AND
     /       \      
NOT        OR
  |          /      \
apples  pears peaches

Then you can easily clip off NOT subtrees.

 
0
 

Author Comment

by:GessWurker
ID: 14002060
Well...my head REALLY hurts now! Can you elaborate a little on how I'd parse the expression into a syntax tree? Thanks again for all your guidance!
0
 
LVL 12

Expert Comment

by:stefan73
ID: 14006567
That's a bit tricky. Your root node normally is an AND expression (could also be OR) and the sub-nodes are the prenthesed expressions. Example:

FIND (field1 / field2 = a ) AND (field3 = c ) AND ( ( field4 = d ) OR ( field5 ct e) )

would translate into
 
                     ++++++ AND +++++++                  
                    /              \                 \
                   =              =                 OR
                 /      \         /   \           /       \
            field1/2   A  field3  c         =        ct
                                               /     \     /   \
                                             field4 d field5  e


I'm not familiar with the search syntax, but I assume "ct" means "contains"?

Also:
FIND (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange / lack of imagination)

Is the same as
FIND (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct orange) OR (ABSTRACT / ADIS_EVALUATION / CROSS_REF / CROSS_REF2 / DERWENT_SUMMARY ct lack of imagination)

like an implicit "OR"?
0
 

Author Comment

by:GessWurker
ID: 14007462
Yes, basically, ct means contains. And yes, it's an implicit "OR".
0
 
LVL 12

Expert Comment

by:stefan73
ID: 14008541
So the OR something you can easily change into an OR subtree.
0
 

Author Comment

by:GessWurker
ID: 14011201
OK. right now, all I'm trying to do is create the array and display each member. But I'm missing something. So all I'm ending up with is a list like this:
ARRAY(0x1845f1c)
ARRAY(0x1845f34)
ARRAY(0x190379c)
ARRAY(0x19037f0)
ARRAY(0x19038b0)
ARRAY(0x1903910)
ARRAY(0x1903a84)

What am I missing?

Here's the code:
$ContentType   = "Content-type:  text/html\n\n";
print $ContentType;

use strict;

my ($query,%data,$InmDB,$InmQ,$InmBU,$InmRF,@exp,$exp);

use CGI;

$query = new CGI;

use URI::Escape qw(uri_unescape);
$InmQ = uri_unescape($data{'InmQ'});

while($InmQ =~ m/\=?:^|\s+\=\W+/g){
    push @exp,[$1,$2];
}
foreach $exp(@exp) {
print <<HTML;
$exp
<br/>
HTML
}

print HTML;


print "<html><head><title>Parse query</title></head><BODY>
       <b>unescaped query (original): </b>".$InmQ."
       <BR/>";
print "</BODY>";
0
 
LVL 12

Expert Comment

by:stefan73
ID: 14011887
Try Data::Dumper:

use Data::Dumper;
...
print Data::Dumper->Dump([\@exp],["*exp"]);

...and you'll see that @exp is a nested array, because you did
push @exp,[$1,$2];

BTW: You'll see your subarrays are empty: You need to capture the data using (...) expressions, in your case
while($InmQ =~ m/\=?:^|\s+\=\W+/g){

replaced with (check this, I'm not sure what you're planning to do)
while($InmQ =~ m/(?<=^|\s)(\=|ct\s)\s*(\.*?)(?=\))/g){

...and you'll get the compare operator ('=' or 'ct ') in $1, the word you're looking for in $2. This regex is still insufficient to capture a bracketed expression. For this, you can do:

while($InmQ =~ m/(?<=\()([^\)]+?)(\=|ct\s)\s*(\.*?)(?=\))/g){

This still won't work for nested parantheses, though. There's no way in a regex to mark nesting levels. That's what YACC or similar other tools are for.
0
 

Author Comment

by:GessWurker
ID: 14012217
made change to the regex as you suggested and am getting an error:
Variable length lookbehind not implemented in regex; marked by <-- HERE in m/(?<=^|\s)(=|ct\s)\s*(\.*?)(?=\)) <-- HERE / at d:\cgi\workshop\ParseQuery.pl line 27.

can you correct my code?

Here's my non-working code:
$ContentType   = "Content-type:  text/html\n\n";
print $ContentType;

use strict;

my ($query,%data,$InmDB,$InmQ,$InmBU,$InmRF,@exp,$exp,%exp);

use CGI;
use Data::Dumper;
$query = new CGI;

use URI::Escape qw(uri_unescape);

use URI::Escape qw(uri_escape);

$InmDB = uc($data{'InmDB'});
$InmQ = uri_unescape($data{'InmQ'});
$InmBU = $data{'InmBU'};
$InmRF = $data{'InmRF'};

while($InmQ =~ m/(?<=^|\s)(\=|ct\s)\s*(\.*?)(?=\))/g){
    push @exp,[$1,$2];
}

print Data::Dumper->Dump([\@exp],["*exp"]);

#foreach $exp(@exp) {
#print $exp."<br/>";}

print "<html><head><title>Parse query</title></head><BODY>
       <b>unescaped query (original): </b>".$InmQ.
       "       <BR/>";
print "</BODY>";
0
 

Author Comment

by:GessWurker
ID: 14018834
stefan73: I've decided to abandon the query-parsing, pushing-to-array, term-highlighting approach, so I'm going to allocate points and close this question. But thanks for all your help on this; I've gained a lot from our discussion.

GessWurker
0
 
LVL 12

Expert Comment

by:stefan73
ID: 14026149
Hi,

Here's a simple example of a syntax tree built from an arithmetric expression:


#!/usr/bin/env perl

use strict;
use Data::Dumper;

# Sample expression
my $input="(3+4)*(5+3)";

# Regex components
my $operator='[+*-/]';

my $root=parse($input);

print Data::Dumper->Dump([$root],['root']);

sub parse($){
      my $txt = shift;
      my ($left,$right,$op);
      
      # Left-hand match
      if($txt =~ m/^(\d+)(?=$operator)/g){
            #print "Left digit: $1\n";
            $left = $1;
      } elsif ($txt =~ m/^\((.*)(?=\)$operator)/g){
            #print "Left bracket: $1\n";
            $left = parse($1);
      }

      # Operator match      
      my $xx = substr($txt,pos($txt));
      if($xx =~ /^\)*($operator)/){
            $op = $1;
      }
      
      # Right-hand match
      if($txt =~ m/(\d+)$/){
            #print "Right digit: $1\n";
            $right = $1;
      } elsif ($txt =~ m/$operator\((.*)\)$/g){
            #print "Right bracket: $1\n";
            $right = parse($1);
      }
      
      {left=>$left,right=>$right,op=>$op};
}

This is quite messy and far from robust, but you get the idea.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question