Solved

Finding strings within strings and assigning score

Posted on 2006-11-13
6
159 Views
Last Modified: 2010-08-05
I want to be able to search for specific characters and strings within a string and increase a score if that string is found.

For example I want to be able to specify my strings to search for in one TEXTAREA box in an HTML page like this:

',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4

So each line is a group of strings and at the end of the line is a colon follow by the score to assign if one of these strings is found. However if multiple strings from the same group are found I don't want them to be counted twice. Also if so many bytes (characters) goes by (configurable) without a match I want the score to be reset to 0.

So say I have the following text

This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.

I would like this to be then output in a format like this with the part between the [] being the score at the time the string was found. For this example lets say the number of bytes to reset is 20.

This is a sample text. This comma[1] has to be counted, [0]once. Delete[3] union[7] but comma[8]. What else do I wa[0]nt to type here having[4] bad characters.

The times when it went back down to [0] were because 20 characters had passed without a match. Hopefully this makes sense.
0
Comment
Question by:mikedgibson
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 17933307
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    $s{$_}=lc $s for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/\G(.{0,19}?($re)|.{20})/i;
$_='This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.
';
$s=0;
s/$re/$1 . "[" . ($s{lc $2}?$s+=$s{lc $2}:($s=0)) . "]"/eg;
print;
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933568
This works great at the command line. Now how difficult would it be to make this a CGI where the sample text, groups and number of characters before resetting are all passed from a form?

Sorry I originally wanted it for command line but now I figure this would work better through a web interface.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933838
Hmm the counting seems to be a bit off. If you take the following sample

This is a sample text. This comma has, to be counted, once. Delete update union but comma. What else do I want to type here having bad characters.

You get the following output

This is a sample tex[0]t. This comma[1] has, to be counted,[0] once. Delete[3] update[6] union[10] but comma[11]. What else do I wan[0]t to type here having[4] bad characters.

The update right after the delete should be part of the same group so the delete should be counted but not the update.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 84

Expert Comment

by:ozo
ID: 17934216
$g=0;
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    ++$g;
    $s{lc$_}=[$s,$g] for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/(.{0,19}?($re)|.{20})/i;
while( <> ){
  $s=0;
  $g=-1;
  s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[0]); $1 . ($p!=$g&&"[$s]")/eg;
  print;
}
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 17934796
Sorry, that should have been
s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[1]); $1 . ($p!=$g&&"[$s]")/eg;
As it was, it would count groups as the same if they had the same score
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17939631
Is there any way to not reset the counter after a new line?
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question