?
Solved

Finding strings within strings and assigning score

Posted on 2006-11-13
6
Medium Priority
?
163 Views
Last Modified: 2010-08-05
I want to be able to search for specific characters and strings within a string and increase a score if that string is found.

For example I want to be able to specify my strings to search for in one TEXTAREA box in an HTML page like this:

',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4

So each line is a group of strings and at the end of the line is a colon follow by the score to assign if one of these strings is found. However if multiple strings from the same group are found I don't want them to be counted twice. Also if so many bytes (characters) goes by (configurable) without a match I want the score to be reset to 0.

So say I have the following text

This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.

I would like this to be then output in a format like this with the part between the [] being the score at the time the string was found. For this example lets say the number of bytes to reset is 20.

This is a sample text. This comma[1] has to be counted, [0]once. Delete[3] union[7] but comma[8]. What else do I wa[0]nt to type here having[4] bad characters.

The times when it went back down to [0] were because 20 characters had passed without a match. Hopefully this makes sense.
0
Comment
Question by:mikedgibson
  • 3
  • 3
6 Comments
 
LVL 85

Expert Comment

by:ozo
ID: 17933307
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    $s{$_}=lc $s for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/\G(.{0,19}?($re)|.{20})/i;
$_='This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.
';
$s=0;
s/$re/$1 . "[" . ($s{lc $2}?$s+=$s{lc $2}:($s=0)) . "]"/eg;
print;
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933568
This works great at the command line. Now how difficult would it be to make this a CGI where the sample text, groups and number of characters before resetting are all passed from a form?

Sorry I originally wanted it for command line but now I figure this would work better through a web interface.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933838
Hmm the counting seems to be a bit off. If you take the following sample

This is a sample text. This comma has, to be counted, once. Delete update union but comma. What else do I want to type here having bad characters.

You get the following output

This is a sample tex[0]t. This comma[1] has, to be counted,[0] once. Delete[3] update[6] union[10] but comma[11]. What else do I wan[0]t to type here having[4] bad characters.

The update right after the delete should be part of the same group so the delete should be counted but not the update.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 85

Expert Comment

by:ozo
ID: 17934216
$g=0;
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    ++$g;
    $s{lc$_}=[$s,$g] for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/(.{0,19}?($re)|.{20})/i;
while( <> ){
  $s=0;
  $g=-1;
  s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[0]); $1 . ($p!=$g&&"[$s]")/eg;
  print;
}
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 85

Accepted Solution

by:
ozo earned 1500 total points
ID: 17934796
Sorry, that should have been
s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[1]); $1 . ($p!=$g&&"[$s]")/eg;
As it was, it would count groups as the same if they had the same score
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17939631
Is there any way to not reset the counter after a new line?
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

807 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question