Solved

Finding strings within strings and assigning score

Posted on 2006-11-13
6
157 Views
Last Modified: 2010-08-05
I want to be able to search for specific characters and strings within a string and increase a score if that string is found.

For example I want to be able to specify my strings to search for in one TEXTAREA box in an HTML page like this:

',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4

So each line is a group of strings and at the end of the line is a colon follow by the score to assign if one of these strings is found. However if multiple strings from the same group are found I don't want them to be counted twice. Also if so many bytes (characters) goes by (configurable) without a match I want the score to be reset to 0.

So say I have the following text

This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.

I would like this to be then output in a format like this with the part between the [] being the score at the time the string was found. For this example lets say the number of bytes to reset is 20.

This is a sample text. This comma[1] has to be counted, [0]once. Delete[3] union[7] but comma[8]. What else do I wa[0]nt to type here having[4] bad characters.

The times when it went back down to [0] were because 20 characters had passed without a match. Hopefully this makes sense.
0
Comment
Question by:mikedgibson
  • 3
  • 3
6 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 17933307
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    $s{$_}=lc $s for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/\G(.{0,19}?($re)|.{20})/i;
$_='This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.
';
$s=0;
s/$re/$1 . "[" . ($s{lc $2}?$s+=$s{lc $2}:($s=0)) . "]"/eg;
print;
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933568
This works great at the command line. Now how difficult would it be to make this a CGI where the sample text, groups and number of characters before resetting are all passed from a form?

Sorry I originally wanted it for command line but now I figure this would work better through a web interface.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933838
Hmm the counting seems to be a bit off. If you take the following sample

This is a sample text. This comma has, to be counted, once. Delete update union but comma. What else do I want to type here having bad characters.

You get the following output

This is a sample tex[0]t. This comma[1] has, to be counted,[0] once. Delete[3] update[6] union[10] but comma[11]. What else do I wan[0]t to type here having[4] bad characters.

The update right after the delete should be part of the same group so the delete should be counted but not the update.
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 84

Expert Comment

by:ozo
ID: 17934216
$g=0;
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    ++$g;
    $s{lc$_}=[$s,$g] for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/(.{0,19}?($re)|.{20})/i;
while( <> ){
  $s=0;
  $g=-1;
  s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[0]); $1 . ($p!=$g&&"[$s]")/eg;
  print;
}
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 17934796
Sorry, that should have been
s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[1]); $1 . ($p!=$g&&"[$s]")/eg;
As it was, it would count groups as the same if they had the same score
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17939631
Is there any way to not reset the counter after a new line?
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
ppm conversion to curl on a module install 8 86
Executing multiple sybase statements in perl dbi 2 94
Export Variables in Perl 3 85
Perl Untar File 1 54
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question