Solved

Finding strings within strings and assigning score

Posted on 2006-11-13
6
154 Views
Last Modified: 2010-08-05
I want to be able to search for specific characters and strings within a string and increase a score if that string is found.

For example I want to be able to specify my strings to search for in one TEXTAREA box in an HTML page like this:

',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4

So each line is a group of strings and at the end of the line is a colon follow by the score to assign if one of these strings is found. However if multiple strings from the same group are found I don't want them to be counted twice. Also if so many bytes (characters) goes by (configurable) without a match I want the score to be reset to 0.

So say I have the following text

This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.

I would like this to be then output in a format like this with the part between the [] being the score at the time the string was found. For this example lets say the number of bytes to reset is 20.

This is a sample text. This comma[1] has to be counted, [0]once. Delete[3] union[7] but comma[8]. What else do I wa[0]nt to type here having[4] bad characters.

The times when it went back down to [0] were because 20 characters had passed without a match. Hopefully this makes sense.
0
Comment
Question by:mikedgibson
  • 3
  • 3
6 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 17933307
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    $s{$_}=lc $s for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/\G(.{0,19}?($re)|.{20})/i;
$_='This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.
';
$s=0;
s/$re/$1 . "[" . ($s{lc $2}?$s+=$s{lc $2}:($s=0)) . "]"/eg;
print;
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933568
This works great at the command line. Now how difficult would it be to make this a CGI where the sample text, groups and number of characters before resetting are all passed from a form?

Sorry I originally wanted it for command line but now I figure this would work better through a web interface.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933838
Hmm the counting seems to be a bit off. If you take the following sample

This is a sample text. This comma has, to be counted, once. Delete update union but comma. What else do I want to type here having bad characters.

You get the following output

This is a sample tex[0]t. This comma[1] has, to be counted,[0] once. Delete[3] update[6] union[10] but comma[11]. What else do I wan[0]t to type here having[4] bad characters.

The update right after the delete should be part of the same group so the delete should be counted but not the update.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 84

Expert Comment

by:ozo
ID: 17934216
$g=0;
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    ++$g;
    $s{lc$_}=[$s,$g] for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/(.{0,19}?($re)|.{20})/i;
while( <> ){
  $s=0;
  $g=-1;
  s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[0]); $1 . ($p!=$g&&"[$s]")/eg;
  print;
}
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 17934796
Sorry, that should have been
s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[1]); $1 . ($p!=$g&&"[$s]")/eg;
As it was, it would count groups as the same if they had the same score
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17939631
Is there any way to not reset the counter after a new line?
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now