Solved

Finding strings within strings and assigning score

Posted on 2006-11-13
6
156 Views
Last Modified: 2010-08-05
I want to be able to search for specific characters and strings within a string and increase a score if that string is found.

For example I want to be able to specify my strings to search for in one TEXTAREA box in an HTML page like this:

',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4

So each line is a group of strings and at the end of the line is a colon follow by the score to assign if one of these strings is found. However if multiple strings from the same group are found I don't want them to be counted twice. Also if so many bytes (characters) goes by (configurable) without a match I want the score to be reset to 0.

So say I have the following text

This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.

I would like this to be then output in a format like this with the part between the [] being the score at the time the string was found. For this example lets say the number of bytes to reset is 20.

This is a sample text. This comma[1] has to be counted, [0]once. Delete[3] union[7] but comma[8]. What else do I wa[0]nt to type here having[4] bad characters.

The times when it went back down to [0] were because 20 characters had passed without a match. Hopefully this makes sense.
0
Comment
Question by:mikedgibson
  • 3
  • 3
6 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 17933307
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    $s{$_}=lc $s for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/\G(.{0,19}?($re)|.{20})/i;
$_='This is a sample text. This comma has to be counted, once. Delete union but comma. What else do I want to type here having bad characters.
';
$s=0;
s/$re/$1 . "[" . ($s{lc $2}?$s+=$s{lc $2}:($s=0)) . "]"/eg;
print;
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933568
This works great at the command line. Now how difficult would it be to make this a CGI where the sample text, groups and number of characters before resetting are all passed from a form?

Sorry I originally wanted it for command line but now I figure this would work better through a web interface.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17933838
Hmm the counting seems to be a bit off. If you take the following sample

This is a sample text. This comma has, to be counted, once. Delete update union but comma. What else do I want to type here having bad characters.

You get the following output

This is a sample tex[0]t. This comma[1] has, to be counted,[0] once. Delete[3] update[6] union[10] but comma[11]. What else do I wan[0]t to type here having[4] bad characters.

The update right after the delete should be part of the same group so the delete should be counted but not the update.
0
Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

 
LVL 84

Expert Comment

by:ozo
ID: 17934216
$g=0;
for( <DATA> ){
    next unless s/:(\d+)\s*//;
    $s = $1;
    @s= split/,/;
    ++$g;
    $s{lc$_}=[$s,$g] for @s;
    push @m,map"\Q$_",@s;
}
$re=join"|",@m;
$re=qr/(.{0,19}?($re)|.{20})/i;
while( <> ){
  $s=0;
  $g=-1;
  s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[0]); $1 . ($p!=$g&&"[$s]")/eg;
  print;
}
__DATA__
',comma,%2C,;,semicolon,%3B:1
delete,update,insert:3
drop table,union,having:4
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 17934796
Sorry, that should have been
s/$re/(@g=@{$s{lc $2}})?($s+=$g[0]):($s=0);($p,$g)=($g,$g[1]); $1 . ($p!=$g&&"[$s]")/eg;
As it was, it would count groups as the same if they had the same score
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 17939631
Is there any way to not reset the counter after a new line?
0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Exchange 2010 Transport Rule Regex 28 107
Writing a parser for java language 4 78
Perl string filter 5 78
quoting a comma separated list 20 84
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Email security requires an ever evolving service that stays up to date with counter-evolving threats. The Email Laundry perform Research and Development to ensure their email security service evolves faster than cyber criminals. We apply our Threat…

786 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question