Solved

create a file that contains the lines not containing any keyword in the keyword.ini

Posted on 2011-03-19
15
248 Views
Last Modified: 2012-05-11
Hi,

To improve the accuracy.

It will be nice if an additional file can be created that stores all the lines (which do not have any keywords matched from the file keyword.ini)  from the input file

this file will help adding new keywords to the keyword.ini file and improve the accuracy of the scan
0
Comment
Question by:anshuma
  • 7
  • 4
  • 3
  • +1
15 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 35174021
This may do what you want.  Input keyword.ini and myInputFile.txt.  Output is two files, one with the lines that contain at least one keyword that is not in keyword.ini, the other is a list of which keywords are missing.
#usage:   perl checkWords.pl myInputFile.txt >linesWithMissingKeywords.txt 2>missingKeywords.txt

open KW,"<keyword.ini";
while ( <KW> )
{
    s/[\r\n\"]//g;
    @x = split(/\s+/);
    foreach $x (@x) { $kw{lc($x)} = 1;}
}
close KW;

while ( <> )
{
    $line = $_;
    $_ = lc($line);
    s/\"\,\".*//;
    s/[^a-z0-9]/ /g;
    @x = split(/\s+/);
    $printIt = 0;
    foreach $x (@x)
    {
        if ( $x =~ /[a-z]/ )
        {
            if ( ! $kw{$x} )
            {
                $missingKw{$x}++;
                $printIt = 1;
            }
        }
    }
    if ( $printIt) { print $line; }
}

foreach $kw (sort(keys(%missingKw)))
{
    print STDERR "$kw\n";
}

Open in new window

0
 

Author Comment

by:anshuma
ID: 35174030
could you please modify the code of fairlightc2x (please see the related question)
0
 
LVL 31

Expert Comment

by:farzanj
ID: 35174057
Try this one.
local $" = "|";
$keywords = 'keywords.ini';
$to_be_filtered = 'lines';

#Open the first keywords
open(INFO, $keywords);
@lines = <INFO>;
chomp @lines;
close(INFO);

open(INFO2, $to_be_filtered);
@lines2 = <INFO2>;
close(INFO2);

$filter = "@lines";
@filtered = grep ( ! /$filter/, @lines2);

print @filtered;
print "\n";

Open in new window

0
 

Author Comment

by:anshuma
ID: 35174065
also my input file is really a bad file as it contains lot of null characters (see the comment by another expert for my related question)  can you incorporate this logic also

"
That file is nowhere near standard CSV format.  It will `cat` like it on linux, but that's only because it's ignoring the NULL character between -every- -single- -character-.  Literally, it's got a NULL (octal notation \000) between every single character in every line.   You can see this if you have a linux box and use vim -b on the file.

Thankfully, the lines are actually lines, they just have a tonne of extra nulls in them.  Which necessitates only one extra line of code.  :)  Here's a version that works on the input file you gave me.  Note the deletion of all \000 characters.
"
0
 
LVL 7

Accepted Solution

by:
Fairlight2cx earned 167 total points
ID: 35174115
#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $nomatch = 'nomatch.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");
open(NM,">${nomatch}") or die("Could not open ${nomatch}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     my $found = 0;
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               $found++;
               last;
          }
     }
     print NM ("${line}\n") unless ${found};
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
close(NM) or die("Could not close ${nomatch}.\n");
exit;
0
 

Author Comment

by:anshuma
ID: 35174160
Hi Fairlight,

The nomatch.txt generated by code contains totally unreadable characters. When I open it in textpad it shows characters like following

??????????????????????????
nomatch.png
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 333 total points
ID: 35174213
Did you test this code?  Here is a commented version.  All you need to do it to change the name of the files in the script.  If there is any problem, please let me know.
use strict;

local $" = "|";

#Filenames
my $keywords = 'keywords.ini';
my $source   = 'lines.txt';

#Open the keywords file
open(INFO, $keywords);
my @lines = <INFO>;
close(INFO);

#Open the source file
open(INFO2, $source);
my @lines2 = <INFO2>;
close(INFO2);

chomp @lines;

#Create a filter
my $filter = "@lines";
my @filtered = grep ( ! /$filter/, @lines2);
@filtered = grep ( ! /\000/, @filtered);
print @filtered;

Open in new window

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174233
That shouldn't happen.  Here, I've made a manual change to end-of-line handling for the nomatch.txt lines.

These -are- clean results...I've just tested the version I'm pasting below on WinXP, with the resultant nomatch.txt opening fine in gvim, notepad, and wordpad.  (I don't have textpad.)

If you get flaky results opening nomatch.txt with the version below, try something more standard, like notepad, wordpad...  It won't be the file that's at issue.



#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $nomatch = 'nomatch.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");
open(NM,">${nomatch}") or die("Could not open ${nomatch}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $line =~ s/\r$//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     my $found = 0;
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               $found++;
               last;
          }
     }
     print NM ("${line}\n") unless ${found};
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
close(NM) or die("Could not close ${nomatch}.\n");
exit;
0
 

Author Comment

by:anshuma
ID: 35174241
Hi Farzanj:

This code doesn't seem to work. The input file is a 58 MB file. The last line is not printing anything

thanks
-anshu
0
 

Author Comment

by:anshuma
ID: 35174260
Hi Fairlight,

This is what I see after opening in wordpad

thanks
-anshu
newnomatch.png
0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174283
That makes no sense.  It's not being written any differently than the other files you can read.  I do see some hex 0xa3 characters throughout the file, but those were inthe original file.

Which version of Windows?

I mean, you can read the other files I'm generating just fine, right?
0
 

Author Comment

by:anshuma
ID: 35174289
yeah :-( the other files are perfectly fine. Anyway right now my more immediate need is the State and City fields. Could you help me with that. If you go back to the old question, I also need the data for city and state.

Thanks alot for all the help. This is a great learning experience for me
0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174308
It'd be possible to give you the combined city/state field, but your data wasn't normalised.  You have some with "city, state", and you have some with "city state" without a comma.  Splitting that out is really not a good thing to try automating with non-normalised data, unless you dictate that if there's no comma, 'x' rule is always followed.
0
 

Author Comment

by:anshuma
ID: 35174386
Cool, I guess. You will help me in creating a preprocessing file now :-) I will post a new question
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 333 total points
ID: 35174966
Did you try my first version.  Please show output for the first version and tell me what is wrong.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
You have products, that come in variants and want to set different prices for them? Watch this micro tutorial that describes how to configure prices for Magento super attributes. Assigning simple products to configurable: We assigned simple products…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now