Solved

create a file that contains the lines not containing any keyword in the keyword.ini

Posted on 2011-03-19
15
256 Views
Last Modified: 2012-05-11
Hi,

To improve the accuracy.

It will be nice if an additional file can be created that stores all the lines (which do not have any keywords matched from the file keyword.ini)  from the input file

this file will help adding new keywords to the keyword.ini file and improve the accuracy of the scan
0
Comment
Question by:anshuma
  • 7
  • 4
  • 3
  • +1
15 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 35174021
This may do what you want.  Input keyword.ini and myInputFile.txt.  Output is two files, one with the lines that contain at least one keyword that is not in keyword.ini, the other is a list of which keywords are missing.
#usage:   perl checkWords.pl myInputFile.txt >linesWithMissingKeywords.txt 2>missingKeywords.txt

open KW,"<keyword.ini";
while ( <KW> )
{
    s/[\r\n\"]//g;
    @x = split(/\s+/);
    foreach $x (@x) { $kw{lc($x)} = 1;}
}
close KW;

while ( <> )
{
    $line = $_;
    $_ = lc($line);
    s/\"\,\".*//;
    s/[^a-z0-9]/ /g;
    @x = split(/\s+/);
    $printIt = 0;
    foreach $x (@x)
    {
        if ( $x =~ /[a-z]/ )
        {
            if ( ! $kw{$x} )
            {
                $missingKw{$x}++;
                $printIt = 1;
            }
        }
    }
    if ( $printIt) { print $line; }
}

foreach $kw (sort(keys(%missingKw)))
{
    print STDERR "$kw\n";
}

Open in new window

0
 

Author Comment

by:anshuma
ID: 35174030
could you please modify the code of fairlightc2x (please see the related question)
0
 
LVL 31

Expert Comment

by:farzanj
ID: 35174057
Try this one.
local $" = "|";
$keywords = 'keywords.ini';
$to_be_filtered = 'lines';

#Open the first keywords
open(INFO, $keywords);
@lines = <INFO>;
chomp @lines;
close(INFO);

open(INFO2, $to_be_filtered);
@lines2 = <INFO2>;
close(INFO2);

$filter = "@lines";
@filtered = grep ( ! /$filter/, @lines2);

print @filtered;
print "\n";

Open in new window

0
ScreenConnect 6.0 Free Trial

At ScreenConnect, partner feedback doesn't fall on deaf ears. We collected partner suggestions off of their virtual wish list and transformed them into one game-changing release: ScreenConnect 6.0. Explore all of the extras and enhancements for yourself!

 

Author Comment

by:anshuma
ID: 35174065
also my input file is really a bad file as it contains lot of null characters (see the comment by another expert for my related question)  can you incorporate this logic also

"
That file is nowhere near standard CSV format.  It will `cat` like it on linux, but that's only because it's ignoring the NULL character between -every- -single- -character-.  Literally, it's got a NULL (octal notation \000) between every single character in every line.   You can see this if you have a linux box and use vim -b on the file.

Thankfully, the lines are actually lines, they just have a tonne of extra nulls in them.  Which necessitates only one extra line of code.  :)  Here's a version that works on the input file you gave me.  Note the deletion of all \000 characters.
"
0
 
LVL 7

Accepted Solution

by:
Fairlight2cx earned 167 total points
ID: 35174115
#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $nomatch = 'nomatch.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");
open(NM,">${nomatch}") or die("Could not open ${nomatch}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     my $found = 0;
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               $found++;
               last;
          }
     }
     print NM ("${line}\n") unless ${found};
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
close(NM) or die("Could not close ${nomatch}.\n");
exit;
0
 

Author Comment

by:anshuma
ID: 35174160
Hi Fairlight,

The nomatch.txt generated by code contains totally unreadable characters. When I open it in textpad it shows characters like following

??????????????????????????
nomatch.png
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 333 total points
ID: 35174213
Did you test this code?  Here is a commented version.  All you need to do it to change the name of the files in the script.  If there is any problem, please let me know.
use strict;

local $" = "|";

#Filenames
my $keywords = 'keywords.ini';
my $source   = 'lines.txt';

#Open the keywords file
open(INFO, $keywords);
my @lines = <INFO>;
close(INFO);

#Open the source file
open(INFO2, $source);
my @lines2 = <INFO2>;
close(INFO2);

chomp @lines;

#Create a filter
my $filter = "@lines";
my @filtered = grep ( ! /$filter/, @lines2);
@filtered = grep ( ! /\000/, @filtered);
print @filtered;

Open in new window

0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174233
That shouldn't happen.  Here, I've made a manual change to end-of-line handling for the nomatch.txt lines.

These -are- clean results...I've just tested the version I'm pasting below on WinXP, with the resultant nomatch.txt opening fine in gvim, notepad, and wordpad.  (I don't have textpad.)

If you get flaky results opening nomatch.txt with the version below, try something more standard, like notepad, wordpad...  It won't be the file that's at issue.



#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $nomatch = 'nomatch.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");
open(NM,">${nomatch}") or die("Could not open ${nomatch}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $line =~ s/\r$//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     my $found = 0;
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               $found++;
               last;
          }
     }
     print NM ("${line}\n") unless ${found};
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
close(NM) or die("Could not close ${nomatch}.\n");
exit;
0
 

Author Comment

by:anshuma
ID: 35174241
Hi Farzanj:

This code doesn't seem to work. The input file is a 58 MB file. The last line is not printing anything

thanks
-anshu
0
 

Author Comment

by:anshuma
ID: 35174260
Hi Fairlight,

This is what I see after opening in wordpad

thanks
-anshu
newnomatch.png
0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174283
That makes no sense.  It's not being written any differently than the other files you can read.  I do see some hex 0xa3 characters throughout the file, but those were inthe original file.

Which version of Windows?

I mean, you can read the other files I'm generating just fine, right?
0
 

Author Comment

by:anshuma
ID: 35174289
yeah :-( the other files are perfectly fine. Anyway right now my more immediate need is the State and City fields. Could you help me with that. If you go back to the old question, I also need the data for city and state.

Thanks alot for all the help. This is a great learning experience for me
0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174308
It'd be possible to give you the combined city/state field, but your data wasn't normalised.  You have some with "city, state", and you have some with "city state" without a comma.  Splitting that out is really not a good thing to try automating with non-normalised data, unless you dictate that if there's no comma, 'x' rule is always followed.
0
 

Author Comment

by:anshuma
ID: 35174386
Cool, I guess. You will help me in creating a preprocessing file now :-) I will post a new question
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 333 total points
ID: 35174966
Did you try my first version.  Please show output for the first version and tell me what is wrong.
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
binary to char / hexadecimal 5 113
Perl - Mawk 2 96
.properties file to call function/method 9 61
Perl File::Find alternative 1 69
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…

773 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question