Expiring Today—Celebrate National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

create a file that contains the lines not containing any keyword in the keyword.ini

Posted on 2011-03-19
15
Medium Priority
?
267 Views
Last Modified: 2012-05-11
Hi,

To improve the accuracy.

It will be nice if an additional file can be created that stores all the lines (which do not have any keywords matched from the file keyword.ini)  from the input file

this file will help adding new keywords to the keyword.ini file and improve the accuracy of the scan
0
Comment
Question by:anshuma
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 4
  • 3
  • +1
15 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 35174021
This may do what you want.  Input keyword.ini and myInputFile.txt.  Output is two files, one with the lines that contain at least one keyword that is not in keyword.ini, the other is a list of which keywords are missing.
#usage:   perl checkWords.pl myInputFile.txt >linesWithMissingKeywords.txt 2>missingKeywords.txt

open KW,"<keyword.ini";
while ( <KW> )
{
    s/[\r\n\"]//g;
    @x = split(/\s+/);
    foreach $x (@x) { $kw{lc($x)} = 1;}
}
close KW;

while ( <> )
{
    $line = $_;
    $_ = lc($line);
    s/\"\,\".*//;
    s/[^a-z0-9]/ /g;
    @x = split(/\s+/);
    $printIt = 0;
    foreach $x (@x)
    {
        if ( $x =~ /[a-z]/ )
        {
            if ( ! $kw{$x} )
            {
                $missingKw{$x}++;
                $printIt = 1;
            }
        }
    }
    if ( $printIt) { print $line; }
}

foreach $kw (sort(keys(%missingKw)))
{
    print STDERR "$kw\n";
}

Open in new window

0
 

Author Comment

by:anshuma
ID: 35174030
could you please modify the code of fairlightc2x (please see the related question)
0
 
LVL 31

Expert Comment

by:farzanj
ID: 35174057
Try this one.
local $" = "|";
$keywords = 'keywords.ini';
$to_be_filtered = 'lines';

#Open the first keywords
open(INFO, $keywords);
@lines = <INFO>;
chomp @lines;
close(INFO);

open(INFO2, $to_be_filtered);
@lines2 = <INFO2>;
close(INFO2);

$filter = "@lines";
@filtered = grep ( ! /$filter/, @lines2);

print @filtered;
print "\n";

Open in new window

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:anshuma
ID: 35174065
also my input file is really a bad file as it contains lot of null characters (see the comment by another expert for my related question)  can you incorporate this logic also

"
That file is nowhere near standard CSV format.  It will `cat` like it on linux, but that's only because it's ignoring the NULL character between -every- -single- -character-.  Literally, it's got a NULL (octal notation \000) between every single character in every line.   You can see this if you have a linux box and use vim -b on the file.

Thankfully, the lines are actually lines, they just have a tonne of extra nulls in them.  Which necessitates only one extra line of code.  :)  Here's a version that works on the input file you gave me.  Note the deletion of all \000 characters.
"
0
 
LVL 7

Accepted Solution

by:
Fairlight2cx earned 668 total points
ID: 35174115
#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $nomatch = 'nomatch.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");
open(NM,">${nomatch}") or die("Could not open ${nomatch}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     my $found = 0;
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               $found++;
               last;
          }
     }
     print NM ("${line}\n") unless ${found};
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
close(NM) or die("Could not close ${nomatch}.\n");
exit;
0
 

Author Comment

by:anshuma
ID: 35174160
Hi Fairlight,

The nomatch.txt generated by code contains totally unreadable characters. When I open it in textpad it shows characters like following

??????????????????????????
nomatch.png
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 1332 total points
ID: 35174213
Did you test this code?  Here is a commented version.  All you need to do it to change the name of the files in the script.  If there is any problem, please let me know.
use strict;

local $" = "|";

#Filenames
my $keywords = 'keywords.ini';
my $source   = 'lines.txt';

#Open the keywords file
open(INFO, $keywords);
my @lines = <INFO>;
close(INFO);

#Open the source file
open(INFO2, $source);
my @lines2 = <INFO2>;
close(INFO2);

chomp @lines;

#Create a filter
my $filter = "@lines";
my @filtered = grep ( ! /$filter/, @lines2);
@filtered = grep ( ! /\000/, @filtered);
print @filtered;

Open in new window

0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174233
That shouldn't happen.  Here, I've made a manual change to end-of-line handling for the nomatch.txt lines.

These -are- clean results...I've just tested the version I'm pasting below on WinXP, with the resultant nomatch.txt opening fine in gvim, notepad, and wordpad.  (I don't have textpad.)

If you get flaky results opening nomatch.txt with the version below, try something more standard, like notepad, wordpad...  It won't be the file that's at issue.



#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $nomatch = 'nomatch.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");
open(NM,">${nomatch}") or die("Could not open ${nomatch}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $line =~ s/\r$//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     my $found = 0;
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               $found++;
               last;
          }
     }
     print NM ("${line}\n") unless ${found};
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
close(NM) or die("Could not close ${nomatch}.\n");
exit;
0
 

Author Comment

by:anshuma
ID: 35174241
Hi Farzanj:

This code doesn't seem to work. The input file is a 58 MB file. The last line is not printing anything

thanks
-anshu
0
 

Author Comment

by:anshuma
ID: 35174260
Hi Fairlight,

This is what I see after opening in wordpad

thanks
-anshu
newnomatch.png
0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174283
That makes no sense.  It's not being written any differently than the other files you can read.  I do see some hex 0xa3 characters throughout the file, but those were inthe original file.

Which version of Windows?

I mean, you can read the other files I'm generating just fine, right?
0
 

Author Comment

by:anshuma
ID: 35174289
yeah :-( the other files are perfectly fine. Anyway right now my more immediate need is the State and City fields. Could you help me with that. If you go back to the old question, I also need the data for city and state.

Thanks alot for all the help. This is a great learning experience for me
0
 
LVL 7

Expert Comment

by:Fairlight2cx
ID: 35174308
It'd be possible to give you the combined city/state field, but your data wasn't normalised.  You have some with "city, state", and you have some with "city state" without a comma.  Splitting that out is really not a good thing to try automating with non-normalised data, unless you dictate that if there's no comma, 'x' rule is always followed.
0
 

Author Comment

by:anshuma
ID: 35174386
Cool, I guess. You will help me in creating a preprocessing file now :-) I will post a new question
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 1332 total points
ID: 35174966
Did you try my first version.  Please show output for the first version and tell me what is wrong.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question