Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 469
  • Last Modified:

read a config file look for keywords and count the keywords

Hi,

I need a perl program that can do the following

1. Read a config file (keyword.ini) and get all the keywords from there. The keywords should be separated by a new line

e.g

"starbucks"
"apple"
"Best Buy"
"macys"
"nordstrom"


2. Reads the input file and counts the instance of key words (not case sensitive) listed in the cofig.ini and generate a count report.  A keyword in each line of input file should be counted as one only (so if the line contains starbucks starbucks then it should be one instance only)

3. The count report should contain the following output

Keyword,Total Count



4. The code should generate another file called cleanfile.txt and should contain

keyword,country,year,month
starbucks,United States, 2010,03

note that year and month are coming from the last column of the input file (2010M03 has become year=2010 and month=03)

5 And the final file should be log.txt which should give the count of lines scanned in the input file and the count of lines that did not contain the keywords

total_lines_scanned
lines_containing_keywords
lines_skipped

lines_skipped + lines_containing_keywords = total_lines_scanned

-----------------------------------------------------------------------------------------

Input File


"Macy's gift card with a Value of $10.00!!!!!","1","United States","Winn, ME","2010M03"
"NORDSTROM  $100.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M05"
"NORDSTROM  $100.00 GIFT CARD READ","1","United States","BROWNS MILLS, NJ","2010M06"
"NORDSTROM  $50.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M06"
"pc richard gift card...$500 value !!","1","United States","Saddle Brook, New Jersey","2010M04"
"RITEAID GIFT CARD value $25.00 no expiration","1","United States","elkton va","2010M10"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M09"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M10"
"Target Gift Card","1","United States","Cactus Country","2010M07"
0
anshuma
Asked:
anshuma
  • 7
  • 6
1 Solution
 
Fairlight2cxCommented:
#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not ${conf}.\n");
open(OUT,">${out}") or die("Could not ${out}.\n");
open(LOG,">${log}") or die("Could not ${log}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               last;
          }
     }
}

foreach my $word (sort(keys(%keywords))) {
     print("${word}: ${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
exit;
0
 
Fairlight2cxCommented:
Oops.  Three typos.  One in output format, and a couple missing "open" words in die() statements...  Editing a bit fast...

Corrected version:

#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               last;
          }
     }
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
exit;
0
 
anshumaEngineeringAuthor Commented:
Hi Fairlight,

this is my output and either the code is not working or may be I am missing something. Is there something wrong with my keyword.ini

keyword.ini contains the following lines

"starbucks"
"apple"
"Best Buy"
"macys"
"nordstrom"

When I run the command this is the output

C:\Perl\scripts\best_scripts>perl generatecount.pl "Listings of Gift Cards with
dates_030110_030111.csv"
"Best Buy",0
"apple",0
"macys",0
"nordstrom",0
"starbucks",0
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Fairlight2cxCommented:
Yeah, the quotes in your config file are not necessary.  Just list words, unquoted:

starbucks
apple
best buy
macys
nordstrom

If Macy's needs an apostrophy, use one.  If not, not.  It's going to be sensitive to that, and if you list Macys and Macy's separately, they'll be tested and counted separately.  No way around that unless one writes an equivalency file for combination work.
0
 
anshumaEngineeringAuthor Commented:
Still no luck

C:\Perl\scripts\best_scripts>perl generatecount.pl "Listings of Gift Cards with
dates_030110_030111.csv"
apple,0
best buy,0
macys,0
nordstrom,0
starbucks,0

even though the input file does contain all these keywords

0
 
anshumaEngineeringAuthor Commented:
By the way looks like your script is working for the sample data. My real data always doesnot contain the keyword as the starting line. The data can be this as well

"       * SAKS FIFTH AVENUE $495 GIFT CARD - NO EXP DATE *","1","United States","Rockaway, New Jersey","2010M05"
"       AUTHENTIC APPLE $50 US iTUNES GIFT CARD CERTIFICATE","1","United States","Newington, Connecticut","2011M01"
"       $30 AMC Movie Theatre Gift Card","1","United States","ridge, NY","2010M09"
"PIZZA HUT GIFT CARD","1","United States","valley stream, NY","2010M11"
"M.A.C.      Gift Card         $100","1","United States","Northridge, CA","2010M06"
"$30 The Cheesecake Factory Gift Card","1","United States","Thanks for checking out my","2011M01"
"BURLINGTON COAT FACTORY GIFT CARD","1","United States","valley stream, NY","2010M12"
"?? AMAZON Gift Card Certificate - $110 ??","1","United States","NY","2011M02"
"""  $ 10.00 BEST BUY....GIFT CARD ""","1","United States","Central Lake,MI","2010M10"
"""  $ 10.00 TARGET....GIFT CARD ""","1","United States","Central Lake,MI","2010M10"
"$55 BEST BUY GIFT CARD Free Shipping!  $55","1","United States","Gainesville, GA","2011M01"
"Crate & Barrel Gift Card        ***$7.25***","1","United States","GA","2010M08"
"HOME DEPOT gift card $500.00","1","United States","san antonio, TX","2011M02"
"GIFT CARD","3","United Kingdom","Erdington, West Midlands","2010M12"
"??????  Lowes Gift Card $214.00 5 DAYS   ??????","1","United States","Exeter, New Hampshire","2010M11"
"$15  STARBUCKS GIFT CARD      3 DAY ","1","United States","WEST PITTSBURG, CA","2010M05"
0
 
Fairlight2cxCommented:
I worked off the data provided:

[arcadia-SuSE] [~] [8:06pm]: cat infile
"Macy's gift card with a Value of $10.00!!!!!","1","United States","Winn, ME","2010M03"
"NORDSTROM  $100.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M05"
"NORDSTROM  $100.00 GIFT CARD READ","1","United States","BROWNS MILLS, NJ","2010M06"
"NORDSTROM  $50.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M06"
"pc richard gift card...$500 value !!","1","United States","Saddle Brook, New Jersey","2010M04"
"RITEAID GIFT CARD value $25.00 no expiration","1","United States","elkton va","2010M10"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M09"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M10"
"Target Gift Card","1","United States","Cactus Country","2010M07"
[arcadia-SuSE] [~] [9:21pm]: cat keyword.ini
starbucks
apple
best buy
macys
nordstrom
[arcadia-SuSE] [~] [9:21pm]: perl perltest infile
apple,0
best buy,0
macys,0
nordstrom,3
starbucks,2
[arcadia-SuSE] [~] [9:22pm]: cat cleanfile.txt
nordstrom,United States,2010,05
nordstrom,United States,2010,06
nordstrom,United States,2010,06
starbucks,United Kingdom,2010,09
starbucks,United Kingdom,2010,10
[arcadia-SuSE] [~] [9:22pm]: cat log.txt
9
5
4

The program shouldn't care about where -in- the first field the keyword occurs, as long as it's -in- the first field (prior to the first instance of ","  (literally, doublequote, comma, doublequote).

Without having the test file you're working off of, I couldn't tell you what's failing.  It works according to your specification.  If the data doesn't match the description and example, please supply the actual data so that the program can be adjusted.
0
 
anshumaEngineeringAuthor Commented:
I am running on windows, looks like it may be adding some extra character in the file. Ok I am attaching a small input.csv. The actual file is 58 MB large


INPUT.csv
0
 
Fairlight2cxCommented:
That file is nowhere near standard CSV format.  It will `cat` like it on linux, but that's only because it's ignoring the NULL character between -every- -single- -character-.  Literally, it's got a NULL (octal notation \000) between every single character in every line.   You can see this if you have a linux box and use vim -b on the file.

Thankfully, the lines are actually lines, they just have a tonne of extra nulls in them.  Which necessitates only one extra line of code.  :)  Here's a version that works on the input file you gave me.  Note the deletion of all \000 characters.


#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               last;
          }
     }
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
exit;
0
 
anshumaEngineeringAuthor Commented:
You rock. God should be thanked for creating geniuses like you.
0
 
anshumaEngineeringAuthor Commented:
can I continue this question for couple of more things or you want me to start a new thread. I think it will be better if this could be continued
0
 
Fairlight2cxCommented:
You can continue, sure...

And thanks for the compliment!
0
 
anshumaEngineeringAuthor Commented:
I need to split the state and city also now and write it to  'cleanfile.txt'

"Macy's gift card with a Value of $10.00!!!!!","1","United States","Winn, ME","2010M03"

So city will be winn and state will be ME. Could you modify the code for that as well.

0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

  • 7
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now