Solved

read a config file look for keywords and count the keywords

Posted on 2011-03-19
13
436 Views
Last Modified: 2012-06-21
Hi,

I need a perl program that can do the following

1. Read a config file (keyword.ini) and get all the keywords from there. The keywords should be separated by a new line

e.g

"starbucks"
"apple"
"Best Buy"
"macys"
"nordstrom"


2. Reads the input file and counts the instance of key words (not case sensitive) listed in the cofig.ini and generate a count report.  A keyword in each line of input file should be counted as one only (so if the line contains starbucks starbucks then it should be one instance only)

3. The count report should contain the following output

Keyword,Total Count



4. The code should generate another file called cleanfile.txt and should contain

keyword,country,year,month
starbucks,United States, 2010,03

note that year and month are coming from the last column of the input file (2010M03 has become year=2010 and month=03)

5 And the final file should be log.txt which should give the count of lines scanned in the input file and the count of lines that did not contain the keywords

total_lines_scanned
lines_containing_keywords
lines_skipped

lines_skipped + lines_containing_keywords = total_lines_scanned

-----------------------------------------------------------------------------------------

Input File


"Macy's gift card with a Value of $10.00!!!!!","1","United States","Winn, ME","2010M03"
"NORDSTROM  $100.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M05"
"NORDSTROM  $100.00 GIFT CARD READ","1","United States","BROWNS MILLS, NJ","2010M06"
"NORDSTROM  $50.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M06"
"pc richard gift card...$500 value !!","1","United States","Saddle Brook, New Jersey","2010M04"
"RITEAID GIFT CARD value $25.00 no expiration","1","United States","elkton va","2010M10"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M09"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M10"
"Target Gift Card","1","United States","Cactus Country","2010M07"
0
Comment
Question by:anshuma
  • 7
  • 6
13 Comments
 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not ${conf}.\n");
open(OUT,">${out}") or die("Could not ${out}.\n");
open(LOG,">${log}") or die("Could not ${log}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               last;
          }
     }
}

foreach my $word (sort(keys(%keywords))) {
     print("${word}: ${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
exit;
0
 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
Oops.  Three typos.  One in output format, and a couple missing "open" words in die() statements...  Editing a bit fast...

Corrected version:

#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               last;
          }
     }
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
exit;
0
 

Author Comment

by:anshuma
Comment Utility
Hi Fairlight,

this is my output and either the code is not working or may be I am missing something. Is there something wrong with my keyword.ini

keyword.ini contains the following lines

"starbucks"
"apple"
"Best Buy"
"macys"
"nordstrom"

When I run the command this is the output

C:\Perl\scripts\best_scripts>perl generatecount.pl "Listings of Gift Cards with
dates_030110_030111.csv"
"Best Buy",0
"apple",0
"macys",0
"nordstrom",0
"starbucks",0
0
 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
Yeah, the quotes in your config file are not necessary.  Just list words, unquoted:

starbucks
apple
best buy
macys
nordstrom

If Macy's needs an apostrophy, use one.  If not, not.  It's going to be sensitive to that, and if you list Macys and Macy's separately, they'll be tested and counted separately.  No way around that unless one writes an equivalency file for combination work.
0
 

Author Comment

by:anshuma
Comment Utility
Still no luck

C:\Perl\scripts\best_scripts>perl generatecount.pl "Listings of Gift Cards with
dates_030110_030111.csv"
apple,0
best buy,0
macys,0
nordstrom,0
starbucks,0

even though the input file does contain all these keywords

0
 

Author Comment

by:anshuma
Comment Utility
By the way looks like your script is working for the sample data. My real data always doesnot contain the keyword as the starting line. The data can be this as well

"       * SAKS FIFTH AVENUE $495 GIFT CARD - NO EXP DATE *","1","United States","Rockaway, New Jersey","2010M05"
"       AUTHENTIC APPLE $50 US iTUNES GIFT CARD CERTIFICATE","1","United States","Newington, Connecticut","2011M01"
"       $30 AMC Movie Theatre Gift Card","1","United States","ridge, NY","2010M09"
"PIZZA HUT GIFT CARD","1","United States","valley stream, NY","2010M11"
"M.A.C.      Gift Card         $100","1","United States","Northridge, CA","2010M06"
"$30 The Cheesecake Factory Gift Card","1","United States","Thanks for checking out my","2011M01"
"BURLINGTON COAT FACTORY GIFT CARD","1","United States","valley stream, NY","2010M12"
"?? AMAZON Gift Card Certificate - $110 ??","1","United States","NY","2011M02"
"""  $ 10.00 BEST BUY....GIFT CARD ""","1","United States","Central Lake,MI","2010M10"
"""  $ 10.00 TARGET....GIFT CARD ""","1","United States","Central Lake,MI","2010M10"
"$55 BEST BUY GIFT CARD Free Shipping!  $55","1","United States","Gainesville, GA","2011M01"
"Crate & Barrel Gift Card        ***$7.25***","1","United States","GA","2010M08"
"HOME DEPOT gift card $500.00","1","United States","san antonio, TX","2011M02"
"GIFT CARD","3","United Kingdom","Erdington, West Midlands","2010M12"
"??????  Lowes Gift Card $214.00 5 DAYS   ??????","1","United States","Exeter, New Hampshire","2010M11"
"$15  STARBUCKS GIFT CARD      3 DAY ","1","United States","WEST PITTSBURG, CA","2010M05"
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
I worked off the data provided:

[arcadia-SuSE] [~] [8:06pm]: cat infile
"Macy's gift card with a Value of $10.00!!!!!","1","United States","Winn, ME","2010M03"
"NORDSTROM  $100.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M05"
"NORDSTROM  $100.00 GIFT CARD READ","1","United States","BROWNS MILLS, NJ","2010M06"
"NORDSTROM  $50.00 GIFT CARD","1","United States","BROWNS MILLS, NJ","2010M06"
"pc richard gift card...$500 value !!","1","United States","Saddle Brook, New Jersey","2010M04"
"RITEAID GIFT CARD value $25.00 no expiration","1","United States","elkton va","2010M10"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M09"
"Starbucks COFFEE ART 2006 GIFT CARD- ","3","United Kingdom","york, North Yorkshire","2010M10"
"Target Gift Card","1","United States","Cactus Country","2010M07"
[arcadia-SuSE] [~] [9:21pm]: cat keyword.ini
starbucks
apple
best buy
macys
nordstrom
[arcadia-SuSE] [~] [9:21pm]: perl perltest infile
apple,0
best buy,0
macys,0
nordstrom,3
starbucks,2
[arcadia-SuSE] [~] [9:22pm]: cat cleanfile.txt
nordstrom,United States,2010,05
nordstrom,United States,2010,06
nordstrom,United States,2010,06
starbucks,United Kingdom,2010,09
starbucks,United Kingdom,2010,10
[arcadia-SuSE] [~] [9:22pm]: cat log.txt
9
5
4

The program shouldn't care about where -in- the first field the keyword occurs, as long as it's -in- the first field (prior to the first instance of ","  (literally, doublequote, comma, doublequote).

Without having the test file you're working off of, I couldn't tell you what's failing.  It works according to your specification.  If the data doesn't match the description and example, please supply the actual data so that the program can be adjusted.
0
 

Author Comment

by:anshuma
Comment Utility
I am running on windows, looks like it may be adding some extra character in the file. Ok I am attaching a small input.csv. The actual file is 58 MB large


INPUT.csv
0
 
LVL 7

Accepted Solution

by:
Fairlight2cx earned 500 total points
Comment Utility
That file is nowhere near standard CSV format.  It will `cat` like it on linux, but that's only because it's ignoring the NULL character between -every- -single- -character-.  Literally, it's got a NULL (octal notation \000) between every single character in every line.   You can see this if you have a linux box and use vim -b on the file.

Thankfully, the lines are actually lines, they just have a tonne of extra nulls in them.  Which necessitates only one extra line of code.  :)  Here's a version that works on the input file you gave me.  Note the deletion of all \000 characters.


#!/usr/bin/perl

use strict;

die("No valid input file given.\n") unless scalar(@ARGV) and -r ${ARGV}[0];
my $conf = 'keyword.ini';
my $out = 'cleanfile.txt';
my $log = 'log.txt';
my $in = ${ARGV}[0];
undef my %keywords;
my $linecount_total = 0;
my $linecount_match = 0;
my $linecount_skip = 0;
my $debug = 0;

open(CONF,"<${conf}") or die("Could not open ${conf}.\n");
while (my $word = <CONF>) {
     chomp(${word});
     next if ${word} =~ /^\s*$/;
     $keywords{${word}} = 0 unless ${word} =~ /^\s*$/;
}
close(CONF) or die("Could not close ${conf}.\n");

open(IN,"<${in}") or die("Could not open ${conf}.\n");
open(OUT,">${out}") or die("Could not open ${out}.\n");
open(LOG,">${log}") or die("Could not open ${log}.\n");

while (my $line = <IN>) {
     chomp(${line});
     $line =~ s/\000//g;
     $linecount_total++;
     my ($desc,$unknown1,$country,$citystate,$yearmonth,$year,$month);
     ($desc,$unknown1,$country,$citystate,$yearmonth) = split(/","/,${line});
     $year = substr(${yearmonth},0,4);
     $month = substr(${yearmonth},5,2);
     if (${debug}) {
          print("desc: ${desc}\n");
          print("country: ${country}\n");
          print("citystate: ${citystate}\n");
          print("yearmonth: ${yearmonth}\n");
          print("month: ${month}\n");
          print("year: ${year}\n");
     }
     foreach my $word (keys(%keywords)) {
          if (${desc} =~ /\W${word}\W/i) {
               $keywords{${word}}++;
               $linecount_match++;
               print OUT (qq(${word},${country},${year},${month}\n));
               last;
          }
     }
}

foreach my $word (sort(keys(%keywords))) {
     print("${word},${keywords{${word}}}\n");
}

my $linecount_skip = ${linecount_total} - ${linecount_match};
print LOG ("${linecount_total}\n");
print LOG ("${linecount_match}\n");
print LOG ("${linecount_skip}\n");
close(IN) or die("Could not close ${in}.\n");
close(OUT) or die("Could not close ${out}.\n");
close(LOG) or die("Could not close ${log}.\n");
exit;
0
 

Author Closing Comment

by:anshuma
Comment Utility
You rock. God should be thanked for creating geniuses like you.
0
 

Author Comment

by:anshuma
Comment Utility
can I continue this question for couple of more things or you want me to start a new thread. I think it will be better if this could be continued
0
 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
You can continue, sure...

And thanks for the compliment!
0
 

Author Comment

by:anshuma
Comment Utility
I need to split the state and city also now and write it to  'cleanfile.txt'

"Macy's gift card with a Value of $10.00!!!!!","1","United States","Winn, ME","2010M03"

So city will be winn and state will be ME. Could you modify the code for that as well.

0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now