Parse meta keywords/split/save

Yoo,

 I am dumping multiple web pages into a single file and like to read line by line to find the META NAME Keywords Content tag and parse the keywords, remove duplicates and then print them to a new file. Here is 500 points for your troubles....
LVL 2
BiffoAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ozoCommented:
use HTML::Parser;
0
BiffoAuthor Commented:
ozo,

 I use Parser daily, however, it is not always the prefered method.
0
oliberCommented:
ozo,
OK if you need to parse an HTML file just to find the META keywords you can use this :
 let's say that $line ontains your actual HTML content

while( <you update $line here> ) {
$line =~ s:\r::;
$line =~ s:\n::;
$line =~ s:<HEAD>(.*?)</HEAD>::;
$head = $1;
$head =~ s:<META *NAME="Keywords" *Content="([^"]+)"[^>]*::;
$keywords = $1;
if( length($keywords) ) {
  print join("\n",split(/,;/,$keywords));
}

}

and here it is...(I hope)

0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

BiffoAuthor Commented:
oliber, Can you somehow sort for duplicates before printing the keywords?
0
oliberCommented:
My advise on sorting for duplicates would be to make a pipe to "sort -u" but this would not be case insensitive.
If you want a case insensitive output, just make a $keywods =~ tr/a-z/A-Z/;
before the print, but you would still have accentuated caracters and "white spaces".
It's not so simple to make a good conversion to capital letters, you should rely on the iso-8859-1 standard and make your own subroutine to translate accentuated caracters to capital letters.
example :
sub my_toupper {
  my($thestring) = @_;
  for $car(keys(%ISOMAJ)) {
   $thestring =~ tr/$car/$ISOMAJ{$car}/;
  }
}

where %ISOMAJ would been defined as :
%ISOMAJ = (
"é","E",
....
);
!!! do not forget capital accentuated caracters

hope thats what your looking for.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ozoCommented:
# perldoc -q unique
@uniq{split/,;/,lc $keywords} = ();
print join "/n",sort keys %uniq;
#and if you setlocale, lc will properly handle accented characters

 
0
oliberCommented:
I agree with you ozo about the uniq and lc feature with setlocale,
the main problem is having a good installation of setlocal.
if you use a perl sub to make the toupper you are independant of the machine & installation.
0
BiffoAuthor Commented:
OK, this seems to work .....

@uniq{split/,;/,lc $keywords} = ();
print join "/n",sort keys %uniq;


ummm how do you print that with CGI? I notice it returns nothing thru STDOUT but does directly piped to a file.
0
ozoCommented:
sorry, I meant join "\n"
If you're writing HTML, perhaps it shoud be join"<br>\n"
0
BiffoAuthor Commented:
ozon, No what I meant was zero output with CGI. It seems the cause is the semi colon and lc, because if I remove them it prints. like this:
@uniq{split/,/, $keywords} = ();
Because I remove it it is not sorted.

@uniq{split/,;/,lc $keywords} = ();
 Works just fine from command line on same machine, just not through CGI web server combo.
0
ozoCommented:
I assumed the previous code was working, What is $keywords before the split?
0
BiffoAuthor Commented:
Would any of you two guys happen to have an idea how to count each keyword before sorting/printing so if I wanted to get fancy I could print the number of times the keyword occurs ? Say "marketing" occurs 8 times I could then print like this:

marketing 8



 
0
ozoCommented:
for( split/\W+/,lc $keywords ){ $count{$_}++ }
for( sort keys %count ){ print "$_ $count{$_}\n"; }
0
BiffoAuthor Commented:
hmmmm ozo, now that brings up an interesting paradox... If marketing appears say 3 times, it would print like this:

marketing 1
marketing 2
marketing 3

When you just want:

marketing 3

What do you do? Maybe create two identical arrays of all the keywords parsed and pushed, sort and dupe clean array A, and then count each keyword in array A aginst array B, then print array A with the counts of keywords found in array B?

0
ozoCommented:
The
 for( keys %count )
was meant to be outside the while loop
0
BiffoAuthor Commented:
ozo, yup that was my problem, I just one level too high.

Could I bug you for one last thing? How would ya sort the keys on the count (numerical)? It doesn't matter if the count is returned in front of the keywords if it matters for sort purposes.
0
ozoCommented:
sort {$count{$a} <=> $count{$b}} keys %count
#see also `perldoc -q sort`
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.