Solved

parse html files

Posted on 1998-10-30
15
211 Views
Last Modified: 2010-03-04
I am trying to write a program in Perl to parse a HTML file for special data.
Very simplified: the program shall do the following things on a given file (or a list of files):

(1) save the text of all links in an array (look for <A.......>xxx</A> and save the xxx in an array)
(2) surround all links with special tags (e.g.: <TAG ID="n"><A...>xxx</a></TAG> where a is the number of the link)
(3) write the modified html data into a new file

my own efforts work at some times and to some part, but not sufficiently. I think my regExp's are quite buggy, so I won't post any code; probably anyone here has a muchg easier and shorter way to do it.

Thanks in advance,

      Christian
0
Comment
Question by:Christian_Wenz
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 7
15 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 1205824
you might use the HTML::Parser module
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205825
I once played around with HTML::HeadParser which is part of this package, that worked quite fine indeed. However, in this case, I'd like to avoid using modules but have some compact :) lines in one script. So if you could post here some code, I'd be very glad to give you the points.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205826
well, as long as there's nothing tricky happening with comments or scripts or nested tags, you may be able to get by with something like

s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array,$2),@array))}">$1</TAG>#sgi;
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205827
Hi ozo,
your idea works perfectly!
Here's my (simplified) code:

#!/usr/bin/perl
open(FILE, "$ARGV[0]");
open(FILE2, ">$ARGV[1]");
$content = "";
while ($line  = <FILE>){$content .= $line;}
  $content =~ s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array, $2), @array))}">$1</TAG>#sgi;
  print FILE2 $content;
foreach $array (@array) { print "$array\n";}
close(FILE);
close(FILE2);

some minor changes: I'd like that the whole link (including <A...> and </A>) is saved in @array. Please answer this question with the new regexp, and if I do not find some more questions, you'l get the points :-)
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205828
open(FILE, "<$ARGV[0]") or die "can't open $ARGV[0]:$!";
open(FILE2, ">$ARGV[1]") or die "can't open $ARGV[1]:$!";
{local $/=undef; $content = <FILE>}
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205829
aargh! what a dumb idea to read the file line by line x-( Thanks for the obvious hint. (one question: do I really need the "local" and the {} brackets?)
I'll check your code; I suggest that you post it as an answer now.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205830
>do I really need the "local" and the {} brackets?
for a small program like this, no.
but I think it's a good idea in general, so you don't inadvertently break code that expects $/ to be normal.
  $content = join'',<FILE>;
would be another way to do it
(the push was also simplified, I got unneccesarily complicated the first time)
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205831
sorry, I still have a question, but this is very easy.
as you know, all <A ...>xxx</A> elements of the page are stored in @array.
I now want to replace all occurances of HREF="..." within the <A> tag with a special HREF stored in $href . If the A tag has no HREF, $href shall be included in the tag.
Here's my code:

  $array =~ s/\s/ /g; #<A\nHREF works in HTML, s replace this by spaces
  if ($array =~ (/HREF/i)) { $array =~ s/HREF\s*=\s*['"].*['"]/$href/i ;} else {$array =~ s/>/ $href>/ ;}
  $array =~ s/"/\\"/g; #escape all "

if the <A> tag has no HREF element, all works fine, however if it does have a HREF attribute, all other attributes in this tag are overwritten x-(

<A HREF="xxx" STYLE="yyy"> turns to
<A HREF=\"---the value of $href---\">

TIA!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205832
forgot to mention: $href = 'HREF="xxx.htm"';
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205833
Try changing
['"].*['"]
to
['"].*?['"]
(this won't work if you have \" within ' or \' within ")
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205834
ok, thanks a lot for your help. Please grab your points now!
0
 
LVL 84

Accepted Solution

by:
ozo earned 100 total points
ID: 1205835
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;

$array =~ s/\s?HREF\s*=\s*['"][^'"]*['"]|(?=>)/ $href/i;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205836
thank you very much for your help!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205837
Hi ozo,

if you are reading this: I have now found the time to have a closer look at the script I wrote using your code. Then, the following questions emerged. If you could answer them for 100 points, I'd then open a "grab your points" question.

I am especially curious about this regular expression:
s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<ILAYER ID="link${\push @array,$1}">$1</ILAYER>#sgi

a) what are the word boundaries \b for?
b) why did you "espace" the push with "\"?
c) how exactly works the "link${...}" construct? I understand that you put the link into the array @array and set the number of array elements after the "link. However how exactly do you do this?

Thanks in advance!
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205838
a) I didn't want to match, say, AUTHOR, or APPLET
b) to create a referense to it
c) ${...} is the scalar referenced by ...
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

691 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question