parse html files

I am trying to write a program in Perl to parse a HTML file for special data.
Very simplified: the program shall do the following things on a given file (or a list of files):

(1) save the text of all links in an array (look for <A.......>xxx</A> and save the xxx in an array)
(2) surround all links with special tags (e.g.: <TAG ID="n"><A...>xxx</a></TAG> where a is the number of the link)
(3) write the modified html data into a new file

my own efforts work at some times and to some part, but not sufficiently. I think my regExp's are quite buggy, so I won't post any code; probably anyone here has a muchg easier and shorter way to do it.

Thanks in advance,

Who is Participating?
ozoConnect With a Mentor Commented:
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;

$array =~ s/\s?HREF\s*=\s*['"][^'"]*['"]|(?=>)/ $href/i;
you might use the HTML::Parser module
Christian_WenzAuthor Commented:
I once played around with HTML::HeadParser which is part of this package, that worked quite fine indeed. However, in this case, I'd like to avoid using modules but have some compact :) lines in one script. So if you could post here some code, I'd be very glad to give you the points.
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

well, as long as there's nothing tricky happening with comments or scripts or nested tags, you may be able to get by with something like

s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array,$2),@array))}">$1</TAG>#sgi;
Christian_WenzAuthor Commented:
Hi ozo,
your idea works perfectly!
Here's my (simplified) code:

open(FILE, "$ARGV[0]");
open(FILE2, ">$ARGV[1]");
$content = "";
while ($line  = <FILE>){$content .= $line;}
  $content =~ s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array, $2), @array))}">$1</TAG>#sgi;
  print FILE2 $content;
foreach $array (@array) { print "$array\n";}

some minor changes: I'd like that the whole link (including <A...> and </A>) is saved in @array. Please answer this question with the new regexp, and if I do not find some more questions, you'l get the points :-)
open(FILE, "<$ARGV[0]") or die "can't open $ARGV[0]:$!";
open(FILE2, ">$ARGV[1]") or die "can't open $ARGV[1]:$!";
{local $/=undef; $content = <FILE>}
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;
Christian_WenzAuthor Commented:
aargh! what a dumb idea to read the file line by line x-( Thanks for the obvious hint. (one question: do I really need the "local" and the {} brackets?)
I'll check your code; I suggest that you post it as an answer now.
>do I really need the "local" and the {} brackets?
for a small program like this, no.
but I think it's a good idea in general, so you don't inadvertently break code that expects $/ to be normal.
  $content = join'',<FILE>;
would be another way to do it
(the push was also simplified, I got unneccesarily complicated the first time)
Christian_WenzAuthor Commented:
sorry, I still have a question, but this is very easy.
as you know, all <A ...>xxx</A> elements of the page are stored in @array.
I now want to replace all occurances of HREF="..." within the <A> tag with a special HREF stored in $href . If the A tag has no HREF, $href shall be included in the tag.
Here's my code:

  $array =~ s/\s/ /g; #<A\nHREF works in HTML, s replace this by spaces
  if ($array =~ (/HREF/i)) { $array =~ s/HREF\s*=\s*['"].*['"]/$href/i ;} else {$array =~ s/>/ $href>/ ;}
  $array =~ s/"/\\"/g; #escape all "

if the <A> tag has no HREF element, all works fine, however if it does have a HREF attribute, all other attributes in this tag are overwritten x-(

<A HREF="xxx" STYLE="yyy"> turns to
<A HREF=\"---the value of $href---\">

Christian_WenzAuthor Commented:
forgot to mention: $href = 'HREF="xxx.htm"';
Try changing
(this won't work if you have \" within ' or \' within ")
Christian_WenzAuthor Commented:
ok, thanks a lot for your help. Please grab your points now!
Christian_WenzAuthor Commented:
thank you very much for your help!
Christian_WenzAuthor Commented:
Hi ozo,

if you are reading this: I have now found the time to have a closer look at the script I wrote using your code. Then, the following questions emerged. If you could answer them for 100 points, I'd then open a "grab your points" question.

I am especially curious about this regular expression:
s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<ILAYER ID="link${\push @array,$1}">$1</ILAYER>#sgi

a) what are the word boundaries \b for?
b) why did you "espace" the push with "\"?
c) how exactly works the "link${...}" construct? I understand that you put the link into the array @array and set the number of array elements after the "link. However how exactly do you do this?

Thanks in advance!
a) I didn't want to match, say, AUTHOR, or APPLET
b) to create a referense to it
c) ${...} is the scalar referenced by ...
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.