Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

parse html files

Posted on 1998-10-30
15
198 Views
Last Modified: 2010-03-04
I am trying to write a program in Perl to parse a HTML file for special data.
Very simplified: the program shall do the following things on a given file (or a list of files):

(1) save the text of all links in an array (look for <A.......>xxx</A> and save the xxx in an array)
(2) surround all links with special tags (e.g.: <TAG ID="n"><A...>xxx</a></TAG> where a is the number of the link)
(3) write the modified html data into a new file

my own efforts work at some times and to some part, but not sufficiently. I think my regExp's are quite buggy, so I won't post any code; probably anyone here has a muchg easier and shorter way to do it.

Thanks in advance,

      Christian
0
Comment
Question by:Christian_Wenz
  • 8
  • 7
15 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 1205824
you might use the HTML::Parser module
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205825
I once played around with HTML::HeadParser which is part of this package, that worked quite fine indeed. However, in this case, I'd like to avoid using modules but have some compact :) lines in one script. So if you could post here some code, I'd be very glad to give you the points.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205826
well, as long as there's nothing tricky happening with comments or scripts or nested tags, you may be able to get by with something like

s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array,$2),@array))}">$1</TAG>#sgi;
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205827
Hi ozo,
your idea works perfectly!
Here's my (simplified) code:

#!/usr/bin/perl
open(FILE, "$ARGV[0]");
open(FILE2, ">$ARGV[1]");
$content = "";
while ($line  = <FILE>){$content .= $line;}
  $content =~ s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array, $2), @array))}">$1</TAG>#sgi;
  print FILE2 $content;
foreach $array (@array) { print "$array\n";}
close(FILE);
close(FILE2);

some minor changes: I'd like that the whole link (including <A...> and </A>) is saved in @array. Please answer this question with the new regexp, and if I do not find some more questions, you'l get the points :-)
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205828
open(FILE, "<$ARGV[0]") or die "can't open $ARGV[0]:$!";
open(FILE2, ">$ARGV[1]") or die "can't open $ARGV[1]:$!";
{local $/=undef; $content = <FILE>}
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205829
aargh! what a dumb idea to read the file line by line x-( Thanks for the obvious hint. (one question: do I really need the "local" and the {} brackets?)
I'll check your code; I suggest that you post it as an answer now.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205830
>do I really need the "local" and the {} brackets?
for a small program like this, no.
but I think it's a good idea in general, so you don't inadvertently break code that expects $/ to be normal.
  $content = join'',<FILE>;
would be another way to do it
(the push was also simplified, I got unneccesarily complicated the first time)
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205831
sorry, I still have a question, but this is very easy.
as you know, all <A ...>xxx</A> elements of the page are stored in @array.
I now want to replace all occurances of HREF="..." within the <A> tag with a special HREF stored in $href . If the A tag has no HREF, $href shall be included in the tag.
Here's my code:

  $array =~ s/\s/ /g; #<A\nHREF works in HTML, s replace this by spaces
  if ($array =~ (/HREF/i)) { $array =~ s/HREF\s*=\s*['"].*['"]/$href/i ;} else {$array =~ s/>/ $href>/ ;}
  $array =~ s/"/\\"/g; #escape all "

if the <A> tag has no HREF element, all works fine, however if it does have a HREF attribute, all other attributes in this tag are overwritten x-(

<A HREF="xxx" STYLE="yyy"> turns to
<A HREF=\"---the value of $href---\">

TIA!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205832
forgot to mention: $href = 'HREF="xxx.htm"';
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205833
Try changing
['"].*['"]
to
['"].*?['"]
(this won't work if you have \" within ' or \' within ")
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205834
ok, thanks a lot for your help. Please grab your points now!
0
 
LVL 84

Accepted Solution

by:
ozo earned 100 total points
ID: 1205835
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;

$array =~ s/\s?HREF\s*=\s*['"][^'"]*['"]|(?=>)/ $href/i;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205836
thank you very much for your help!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205837
Hi ozo,

if you are reading this: I have now found the time to have a closer look at the script I wrote using your code. Then, the following questions emerged. If you could answer them for 100 points, I'd then open a "grab your points" question.

I am especially curious about this regular expression:
s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<ILAYER ID="link${\push @array,$1}">$1</ILAYER>#sgi

a) what are the word boundaries \b for?
b) why did you "espace" the push with "\"?
c) how exactly works the "link${...}" construct? I understand that you put the link into the array @array and set the number of array elements after the "link. However how exactly do you do this?

Thanks in advance!
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205838
a) I didn't want to match, say, AUTHOR, or APPLET
b) to create a referense to it
c) ${...} is the scalar referenced by ...
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

792 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question