Solved

parse html files

Posted on 1998-10-30
15
192 Views
Last Modified: 2010-03-04
I am trying to write a program in Perl to parse a HTML file for special data.
Very simplified: the program shall do the following things on a given file (or a list of files):

(1) save the text of all links in an array (look for <A.......>xxx</A> and save the xxx in an array)
(2) surround all links with special tags (e.g.: <TAG ID="n"><A...>xxx</a></TAG> where a is the number of the link)
(3) write the modified html data into a new file

my own efforts work at some times and to some part, but not sufficiently. I think my regExp's are quite buggy, so I won't post any code; probably anyone here has a muchg easier and shorter way to do it.

Thanks in advance,

      Christian
0
Comment
Question by:Christian_Wenz
  • 8
  • 7
15 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 1205824
you might use the HTML::Parser module
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205825
I once played around with HTML::HeadParser which is part of this package, that worked quite fine indeed. However, in this case, I'd like to avoid using modules but have some compact :) lines in one script. So if you could post here some code, I'd be very glad to give you the points.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205826
well, as long as there's nothing tricky happening with comments or scripts or nested tags, you may be able to get by with something like

s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array,$2),@array))}">$1</TAG>#sgi;
0
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205827
Hi ozo,
your idea works perfectly!
Here's my (simplified) code:

#!/usr/bin/perl
open(FILE, "$ARGV[0]");
open(FILE2, ">$ARGV[1]");
$content = "";
while ($line  = <FILE>){$content .= $line;}
  $content =~ s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array, $2), @array))}">$1</TAG>#sgi;
  print FILE2 $content;
foreach $array (@array) { print "$array\n";}
close(FILE);
close(FILE2);

some minor changes: I'd like that the whole link (including <A...> and </A>) is saved in @array. Please answer this question with the new regexp, and if I do not find some more questions, you'l get the points :-)
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205828
open(FILE, "<$ARGV[0]") or die "can't open $ARGV[0]:$!";
open(FILE2, ">$ARGV[1]") or die "can't open $ARGV[1]:$!";
{local $/=undef; $content = <FILE>}
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205829
aargh! what a dumb idea to read the file line by line x-( Thanks for the obvious hint. (one question: do I really need the "local" and the {} brackets?)
I'll check your code; I suggest that you post it as an answer now.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205830
>do I really need the "local" and the {} brackets?
for a small program like this, no.
but I think it's a good idea in general, so you don't inadvertently break code that expects $/ to be normal.
  $content = join'',<FILE>;
would be another way to do it
(the push was also simplified, I got unneccesarily complicated the first time)
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205831
sorry, I still have a question, but this is very easy.
as you know, all <A ...>xxx</A> elements of the page are stored in @array.
I now want to replace all occurances of HREF="..." within the <A> tag with a special HREF stored in $href . If the A tag has no HREF, $href shall be included in the tag.
Here's my code:

  $array =~ s/\s/ /g; #<A\nHREF works in HTML, s replace this by spaces
  if ($array =~ (/HREF/i)) { $array =~ s/HREF\s*=\s*['"].*['"]/$href/i ;} else {$array =~ s/>/ $href>/ ;}
  $array =~ s/"/\\"/g; #escape all "

if the <A> tag has no HREF element, all works fine, however if it does have a HREF attribute, all other attributes in this tag are overwritten x-(

<A HREF="xxx" STYLE="yyy"> turns to
<A HREF=\"---the value of $href---\">

TIA!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205832
forgot to mention: $href = 'HREF="xxx.htm"';
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205833
Try changing
['"].*['"]
to
['"].*?['"]
(this won't work if you have \" within ' or \' within ")
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205834
ok, thanks a lot for your help. Please grab your points now!
0
 
LVL 84

Accepted Solution

by:
ozo earned 100 total points
ID: 1205835
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;

$array =~ s/\s?HREF\s*=\s*['"][^'"]*['"]|(?=>)/ $href/i;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205836
thank you very much for your help!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205837
Hi ozo,

if you are reading this: I have now found the time to have a closer look at the script I wrote using your code. Then, the following questions emerged. If you could answer them for 100 points, I'd then open a "grab your points" question.

I am especially curious about this regular expression:
s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<ILAYER ID="link${\push @array,$1}">$1</ILAYER>#sgi

a) what are the word boundaries \b for?
b) why did you "espace" the push with "\"?
c) how exactly works the "link${...}" construct? I understand that you put the link into the array @array and set the number of array elements after the "link. However how exactly do you do this?

Thanks in advance!
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205838
a) I didn't want to match, say, AUTHOR, or APPLET
b) to create a referense to it
c) ${...} is the scalar referenced by ...
0

Featured Post

Problems using Powershell and Active Directory?

Managing Active Directory does not always have to be complicated.  If you are spending more time trying instead of doing, then it's time to look at something else. For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
read PayflowPro reports using the report metadata using perl 11 43
compress files in RAR using perl 13 81
Awk Question 2 127
Union rows in array that have common elements 2 96
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question