Solved

parse html files

Posted on 1998-10-30
15
174 Views
Last Modified: 2010-03-04
I am trying to write a program in Perl to parse a HTML file for special data.
Very simplified: the program shall do the following things on a given file (or a list of files):

(1) save the text of all links in an array (look for <A.......>xxx</A> and save the xxx in an array)
(2) surround all links with special tags (e.g.: <TAG ID="n"><A...>xxx</a></TAG> where a is the number of the link)
(3) write the modified html data into a new file

my own efforts work at some times and to some part, but not sufficiently. I think my regExp's are quite buggy, so I won't post any code; probably anyone here has a muchg easier and shorter way to do it.

Thanks in advance,

      Christian
0
Comment
Question by:Christian_Wenz
  • 8
  • 7
15 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 1205824
you might use the HTML::Parser module
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205825
I once played around with HTML::HeadParser which is part of this package, that worked quite fine indeed. However, in this case, I'd like to avoid using modules but have some compact :) lines in one script. So if you could post here some code, I'd be very glad to give you the points.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205826
well, as long as there's nothing tricky happening with comments or scripts or nested tags, you may be able to get by with something like

s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array,$2),@array))}">$1</TAG>#sgi;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205827
Hi ozo,
your idea works perfectly!
Here's my (simplified) code:

#!/usr/bin/perl
open(FILE, "$ARGV[0]");
open(FILE2, ">$ARGV[1]");
$content = "";
while ($line  = <FILE>){$content .= $line;}
  $content =~ s#(<\s*A\b[^>]*>(.*?)<\s*/\s*A\b[^>]*>)#<TAG ID="${\(0+((push @array, $2), @array))}">$1</TAG>#sgi;
  print FILE2 $content;
foreach $array (@array) { print "$array\n";}
close(FILE);
close(FILE2);

some minor changes: I'd like that the whole link (including <A...> and </A>) is saved in @array. Please answer this question with the new regexp, and if I do not find some more questions, you'l get the points :-)
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205828
open(FILE, "<$ARGV[0]") or die "can't open $ARGV[0]:$!";
open(FILE2, ">$ARGV[1]") or die "can't open $ARGV[1]:$!";
{local $/=undef; $content = <FILE>}
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205829
aargh! what a dumb idea to read the file line by line x-( Thanks for the obvious hint. (one question: do I really need the "local" and the {} brackets?)
I'll check your code; I suggest that you post it as an answer now.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205830
>do I really need the "local" and the {} brackets?
for a small program like this, no.
but I think it's a good idea in general, so you don't inadvertently break code that expects $/ to be normal.
  $content = join'',<FILE>;
would be another way to do it
(the push was also simplified, I got unneccesarily complicated the first time)
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205831
sorry, I still have a question, but this is very easy.
as you know, all <A ...>xxx</A> elements of the page are stored in @array.
I now want to replace all occurances of HREF="..." within the <A> tag with a special HREF stored in $href . If the A tag has no HREF, $href shall be included in the tag.
Here's my code:

  $array =~ s/\s/ /g; #<A\nHREF works in HTML, s replace this by spaces
  if ($array =~ (/HREF/i)) { $array =~ s/HREF\s*=\s*['"].*['"]/$href/i ;} else {$array =~ s/>/ $href>/ ;}
  $array =~ s/"/\\"/g; #escape all "

if the <A> tag has no HREF element, all works fine, however if it does have a HREF attribute, all other attributes in this tag are overwritten x-(

<A HREF="xxx" STYLE="yyy"> turns to
<A HREF=\"---the value of $href---\">

TIA!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205832
forgot to mention: $href = 'HREF="xxx.htm"';
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205833
Try changing
['"].*['"]
to
['"].*?['"]
(this won't work if you have \" within ' or \' within ")
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205834
ok, thanks a lot for your help. Please grab your points now!
0
 
LVL 84

Accepted Solution

by:
ozo earned 100 total points
ID: 1205835
$content =~ s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<TAG ID="${\push @array,$1}">$1</TAG>#sgi;

$array =~ s/\s?HREF\s*=\s*['"][^'"]*['"]|(?=>)/ $href/i;
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205836
thank you very much for your help!
0
 
LVL 5

Author Comment

by:Christian_Wenz
ID: 1205837
Hi ozo,

if you are reading this: I have now found the time to have a closer look at the script I wrote using your code. Then, the following questions emerged. If you could answer them for 100 points, I'd then open a "grab your points" question.

I am especially curious about this regular expression:
s#(<\s*A\b[^>]*>.*?<\s*/\s*A\b[^>]*>)#<ILAYER ID="link${\push @array,$1}">$1</ILAYER>#sgi

a) what are the word boundaries \b for?
b) why did you "espace" the push with "\"?
c) how exactly works the "link${...}" construct? I understand that you put the link into the array @array and set the number of array elements after the "link. However how exactly do you do this?

Thanks in advance!
0
 
LVL 84

Expert Comment

by:ozo
ID: 1205838
a) I didn't want to match, say, AUTHOR, or APPLET
b) to create a referense to it
c) ${...} is the scalar referenced by ...
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now