Solved

Tricky parse from HTML...

Posted on 2000-02-29
15
204 Views
Last Modified: 2010-03-05
I want to parse out all of the lines of code in an HTML document and present them as simple text.  So say I have this line of HTML text, pulled from an HTML doc:

"<LI><A HREF="http://www.mywebsite.com">My web site link!</A><BR>"

How could I parse this down to just "http://www.mywebsite.com" and "My Web site link!"?

I tried setting a scalar variable to represent a single line of code, parsed from the HTML, but the s/// didn't work.  If $myVar is set to the above line of HTML, then shouldn't:

$_ = $myVar;
s/(\< | \<\/)\..\>//g; #or something close to that
$myVar = $_;

at least parse out the <LI>, the </A>, and the <BR>? (Or, for that matter, any tag in the <xx> or </xx> form?)  I get nothing back, no change to $myVar at all, even when I tried s/href/dogs/g;  What am I doing wrong?  And if anyone can clean up the code in general...I think the problem is the quotes that come in with the HTML line, do I have to break it all down, escape the quotes, put it all back together, and then parse?.
0
Comment
Question by:Raydot
  • 7
  • 5
  • 2
  • +1
15 Comments
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 2570985
If you can download and install the CPAN module HTML::Parse, the following should help.

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($file_name);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 80);
$ascii = $formatter->format($html);
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2570996
The substitution works OK for me.
Here is a crude example that you can build on:

$Line = "\"<LI><A HREF=\"http://www.mywebsite.com\">My web site link!</A><BR>\"";
$Line =~ s!<[^="]+!!g;
$Line =~ s!>! !g;
print $Line;

Its output is:
"="http://www.mywebsite.com" My web site link!"
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571007
>> do I have to break it all down, escape the quotes, put it all back together, and then parse?.
>>
No.  In my example I only escaped the quotes so that I could accurately set $Line to what it should be.

0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 3

Author Comment

by:Raydot
ID: 2571191
I think I didn't explain myself well enough.  I'm pulling the HTML from a list of links, so I don't have the luxury of pre-generating the line, as PC321 suggests.  Tera, I'll take a look at your answer...
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571224
I don't suggest that you pregenerate the lines.  You presumably have read the HTML lines into a Perl scalar or array, and then you just process that.
0
 
LVL 3

Author Comment

by:Raydot
ID: 2571274
Oh...I see, actually.
This'll show my ignorance, but what does [^="]+ signify?  Wait, I get the plus, but not the brackets and not the ^=".

Thanks!
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571300
$Line =~ s!<[^="]+!!g;

This line means:
1. Find a '<'
2. Immediately after that find the longest possible string consisting entirely of any characters except '=' or '"'  (the '^' means except)
3. If a string was found in step 2, then replace it and what was found in step 1, with nothing.

For more info see
http://www.perl.com/pub/doc/manual/html/pod/perlre.html
0
 
LVL 5

Accepted Solution

by:
PC_User321 earned 150 total points
ID: 2571563
Stimulated by your question, I had a bit of a play with 'cleaning up' HTML files, and came with the script below.
It is not strictly relevant to your problem, but you might find it interesting.  It works quite well.

open(F, "C:\\Apps\\Perl\\html\\ppmproxy.htm");
@AllLines = <F>;
$OneBigLine = join '', @AllLines;
# Remove everything between a '<' and the next '>',
# even if it spans several lines (that's what the 's' is).
$OneBigLine =~ s!<[^>]+>!!gs;
# Remove blank lines
$OneBigLine =~ s!\n\s*!\n!g;
print $OneBigLine;
0
 
LVL 84

Expert Comment

by:ozo
ID: 2571621
perldoc -q "How do I remove HTML from a string"
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571701
Good comment, ozo.
There are lots of traps in trying to decode HTML.
0
 
LVL 3

Author Comment

by:Raydot
ID: 2573750
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
ID: 2573863
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
ID: 2574207
You were right on the money.  Just do me a favor and answer my last question before we all check out...also, you never told me what [ ] is used for...
0
 
LVL 84

Expert Comment

by:ozo
ID: 2574282
@bold = $text =~ m(<b>(.*?)</b>)gs;  #with the caveats mentioned in the FAQ, use HTML::Parser if you want to do it reliably

[] are used for character classes in regular expressions, see
perldoc perlre
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2574361
By the way, in my first post the line
   $Line =~ s!<[^="]+!!g;
should be replaced with
   $Line =~ s!<[^>="]+[>="]!!g;
for safety.

(with the caveats mentioned in the FAQ, as ozo says)
Have fun in the quagmire :)
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Merge files & delete row based on criteria using Perl. 1 104
Awk Question 2 129
Control Number of Log Files -Perl 7 87
Create an automated page index 9 50
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question