Solved

Tricky parse from HTML...

Posted on 2000-02-29
15
177 Views
Last Modified: 2010-03-05
I want to parse out all of the lines of code in an HTML document and present them as simple text.  So say I have this line of HTML text, pulled from an HTML doc:

"<LI><A HREF="http://www.mywebsite.com">My web site link!</A><BR>"

How could I parse this down to just "http://www.mywebsite.com" and "My Web site link!"?

I tried setting a scalar variable to represent a single line of code, parsed from the HTML, but the s/// didn't work.  If $myVar is set to the above line of HTML, then shouldn't:

$_ = $myVar;
s/(\< | \<\/)\..\>//g; #or something close to that
$myVar = $_;

at least parse out the <LI>, the </A>, and the <BR>? (Or, for that matter, any tag in the <xx> or </xx> form?)  I get nothing back, no change to $myVar at all, even when I tried s/href/dogs/g;  What am I doing wrong?  And if anyone can clean up the code in general...I think the problem is the quotes that come in with the HTML line, do I have to break it all down, escape the quotes, put it all back together, and then parse?.
0
Comment
Question by:Raydot
  • 7
  • 5
  • 2
  • +1
15 Comments
 
LVL 19

Expert Comment

by:Kim Ryan
Comment Utility
If you can download and install the CPAN module HTML::Parse, the following should help.

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($file_name);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 80);
$ascii = $formatter->format($html);
0
 
LVL 5

Expert Comment

by:PC_User321
Comment Utility
The substitution works OK for me.
Here is a crude example that you can build on:

$Line = "\"<LI><A HREF=\"http://www.mywebsite.com\">My web site link!</A><BR>\"";
$Line =~ s!<[^="]+!!g;
$Line =~ s!>! !g;
print $Line;

Its output is:
"="http://www.mywebsite.com" My web site link!"
0
 
LVL 5

Expert Comment

by:PC_User321
Comment Utility
>> do I have to break it all down, escape the quotes, put it all back together, and then parse?.
>>
No.  In my example I only escaped the quotes so that I could accurately set $Line to what it should be.

0
 
LVL 3

Author Comment

by:Raydot
Comment Utility
I think I didn't explain myself well enough.  I'm pulling the HTML from a list of links, so I don't have the luxury of pre-generating the line, as PC321 suggests.  Tera, I'll take a look at your answer...
0
 
LVL 5

Expert Comment

by:PC_User321
Comment Utility
I don't suggest that you pregenerate the lines.  You presumably have read the HTML lines into a Perl scalar or array, and then you just process that.
0
 
LVL 3

Author Comment

by:Raydot
Comment Utility
Oh...I see, actually.
This'll show my ignorance, but what does [^="]+ signify?  Wait, I get the plus, but not the brackets and not the ^=".

Thanks!
0
 
LVL 5

Expert Comment

by:PC_User321
Comment Utility
$Line =~ s!<[^="]+!!g;

This line means:
1. Find a '<'
2. Immediately after that find the longest possible string consisting entirely of any characters except '=' or '"'  (the '^' means except)
3. If a string was found in step 2, then replace it and what was found in step 1, with nothing.

For more info see
http://www.perl.com/pub/doc/manual/html/pod/perlre.html
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 5

Accepted Solution

by:
PC_User321 earned 150 total points
Comment Utility
Stimulated by your question, I had a bit of a play with 'cleaning up' HTML files, and came with the script below.
It is not strictly relevant to your problem, but you might find it interesting.  It works quite well.

open(F, "C:\\Apps\\Perl\\html\\ppmproxy.htm");
@AllLines = <F>;
$OneBigLine = join '', @AllLines;
# Remove everything between a '<' and the next '>',
# even if it spans several lines (that's what the 's' is).
$OneBigLine =~ s!<[^>]+>!!gs;
# Remove blank lines
$OneBigLine =~ s!\n\s*!\n!g;
print $OneBigLine;
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
perldoc -q "How do I remove HTML from a string"
0
 
LVL 5

Expert Comment

by:PC_User321
Comment Utility
Good comment, ozo.
There are lots of traps in trying to decode HTML.
0
 
LVL 3

Author Comment

by:Raydot
Comment Utility
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
Comment Utility
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
Comment Utility
You were right on the money.  Just do me a favor and answer my last question before we all check out...also, you never told me what [ ] is used for...
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
@bold = $text =~ m(<b>(.*?)</b>)gs;  #with the caveats mentioned in the FAQ, use HTML::Parser if you want to do it reliably

[] are used for character classes in regular expressions, see
perldoc perlre
0
 
LVL 5

Expert Comment

by:PC_User321
Comment Utility
By the way, in my first post the line
   $Line =~ s!<[^="]+!!g;
should be replaced with
   $Line =~ s!<[^>="]+[>="]!!g;
for safety.

(with the caveats mentioned in the FAQ, as ozo says)
Have fun in the quagmire :)
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now