Tricky parse from HTML...

I want to parse out all of the lines of code in an HTML document and present them as simple text.  So say I have this line of HTML text, pulled from an HTML doc:

"<LI><A HREF="http://www.mywebsite.com">My web site link!</A><BR>"

How could I parse this down to just "http://www.mywebsite.com" and "My Web site link!"?

I tried setting a scalar variable to represent a single line of code, parsed from the HTML, but the s/// didn't work.  If $myVar is set to the above line of HTML, then shouldn't:

$_ = $myVar;
s/(\< | \<\/)\..\>//g; #or something close to that
$myVar = $_;

at least parse out the <LI>, the </A>, and the <BR>? (Or, for that matter, any tag in the <xx> or </xx> form?)  I get nothing back, no change to $myVar at all, even when I tried s/href/dogs/g;  What am I doing wrong?  And if anyone can clean up the code in general...I think the problem is the quotes that come in with the HTML line, do I have to break it all down, escape the quotes, put it all back together, and then parse?.
LVL 3
RaydotAsked:
Who is Participating?
 
PC_User321Connect With a Mentor Commented:
Stimulated by your question, I had a bit of a play with 'cleaning up' HTML files, and came with the script below.
It is not strictly relevant to your problem, but you might find it interesting.  It works quite well.

open(F, "C:\\Apps\\Perl\\html\\ppmproxy.htm");
@AllLines = <F>;
$OneBigLine = join '', @AllLines;
# Remove everything between a '<' and the next '>',
# even if it spans several lines (that's what the 's' is).
$OneBigLine =~ s!<[^>]+>!!gs;
# Remove blank lines
$OneBigLine =~ s!\n\s*!\n!g;
print $OneBigLine;
0
 
Kim RyanIT ConsultantCommented:
If you can download and install the CPAN module HTML::Parse, the following should help.

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($file_name);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 80);
$ascii = $formatter->format($html);
0
 
PC_User321Commented:
The substitution works OK for me.
Here is a crude example that you can build on:

$Line = "\"<LI><A HREF=\"http://www.mywebsite.com\">My web site link!</A><BR>\"";
$Line =~ s!<[^="]+!!g;
$Line =~ s!>! !g;
print $Line;

Its output is:
"="http://www.mywebsite.com" My web site link!"
0
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

 
PC_User321Commented:
>> do I have to break it all down, escape the quotes, put it all back together, and then parse?.
>>
No.  In my example I only escaped the quotes so that I could accurately set $Line to what it should be.

0
 
RaydotAuthor Commented:
I think I didn't explain myself well enough.  I'm pulling the HTML from a list of links, so I don't have the luxury of pre-generating the line, as PC321 suggests.  Tera, I'll take a look at your answer...
0
 
PC_User321Commented:
I don't suggest that you pregenerate the lines.  You presumably have read the HTML lines into a Perl scalar or array, and then you just process that.
0
 
RaydotAuthor Commented:
Oh...I see, actually.
This'll show my ignorance, but what does [^="]+ signify?  Wait, I get the plus, but not the brackets and not the ^=".

Thanks!
0
 
PC_User321Commented:
$Line =~ s!<[^="]+!!g;

This line means:
1. Find a '<'
2. Immediately after that find the longest possible string consisting entirely of any characters except '=' or '"'  (the '^' means except)
3. If a string was found in step 2, then replace it and what was found in step 1, with nothing.

For more info see
http://www.perl.com/pub/doc/manual/html/pod/perlre.html
0
 
ozoCommented:
perldoc -q "How do I remove HTML from a string"
0
 
PC_User321Commented:
Good comment, ozo.
There are lots of traps in trying to decode HTML.
0
 
RaydotAuthor Commented:
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
RaydotAuthor Commented:
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
RaydotAuthor Commented:
You were right on the money.  Just do me a favor and answer my last question before we all check out...also, you never told me what [ ] is used for...
0
 
ozoCommented:
@bold = $text =~ m(<b>(.*?)</b>)gs;  #with the caveats mentioned in the FAQ, use HTML::Parser if you want to do it reliably

[] are used for character classes in regular expressions, see
perldoc perlre
0
 
PC_User321Commented:
By the way, in my first post the line
   $Line =~ s!<[^="]+!!g;
should be replaced with
   $Line =~ s!<[^>="]+[>="]!!g;
for safety.

(with the caveats mentioned in the FAQ, as ozo says)
Have fun in the quagmire :)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.