Solved

Tricky parse from HTML...

Posted on 2000-02-29
15
187 Views
Last Modified: 2010-03-05
I want to parse out all of the lines of code in an HTML document and present them as simple text.  So say I have this line of HTML text, pulled from an HTML doc:

"<LI><A HREF="http://www.mywebsite.com">My web site link!</A><BR>"

How could I parse this down to just "http://www.mywebsite.com" and "My Web site link!"?

I tried setting a scalar variable to represent a single line of code, parsed from the HTML, but the s/// didn't work.  If $myVar is set to the above line of HTML, then shouldn't:

$_ = $myVar;
s/(\< | \<\/)\..\>//g; #or something close to that
$myVar = $_;

at least parse out the <LI>, the </A>, and the <BR>? (Or, for that matter, any tag in the <xx> or </xx> form?)  I get nothing back, no change to $myVar at all, even when I tried s/href/dogs/g;  What am I doing wrong?  And if anyone can clean up the code in general...I think the problem is the quotes that come in with the HTML line, do I have to break it all down, escape the quotes, put it all back together, and then parse?.
0
Comment
Question by:Raydot
  • 7
  • 5
  • 2
  • +1
15 Comments
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 2570985
If you can download and install the CPAN module HTML::Parse, the following should help.

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($file_name);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 80);
$ascii = $formatter->format($html);
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2570996
The substitution works OK for me.
Here is a crude example that you can build on:

$Line = "\"<LI><A HREF=\"http://www.mywebsite.com\">My web site link!</A><BR>\"";
$Line =~ s!<[^="]+!!g;
$Line =~ s!>! !g;
print $Line;

Its output is:
"="http://www.mywebsite.com" My web site link!"
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571007
>> do I have to break it all down, escape the quotes, put it all back together, and then parse?.
>>
No.  In my example I only escaped the quotes so that I could accurately set $Line to what it should be.

0
 
LVL 3

Author Comment

by:Raydot
ID: 2571191
I think I didn't explain myself well enough.  I'm pulling the HTML from a list of links, so I don't have the luxury of pre-generating the line, as PC321 suggests.  Tera, I'll take a look at your answer...
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571224
I don't suggest that you pregenerate the lines.  You presumably have read the HTML lines into a Perl scalar or array, and then you just process that.
0
 
LVL 3

Author Comment

by:Raydot
ID: 2571274
Oh...I see, actually.
This'll show my ignorance, but what does [^="]+ signify?  Wait, I get the plus, but not the brackets and not the ^=".

Thanks!
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571300
$Line =~ s!<[^="]+!!g;

This line means:
1. Find a '<'
2. Immediately after that find the longest possible string consisting entirely of any characters except '=' or '"'  (the '^' means except)
3. If a string was found in step 2, then replace it and what was found in step 1, with nothing.

For more info see
http://www.perl.com/pub/doc/manual/html/pod/perlre.html
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 5

Accepted Solution

by:
PC_User321 earned 150 total points
ID: 2571563
Stimulated by your question, I had a bit of a play with 'cleaning up' HTML files, and came with the script below.
It is not strictly relevant to your problem, but you might find it interesting.  It works quite well.

open(F, "C:\\Apps\\Perl\\html\\ppmproxy.htm");
@AllLines = <F>;
$OneBigLine = join '', @AllLines;
# Remove everything between a '<' and the next '>',
# even if it spans several lines (that's what the 's' is).
$OneBigLine =~ s!<[^>]+>!!gs;
# Remove blank lines
$OneBigLine =~ s!\n\s*!\n!g;
print $OneBigLine;
0
 
LVL 84

Expert Comment

by:ozo
ID: 2571621
perldoc -q "How do I remove HTML from a string"
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571701
Good comment, ozo.
There are lots of traps in trying to decode HTML.
0
 
LVL 3

Author Comment

by:Raydot
ID: 2573750
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
ID: 2573863
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
ID: 2574207
You were right on the money.  Just do me a favor and answer my last question before we all check out...also, you never told me what [ ] is used for...
0
 
LVL 84

Expert Comment

by:ozo
ID: 2574282
@bold = $text =~ m(<b>(.*?)</b>)gs;  #with the caveats mentioned in the FAQ, use HTML::Parser if you want to do it reliably

[] are used for character classes in regular expressions, see
perldoc perlre
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2574361
By the way, in my first post the line
   $Line =~ s!<[^="]+!!g;
should be replaced with
   $Line =~ s!<[^>="]+[>="]!!g;
for safety.

(with the caveats mentioned in the FAQ, as ozo says)
Have fun in the quagmire :)
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
HTTP Error 502.2 - Bad Gateway 3 204
Removing file extension within a file. 4 90
Vb script to unzip a files and rename the files 12 93
Perl Frameworks 1 77
I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This is a video describing the growing solar energy use in Utah. This is a topic that greatly interests me and so I decided to produce a video about it.

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now