Solved

Tricky parse from HTML...

Posted on 2000-02-29
15
206 Views
Last Modified: 2010-03-05
I want to parse out all of the lines of code in an HTML document and present them as simple text.  So say I have this line of HTML text, pulled from an HTML doc:

"<LI><A HREF="http://www.mywebsite.com">My web site link!</A><BR>"

How could I parse this down to just "http://www.mywebsite.com" and "My Web site link!"?

I tried setting a scalar variable to represent a single line of code, parsed from the HTML, but the s/// didn't work.  If $myVar is set to the above line of HTML, then shouldn't:

$_ = $myVar;
s/(\< | \<\/)\..\>//g; #or something close to that
$myVar = $_;

at least parse out the <LI>, the </A>, and the <BR>? (Or, for that matter, any tag in the <xx> or </xx> form?)  I get nothing back, no change to $myVar at all, even when I tried s/href/dogs/g;  What am I doing wrong?  And if anyone can clean up the code in general...I think the problem is the quotes that come in with the HTML line, do I have to break it all down, escape the quotes, put it all back together, and then parse?.
0
Comment
Question by:Raydot
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 5
  • 2
  • +1
15 Comments
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 2570985
If you can download and install the CPAN module HTML::Parse, the following should help.

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($file_name);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 80);
$ascii = $formatter->format($html);
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2570996
The substitution works OK for me.
Here is a crude example that you can build on:

$Line = "\"<LI><A HREF=\"http://www.mywebsite.com\">My web site link!</A><BR>\"";
$Line =~ s!<[^="]+!!g;
$Line =~ s!>! !g;
print $Line;

Its output is:
"="http://www.mywebsite.com" My web site link!"
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571007
>> do I have to break it all down, escape the quotes, put it all back together, and then parse?.
>>
No.  In my example I only escaped the quotes so that I could accurately set $Line to what it should be.

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 3

Author Comment

by:Raydot
ID: 2571191
I think I didn't explain myself well enough.  I'm pulling the HTML from a list of links, so I don't have the luxury of pre-generating the line, as PC321 suggests.  Tera, I'll take a look at your answer...
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571224
I don't suggest that you pregenerate the lines.  You presumably have read the HTML lines into a Perl scalar or array, and then you just process that.
0
 
LVL 3

Author Comment

by:Raydot
ID: 2571274
Oh...I see, actually.
This'll show my ignorance, but what does [^="]+ signify?  Wait, I get the plus, but not the brackets and not the ^=".

Thanks!
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571300
$Line =~ s!<[^="]+!!g;

This line means:
1. Find a '<'
2. Immediately after that find the longest possible string consisting entirely of any characters except '=' or '"'  (the '^' means except)
3. If a string was found in step 2, then replace it and what was found in step 1, with nothing.

For more info see
http://www.perl.com/pub/doc/manual/html/pod/perlre.html
0
 
LVL 5

Accepted Solution

by:
PC_User321 earned 150 total points
ID: 2571563
Stimulated by your question, I had a bit of a play with 'cleaning up' HTML files, and came with the script below.
It is not strictly relevant to your problem, but you might find it interesting.  It works quite well.

open(F, "C:\\Apps\\Perl\\html\\ppmproxy.htm");
@AllLines = <F>;
$OneBigLine = join '', @AllLines;
# Remove everything between a '<' and the next '>',
# even if it spans several lines (that's what the 's' is).
$OneBigLine =~ s!<[^>]+>!!gs;
# Remove blank lines
$OneBigLine =~ s!\n\s*!\n!g;
print $OneBigLine;
0
 
LVL 84

Expert Comment

by:ozo
ID: 2571621
perldoc -q "How do I remove HTML from a string"
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2571701
Good comment, ozo.
There are lots of traps in trying to decode HTML.
0
 
LVL 3

Author Comment

by:Raydot
ID: 2573750
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
ID: 2573863
OK PC, I get it.  One last question.  What would you suggest for parsing text from between tags, like everything between <b> and </b>?
0
 
LVL 3

Author Comment

by:Raydot
ID: 2574207
You were right on the money.  Just do me a favor and answer my last question before we all check out...also, you never told me what [ ] is used for...
0
 
LVL 84

Expert Comment

by:ozo
ID: 2574282
@bold = $text =~ m(<b>(.*?)</b>)gs;  #with the caveats mentioned in the FAQ, use HTML::Parser if you want to do it reliably

[] are used for character classes in regular expressions, see
perldoc perlre
0
 
LVL 5

Expert Comment

by:PC_User321
ID: 2574361
By the way, in my first post the line
   $Line =~ s!<[^="]+!!g;
should be replaced with
   $Line =~ s!<[^>="]+[>="]!!g;
for safety.

(with the caveats mentioned in the FAQ, as ozo says)
Have fun in the quagmire :)
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
pattern matching in perl 2 111
Removing file extension within a file. 4 102
Migrate OTRS to newest version. 1 497
Perl Untar File 1 68
I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

735 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question