REG EXP for stripping link info from HTML.

Anyone have a regular expression for stripping links out of a HTML document.
I would want in to return both the URL's and the link titles for a document in an array or such like.
matthewallumAsked:
Who is Participating?
 
pc012197Connect With a Mentor Commented:
It would be very difficult to fit this into a single regular
expression, if not impossible. But the following short
piece of code will do what you want. $html should contain
the complete HTML code of the page.

# first remove comments
while( $html =~ /<!--.*?-->/si ) {
    $html = $`.$';
}

# then create the array
@array = ();
while( $html =~ /<a\s[^>]*href=("[^"]*"|[^\s>]+)[^>]*>(.*?)<\/a>/si ) {
    $html = $`.$';
    push @array, ($1, $2);
}

0
 
ozoCommented:
If you don't want to be fooled by links inside of a comment or <script> this could get tricky to do in a single regular expression.
it may be better to
use HTML::Parser;
0
 
matthewallumAuthor Commented:
many thanks
0
 
ozoCommented:
<!-- That works in many cases, but is still fooled by things like: -->
<script>print "<!--"</script>
<a HREF=/bin/ShowQ?qid=10176039>Reload ?</a>
<script>print "-->"</script>
<script>print "<a HREF=http://www.experts-exchange.com/>Home</a>"</script>

(Also, see some of the links at http://microsoft.com)
0
 
pc012197Commented:
...which leads us to the fact that different browsers treat
HTML differently. So why should a perl script be able
to parse *Script?

0
All Courses

From novice to tech pro — start learning today.