REG EXP for stripping link info from HTML.

Anyone have a regular expression for stripping links out of a HTML document.
I would want in to return both the URL's and the link titles for a document in an array or such like.
Who is Participating?
pc012197Connect With a Mentor Commented:
It would be very difficult to fit this into a single regular
expression, if not impossible. But the following short
piece of code will do what you want. $html should contain
the complete HTML code of the page.

# first remove comments
while( $html =~ /<!--.*?-->/si ) {
    $html = $`.$';

# then create the array
@array = ();
while( $html =~ /<a\s[^>]*href=("[^"]*"|[^\s>]+)[^>]*>(.*?)<\/a>/si ) {
    $html = $`.$';
    push @array, ($1, $2);

If you don't want to be fooled by links inside of a comment or <script> this could get tricky to do in a single regular expression.
it may be better to
use HTML::Parser;
matthewallumAuthor Commented:
many thanks
<!-- That works in many cases, but is still fooled by things like: -->
<script>print "<!--"</script>
<a HREF=/bin/ShowQ?qid=10176039>Reload ?</a>
<script>print "-->"</script>
<script>print "<a HREF=>Home</a>"</script>

(Also, see some of the links at
...which leads us to the fact that different browsers treat
HTML differently. So why should a perl script be able
to parse *Script?

All Courses

From novice to tech pro — start learning today.