REG EXP for stripping link info from HTML.

Posted on 1999-06-27
Medium Priority
Last Modified: 2010-03-04
Anyone have a regular expression for stripping links out of a HTML document.
I would want in to return both the URL's and the link titles for a document in an array or such like.
Question by:matthewallum
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
LVL 84

Expert Comment

ID: 1213526
If you don't want to be fooled by links inside of a comment or <script> this could get tricky to do in a single regular expression.
it may be better to
use HTML::Parser;

Accepted Solution

pc012197 earned 800 total points
ID: 1213527
It would be very difficult to fit this into a single regular
expression, if not impossible. But the following short
piece of code will do what you want. $html should contain
the complete HTML code of the page.

# first remove comments
while( $html =~ /<!--.*?-->/si ) {
    $html = $`.$';

# then create the array
@array = ();
while( $html =~ /<a\s[^>]*href=("[^"]*"|[^\s>]+)[^>]*>(.*?)<\/a>/si ) {
    $html = $`.$';
    push @array, ($1, $2);


Author Comment

ID: 1213528
many thanks
LVL 84

Expert Comment

ID: 1213529
<!-- That works in many cases, but is still fooled by things like: -->
<script>print "<!--"</script>
<a HREF=/bin/ShowQ?qid=10176039>Reload ?</a>
<script>print "-->"</script>
<script>print "<a HREF=http://www.experts-exchange.com/>Home</a>"</script>

(Also, see some of the links at http://microsoft.com)

Expert Comment

ID: 1213530
...which leads us to the fact that different browsers treat
HTML differently. So why should a perl script be able
to parse *Script?


Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

718 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question