Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1172
  • Last Modified:

Regular Expression to find HTML links with rel nofollow attributes php

I am trying to come up with a regular expression that will get the html link and look for if that link has a rel="nofollow" attribute to it and store them in variables. I have come up with the top of my head code to get the links using the strip_tags() with php but it returns not just the links but all other text. Not sure if there is a function already in php that can do this or I need a regex for that along with finding the nofollow tag.

Ultimately I want to scan a webpage and return 2 things.

The link and whether there is a nofollow tag associated with that link on the page

I don't need to pull the nofollow tag text obviously I just need to know if that link has the tag. I am assuming preg_match will be used for that purpose along with a regex. Anyone help me out?
0
cbielich
Asked:
cbielich
  • 2
  • 2
1 Solution
 
Terry WoodsIT GuruCommented:
I'd do it in 2 steps:
1. Get the links:
2. Then, foreach match, check whether it has the nofollow attribute:

preg_match_all("#<a[^>]*href\s*=\s*['\"]([^'\">]*)['\"][^>]*>#i", $myhtml, $matches);

foreach ($matches[0] as $matchnum=>$match) {
  if (preg_match("#rel\s*=\s*['\"]nofollow['\"]#",$matches[0][$matchnum])) {
    print "Link (nofollow): {$matches[1][$matchnum]}\n";
  } else {
    print "Link: {$matches[1][$matchnum]}\n";
  }

}

Open in new window

0
 
cbielichAuthor Commented:
I just came up with this, what you think?

<?
$yourHTML = file_get_contents('http://www.somewebsite.com');
//$yourHTML = strip_tags($yourHTML);

$dom = new DOMDocument;
@$dom->loadHTML($yourHTML);

$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
    if ($link->hasAttribute('rel')) {
            if ($link->getAttribute('rel') == 'nofollow') {
                  echo $link->getAttribute('href');
            }
    }
}
?>
0
 
cbielichAuthor Commented:
I like yours better :)
0
 
Terry WoodsIT GuruCommented:
In an ideal world, the DOMDocument way would be the best. However, I've had others complain that it doesn't handle invalid HTML well though; I'm not sure in what way it fails though.

Note also that you might like to add an "i" pattern modifier to the preg_match call so it ignores case:
  if (preg_match("#rel\s*=\s*['\"]nofollow['\"]#i",$matches[0][$matchnum])) {

Open in new window

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now