preg_match to find meta tag noindex

I am trying to use preg_match and trying to find if a page contains the meta tag of noindex. problem is that there are several possible tags used. Here are a list of them

All Search Engines

<meta name="robots" content="noindex" />

Google

<meta name="googlebot" content="noindex" />

Yahoo

<meta name="Slurp" content="noindex" />

Bing (Microsoft)

<meta name="msnbot" content="noindex" />

Also some meta tags might have content="noindex, follow" or content="follow, noindex"

anyone have an example preg_match that would help me find these?
LVL 1
cbielichAsked:
Who is Participating?
 
Robert SchuttSoftware EngineerCommented:
If you want to allow for some variations in the html, try this expression:
"/<meta\s+name\s*=\s*[\"'](.*?)[\"']\s*content\s*=\s*[\"'].*?noindex.*?[\"']\s*\/?>/i"

Open in new window

To check/record which bot was targeted you could do something like:
if (preg_match("/<meta\s+name\s*=\s*[\"'](.*?)[\"']\s*content\s*=\s*[\"'].*?noindex.*?[\"']\s*\/?>/i", $s, $m) === 1) {
	echo "found: ".htmlspecialchars($m[1])."<br>";
}

Open in new window

assuming $s is a string containing the html.
0
 
Ray PaseurCommented:
I think you don't really need a regular expression; this function should be enough.
http://us2.php.net/manual/en/function.stripos.php

See http://www.laprbass.com/RAY_temp_cbielich.php
<?php // RAY_temp_cbielich.php
error_reporting(E_ALL);

// SIMULATE A WEB PAGE

$htm = <<<HTM
<meta name="robots" content="noindex" />
Google
<meta name="googlebot" content="noindex" />
Yahoo
<meta name="Slurp" content="noindex" />
Bing (Microsoft)
<meta name="msnbot" content="noindex" />
Also some meta tags might have content="noindex, follow" or content="follow, noindex"
HTM;

// PROCESS EACH LINE
$arr = explode(PHP_EOL, $htm);
$out = array();
$bad = array();
foreach ($arr as $str)
{
    if (stripos($str, 'NOINDEX') === FALSE)
    {
        $bad[] = htmlentities($str);
    }
    else
    {
        $out[] = htmlentities($str);
    }
}

// SHOW WHERE WE FOUND "NOINDEX"
echo '<pre>';
echo 'HERE ARE THE LINES WITH NOINDEX: ' . PHP_EOL;
print_r($out);
echo PHP_EOL;
echo 'HERE ARE THE LINES WITHOUT: ' . PHP_EOL;
print_r($bad);

Open in new window

Don't forget about robots.txt ;-)

Best regards, ~Ray
0
 
käµfm³d 👽Commented:
I think for this purpose the regex version could be as simple as:

preg_match_all('/<meta [^>]*?noindex/', $input, $matches);

Open in new window

0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
Ray PaseurCommented:
@kaufmed: Might want to consider case-insensitive.  Just a thought... ~Ray
0
 
käµfm³d 👽Commented:
@Ray_Paseur

Certainly  : )
0
 
cbielichAuthor Commented:
Any example code to include the robots.exe file? :)
0
 
Ray PaseurCommented:
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.