• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1038
  • Last Modified:

preg_match to find meta tag noindex

I am trying to use preg_match and trying to find if a page contains the meta tag of noindex. problem is that there are several possible tags used. Here are a list of them

All Search Engines

<meta name="robots" content="noindex" />

Google

<meta name="googlebot" content="noindex" />

Yahoo

<meta name="Slurp" content="noindex" />

Bing (Microsoft)

<meta name="msnbot" content="noindex" />

Also some meta tags might have content="noindex, follow" or content="follow, noindex"

anyone have an example preg_match that would help me find these?
0
cbielich
Asked:
cbielich
1 Solution
 
Robert SchuttSoftware EngineerCommented:
If you want to allow for some variations in the html, try this expression:
"/<meta\s+name\s*=\s*[\"'](.*?)[\"']\s*content\s*=\s*[\"'].*?noindex.*?[\"']\s*\/?>/i"

Open in new window

To check/record which bot was targeted you could do something like:
if (preg_match("/<meta\s+name\s*=\s*[\"'](.*?)[\"']\s*content\s*=\s*[\"'].*?noindex.*?[\"']\s*\/?>/i", $s, $m) === 1) {
	echo "found: ".htmlspecialchars($m[1])."<br>";
}

Open in new window

assuming $s is a string containing the html.
0
 
Ray PaseurCommented:
I think you don't really need a regular expression; this function should be enough.
http://us2.php.net/manual/en/function.stripos.php

See http://www.laprbass.com/RAY_temp_cbielich.php
<?php // RAY_temp_cbielich.php
error_reporting(E_ALL);

// SIMULATE A WEB PAGE

$htm = <<<HTM
<meta name="robots" content="noindex" />
Google
<meta name="googlebot" content="noindex" />
Yahoo
<meta name="Slurp" content="noindex" />
Bing (Microsoft)
<meta name="msnbot" content="noindex" />
Also some meta tags might have content="noindex, follow" or content="follow, noindex"
HTM;

// PROCESS EACH LINE
$arr = explode(PHP_EOL, $htm);
$out = array();
$bad = array();
foreach ($arr as $str)
{
    if (stripos($str, 'NOINDEX') === FALSE)
    {
        $bad[] = htmlentities($str);
    }
    else
    {
        $out[] = htmlentities($str);
    }
}

// SHOW WHERE WE FOUND "NOINDEX"
echo '<pre>';
echo 'HERE ARE THE LINES WITH NOINDEX: ' . PHP_EOL;
print_r($out);
echo PHP_EOL;
echo 'HERE ARE THE LINES WITHOUT: ' . PHP_EOL;
print_r($bad);

Open in new window

Don't forget about robots.txt ;-)

Best regards, ~Ray
0
 
käµfm³d 👽Commented:
I think for this purpose the regex version could be as simple as:

preg_match_all('/<meta [^>]*?noindex/', $input, $matches);

Open in new window

0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
Ray PaseurCommented:
@kaufmed: Might want to consider case-insensitive.  Just a thought... ~Ray
0
 
käµfm³d 👽Commented:
@Ray_Paseur

Certainly  : )
0
 
cbielichAuthor Commented:
Any example code to include the robots.exe file? :)
0
 
Ray PaseurCommented:
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now