asked on

Regex or operator problems

foollowing on from a previous post im updaating my imdb scraper to fall inline with nhe new site layout. the current problem is there seeems to be 2 regex patterns for obtaining the title these being:

1. <h1 itemprop="name" class="">
2. <h1 itemprop="name" class="long">

individually they work fine but im having trouble combing them so it searches for <h1 itemprop="name" class=""> or <h1 itemprop="name" class="long"> the pattern iv tried is:
$f_content = explode('(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)', $movie_content);

the regex pattern seems to work in editpad lite 7 but not php.

can anyone offer some pointers

Dan Craciun

explode does not work with regular expressions.

Your code will look for the literal string "(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)".

HTH,
Dan

Dan Craciun

Furthermore, if you work with explode, I don't see the need for regular expressions:

$f_content = explode('<h1 class="long" itemprop="name">', $movie_content);
if($f_content[0] == $movie_content  {  // if no matches found, explode will return the original string
    $f_content = explode('<h1 class="" itemprop="name">', $movie_content);
}

Open in new window

Ray Paseur

Be sure to comply with Robots and Screen Scraping in the IMDB terms of use.
The best thing you can give us, in order for us to help you, is a good set of test data. I'll try to make do with my imagination, but your real-world test cases are the things that make the software development process work.
http://iconoun.com/demo/temp_nwalker78.php

<?php // demo/temp_nwalker78.php
/**
 * http://www.experts-exchange.com/questions/28934186/Regex-or-operator-problems.html
 * http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top
 *
 * http://php.net/manual/en/function.preg-match-all.php
 * http://php.net/manual/en/reference.pcre.pattern.syntax.php
 * http://php.net/manual/en/reference.pcre.pattern.modifiers.php
 */
error_reporting(E_ALL);
echo '<pre>';

// SOME SAMPLE TEST DATA
$html = <<<EOD
1.  <h1 itemprop="name" class="">Hello</h1>
2.  <H1 itemprop="name" class="long">World</H1>
<h1>Zalgo Pony</h1>
EOD;

// A REGEX TO EXTRACT THE CONTENTS OF HTML H1 TAGS
$rgx
= '#'                 // REGEX DELIMITER
. preg_quote('<h1')   // H1 HTML TAG
. '.*?'               // ANYTHING OR NOTHING
. preg_quote('>')     // END OF H1 OPENING TAG
. '(.*?)'             // CAPTURE GROUP OF ANYTHING OR NOTHING
. preg_quote('</h1')  // END OF H1 TAG
. '#'                 // REGEX DELIMITER
. 'i'                 // CASE-INSENSITIVE FLAG
;

preg_match_all($rgx, $html, $match);
foreach ($match[1] as $title)
{
    echo PHP_EOL . $title;
}

Open in new window

nwalker78

ASKER

here is the full code including test data:

<?php
set_time_limit(0);
error_reporting(E_ERROR | E_WARNING | E_PARSE);


$movieids = array
(
	tt0086190	// class=""
	,tt0092099	// class=""
	,tt0080684  // class="long"
	,tt0087985	// class=""
	,tt0095016	// class=""
	,tt0096895	// class=""
);  
  
foreach($movieids as $movies) // looping for titles with class=""
{ 
	$movie_url = "http://www.imdb.com/title/".$movies;
	echo $movie_url.'<br><hr>';
	sleep(2); 
	$movie_content= file_get_contents($movie_url); 
	$f_content = explode('<h1 itemprop="name" class="">', $movie_content); //Find beginning of film title string.
	$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
	
	$resultsa[$movies]['movieNM'] = str_replace( "&nbsp;", "", "$f_content[0]"); // remove unwanted characters
}

// =======================================
foreach($movieids as $moviel)// looping for titles with class="long"
{ 
	$movie_url = "http://www.imdb.com/title/".$moviel;
	echo $movie_url.'<br><hr>';
	sleep(2); 
	$movie_content= file_get_contents($movie_url); 
	$f_content = explode('<h1 itemprop="name" class="long">', $movie_content); //Find beginning of film title string.
	$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
	
	$resultsb[$moviel]['movieNM'] = str_replace( "&nbsp;", "", "$f_content[0]"); // remove unwanted characters
}
foreach($movieids as $movieb)// looping for titles with class="" or class="long"
{ 
	$movie_url = "http://www.imdb.com/title/".$movieb;
	echo $movie_url.'<br><hr>';
	sleep(2); 
	$movie_content= file_get_contents($movie_url); 
	$f_content = explode('(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)', $movie_content); //Find beginning of film title string.
	$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
	
	$resultsc[$movieb]['movieNM'] = str_replace( "&nbsp;", "", "$f_content[0]"); // remove unwanted characters
}
var_dump($resultsa);
echo '<hr>';
var_dump($resultsb);
echo '<hr>';
var_dump($resultsc);

?>

Open in new window

in my original post here Php-curl-xpath Ray Paseur stated it would be better to use regex/explode rather than dom/xpath. thats why i changed it

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

nwalker78

ASKER

Hi,

once again thanks for the help. ive now managed to solve the problem and keep the bald patches to a minimum.

just out of curiosity how would you condens the regex you gave:

$rgx
= '#'                 // REGEX DELIMITER
. preg_quote('<h1')   // H1 HTML TAG
. '.*?'               // ANYTHING OR NOTHING
. preg_quote('>')     // END OF H1 OPENING TAG
. '(.*?)'             // CAPTURE GROUP OF ANYTHING OR NOTHING
. preg_quote('</h1')  // END OF H1 TAG
. '#'                 // REGEX DELIMITER
. 'i'                 // CASE-INSENSITIVE FLAG
. 's'                 // SINGLE-LINE FLAG
;

Open in new window

down to one line?

Many thanks

Dave

SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial