Link to home
Start Free TrialLog in
Avatar of nwalker78
nwalker78

asked on

Regex or operator problems

foollowing on from a previous post im updaating my imdb scraper to fall inline with nhe new site layout. the current problem is there seeems to be 2 regex patterns for obtaining the title these being:

1.  <h1 itemprop="name" class="">
2.  <h1 itemprop="name" class="long">

individually they work fine but im having trouble combing them so it  searches for <h1 itemprop="name" class=""> or  <h1 itemprop="name" class="long"> the pattern  iv tried is:
  $f_content = explode('(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)', $movie_content);

the regex pattern seems to work in editpad lite 7 but not php.

can anyone offer some pointers
Avatar of Dan Craciun
Dan Craciun
Flag of Romania image

explode does not work with regular expressions.

Your code will look for the literal string "(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)".

HTH,
Dan
Furthermore, if you work with explode, I don't see the need for regular expressions:

$f_content = explode('<h1 class="long" itemprop="name">', $movie_content);
if($f_content[0] == $movie_content  {  // if no matches found, explode will return the original string
    $f_content = explode('<h1 class="" itemprop="name">', $movie_content);
}

Open in new window

Be sure to comply with Robots and Screen Scraping in the IMDB terms of use.  
The best thing you can give us, in order for us to help you, is a good set of test data.  I'll try to make do with my imagination, but your real-world test cases are the things that make the software development process work.
http://iconoun.com/demo/temp_nwalker78.php
<?php // demo/temp_nwalker78.php
/**
 * http://www.experts-exchange.com/questions/28934186/Regex-or-operator-problems.html
 * http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top
 *
 * http://php.net/manual/en/function.preg-match-all.php
 * http://php.net/manual/en/reference.pcre.pattern.syntax.php
 * http://php.net/manual/en/reference.pcre.pattern.modifiers.php
 */
error_reporting(E_ALL);
echo '<pre>';

// SOME SAMPLE TEST DATA
$html = <<<EOD
1.  <h1 itemprop="name" class="">Hello</h1>
2.  <H1 itemprop="name" class="long">World</H1>
<h1>Zalgo Pony</h1>
EOD;

// A REGEX TO EXTRACT THE CONTENTS OF HTML H1 TAGS
$rgx
= '#'                 // REGEX DELIMITER
. preg_quote('<h1')   // H1 HTML TAG
. '.*?'               // ANYTHING OR NOTHING
. preg_quote('>')     // END OF H1 OPENING TAG
. '(.*?)'             // CAPTURE GROUP OF ANYTHING OR NOTHING
. preg_quote('</h1')  // END OF H1 TAG
. '#'                 // REGEX DELIMITER
. 'i'                 // CASE-INSENSITIVE FLAG
;

preg_match_all($rgx, $html, $match);
foreach ($match[1] as $title)
{
    echo PHP_EOL . $title;
}

Open in new window

Avatar of nwalker78
nwalker78

ASKER

here is the full code including test data:

<?php
set_time_limit(0);
error_reporting(E_ERROR | E_WARNING | E_PARSE);


$movieids = array
(
	tt0086190	// class=""
	,tt0092099	// class=""
	,tt0080684  // class="long"
	,tt0087985	// class=""
	,tt0095016	// class=""
	,tt0096895	// class=""
);  
  
foreach($movieids as $movies) // looping for titles with class=""
{ 
	$movie_url = "http://www.imdb.com/title/".$movies;
	echo $movie_url.'<br><hr>';
	sleep(2); 
	$movie_content= file_get_contents($movie_url); 
	$f_content = explode('<h1 itemprop="name" class="">', $movie_content); //Find beginning of film title string.
	$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
	
	$resultsa[$movies]['movieNM'] = str_replace( "&nbsp;", "", "$f_content[0]"); // remove unwanted characters
}

// =======================================
foreach($movieids as $moviel)// looping for titles with class="long"
{ 
	$movie_url = "http://www.imdb.com/title/".$moviel;
	echo $movie_url.'<br><hr>';
	sleep(2); 
	$movie_content= file_get_contents($movie_url); 
	$f_content = explode('<h1 itemprop="name" class="long">', $movie_content); //Find beginning of film title string.
	$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
	
	$resultsb[$moviel]['movieNM'] = str_replace( "&nbsp;", "", "$f_content[0]"); // remove unwanted characters
}
foreach($movieids as $movieb)// looping for titles with class="" or class="long"
{ 
	$movie_url = "http://www.imdb.com/title/".$movieb;
	echo $movie_url.'<br><hr>';
	sleep(2); 
	$movie_content= file_get_contents($movie_url); 
	$f_content = explode('(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)', $movie_content); //Find beginning of film title string.
	$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
	
	$resultsc[$movieb]['movieNM'] = str_replace( "&nbsp;", "", "$f_content[0]"); // remove unwanted characters
}
var_dump($resultsa);
echo '<hr>';
var_dump($resultsb);
echo '<hr>';
var_dump($resultsc);

?>

Open in new window


in my original post here Php-curl-xpath Ray Paseur stated it would be better to use regex/explode rather than dom/xpath. thats why i changed it
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi,

once again thanks for the help. ive now managed to solve the problem and keep the bald patches to a minimum.

just out of curiosity how would you condens the regex you gave:
$rgx
= '#'                 // REGEX DELIMITER
. preg_quote('<h1')   // H1 HTML TAG
. '.*?'               // ANYTHING OR NOTHING
. preg_quote('>')     // END OF H1 OPENING TAG
. '(.*?)'             // CAPTURE GROUP OF ANYTHING OR NOTHING
. preg_quote('</h1')  // END OF H1 TAG
. '#'                 // REGEX DELIMITER
. 'i'                 // CASE-INSENSITIVE FLAG
. 's'                 // SINGLE-LINE FLAG
;

Open in new window


down to one line?

Many thanks

Dave
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial