nwalker78
asked on
Regex or operator problems
foollowing on from a previous post im updaating my imdb scraper to fall inline with nhe new site layout. the current problem is there seeems to be 2 regex patterns for obtaining the title these being:
1. <h1 itemprop="name" class="">
2. <h1 itemprop="name" class="long">
individually they work fine but im having trouble combing them so it searches for <h1 itemprop="name" class=""> or <h1 itemprop="name" class="long"> the pattern iv tried is:
$f_content = explode('(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)', $movie_content);
the regex pattern seems to work in editpad lite 7 but not php.
can anyone offer some pointers
1. <h1 itemprop="name" class="">
2. <h1 itemprop="name" class="long">
individually they work fine but im having trouble combing them so it searches for <h1 itemprop="name" class=""> or <h1 itemprop="name" class="long"> the pattern iv tried is:
$f_content = explode('(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)', $movie_content);
the regex pattern seems to work in editpad lite 7 but not php.
can anyone offer some pointers
Furthermore, if you work with explode, I don't see the need for regular expressions:
$f_content = explode('<h1 class="long" itemprop="name">', $movie_content);
if($f_content[0] == $movie_content { // if no matches found, explode will return the original string
$f_content = explode('<h1 class="" itemprop="name">', $movie_content);
}
Be sure to comply with Robots and Screen Scraping in the IMDB terms of use.
The best thing you can give us, in order for us to help you, is a good set of test data. I'll try to make do with my imagination, but your real-world test cases are the things that make the software development process work.
http://iconoun.com/demo/temp_nwalker78.php
The best thing you can give us, in order for us to help you, is a good set of test data. I'll try to make do with my imagination, but your real-world test cases are the things that make the software development process work.
http://iconoun.com/demo/temp_nwalker78.php
<?php // demo/temp_nwalker78.php
/**
* http://www.experts-exchange.com/questions/28934186/Regex-or-operator-problems.html
* http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top
*
* http://php.net/manual/en/function.preg-match-all.php
* http://php.net/manual/en/reference.pcre.pattern.syntax.php
* http://php.net/manual/en/reference.pcre.pattern.modifiers.php
*/
error_reporting(E_ALL);
echo '<pre>';
// SOME SAMPLE TEST DATA
$html = <<<EOD
1. <h1 itemprop="name" class="">Hello</h1>
2. <H1 itemprop="name" class="long">World</H1>
<h1>Zalgo Pony</h1>
EOD;
// A REGEX TO EXTRACT THE CONTENTS OF HTML H1 TAGS
$rgx
= '#' // REGEX DELIMITER
. preg_quote('<h1') // H1 HTML TAG
. '.*?' // ANYTHING OR NOTHING
. preg_quote('>') // END OF H1 OPENING TAG
. '(.*?)' // CAPTURE GROUP OF ANYTHING OR NOTHING
. preg_quote('</h1') // END OF H1 TAG
. '#' // REGEX DELIMITER
. 'i' // CASE-INSENSITIVE FLAG
;
preg_match_all($rgx, $html, $match);
foreach ($match[1] as $title)
{
echo PHP_EOL . $title;
}
ASKER
here is the full code including test data:
in my original post here Php-curl-xpath Ray Paseur stated it would be better to use regex/explode rather than dom/xpath. thats why i changed it
<?php
set_time_limit(0);
error_reporting(E_ERROR | E_WARNING | E_PARSE);
$movieids = array
(
tt0086190 // class=""
,tt0092099 // class=""
,tt0080684 // class="long"
,tt0087985 // class=""
,tt0095016 // class=""
,tt0096895 // class=""
);
foreach($movieids as $movies) // looping for titles with class=""
{
$movie_url = "http://www.imdb.com/title/".$movies;
echo $movie_url.'<br><hr>';
sleep(2);
$movie_content= file_get_contents($movie_url);
$f_content = explode('<h1 itemprop="name" class="">', $movie_content); //Find beginning of film title string.
$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
$resultsa[$movies]['movieNM'] = str_replace( " ", "", "$f_content[0]"); // remove unwanted characters
}
// =======================================
foreach($movieids as $moviel)// looping for titles with class="long"
{
$movie_url = "http://www.imdb.com/title/".$moviel;
echo $movie_url.'<br><hr>';
sleep(2);
$movie_content= file_get_contents($movie_url);
$f_content = explode('<h1 itemprop="name" class="long">', $movie_content); //Find beginning of film title string.
$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
$resultsb[$moviel]['movieNM'] = str_replace( " ", "", "$f_content[0]"); // remove unwanted characters
}
foreach($movieids as $movieb)// looping for titles with class="" or class="long"
{
$movie_url = "http://www.imdb.com/title/".$movieb;
echo $movie_url.'<br><hr>';
sleep(2);
$movie_content= file_get_contents($movie_url);
$f_content = explode('(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)', $movie_content); //Find beginning of film title string.
$f_content = explode('<span id="titleYear">', $f_content[1]); //Find end of film title string.
$resultsc[$movieb]['movieNM'] = str_replace( " ", "", "$f_content[0]"); // remove unwanted characters
}
var_dump($resultsa);
echo '<hr>';
var_dump($resultsb);
echo '<hr>';
var_dump($resultsc);
?>
in my original post here Php-curl-xpath Ray Paseur stated it would be better to use regex/explode rather than dom/xpath. thats why i changed it
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi,
once again thanks for the help. ive now managed to solve the problem and keep the bald patches to a minimum.
just out of curiosity how would you condens the regex you gave:
down to one line?
Many thanks
Dave
once again thanks for the help. ive now managed to solve the problem and keep the bald patches to a minimum.
just out of curiosity how would you condens the regex you gave:
$rgx = '#' // REGEX DELIMITER . preg_quote('<h1') // H1 HTML TAG . '.*?' // ANYTHING OR NOTHING . preg_quote('>') // END OF H1 OPENING TAG . '(.*?)' // CAPTURE GROUP OF ANYTHING OR NOTHING . preg_quote('</h1') // END OF H1 TAG . '#' // REGEX DELIMITER . 'i' // CASE-INSENSITIVE FLAG . 's' // SINGLE-LINE FLAG ;
down to one line?
Many thanks
Dave
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Your code will look for the literal string "(<h1 class="long" itemprop="name">|<h1 class="" itemprop="name">)".
HTH,
Dan