nwalker78
asked on
Php/curl/xpath
Hi,
Im running into some issues regarding execution of some xpath queries from scraped data see code below the error im getting is:
the issue is caused by the search query not being there for example actors dob/dod seeing as most actor are still alive the $actor_ddate causes an error i can sort this by using if(strlen($actor_ddate) <=1){$actor_ddate = 'Still Alive';}
after the querey has run and it works just fine, the only downfall if the above error/notice keeps being sjhown.
if tried several things with noluck and was wondering if anybody could shed some pointers. Ideally i want to solve the issue rather than silencing/surpressing the notice.
kind regards
nw
Im running into some issues regarding execution of some xpath queries from scraped data see code below the error im getting is:
Notice: Trying to get property of non-object in actorinfo.php on line xx
the issue is caused by the search query not being there for example actors dob/dod seeing as most actor are still alive the $actor_ddate causes an error i can sort this by using if(strlen($actor_ddate) <=1){$actor_ddate = 'Still Alive';}
after the querey has run and it works just fine, the only downfall if the above error/notice keeps being sjhown.
$newDom = new DOMDocument;
$newDom->appendChild($newDom->importNode($people,true));
$personXpath = new DOMXPath($newDom);
// Scraped Content
$actor_image = trim($personXpath->query("//img[1]/@src")->item(0)->nodeValue);
$actor_name = trim($personXpath->query(".//*[@id='overview-top']/h1/span[1]/text()[1]")->item(0)->nodeValue);
$actor_bdate = trim($personXpath->query(".//*[@id='name-born-info']/time")->item(0)->nodeValue);
$actor_ddate = trim($personXpath->query(".//*[@id='name-death-info']/time")->item(0)->nodeValue);
$results[] = array
(
'actor_image' => $actor_image,
'actor_name' => $actor_name,
'actor_dob' => $actor_bdate,
'actor_dod' => $actor_ddate,
);
if tried several things with noluck and was wondering if anybody could shed some pointers. Ideally i want to solve the issue rather than silencing/surpressing the notice.
kind regards
nw
ASKER
hi full code of page is:
result from var_dump($results)
warning:
as you can see 4 out of the 6 actors are still alive which are the 4 that generate the notice on line 71 and although the var_dump shows the error handled this is done after the fact.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
<style type="text/css">
.galleryItem {
width: 114px;
float:left;
border:#000 thin solid;
margin: 2px;
}
.galleryImage {
margin:3px;
height: 158px;
width: 107px;
border:#000 thin solid;
}
.galleryText {
margin:3px;
border:#000 thin solid;
height: 12px;
width: 107px;
font-size:10px;
text-align:center;
}
</style>
</head>
<body><?php
set_time_limit(0);
$results = array();
$actorlist= array('http://www.imdb.com/name/nm0000552',
'http://www.imdb.com/name/nm2242932',
'http://www.imdb.com/name/nm0219292',
'http://www.imdb.com/name/nm0256297',
'http://www.imdb.com/name/nm0000245',
'http://www.imdb.com/name/nm0003817',);
for ($actorid =0; $actorid <count($actorlist); $actorid++)
{
$actor_content = file_get_contents($actorlist[$actorid]);
$dom = new DOMDocument();
@$dom->loadHTML($actor_content);
$tempDom = new DOMDocument();
$overview_xpath = new DOMXPath($dom);
$movie_overview = $overview_xpath->query('//div[@class="article name-overview"]');
foreach ($movie_overview as $item)
{
$tempDom->appendChild($tempDom->importNode($item,true));
}
$tempDom->saveHTML();
$peopleXpath = new DOMXPath($tempDom);
$peopleDiv = $peopleXpath->query('//table[@id="name-overview-widget-layout"]');
foreach ($peopleDiv as $people)
{
$newDom = new DOMDocument;
$newDom->appendChild($newDom->importNode($people,true));
$personXpath = new DOMXPath($newDom);
// Scraped Content
$actor_image = trim($personXpath->query("//img[1]/@src")->item(0)->nodeValue);
$actor_name = trim($personXpath->query(".//*[@id='overview-top']/h1/span[1]/text()[1]")->item(0)->nodeValue);
$actor_idtag = $actorlist[$actorid];
$actor_bdate = trim($personXpath->query(".//*[@id='name-born-info']/time")->item(0)->nodeValue);
$actor_ddate = trim($personXpath->query(".//*[@id='name-death-info']/time")->item(0)->nodeValue);
$actor_bdate = preg_replace('/[^A-Za-z0-9\-]/', '-', $actor_bdate);
$actor_bdate = preg_replace('/-+/', '-', $actor_bdate);
$actor_bdate = preg_replace('/-/', ' ', $actor_bdate);
$actor_ddate = preg_replace('/[^A-Za-z0-9\-]/', '-', $actor_ddate);
$actor_ddate = preg_replace('/-+/', '-', $actor_ddate);
$actor_ddate = preg_replace('/-/', ' ', $actor_ddate);
if(strlen($actor_ddate) <=1){$actor_ddate = 'Still Alive';}
$actor_idtag = str_replace("http://","", $actor_idtag);
$idtag_array = explode( '/', $actor_idtag);
$results[] = array
(
'actor_image' => $actor_image,
'actor_name' => $actor_name,
'actor_idtag' => $idtag_array[2],
'actor_dob' => $actor_bdate,
'actor_dod' => $actor_ddate,
);
}
sleep(rand(1,3));
}
var_dump($results);
echo '<hr>';
for ($r =0; $r < count($results); $r++)
{
$sData = $results[$r]['actor_idtag'].'.jpg';
$filename = 'D:\wamp\www\guesswhat\Actors\\'.$sData;
if (file_exists($filename))
{
$imgres = "Exists";
} else {
$imgres = "Added";
//get_file($results[$r]['actor_image'], "D:\Actors\\", $sData);
} ?>
<div class="galleryItem">
<div class="galleryImage"><img src="<?php echo 'Actors/'.$sData ?>" width="107" height="158" /></div>
<div class="galleryText"><?php echo $imgres ?></div>
</div>
<?php
}
?></body>
</html>
result from var_dump($results)
array (size=6)
0 =>
array (size=5)
'actor_image' => string 'http://ia.media-imdb.com/images/M/MV5BMTc0NDQzODAwNF5BMl5BanBnXkFtZTYwMzUzNTk3._V1_UY317_CR6,0,214,317_AL_.jpg' (length=110)
'actor_name' => string 'Eddie Murphy' (length=12)
'actor_idtag' => string 'nm0000552' (length=9)
'actor_dob' => string 'April 3 1961' (length=12)
'actor_dod' => string 'Still Alive' (length=11)
1 =>
array (size=5)
'actor_image' => string 'http://ia.media-imdb.com/images/M/MV5BMTkxNzU2OTY4OF5BMl5BanBnXkFtZTcwOTE1MzQwOQ@@._V1_UY317_CR12,0,214,317_AL_.jpg' (length=115)
'actor_name' => string 'Kenzie Dalton' (length=13)
'actor_idtag' => string 'nm2242932' (length=9)
'actor_dob' => string 'March 7 1988' (length=12)
'actor_dod' => string 'Still Alive' (length=11)
2 =>
array (size=5)
'actor_image' => string 'http://ia.media-imdb.com/images/M/MV5BMjExNzgzNTk5OF5BMl5BanBnXkFtZTcwMjgxNDA2Nw@@._V1_UX214_CR0,0,214,317_AL_.jpg' (length=114)
'actor_name' => string 'David Denman' (length=12)
'actor_idtag' => string 'nm0219292' (length=9)
'actor_dob' => string 'July 25 1973' (length=12)
'actor_dod' => string 'Still Alive' (length=11)
3 =>
array (size=5)
'actor_image' => string 'http://ia.media-imdb.com/images/M/MV5BMTg3NzA3OTE2Ml5BMl5BanBnXkFtZTgwNDUyMzYxNjE@._V1_UY317_CR1,0,214,317_AL_.jpg' (length=114)
'actor_name' => string 'Gideon Emery' (length=12)
'actor_idtag' => string 'nm0256297' (length=9)
'actor_dob' => string 'September 12 1972' (length=17)
'actor_dod' => string 'Still Alive' (length=11)
4 =>
array (size=5)
'actor_image' => string 'http://ia.media-imdb.com/images/M/MV5BNTYzMjc2Mjg4MF5BMl5BanBnXkFtZTcwODc1OTQwNw@@._V1_UX214_CR0,0,214,317_AL_.jpg' (length=114)
'actor_name' => string 'Robin Williams' (length=14)
'actor_idtag' => string 'nm0000245' (length=9)
'actor_dob' => string 'July 21 1951' (length=12)
'actor_dod' => string 'August 11 2014' (length=14)
5 =>
array (size=5)
'actor_image' => string 'http://ia.media-imdb.com/images/M/MV5BMTI3NDY2ODk5OV5BMl5BanBnXkFtZTYwMjQ0NzE0._V1_UY317_CR27,0,214,317_AL_.jpg' (length=111)
'actor_name' => string 'Michael Clarke Duncan' (length=21)
'actor_idtag' => string 'nm0003817' (length=9)
'actor_dob' => string 'December 10 1957' (length=16)
'actor_dod' => string 'September 3 2012' (length=16)
warning:
Notice: Trying to get property of non-object in D:\wamp\www\actorinfo.php on line 71
line 71 refeers to: $actor_ddate = trim($personXpath->query(".//*[@id='name-death-info']/time")->item(0)->nodeValue);
as you can see 4 out of the 6 actors are still alive which are the 4 that generate the notice on line 71 and although the var_dump shows the error handled this is done after the fact.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thankyou somuch for simplyfying this. I wasnt expecting an as detaild solution, i am aware of sites frowing on scraping. I usually have my sleep set to a random of 15 and 45 seconds sos not to hammer the site
Also what value does $people have? Where is that data coming from? can you post an example so I can test it.