Scott Fell
asked on
PHP extract HTML using DOMXpath from nested tables
i am using the PHP below to grab data from an html file (that I have permission to extract)
At this point I have a series of tables with class X1 which contain nested tables and data I would like to extract
EXTRACTED HTML
Next I am looping through each table with class X1 with
FULL CODE
Ultimately, I would like to get the image, name, age and other data. For this question, just getting the image will help get me on the right track. Right now, I have an error from line 16 directly above, "Warning: DOMXPath::query(): Invalid expression"
I have tried using below with errors as well.
$image = $xpath->query('./table[@cl ass="X2"]/ tbody/tr/t d/img[1]@s rc');
$image = $xpath->query('//table[@cl ass="X2"]/ tbody/tr/t d/img[1]@s rc');
$image = $xpath->query('.//table[@c lass="X2"] /tbody/tr/ td/img[1]@ src');
$html = file_get_contents("http://mysite.com/files/data.html");
# CREATE DOCUMENT AND LOAD DATA
$doc = new DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//table[@class="X1"]');
# DATA IS NOW LOADED: SEE SAMPLE EXTRACTED TABLES WITH CLASS X1
At this point I have a series of tables with class X1 which contain nested tables and data I would like to extract
EXTRACTED HTML
<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
<tbody><tr>
<td>
<table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
<tbody><tr>
<td width="16%" valign="top">
<img src="images/spacer.gif" alt="" border="0" width="1" height="8">
<img src="/images/getimage.aspx?img=abc123.jpg" alt="">
</td>
<td width="84%" valign="top">
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
<tbody><tr>
<td>
<a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName LastName</a>
</td>
</tr>
</tbody></table>
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
<tbody><tr>
<td width="18%">Age:</td>
<td width="21%">25</td>
<td width="10%" class="classText">Class:</td>
<td width="34%" class="classText">NON SENTENCED </td>
</tr>
<tr>
<td>Race/Sex: </td>
<td>W/F</td>
<td>Visit:</td>
<td>Call Facility For Time**</td>
</tr>
<tr>
<td valign="top">Intake Date: </td>
<td>
10/3/2016
<br>
10:34 PM
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td>
<table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="51%" valign="top">
<div align="right">
<img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
</div>
</td>
<td width="49%" valign="baseline" class="X4">
<div align="center">
<a href="#Top">back to top</a>
</div>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
<tbody><tr>
<td>
<table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
<tbody><tr>
<td width="16%" valign="top">
<img src="images/spacer.gif" alt="" border="0" width="1" height="8">
<img src="/images/getimage.aspx?img=abc123.jpg" alt="">
</td>
<td width="84%" valign="top">
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
<tbody><tr>
<td>
<a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName LastName</a>
</td>
</tr>
</tbody></table>
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
<tbody><tr>
<td width="18%">Age:</td>
<td width="21%">25</td>
<td width="10%" class="classText">Class:</td>
<td width="34%" class="classText">NON SENTENCED </td>
</tr>
<tr>
<td>Race/Sex: </td>
<td>W/F</td>
<td>Visit:</td>
<td>Call Facility For Time**</td>
</tr>
<tr>
<td valign="top">Intake Date: </td>
<td>
10/3/2016
<br>
10:34 PM
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td>
<table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="51%" valign="top">
<div align="right">
<img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
</div>
</td>
<td width="49%" valign="baseline" class="X4">
<div align="center">
<a href="#Top">back to top</a>
</div>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
<tbody><tr>
<td>
<table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
<tbody><tr>
<td width="16%" valign="top">
<img src="images/spacer.gif" alt="" border="0" width="1" height="8">
<img src="/images/getimage.aspx?img=abc123.jpg" alt="">
</td>
<td width="84%" valign="top">
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
<tbody><tr>
<td>
<a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName LastName</a>
</td>
</tr>
</tbody></table>
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
<tbody><tr>
<td width="18%">Age:</td>
<td width="21%">25</td>
<td width="10%" class="classText">Class:</td>
<td width="34%" class="classText">NON SENTENCED </td>
</tr>
<tr>
<td>Race/Sex: </td>
<td>W/F</td>
<td>Visit:</td>
<td>Call Facility For Time**</td>
</tr>
<tr>
<td valign="top">Intake Date: </td>
<td>
10/3/2016
<br>
10:34 PM
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td>
<table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="51%" valign="top">
<div align="right">
<img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
</div>
</td>
<td width="49%" valign="baseline" class="X4">
<div align="center">
<a href="#Top">back to top</a>
</div>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
Next I am looping through each table with class X1 with
if (!is_null($elements)) {
foreach ($elements as $element) {
$image = $xpath->query('/table[@class="X2"]/tbody/tr/td/img[1]@src');
echo $image
}
}
FULL CODE
$html = file_get_contents("http://mysite.com/files/data.html");
# CREATE DOCUMENT AND LOAD DATA
$doc = new DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//table[@class="X1"]');
# DATA IS NOW LOADED: SEE SAMPLE EXTRACTED TABLES WITH CLASS X1
if (!is_null($elements)) {
foreach ($elements as $element) {
$image = $xpath->query('/table[@class="X2"]/tbody/tr/td/img[1]@src');
echo $image
}
}
Ultimately, I would like to get the image, name, age and other data. For this question, just getting the image will help get me on the right track. Right now, I have an error from line 16 directly above, "Warning: DOMXPath::query(): Invalid expression"
I have tried using below with errors as well.
$image = $xpath->query('./table[@cl
$image = $xpath->query('//table[@cl
$image = $xpath->query('.//table[@c
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
<?php
$html = getHTML();
# CREATE DOCUMENT AND LOAD DATA
$doc = new DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//table[@class="X1"]');
if (!is_null($elements)) {
foreach ($elements as $element) {
$imageList = $xpath->query('//table[@class="X2"]/tbody/tr/td/img[2]/@src');
foreach($imageList as $node) {
echo "{$node->nodeName} - {$node->nodeValue} <br/>";
}
}
}
echo "<hr>".$html;
function getHTML(){
$html = <<<'EOD'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>
<!-- start repeating -->
<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
<tbody><tr>
<td>
<table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
<tbody><tr>
<td width="16%" valign="top">
<img src="images/spacer.gif" alt="" border="0" width="1" height="8">
<img src="/images/getimage.aspx?img=abc123.jpg" alt="">
</td>
<td width="84%" valign="top">
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
<tbody><tr>
<td>
<a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName1 LastName1</a>
</td>
</tr>
</tbody></table>
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
<tbody><tr>
<td width="18%">Age:</td>
<td width="21%">25</td>
<td width="10%" class="classText">Class:</td>
<td width="34%" class="classText">NON SENTENCED </td>
</tr>
<tr>
<td>Race/Sex: </td>
<td>W/F</td>
<td>Visit:</td>
<td>Call Facility For Time**</td>
</tr>
<tr>
<td valign="top">Intake Date: </td>
<td>
10/3/2016
<br>
10:34 PM
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td>
<table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="51%" valign="top">
<div align="right">
<img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
</div>
</td>
<td width="49%" valign="baseline" class="X4">
<div align="center">
<a href="#Top">back to top</a>
</div>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
<tbody><tr>
<td>
<table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
<tbody><tr>
<td width="16%" valign="top">
<img src="images/spacer.gif" alt="" border="0" width="1" height="8">
<img src="/images/getimage.aspx?img=def456.jpg" alt="">
</td>
<td width="84%" valign="top">
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
<tbody><tr>
<td>
<a href="javascript: void(0)" onclick="popup('details.aspx?id=def456')">FirstName2 LastName2</a>
</td>
</tr>
</tbody></table>
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
<tbody><tr>
<td width="18%">Age:</td>
<td width="21%">25</td>
<td width="10%" class="classText">Class:</td>
<td width="34%" class="classText">NON SENTENCED </td>
</tr>
<tr>
<td>Race/Sex: </td>
<td>W/F</td>
<td>Visit:</td>
<td>Call Facility For Time**</td>
</tr>
<tr>
<td valign="top">Intake Date: </td>
<td>
10/3/2016
<br>
10:34 PM
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td>
<table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="51%" valign="top">
<div align="right">
<img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
</div>
</td>
<td width="49%" valign="baseline" class="X4">
<div align="center">
<a href="#Top">back to top</a>
</div>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
<tbody><tr>
<td>
<table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
<tbody><tr>
<td width="16%" valign="top">
<img src="images/spacer.gif" alt="" border="0" width="1" height="8">
<img src="/images/getimage.aspx?img=ghi789.jpg" alt="">
</td>
<td width="84%" valign="top">
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
<tbody><tr>
<td>
<a href="javascript: void(0)" onclick="popup('details.aspx?id=ghi789')">FirstName3 LastName3</a>
</td>
</tr>
</tbody></table>
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
<tbody><tr>
<td width="18%">Age:</td>
<td width="21%">25</td>
<td width="10%" class="classText">Class:</td>
<td width="34%" class="classText">NON SENTENCED </td>
</tr>
<tr>
<td>Race/Sex: </td>
<td>W/F</td>
<td>Visit:</td>
<td>Call Facility For Time**</td>
</tr>
<tr>
<td valign="top">Intake Date: </td>
<td>
10/3/2016
<br>
10:34 PM
</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td>
<table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
<tbody><tr>
<td width="51%" valign="top">
<div align="right">
<img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
</div>
</td>
<td width="49%" valign="baseline" class="X4">
<div align="center">
<a href="#Top">back to top</a>
</div>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
<!-- end repeating -->
</td>
</tr>
</tbody>
</table>
</body>
</html>
EOD;
return $html;
}
The output from the above is:
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
What I am expecting is just 3 images because there are 3 tables each with their own image:
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
src - /images/getimage.aspx?img=
It is repeating finding table[@class="X2"]/tbody/t
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks, I will give it a try and report back later today!
@Scott - were you able to resolve your problem?
ASKER
Sorry for the long delay. I was pulled off this one and still working on it. The information here put me in the right direction. Thank you zephyr_hex!
ASKER
Let me build a better self contained test case and post it here. I think the issue I am having is looping where first extract a portion of the html tree (all tables wtih class X1), then loop through those tables and extract bits by adding on to xpath. Or should a new xpath be created?