Link to home
Start Free TrialLog in
Avatar of Scott Fell
Scott FellFlag for United States of America

asked on

PHP extract HTML using DOMXpath from nested tables

i am using the PHP below to grab data from an html file (that I have permission to extract)
$html  = file_get_contents("http://mysite.com/files/data.html");

#  CREATE DOCUMENT AND LOAD DATA
$doc = new DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($doc);

$elements = $xpath->query('//table[@class="X1"]');
# DATA IS NOW LOADED: SEE SAMPLE EXTRACTED TABLES WITH CLASS X1

Open in new window


At this point I have a series of tables with class X1 which contain nested tables and data I would like to extract

EXTRACTED HTML
<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
    <tbody><tr>
        <td>
            <table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
                <tbody><tr>
                    <td width="16%" valign="top">
                        <img src="images/spacer.gif" alt="" border="0" width="1" height="8">
                        <img src="/images/getimage.aspx?img=abc123.jpg" alt="">
                    </td>
                    <td width="84%" valign="top">

                        <table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
                            <tbody><tr>
                                <td>
                                    <a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName  LastName</a>
                                </td>
                            </tr>
                            </tbody></table>

                        <table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
                            <tbody><tr>
                                <td width="18%">Age:</td>
                                <td width="21%">25</td>
                                <td width="10%" class="classText">Class:</td>
                                <td width="34%" class="classText">NON SENTENCED                       </td>
                            </tr>
                            <tr>
                                <td>Race/Sex: </td>
                                <td>W/F</td>
                                <td>Visit:</td>
                                <td>Call Facility For Time**</td>
                            </tr>
                            <tr>
                                <td valign="top">Intake Date: </td>
                                <td>
                                    10/3/2016
                                    <br>
                                    10:34 PM
                                </td>
                                <td>&nbsp; </td>
                            </tr>
                            <tr>
                                <td>&nbsp;</td>
                                <td>&nbsp;</td>
                                <td>&nbsp;</td>
                                <td>
                                    <table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
                                        <tbody><tr>
                                            <td width="51%" valign="top">
                                                <div align="right">
                                                    <img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
                                                </div>

                                            </td>
                                            <td width="49%" valign="baseline" class="X4">
                                                <div align="center">
                                                    <a href="#Top">back to top</a>
                                                </div>

                                            </td>
                                        </tr>
                                        </tbody></table>
                                </td>

                            </tr>
                            </tbody></table>

                    </td>

                </tr>
                </tbody></table>
        </td>
    </tr>
    </tbody></table>



<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
    <tbody><tr>
        <td>
            <table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
                <tbody><tr>
                    <td width="16%" valign="top">
                        <img src="images/spacer.gif" alt="" border="0" width="1" height="8">
                        <img src="/images/getimage.aspx?img=abc123.jpg" alt="">
                    </td>
                    <td width="84%" valign="top">

                        <table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
                            <tbody><tr>
                                <td>
                                    <a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName  LastName</a>
                                </td>
                            </tr>
                            </tbody></table>

                        <table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
                            <tbody><tr>
                                <td width="18%">Age:</td>
                                <td width="21%">25</td>
                                <td width="10%" class="classText">Class:</td>
                                <td width="34%" class="classText">NON SENTENCED                       </td>
                            </tr>
                            <tr>
                                <td>Race/Sex: </td>
                                <td>W/F</td>
                                <td>Visit:</td>
                                <td>Call Facility For Time**</td>
                            </tr>
                            <tr>
                                <td valign="top">Intake Date: </td>
                                <td>
                                    10/3/2016
                                    <br>
                                    10:34 PM
                                </td>
                                <td>&nbsp; </td>
                            </tr>
                            <tr>
                                <td>&nbsp;</td>
                                <td>&nbsp;</td>
                                <td>&nbsp;</td>
                                <td>
                                    <table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
                                        <tbody><tr>
                                            <td width="51%" valign="top">
                                                <div align="right">
                                                    <img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
                                                </div>

                                            </td>
                                            <td width="49%" valign="baseline" class="X4">
                                                <div align="center">
                                                    <a href="#Top">back to top</a>
                                                </div>

                                            </td>
                                        </tr>
                                        </tbody></table>
                                </td>

                            </tr>
                            </tbody></table>

                    </td>

                </tr>
                </tbody></table>
        </td>
    </tr>
    </tbody></table>



<table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
    <tbody><tr>
        <td>
            <table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
                <tbody><tr>
                    <td width="16%" valign="top">
                        <img src="images/spacer.gif" alt="" border="0" width="1" height="8">
                        <img src="/images/getimage.aspx?img=abc123.jpg" alt="">
                    </td>
                    <td width="84%" valign="top">

                        <table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
                            <tbody><tr>
                                <td>
                                    <a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName  LastName</a>
                                </td>
                            </tr>
                            </tbody></table>

                        <table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
                            <tbody><tr>
                                <td width="18%">Age:</td>
                                <td width="21%">25</td>
                                <td width="10%" class="classText">Class:</td>
                                <td width="34%" class="classText">NON SENTENCED                       </td>
                            </tr>
                            <tr>
                                <td>Race/Sex: </td>
                                <td>W/F</td>
                                <td>Visit:</td>
                                <td>Call Facility For Time**</td>
                            </tr>
                            <tr>
                                <td valign="top">Intake Date: </td>
                                <td>
                                    10/3/2016
                                    <br>
                                    10:34 PM
                                </td>
                                <td>&nbsp; </td>
                            </tr>
                            <tr>
                                <td>&nbsp;</td>
                                <td>&nbsp;</td>
                                <td>&nbsp;</td>
                                <td>
                                    <table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
                                        <tbody><tr>
                                            <td width="51%" valign="top">
                                                <div align="right">
                                                    <img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
                                                </div>

                                            </td>
                                            <td width="49%" valign="baseline" class="X4">
                                                <div align="center">
                                                    <a href="#Top">back to top</a>
                                                </div>

                                            </td>
                                        </tr>
                                        </tbody></table>
                                </td>

                            </tr>
                            </tbody></table>

                    </td>

                </tr>
                </tbody></table>
        </td>
    </tr>
    </tbody></table>

Open in new window


Next I am looping through each table with class X1 with
if (!is_null($elements)) {
    foreach ($elements as $element) {

        $image = $xpath->query('/table[@class="X2"]/tbody/tr/td/img[1]@src');
       echo $image


    }
}

Open in new window


FULL CODE
$html  = file_get_contents("http://mysite.com/files/data.html");

#  CREATE DOCUMENT AND LOAD DATA
$doc = new DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($doc);

$elements = $xpath->query('//table[@class="X1"]');
# DATA IS NOW LOADED: SEE SAMPLE EXTRACTED TABLES WITH CLASS X1

if (!is_null($elements)) {
    foreach ($elements as $element) {

        $image = $xpath->query('/table[@class="X2"]/tbody/tr/td/img[1]@src');
       echo $image


    }
}

Open in new window


Ultimately, I would like to get the image,  name, age and other data.  For this question, just getting the image will help get me on the right track.   Right now, I have an error from line 16 directly above, "Warning: DOMXPath::query(): Invalid expression"

I have tried using below with errors as well.  

$image = $xpath->query('./table[@class="X2"]/tbody/tr/td/img[1]@src');
$image = $xpath->query('//table[@class="X2"]/tbody/tr/td/img[1]@src');
$image = $xpath->query('.//table[@class="X2"]/tbody/tr/td/img[1]@src');
ASKER CERTIFIED SOLUTION
Avatar of zephyr_hex (Megan)
zephyr_hex (Megan)
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Scott Fell

ASKER

Thank you.  The missing semicolons may be from changing some of the code before posting here.  

Let me build a better self contained test case and post it here. I think the issue I am having is looping where first extract a portion of the html tree (all tables wtih class X1), then loop through those tables and extract bits by adding on to xpath.  Or should a new xpath be created?
<?php


$html  = getHTML();

#  CREATE DOCUMENT AND LOAD DATA
$doc = new DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($doc);

$elements = $xpath->query('//table[@class="X1"]');

if (!is_null($elements)) {

    foreach ($elements as $element) {
        $imageList = $xpath->query('//table[@class="X2"]/tbody/tr/td/img[2]/@src');
        
        foreach($imageList as $node) {
            echo "{$node->nodeName} - {$node->nodeValue} <br/>";
        }
    }
}

echo "<hr>".$html;


function getHTML(){

    $html = <<<'EOD'
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>

<table>
    <tbody>
    <tr>
        <td>
            <!-- start repeating -->

            <table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
                <tbody><tr>
                    <td>
                        <table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
                            <tbody><tr>
                                <td width="16%" valign="top">
                                    <img src="images/spacer.gif" alt="" border="0" width="1" height="8">
                                    <img src="/images/getimage.aspx?img=abc123.jpg" alt="">
                                </td>
                                <td width="84%" valign="top">

                                    <table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
                                        <tbody><tr>
                                            <td>
                                                <a href="javascript: void(0)" onclick="popup('details.aspx?id=abc123')">FirstName1  LastName1</a>
                                            </td>
                                        </tr>
                                        </tbody></table>

                                    <table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
                                        <tbody><tr>
                                            <td width="18%">Age:</td>
                                            <td width="21%">25</td>
                                            <td width="10%" class="classText">Class:</td>
                                            <td width="34%" class="classText">NON SENTENCED                       </td>
                                        </tr>
                                        <tr>
                                            <td>Race/Sex: </td>
                                            <td>W/F</td>
                                            <td>Visit:</td>
                                            <td>Call Facility For Time**</td>
                                        </tr>
                                        <tr>
                                            <td valign="top">Intake Date: </td>
                                            <td>
                                                10/3/2016
                                                <br>
                                                10:34 PM
                                            </td>
                                            <td>&nbsp; </td>
                                        </tr>
                                        <tr>
                                            <td>&nbsp;</td>
                                            <td>&nbsp;</td>
                                            <td>&nbsp;</td>
                                            <td>
                                                <table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
                                                    <tbody><tr>
                                                        <td width="51%" valign="top">
                                                            <div align="right">
                                                                <img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
                                                            </div>

                                                        </td>
                                                        <td width="49%" valign="baseline" class="X4">
                                                            <div align="center">
                                                                <a href="#Top">back to top</a>
                                                            </div>

                                                        </td>
                                                    </tr>
                                                    </tbody></table>
                                            </td>

                                        </tr>
                                        </tbody></table>

                                </td>

                            </tr>
                            </tbody></table>
                    </td>
                </tr>
                </tbody></table>



            <table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
                <tbody><tr>
                    <td>
                        <table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
                            <tbody><tr>
                                <td width="16%" valign="top">
                                    <img src="images/spacer.gif" alt="" border="0" width="1" height="8">
                                    <img src="/images/getimage.aspx?img=def456.jpg" alt="">
                                </td>
                                <td width="84%" valign="top">

                                    <table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
                                        <tbody><tr>
                                            <td>
                                                <a href="javascript: void(0)" onclick="popup('details.aspx?id=def456')">FirstName2  LastName2</a>
                                            </td>
                                        </tr>
                                        </tbody></table>

                                    <table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
                                        <tbody><tr>
                                            <td width="18%">Age:</td>
                                            <td width="21%">25</td>
                                            <td width="10%" class="classText">Class:</td>
                                            <td width="34%" class="classText">NON SENTENCED                       </td>
                                        </tr>
                                        <tr>
                                            <td>Race/Sex: </td>
                                            <td>W/F</td>
                                            <td>Visit:</td>
                                            <td>Call Facility For Time**</td>
                                        </tr>
                                        <tr>
                                            <td valign="top">Intake Date: </td>
                                            <td>
                                                10/3/2016
                                                <br>
                                                10:34 PM
                                            </td>
                                            <td>&nbsp; </td>
                                        </tr>
                                        <tr>
                                            <td>&nbsp;</td>
                                            <td>&nbsp;</td>
                                            <td>&nbsp;</td>
                                            <td>
                                                <table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
                                                    <tbody><tr>
                                                        <td width="51%" valign="top">
                                                            <div align="right">
                                                                <img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
                                                            </div>

                                                        </td>
                                                        <td width="49%" valign="baseline" class="X4">
                                                            <div align="center">
                                                                <a href="#Top">back to top</a>
                                                            </div>

                                                        </td>
                                                    </tr>
                                                    </tbody></table>
                                            </td>

                                        </tr>
                                        </tbody></table>

                                </td>

                            </tr>
                            </tbody></table>
                    </td>
                </tr>
                </tbody></table>



            <table width="575" border="0" align="center" cellpadding="0" cellspacing="0" class="X1">
                <tbody><tr>
                    <td>
                        <table width="98%" border="0" align="left" cellpadding="0" cellspacing="0" class="X2">
                            <tbody><tr>
                                <td width="16%" valign="top">
                                    <img src="images/spacer.gif" alt="" border="0" width="1" height="8">
                                    <img src="/images/getimage.aspx?img=ghi789.jpg" alt="">
                                </td>
                                <td width="84%" valign="top">

                                    <table width="100%" border="0" cellpadding="3" cellspacing="3" class="Name">
                                        <tbody><tr>
                                            <td>
                                                <a href="javascript: void(0)" onclick="popup('details.aspx?id=ghi789')">FirstName3  LastName3</a>
                                            </td>
                                        </tr>
                                        </tbody></table>

                                    <table width="100%" border="0" cellpadding="3" cellspacing="3" class="X3">
                                        <tbody><tr>
                                            <td width="18%">Age:</td>
                                            <td width="21%">25</td>
                                            <td width="10%" class="classText">Class:</td>
                                            <td width="34%" class="classText">NON SENTENCED                       </td>
                                        </tr>
                                        <tr>
                                            <td>Race/Sex: </td>
                                            <td>W/F</td>
                                            <td>Visit:</td>
                                            <td>Call Facility For Time**</td>
                                        </tr>
                                        <tr>
                                            <td valign="top">Intake Date: </td>
                                            <td>
                                                10/3/2016
                                                <br>
                                                10:34 PM
                                            </td>
                                            <td>&nbsp; </td>
                                        </tr>
                                        <tr>
                                            <td>&nbsp;</td>
                                            <td>&nbsp;</td>
                                            <td>&nbsp;</td>
                                            <td>
                                                <table width="90%" border="0" align="left" cellpadding="0" cellspacing="0">
                                                    <tbody><tr>
                                                        <td width="51%" valign="top">
                                                            <div align="right">
                                                                <img src="images/layout/img_uparrow.gif" alt="" width="9" height="12">
                                                            </div>

                                                        </td>
                                                        <td width="49%" valign="baseline" class="X4">
                                                            <div align="center">
                                                                <a href="#Top">back to top</a>
                                                            </div>

                                                        </td>
                                                    </tr>
                                                    </tbody></table>
                                            </td>

                                        </tr>
                                        </tbody></table>

                                </td>

                            </tr>
                            </tbody></table>
                    </td>
                </tr>
                </tbody></table>
            <!-- end repeating -->
        </td>
    </tr>
    </tbody>
</table>


</body>
</html>

EOD;
    return $html;
}

Open in new window


The output from the above is:
src - /images/getimage.aspx?img=abc123.jpg
src - /images/getimage.aspx?img=def456.jpg
src - /images/getimage.aspx?img=ghi789.jpg
src - /images/getimage.aspx?img=abc123.jpg
src - /images/getimage.aspx?img=def456.jpg
src - /images/getimage.aspx?img=ghi789.jpg
src - /images/getimage.aspx?img=abc123.jpg
src - /images/getimage.aspx?img=def456.jpg
src - /images/getimage.aspx?img=ghi789.jpg

What I am expecting is just 3 images because there are 3 tables each with their own image:
src - /images/getimage.aspx?img=abc123.jpg
src - /images/getimage.aspx?img=def456.jpg
src - /images/getimage.aspx?img=ghi789.jpg

It is repeating finding table[@class="X2"]/tbody/tr/td/img[2]/@src when I only want the image in the current table[@class="X1"]
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks, I will give it a try and report back later today!
@Scott - were you able to resolve your problem?
Sorry for the long delay.  I was pulled off this one and still working on it.  The information here put me in the right direction.  Thank you zephyr_hex!