asked on

PHP simple_html_dom.php on a normal HTML page

Hi PHP Experts

I am using the 'PHP simple HTML DOM Parser'. It works OK for me when a page has a file extension .php. So for example a file called test-status.php with the code below works OK where the same file renamed test-status.html will not work. There is no error, just blank.

Now here's the funny part. I have a .htaccess file that allows PHP to work on HTML pages and I can do a simple include on any HTML page on the site, something like...

<?php
include("navigation.php");
?>

...works OK. But the PHP below which is an include plus extra PHP code doesn't work on a HTML page (but does if the same file is renamed .php)

Any ideas why it doesn't work. Do I have to move it into the head and call it below?

I could rename the .html page .php but would be a pain as it's the index and is linked to from many sites etc.

Hope you can help.

Cheers,

Will

<?php
include('simplehtmldom/simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('news-info-school-status.html');
// find all div tags with id=copystatus
foreach($html->find('div#copystatus') as $e)
    echo $e->innertext . '<br>';
?>

Open in new window

Zoppo

Hi willnjen,

IMO this has nothing to do with the script, it's caused by the HTTP server - for some (if not for all, don't know) HTTP servers one has to specify that HTML files even are parsed/processed by the PHP script engine.

I.e. for APACHE this line in 'httpd.conf' configures APACHE to process PHP files (this is the default if PHP is installed):
> AddType application/x-httpd-php .php

If you add this line (and restart the HTTP server) HTML files even are processed as if they were PHP files:
> AddType application/x-httpd-php .html

If you use another HTTP server you should check its documentation to figure out how it needs to be configured.

Hope that helps,

ZOPPO

Ray Paseur

"doesn't work" is not much of an explanation. What happens, exactly? No output? Parse error? Halt and catch fire?

Please turn on the display errors setting and use error_reporting(E_ALL) then post the reported errors back here. Thanks, ~Ray

willnjen

ASKER

Hi Geniuses

Sorry if I didn't make my situation/problem clear enough.

Zoppo, The apache server already runs PHP files as HTML because I had set the htaccess file to AddType application/x-httpd-php .html. It runs simple PHP includes OK but not this example labelled as .html.

Ray, Sorry about my 'doesn't work' description. Here is some more info....

The following code in a file named test-status.php works fine and returns text scraped from the page.

<?php
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('news-info-school-status.html');
foreach($html->find('div#copystatus') as $e)
echo $e->innertext . '<br>';
?>

The same file renamed test-status.html doesn't return any errors. It is just blank. See the results in these two examples...

http://www.wanaka.school.nz/test-status.php
http://www.wanaka.school.nz/test-status.html

Some more info is the following works OK as both php and html which makes me think it's the extra complication of the test-status file that is the problem rather than the Apache settings.

<?php
include_once('../simple_html_dom.php');
echo file_get_html('../../news-info-school-status.html')->plaintext;
?>

Re your suggestion to display errors, I'm sorry I don't understand where. I can't find any error reporting in the PHP simple HTML DOM Parser. If you mean turning on error reporting in Apache, Plesk or PHP, I can't find them. If I create a situation which will cause an error, eg change the relative address to something wrong, I get errors in the PHP file but no errors displayed in the html file.

In short, what confuses me is why this works as a file called test.html

<?php
include_once('../simple_html_dom.php');
echo file_get_html('../../news-info-school-status.html')->plaintext;
?>

where this doesn't as a file called test-status.html

<?php
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('news-info-school-status.html');
foreach($html->find('div#copystatus') as $e)
echo $e->innertext . '<br>';
?>

and both work as files ending with php.

Sorry if I'm rambling.

Will

Zoppo

Hm, strange - IMO if both PHP and HTML files are specified with 'AddType application/x-httpd-php' there shouldn't be a difference if the file's extension is HTML or PHP.

Maybe it's a problem related to the directory where the files exist - the both last samples you posted seem to be in different directories. Maybe the 'AddType application/x-httpd-php .html' isn't done in all 'htaccess' files?

willnjen

ASKER

Hi Zoppo

Yes, it is strange isn't it. The examples have different directories only because I was playing around testing but all paths are correct. What's strangest is the fact all PHP seems to work as .html files except for this particular simplehtmldom code. The basic simplehtmldom example test.html works as both HTML and PHP extensions where this example test-status.html doesn't only as .html.

I'll keep testing to figure it out.

Will

Zoppo

Could you verify that the PHP part is executed at all when loading the 'test-status.html'? Maybe interesting would be to add this line as first code line to the PHP block:
> echo getcwd() . "\n";

Thus you can see if the script runs at all and (if you start the same as PHP too) if the current directory is equal for both.

If the script runs at all maybe you should dump the '$html' after calling 'file_get_html'

Further you should configure PHP to show errors if not done yet. You can do this in 'PHP.INI' or somehow like this in the code:
> ini_set("error_reporting", E_ALL);
> ini_set("display_error", "1");

Ray Paseur

The default installation of PHP suppresses error "notices" - these include undefined variables. If you turn on error_reporting(E_ALL) you might want to do it on a script-by-script basis because a lot of PHP code takes advantage of the way PHP loosely types undefined variables. These two lines will do it for you in your PHP script. Put them at the very top before any other code is executed.

<?php // ADD THESE AT THE TOP
ini_set('display_errors', TRUE);
error_reporting(E_ALL);

Open in new window

Ray Paseur

But let's look at this problem from a slightly different perspective. Simple_HTML_DOM looks like it may be an orphan project. There has not even been a post in the SourceForge forums in MORE THAN THREE MONTHS! I would not rely on something like this if my application had any importance.
http://sourceforge.net/project/stats/detail.php?group_id=218559&ugn=simplehtmldom&type=forum&mode=12months&forum_id=0

If you want to post some test data, I would be glad to show you how to get the contents of DIVs with the id=copystatus.

willnjen

ASKER

I've added the error reporting code to the top of both the .php and .html files. The .php runs with no reported errors and displays the correct info, but the .html is very interesting as it throws up pages of errors all relating to the function.preg-match-all. See this... http://www.wanaka.school.nz/simplehtmldom/example/test.html
(note: to make testing easy, I have two file called test.php and test.html with the code below)

Maybe we should look at it another way as Ray suggests. All I want to do is extract part of the HTML from one page and display that on another. The SimpleHTMLDOM script does this but as described, I can't get it working on the .html file. I've also tried SimpleHTMLDOM's other options of including a <span> contents and that comes up with the same errors.

Is there another way to achieve this?

Here is an example. Two files, both in the same directory and both ending .html.
The first is called example1.html and has this code.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>example1</title>
</head>

<body>
<p>Normal HTML above the Div</p>
<div id="copystatus">HTML text in a DIV to be copied</div>
<p>Normal HTML bleow the Div</p>
</body>

</html>

The second file called example2.html wants to display "HTML text in a DIV to be copied". It doesn't have to be in a DIV and could be identified using CLASS or TAGS or anything.

Is there other PHP you can point me to that does this?

Thanks,

Will

<?php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);

// example of how to use basic selector to retrieve HTML contents
include('../simple_html_dom.php');
 
// get DOM from URL or file
$html = file_get_html('../../news-info-school-status.html');


// find all div tags with id=gbar
foreach($html->find('div#copystatus') as $e)
    echo $e->innertext . '<br>';
?>

Open in new window

Ray Paseur

Well, it would be easier if the web page had XHTML STRICT, but we may be able to do something with this. You might find that REGEX is not the way to go -- a state engine might give better extraction. But this at least finds the data strings inside the HTML inside the appropriate DIV tags.

<?php // RAY_temp_willnjen.php
error_reporting(E_ALL);


// EXTRACT INFORMATION FROM THE DIV WITH ID="copystatus"


// SIMULATED DATA FROM AN EXTERNAL SOURCE
$htm = <<<ENDHTM
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>example1</title>
</head>

<body>
<p>Normal HTML above the Div</p>
<div id="copystatus">HTML text in a DIV to be copied</div>
<p>Normal HTML bleow the Div</p>
<div   id="copystatus">A second DIV for good testing</div>
</body>

</html>
ENDHTM;


// STANDARDIZE BLANKS TO ONE SPACE EACH
$htm = preg_replace('# +#', ' ', $htm);

// PREPARE A REGEX
$rgx
= '#'                          // THE REGEX DELIMITER
. '(\<div id="copystatus"\>)'  // GROUP 1: THE OPEN DIV TAG
. '(.*?)'                      // GROUP 2: THE UNGREEDY "MATCH ANY STRING"
. '(\</div\>)'                 // GROUP 3: THE END DIV TAG
. '#'                          // THE REGEX DELIMITER
. 'is'                         // CASE-INSENSITIVE, TREAT STRING AS A SINLE LINE
;

// DECLOP THE HTML
preg_match_all($rgx, $htm, $arr);

// ACTIVATE THIS AND USE 'VIEW SOURCE' TO SEE THE ALL THE REGEX OUTPUT
// echo "<pre>";
// var_dump($arr);

// SHOW WHAT MATCHED GROUP #2
var_dump($arr[2]);

Open in new window

willnjen

ASKER

Hi Ray

OK, I've learnt a lot about PHP today. I found a tutorial that explained REGEX etc and I combined your code and the tutorial's code to come up with this...

<?php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);

$data = file_get_contents('example1.html');
$regex
= '#' // THE REGEX DELIMITER
. '(\<div id="copystatus"\>)' // GROUP 1: THE OPEN DIV TAG
. '(.*?)' // GROUP 2: THE UNGREEDY "MATCH ANY STRING"
. '(\</div\>)' // GROUP 3: THE END DIV TAG
. '#' // THE REGEX DELIMITER
. 'is' // CASE-INSENSITIVE, TREAT STRING AS A SINLE LINE
;
preg_match($regex,$data,$match);
//var_dump($match);
echo $match[2];
?>

Saved as a file thursday.php it correctly displays "HTML text in a DIV to be copied" with no errors.
BUT, same problem if I save it as thursday.html it doesn't display the DIV text and displays the following error.

Warning: preg_match() [function.preg-match]: Internal pcre_fullinfo() error -3 in /usr/local/www/vhosts/wanaka.school.nz/httpdocs/simplehtmldom/new/thurs1.html on line 22

Notice: Undefined offset: 2 in /usr/local/www/vhosts/wanaka.school.nz/httpdocs/simplehtmldom/new/thurs1.html on line 24

preg_match (and preg_match_all) is the same error I get with the simple_HTML_DOM_Parser script. So it appears that any code with preg_match in a file with the suffix .html creates the same error. This leads me to believe it's a server setting. For the record, the HTML is below and my PHP setting are this.. http://www.wanaka.school.nz/simplehtmldom/example/test.php

Any idea why preg_match is not allowed?

Will

<html>

<head>
<title>Untitled 1</title>
</head>

<body>

<?php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);

$data = file_get_contents('example1.html');
$regex
= '#' // THE REGEX DELIMITER
. '(\<div id="copystatus"\>)' // GROUP 1: THE OPEN DIV TAG
. '(.*?)' // GROUP 2: THE UNGREEDY "MATCH ANY STRING"
. '(\</div\>)' // GROUP 3: THE END DIV TAG
. '#' // THE REGEX DELIMITER
. 'is' // CASE-INSENSITIVE, TREAT STRING AS A SINLE LINE
;
preg_match($regex,$data,$match);
//var_dump($match);
echo $match[2];
?>

</body>

</html>

willnjen

ASKER

Progress... it turns out others have the same error messages and this fix is suggested...

http://lists.freebsd.org/pipermail/freebsd-questions/2009-October/205977.html

I've emailed my hosting company and asked them to look into it. Wait and see.

Will

Ray Paseur

Interesting - it certainly seems like a bug. I'm on Linux CGI/FastCGI and can parse PHP consistently regardless of the file name suffix.

Ray Paseur

The errors still appear on the wanaka site.

If this is the question, ID:33951242 then this is the answer http://#a33952136

Best to all, ~Ray

willnjen

ASKER

Hi Ray

I have given up on trying to get this to work as it seems it's the hosting which won't allow me run the PHP as HTML. I got the following reply from my Hosting Support...

"PHP on our servers is already compiled using --with-pcre-regex , you can verify this by running a phpinfo() command.
Half the problem will most likely be that we run PHP in a FastCGI environment and not as an Apache module and FastCGI will only work properly with the .php extension.
I understand that you have probably created an .htaccess as a work around to make .html pages parse PHP however I wouldn't trust all aspects of PHP to run smoothly.
It is completely non-standard to run php from within a .html document as oppose to running html within a .php file.
For the reason of unexpected problems we just don't recommend doing it especially in a FastCGI PHP environment."

Sorry to waste your time and the time of others and your help was appreciated.

Regards,

Will

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

willnjen

ASKER

Many thanks for your help!!!