We help IT Professionals succeed at work.

PHP Simple HTML DOM Parser Question

cbielich
cbielich asked
on
Not sure how to ask this so I will just show an example. Lets say I am parsing a section of html like this

<div class="results">
<p><b>word1</b></p>word2<a href="/my/link">word3</a>.
</div>

With the parser all I want to get out of this is "word2" I want to ignore all the <p><b> and <a> tags and the text inside of it.
Comment
Watch Question

Kiran SonawaneProject Lead
Top Expert 2011

Commented:
Using PHP you can have strip_tags function

See here http://php.net/manual/en/function.strip-tags.php

You also play with the DOM element using jquery. This is most easiest way. Let me know if you need to parse dom element using jquery.

Author

Commented:
That removes the tags but keeps the text. I dont want the text to appear

For Example with strip_tags I get

word1word2word3

I just want word2
Kiran SonawaneProject Lead
Top Expert 2011

Commented:

Author

Commented:
Cant find a solution in there
Sandeep KothariProject Lead

Commented:

preg_match_all("#</.*?>(.*?)<#",$string,$match);
print_r($match);

where $string contains the text... should work
Most Valuable Expert 2011
Top Expert 2016
Commented:
In the example given (which looks far too hypothetical to be really useful) the defining characteristics of the environment of "word2" are these:

1. It is encapsulated by the <div> tag with the class attribute "results".
2. It is not encapsulated by any other tags below the <div> tag.

The only other piece of HTML that matches these characteristics is the period after the </a> tag.  

A word about parsing HTML: There are large companies (Microsoft, Google, Apple, and of course the Mozilla project) that devote considerable resources to parsing HTML.  These engines are called web browsers and they are fairly complicated pieces of programming.  So do not become discouraged if you have difficulty teasing apart the HTML.  Even if the HTML is perfectly formed according to the W3 standards (and not much HTML is well formed) it is a difficult programming task.  In my experience, regular expressions are not very useful on large pieces of HTML, and tend to be more useful once the HTML has been distilled down to the part you want.

The strategy I would employ for your example here would be to use a state engine.  Examine each character inside the <div>one byte at time.  Each < starts a new tag and each > closes a tag.  When a character is inside a tag, you do not want it.  Each HTML tag encapsulates a string.  Each /> closes an encapsulated string.  When a character is inside an encapsulated string you do not want it.  If you keep only the characters you want, you will wind up with the string word2. and then you can decide what to do with the trailing dot.

Author

Commented:
@kshna

Your code produced

"Array ( [0] => Array ( [0] => < [1] => .< ) [1] => Array ( [0] => [1] => . ) )"

Author

Commented:
Sweet, I got it

preg_replace("#(<p>.*</p>||<a.*</a>)#is", "", $string);

I will accept kshna comment which lead me to this :)
Most Valuable Expert 2011
Top Expert 2016

Commented:
For better or worse, computer programming is a fairly precise task (more like baking than like soup making) and so fairly precise questions and answers are needed.  The accepted solution on this question does not work, nor does the solution at ID:37254110, where it leaves the dot after "word2" -- see for yourself here.
http://www.laprbass.com/RAY_temp_cbielich.php

Regular expressions are not very good for parsing HTML, and certainly not very good for parsing larger sections of HTML.  That is why I recommended a state engine or some additional measures to avoid things like leaving unwanted extra characters in the resulting string.

Good luck with the project, ~Ray
<?php // RAY_temp_cbielich.php
error_reporting(E_ALL);
echo "<pre>";


echo PHP_EOL . 'POSTED TEST DATA';
// COPIED FROM http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_27477941.html
$str = <<<STR
<div class="results">
<p><b>word1</b></p>word2<a href="/my/link">word3</a>.
</div>
STR;
echo PHP_EOL . htmlentities($str);
echo PHP_EOL;


echo PHP_EOL . 'ACCEPTED SOLUTION';
// COPIED FROM http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_27477941.html?cid=1135#a37236591
preg_match_all("#</.*?>(.*?)<#",$str,$match);
print_r($match);
echo PHP_EOL;


echo PHP_EOL . 'AUTHOR SOLUTION';
// COPIED FROM http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_27477941.html?cid=1135#a37254110
$rgx = "#(<p>.*</p>||<a.*</a>)#is";
$new = preg_replace($rgx, NULL, $str);
echo PHP_EOL . htmlentities($new);

Open in new window

Most Valuable Expert 2011
Top Expert 2016

Commented:
There is some useful advice and the name of a solution, to wit: "a state engine" at http:#37237075.  The test data is far too small and too hypothetical to make it worth time writing a state engine for this tiny example.  

HTML parsers are a very big task, even for large companies with a lot of highly trained programmers and full-time testers.  If this were something I was working on, I would start by getting together the best possible test data set, one including center, edge and false cases, inspecting it by hand and eye, and writing up the expected results.  Once you know the data inputs and the desired outputs, the code usually is self-evident.