Link to home
Start Free TrialLog in
Avatar of Fernanditos
Fernanditos

asked on

Getting content inside DIV with dynamic class name and ID.

The attached function gets the content inside the defined DIV. This function works just perfect.

The defined DIV here is:

$str = '<div class="post-333 post hentry category-libros" id="post-333">';

When I define the DIV, I want to ignore "333 post hentry category-libros" id="post-333" part. So, I want to get the content inside the DIV starting like "<div class="post-"  ... ignoring anything after "post-"

I mean to get the content inside the div with name starting with class name "post-" and ignoring the rest of name and id.

Please help.
Thank you
function get_content ($url) {
// FIND ALL OF THE DESIRED DIV
$htm = file_get_contents($url);
$str = '<div class="post-333 post hentry category-libros" id="post-333">';
$arr = explode($str, $htm);
$new = $arr[1];
$len = strlen($new);

// ACCUMULATE THE OUTPUT STRING HERE
$out = NULL;

// WE ARE INSIDE ONE DIV TAG
$cnt = 1;

// UNTIL THE END OF STRING OR UNTIL WE ARE OUT OF ALL DIV TAGS
while ($len)
{
    // COPY A CHARACTER
    $chr = substr($new,0,1);

    // IF THE DIV NESTING LEVEL INCREASES OR DECREASES
    if (substr($new,0,4) == '<div')  $cnt++;
    if (substr($new,0,5) == '</div') $cnt--;

    // ACTIVATE THIS TO FOLLOW THE COUNT OF NESTING LEVELS
    // echo " $cnt";

    // WHEN THE NESTING LEVEL GOES BACK TO ZERO
    if (!$cnt) break;

    // WHEN THE NESTING LEVEL IS STILL POSITIVE
    $len--;
    $out .= $chr;
    $new = substr($new,1);
} Return $out; }

Open in new window

Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Sorry - I do not keep track of things from one question to the next.  Please post the test data that you want us to use, thanks.
Avatar of Fernanditos
Fernanditos

ASKER

Thank you Ray, find the test data here: http://www.frostwave.com/data.html

I want my function to get all content inside DIV:
 
<div class="post-333 post hentry category-libros" id="post-333">

Open in new window


The value "333 post hentry category-libros" id="post-333"" is dynamic and will always change, so I need to check only the DIV first part, starting with "<div class="post-" and ignore the rest of class name and id name.

Thank you so much for your support.
If you use jQuery, this problem should be easily solved by addressing the following code.

$('div[class^="post"]')

For example...

alert($('div[class^="post"]').html());
I have in mind something like:

$str = '<div class="post-(.*)" id="(.*)">';

Open in new window


ASKER CERTIFIED SOLUTION
Avatar of StingRaY
StingRaY
Flag of Thailand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@StingRaY that worked great, I will test once again. Thank you.

$str = '{<div class="post-[^"]+"[^>]+>}';
@Fernanditos: I believe that solution can work as long as you are only looking for one <div> per page, and the attributes of the <div> tags are all on one line and in an exact order.  Good test data, including edge cases, is fairly important when you're working with external input.  Consider what your programming will do with these, which are valid and equivalent HTML statements.

<div class="post-333 post hentry category-libros" id="post-333">
<div id="post-333" class="post-333 post hentry category-libros">
<div class='post-333 post hentry category-libros' id="post-333">
<div
    class="post-333
               post hentry
               category-libros"
    id="post-333">

Executive summary: Using regular expressions to parse HTML is not a very professional approach.  A state engine is more reliable.

If you are parsing HTML to try to get information from a web publisher you might want to consider asking the publishers if they expose an API.  That way you would have a formal interface which is much more dependable than trying to scrape HTML.  If the publisher wants you to have their information they will almost certainly want to expose an API that is versioned and dependable.

Anyway, good luck with your project. ~Ray
@Fernanditos: Ray is correct. The solution is not the best one. Other approach would be the better considerable, for example, Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/).