Link to home
Start Free TrialLog in
Avatar of Insoftservice inso
Insoftservice insoFlag for India

asked on

Data between html tags using regex in php with utf8 characters

Below pattern works perfectly fine for extracting data with html tags without utf8 characters.
Please help me to find out method to find out all <h1>,<div> tags .

I have attached two sample files .
1> pattern works for "Backward-history-info.php" file even having UTF8 characters within it. But if edited it stops working.
2> file does not shows any data for other file i.e "Backward_past_références.php" .
3> please even let me know how to handle file name "Backward_past_références.php" with utf8 characters via file_get_contents()

$pattern = "/<" . $tag . "(>|.+?(?<!<|>)>).+?(?<!<" . $tag . ")(.+?)<\/" . $tag . ">/i";

https://www.experts-exchange.com/questions/27981504/Data-between-html-tags-using-regex-in-php.html
Backward-past-r-f-rences.php
Backward-history-info.php
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

First off, there are a lot of things to know about UTF-8.  This article can tell you a lot of what you will need to understand any answers you get to this question.  I can't post it at EE because EE's article wizard cannot deal with the multiple character representations.
http://iconoun.com/articles/collisions/

Second, you might want to look at these:
http://php.net/manual/en/regexp.reference.unicode.php
http://php.net/manual/en/reference.pcre.pattern.modifiers.php (check out "u" in the modifiers).

I'll see if there is an easy way to modify what you've got here and post back in a little while.
Avatar of Insoftservice inso

ASKER

@Ray any success as i tried

$sHtml = utf8_encode($sHtml); But it did not helped.
This file has a bunch of problems.  It looks like it was created with UTF-8, but then stored with ISO-8859-1 or similar character set.  This can be caused by text editor settings.  Check to see that your editor is saving the file in UTF-8 encoding.
http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php

This file also has character collisions, though not so many.  Example: Considérations thérapeutiques
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php

I will try to repair one of these and see if there is a way to use REGEX with the repaired file. In directly related matters, you might want to learn about the HTML5 doctype and the rules for using the meta-charset tag.  It's needed here!
Thanx Ray .Please even do let me know steps how to repair it. File name was actually  Backward_past_références.php i hope EE had rename the file
Progress!  This worked to read the HTML document.  What do you want that REGEX to do - it looks awfully complicated, maybe an explanation will be easier to understand!

<?php // demo/temp_insoftservice.php

/**
 * http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
 */
error_reporting(E_ALL);

// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

// READ THE FILE AND DECODE THE INVALID CHARACTERS
$url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;

Open in new window

I want to extract data within <h1><h2>...<hn> & <div> tags.

$sTag       = 'h[0-9]';
 $sPattern = "/<" . $sTag . "(>|.+?(?<!<|>)>).+?(?<!<" . $sTag . ")(.+?)<\/" . $sTag . ">/i";
OK, let me see what I can do with that.  I'm going to accept it as an article of faith that the REGEX works correctly on ASCII data.
Ya it works perfectly fine for that. It works for
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php.
But if i edit and try to save it fails to work. As you had mentioned in your previous comments it might be due to not saving in proper format. But even i tried to save in notepad++ with Encoding type "Conevrt to UTf8 without BOM" it fails
Well, I've tried and I'm not having any luck.  I'll try an alternative approach instead.  While I work on that, please read this explanation of what happens when you try to use REGEX to parse XML or XHTML or HTML.  It tells the "real story" in a funny way!
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top
Ray i tried this method its able to fetch data but i have to do some formatting in it to display proper tags .
 Please provide your comments if possible or if u have better option please do let me know
Another issue over here i have to call all tags via separate functions or by using loop

$url = "http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php";
$html = file_get_contents($url);
function get_all_string_between($string, $start, $end)
{
    $result = array();
    $string = " ".$string;
    $offset = 0;
    while(true)
    {
        $ini = strpos($string,$start,$offset);
        if ($ini == 0)
            break;
        $ini += strlen($start);
        $len = strpos($string,$end,$ini) - $ini;
        $result[] = substr($string,$ini,$len);
        $offset = $ini+$len;
    }
    return $result;
}

 $result = get_all_string_between($html, '<h2', '</h2>');
 $result2 = get_all_string_between($html, '<h1', '</h1>');
 echo "<pre>";print_r($result); echo "<pre>h1 tag ";print_r($result1);

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Its working as expected i had checked with few sample files.
Please do let me know how to use file_get_contents if file name has utf8 characters. Backward_pawn_voyez_également.php
The standard advice is: Please do not use a file name with UTF-8 characters.  Use only ASCII characters in file names.  If someone else is creating the files, and you're having trouble reading them because of the file names, please post a fully-qualified URL to the files and I'll try to help you find a solution.
This seems to work just fine.  The accented "e" is URL-encoded to the hex UTF-8 characters.  All I did was copy the URL (copy/paste) from the link posted above.
<?php // demo/temp_insoftservice.php

/**
 * http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
 * http://iconoun.com/articles/collisions/
 * http://php.net/manual/en/reference.pcre.pattern.modifiers.php
 *
 * I want to extract data within <h1><h2>...<hn> & <div> tags.
 */
error_reporting(E_ALL);

// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

// READ THE FILE AND DECODE THE INVALID CHARACTERS
// $url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$url = 'http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;

// WHAT TAGS WILL WE USE?
$tags = array( 'h1', 'h2', 'h3', 'div' );
echo '<pre>';

// PROCESS EACH TAG
foreach ($tags as $tag)
{
    $rgx
    = '#'                     // REGEX DELIMITER
    . '\<'                    // START OF THE TAG
    . $tag
    . '.*?\>'                 // TAG ATTRIBUTES IF ANY
    . '(.*?)'                 // CAPTURE GROUP CONTENT
    . '\</'                   // END OF THE TAG
    . $tag
    . '\>'
    . '#'                     // REGEX DELIMITER
    . 'is'                    // FLAGS
    ;

    // SHOW THE TAG WE ARE PROCESSING
    echo PHP_EOL . "<h2>$tag</h2>";

    // SHOW THE FINDINGS, IF ANY
    preg_match_all($rgx, $htm, $mat);
    var_dump($mat[1]);
    echo PHP_EOL;
}

Open in new window

File name is Backward_pawn_voyez_également.php and its not Backward_pawn_voyez_%C3%A9galement.php. I hope you had copy pasted the url again from browser.
Let me go back to this.  It really is the correct advice!
The standard advice is: Please do not use a file name with UTF-8 characters.  Use only ASCII characters in file names.
To see what is going wrong, follow these steps.  Please go to this URL:
https://www.experts-exchange.com/questions/28666444/Data-between-html-tags-using-regex-in-php-with-utf8-characters.html?anchorAnswerId=40756338#a40756338

Click the link to the URL that you posted there.  It will open a web page that says, "Gage arrière, etc."

Copy the URL from your browser address bar and paste it into a text document.  You will see that the encoded URL address says this:

http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php

Now go back to the URL and change it.  Remove the UTF-8 characters that say "%C3%A9" and insert the "é" character.  Fire up the request again and you will see:

Kohana_Request_Exception [ 0 ]: Unable to find a route to match the URI: Backward_pawn_voyez_?galement.php

This fails because the UTF-8 characters are translated by the web server.  That is why we advise you not to use accented characters (or any non-ASCII characters) in URLs or file names.

The disappearance of the "é" character and the replacement with the UTF-8 version, "%C3%A9" is the browser's way of trying to help you get a usable URL.  There are rules that we have to follow, and putting the "é" character into the URL violates these rules.  So just don't do that, and you'll be fine.  Use the "percent-encoded" version instead.
http://tools.ietf.org/html/rfc3986
http://www.w3.org/Addressing/URL/uri-spec.html
http://en.wikipedia.org/wiki/Percent-encoding
Actually these are the files provided by the client with utf8 characters which has to be scanned .So can't change the file name.is there any other way to read it as i don't have permission to rename the files
Maybe you can do what I did -- copy each URL and paste it into the browser address bar.  Click the link to visit the URL.  Then copy the URL out of the browser address bar and paste it into your code editor.  There may be shortcuts - you can try other solutions, but the right answer is to follow the standards.  I recommend that you raise your rates to cover the extra cost of converting the non-standard URLs into something that will work right!
Thx Ray