Data between html tags using regex in php with utf8 characters

Below pattern works perfectly fine for extracting data with html tags without utf8 characters.
Please help me to find out method to find out all <h1>,<div> tags .

I have attached two sample files .
1> pattern works for "Backward-history-info.php" file even having UTF8 characters within it. But if edited it stops working.
2> file does not shows any data for other file i.e "Backward_past_références.php" .
3> please even let me know how to handle file name "Backward_past_références.php" with utf8 characters via file_get_contents()

$pattern = "/<" . $tag . "(>|.+?(?<!<|>)>).+?(?<!<" . $tag . ")(.+?)<\/" . $tag . ">/i";

http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_27981504.html
Backward-past-r-f-rences.php
Backward-history-info.php
LVL 15
InsoftserviceAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
First off, there are a lot of things to know about UTF-8.  This article can tell you a lot of what you will need to understand any answers you get to this question.  I can't post it at EE because EE's article wizard cannot deal with the multiple character representations.
http://iconoun.com/articles/collisions/

Second, you might want to look at these:
http://php.net/manual/en/regexp.reference.unicode.php
http://php.net/manual/en/reference.pcre.pattern.modifiers.php (check out "u" in the modifiers).

I'll see if there is an easy way to modify what you've got here and post back in a little while.
InsoftserviceAuthor Commented:
@Ray any success as i tried

$sHtml = utf8_encode($sHtml); But it did not helped.
Ray PaseurCommented:
This file has a bunch of problems.  It looks like it was created with UTF-8, but then stored with ISO-8859-1 or similar character set.  This can be caused by text editor settings.  Check to see that your editor is saving the file in UTF-8 encoding.
http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php

This file also has character collisions, though not so many.  Example: Considérations thérapeutiques
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php

I will try to repair one of these and see if there is a way to use REGEX with the repaired file. In directly related matters, you might want to learn about the HTML5 doctype and the rules for using the meta-charset tag.  It's needed here!
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

InsoftserviceAuthor Commented:
Thanx Ray .Please even do let me know steps how to repair it. File name was actually  Backward_past_références.php i hope EE had rename the file
Ray PaseurCommented:
Progress!  This worked to read the HTML document.  What do you want that REGEX to do - it looks awfully complicated, maybe an explanation will be easier to understand!

<?php // demo/temp_insoftservice.php

/**
 * http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
 */
error_reporting(E_ALL);

// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

// READ THE FILE AND DECODE THE INVALID CHARACTERS
$url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;

Open in new window

InsoftserviceAuthor Commented:
I want to extract data within <h1><h2>...<hn> & <div> tags.

$sTag       = 'h[0-9]';
 $sPattern = "/<" . $sTag . "(>|.+?(?<!<|>)>).+?(?<!<" . $sTag . ")(.+?)<\/" . $sTag . ">/i";
Ray PaseurCommented:
OK, let me see what I can do with that.  I'm going to accept it as an article of faith that the REGEX works correctly on ASCII data.
InsoftserviceAuthor Commented:
Ya it works perfectly fine for that. It works for
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php.
But if i edit and try to save it fails to work. As you had mentioned in your previous comments it might be due to not saving in proper format. But even i tried to save in notepad++ with Encoding type "Conevrt to UTf8 without BOM" it fails
Ray PaseurCommented:
Well, I've tried and I'm not having any luck.  I'll try an alternative approach instead.  While I work on that, please read this explanation of what happens when you try to use REGEX to parse XML or XHTML or HTML.  It tells the "real story" in a funny way!
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top
InsoftserviceAuthor Commented:
Ray i tried this method its able to fetch data but i have to do some formatting in it to display proper tags .
 Please provide your comments if possible or if u have better option please do let me know
Another issue over here i have to call all tags via separate functions or by using loop

$url = "http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php";
$html = file_get_contents($url);
function get_all_string_between($string, $start, $end)
{
    $result = array();
    $string = " ".$string;
    $offset = 0;
    while(true)
    {
        $ini = strpos($string,$start,$offset);
        if ($ini == 0)
            break;
        $ini += strlen($start);
        $len = strpos($string,$end,$ini) - $ini;
        $result[] = substr($string,$ini,$len);
        $offset = $ini+$len;
    }
    return $result;
}

 $result = get_all_string_between($html, '<h2', '</h2>');
 $result2 = get_all_string_between($html, '<h1', '</h1>');
 echo "<pre>";print_r($result); echo "<pre>h1 tag ";print_r($result1);

Open in new window

Ray PaseurCommented:
Please see the "view source" from this script: http://iconoun.com/demo/temp_insoftservice.php
<?php // demo/temp_insoftservice.php

/**
 * http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
 * http://iconoun.com/articles/collisions/
 * http://php.net/manual/en/reference.pcre.pattern.modifiers.php
 *
 * I want to extract data within <h1><h2>...<hn> & <div> tags.
 */
error_reporting(E_ALL);

// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

// READ THE FILE AND DECODE THE INVALID CHARACTERS
$url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;

// WHAT TAGS WILL WE USE?
$tags = array( 'h1', 'h2', 'h3', 'div' );
echo '<pre>';

// PROCESS EACH TAG
foreach ($tags as $tag)
{
    $rgx
    = '#'                     // REGEX DELIMITER
    . '\<'                    // START OF THE TAG
    . $tag
    . '.*?\>'                 // TAG ATTRIBUTES IF ANY
    . '(.*?)'                 // CAPTURE GROUP CONTENT
    . '\</'                   // END OF THE TAG
    . $tag
    . '\>'
    . '#'                     // REGEX DELIMITER
    . 'is'                    // FLAGS
    ;

    // SHOW THE TAG WE ARE PROCESSING
    echo PHP_EOL . "<h2>$tag</h2>";

    // SHOW THE FINDINGS, IF ANY
    preg_match_all($rgx, $htm, $mat);
    var_dump($mat[1]);
    echo PHP_EOL;
}

Open in new window

HTH, ~Ray

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
InsoftserviceAuthor Commented:
Its working as expected i had checked with few sample files.
Please do let me know how to use file_get_contents if file name has utf8 characters. Backward_pawn_voyez_également.php
Ray PaseurCommented:
The standard advice is: Please do not use a file name with UTF-8 characters.  Use only ASCII characters in file names.  If someone else is creating the files, and you're having trouble reading them because of the file names, please post a fully-qualified URL to the files and I'll try to help you find a solution.
Ray PaseurCommented:
This seems to work just fine.  The accented "e" is URL-encoded to the hex UTF-8 characters.  All I did was copy the URL (copy/paste) from the link posted above.
<?php // demo/temp_insoftservice.php

/**
 * http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
 * http://iconoun.com/articles/collisions/
 * http://php.net/manual/en/reference.pcre.pattern.modifiers.php
 *
 * I want to extract data within <h1><h2>...<hn> & <div> tags.
 */
error_reporting(E_ALL);

// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

// READ THE FILE AND DECODE THE INVALID CHARACTERS
// $url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$url = 'http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;

// WHAT TAGS WILL WE USE?
$tags = array( 'h1', 'h2', 'h3', 'div' );
echo '<pre>';

// PROCESS EACH TAG
foreach ($tags as $tag)
{
    $rgx
    = '#'                     // REGEX DELIMITER
    . '\<'                    // START OF THE TAG
    . $tag
    . '.*?\>'                 // TAG ATTRIBUTES IF ANY
    . '(.*?)'                 // CAPTURE GROUP CONTENT
    . '\</'                   // END OF THE TAG
    . $tag
    . '\>'
    . '#'                     // REGEX DELIMITER
    . 'is'                    // FLAGS
    ;

    // SHOW THE TAG WE ARE PROCESSING
    echo PHP_EOL . "<h2>$tag</h2>";

    // SHOW THE FINDINGS, IF ANY
    preg_match_all($rgx, $htm, $mat);
    var_dump($mat[1]);
    echo PHP_EOL;
}

Open in new window

InsoftserviceAuthor Commented:
File name is Backward_pawn_voyez_également.php and its not Backward_pawn_voyez_%C3%A9galement.php. I hope you had copy pasted the url again from browser.
Ray PaseurCommented:
Let me go back to this.  It really is the correct advice!
The standard advice is: Please do not use a file name with UTF-8 characters.  Use only ASCII characters in file names.
To see what is going wrong, follow these steps.  Please go to this URL:
http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_28666444.html#a40756338

Click the link to the URL that you posted there.  It will open a web page that says, "Gage arrière, etc."

Copy the URL from your browser address bar and paste it into a text document.  You will see that the encoded URL address says this:

http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php

Now go back to the URL and change it.  Remove the UTF-8 characters that say "%C3%A9" and insert the "é" character.  Fire up the request again and you will see:

Kohana_Request_Exception [ 0 ]: Unable to find a route to match the URI: Backward_pawn_voyez_?galement.php

This fails because the UTF-8 characters are translated by the web server.  That is why we advise you not to use accented characters (or any non-ASCII characters) in URLs or file names.

The disappearance of the "é" character and the replacement with the UTF-8 version, "%C3%A9" is the browser's way of trying to help you get a usable URL.  There are rules that we have to follow, and putting the "é" character into the URL violates these rules.  So just don't do that, and you'll be fine.  Use the "percent-encoded" version instead.
http://tools.ietf.org/html/rfc3986
http://www.w3.org/Addressing/URL/uri-spec.html
http://en.wikipedia.org/wiki/Percent-encoding
InsoftserviceAuthor Commented:
Actually these are the files provided by the client with utf8 characters which has to be scanned .So can't change the file name.is there any other way to read it as i don't have permission to rename the files
Ray PaseurCommented:
Maybe you can do what I did -- copy each URL and paste it into the browser address bar.  Click the link to visit the URL.  Then copy the URL out of the browser address bar and paste it into your code editor.  There may be shortcuts - you can try other solutions, but the right answer is to follow the standards.  I recommend that you raise your rates to cover the extra cost of converting the non-standard URLs into something that will work right!
InsoftserviceAuthor Commented:
Thx Ray
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.