asked on

Data between html tags using regex in php with utf8 characters

Below pattern works perfectly fine for extracting data with html tags without utf8 characters.
Please help me to find out method to find out all <h1>,<div> tags .

I have attached two sample files .
1> pattern works for "Backward-history-info.php" file even having UTF8 characters within it. But if edited it stops working.
2> file does not shows any data for other file i.e "Backward_past_références.php" .
3> please even let me know how to handle file name "Backward_past_références.php" with utf8 characters via file_get_contents()

$pattern = "/<" . $tag . "(>|.+?(?<!<|>)>).+?(?<!<" . $tag . ")(.+?)<\/" . $tag . ">/i";

https://www.experts-exchange.com/questions/27981504/Data-between-html-tags-using-regex-in-php.html
Backward-past-r-f-rences.php
Backward-history-info.php

Ray Paseur

First off, there are a lot of things to know about UTF-8. This article can tell you a lot of what you will need to understand any answers you get to this question. I can't post it at EE because EE's article wizard cannot deal with the multiple character representations.
http://iconoun.com/articles/collisions/

Second, you might want to look at these:
http://php.net/manual/en/regexp.reference.unicode.php
http://php.net/manual/en/reference.pcre.pattern.modifiers.php (check out "u" in the modifiers).

I'll see if there is an easy way to modify what you've got here and post back in a little while.

Insoftservice inso

ASKER

@Ray any success as i tried

$sHtml = utf8_encode($sHtml); But it did not helped.

Ray Paseur

This file has a bunch of problems. It looks like it was created with UTF-8, but then stored with ISO-8859-1 or similar character set. This can be caused by text editor settings. Check to see that your editor is saving the file in UTF-8 encoding.
http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php

This file also has character collisions, though not so many. Example: ConsidÃ©rations thÃ©rapeutiques
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php

I will try to repair one of these and see if there is a way to use REGEX with the repaired file. In directly related matters, you might want to learn about the HTML5 doctype and the rules for using the meta-charset tag. It's needed here!

Insoftservice inso

ASKER

Thanx Ray .Please even do let me know steps how to repair it. File name was actually Backward_past_références.php i hope EE had rename the file

Ray Paseur

Progress! This worked to read the HTML document. What do you want that REGEX to do - it looks awfully complicated, maybe an explanation will be easier to understand!

<?php // demo/temp_insoftservice.php

/**
 * http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
 */
error_reporting(E_ALL);

// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

// READ THE FILE AND DECODE THE INVALID CHARACTERS
$url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;

Open in new window

Insoftservice inso

ASKER

I want to extract data within <h1><h2>...<hn> & <div> tags.

$sTag = 'h[0-9]';
$sPattern = "/<" . $sTag . "(>|.+?(?<!<|>)>).+?(?<!<" . $sTag . ")(.+?)<\/" . $sTag . ">/i";

Ray Paseur

OK, let me see what I can do with that. I'm going to accept it as an article of faith that the REGEX works correctly on ASCII data.

Insoftservice inso

ASKER

Ya it works perfectly fine for that. It works for
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php.
But if i edit and try to save it fails to work. As you had mentioned in your previous comments it might be due to not saving in proper format. But even i tried to save in notepad++ with Encoding type "Conevrt to UTf8 without BOM" it fails

Ray Paseur

Well, I've tried and I'm not having any luck. I'll try an alternative approach instead. While I work on that, please read this explanation of what happens when you try to use REGEX to parse XML or XHTML or HTML. It tells the "real story" in a funny way!
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top

Insoftservice inso

ASKER

Ray i tried this method its able to fetch data but i have to do some formatting in it to display proper tags .
Please provide your comments if possible or if u have better option please do let me know
Another issue over here i have to call all tags via separate functions or by using loop

$url = "http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php";
$html = file_get_contents($url);
function get_all_string_between($string, $start, $end)
{
    $result = array();
    $string = " ".$string;
    $offset = 0;
    while(true)
    {
        $ini = strpos($string,$start,$offset);
        if ($ini == 0)
            break;
        $ini += strlen($start);
        $len = strpos($string,$end,$ini) - $ini;
        $result[] = substr($string,$ini,$len);
        $offset = $ini+$len;
    }
    return $result;
}

 $result = get_all_string_between($html, '<h2', '</h2>');
 $result2 = get_all_string_between($html, '<h1', '</h1>');
 echo "<pre>";print_r($result); echo "<pre>h1 tag ";print_r($result1);

Open in new window

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Insoftservice inso

ASKER

Its working as expected i had checked with few sample files.
Please do let me know how to use file_get_contents if file name has utf8 characters. Backward_pawn_voyez_également.php

Ray Paseur

The standard advice is: Please do not use a file name with UTF-8 characters. Use only ASCII characters in file names. If someone else is creating the files, and you're having trouble reading them because of the file names, please post a fully-qualified URL to the files and I'll try to help you find a solution.

Insoftservice inso

ASKER

http://www.lovetomarry.com/Backward_pawn_voyez_également.php

Ray Paseur

This seems to work just fine. The accented "e" is URL-encoded to the hex UTF-8 characters. All I did was copy the URL (copy/paste) from the link posted above.

<?php // demo/temp_insoftservice.php

/**
 * http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
 * http://iconoun.com/articles/collisions/
 * http://php.net/manual/en/reference.pcre.pattern.modifiers.php
 *
 * I want to extract data within <h1><h2>...<hn> & <div> tags.
 */
error_reporting(E_ALL);

// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

// READ THE FILE AND DECODE THE INVALID CHARACTERS
// $url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$url = 'http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;

// WHAT TAGS WILL WE USE?
$tags = array( 'h1', 'h2', 'h3', 'div' );
echo '<pre>';

// PROCESS EACH TAG
foreach ($tags as $tag)
{
    $rgx
    = '#'                     // REGEX DELIMITER
    . '\<'                    // START OF THE TAG
    . $tag
    . '.*?\>'                 // TAG ATTRIBUTES IF ANY
    . '(.*?)'                 // CAPTURE GROUP CONTENT
    . '\</'                   // END OF THE TAG
    . $tag
    . '\>'
    . '#'                     // REGEX DELIMITER
    . 'is'                    // FLAGS
    ;

    // SHOW THE TAG WE ARE PROCESSING
    echo PHP_EOL . "<h2>$tag</h2>";

    // SHOW THE FINDINGS, IF ANY
    preg_match_all($rgx, $htm, $mat);
    var_dump($mat[1]);
    echo PHP_EOL;
}

Open in new window

Insoftservice inso

ASKER

File name is Backward_pawn_voyez_également.php and its not Backward_pawn_voyez_%C3%A9galement.php. I hope you had copy pasted the url again from browser.

Ray Paseur

Let me go back to this. It really is the correct advice!

The standard advice is: Please do not use a file name with UTF-8 characters. Use only ASCII characters in file names.

To see what is going wrong, follow these steps. Please go to this URL:
https://www.experts-exchange.com/questions/28666444/Data-between-html-tags-using-regex-in-php-with-utf8-characters.html?anchorAnswerId=40756338#a40756338

Click the link to the URL that you posted there. It will open a web page that says, "Gage arrière, etc."

Copy the URL from your browser address bar and paste it into a text document. You will see that the encoded URL address says this:

http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php

Now go back to the URL and change it. Remove the UTF-8 characters that say "%C3%A9" and insert the "é" character. Fire up the request again and you will see:

Kohana_Request_Exception [ 0 ]: Unable to find a route to match the URI: Backward_pawn_voyez_?galement.php

This fails because the UTF-8 characters are translated by the web server. That is why we advise you not to use accented characters (or any non-ASCII characters) in URLs or file names.

The disappearance of the "é" character and the replacement with the UTF-8 version, "%C3%A9" is the browser's way of trying to help you get a usable URL. There are rules that we have to follow, and putting the "é" character into the URL violates these rules. So just don't do that, and you'll be fine. Use the "percent-encoded" version instead.
http://tools.ietf.org/html/rfc3986
http://www.w3.org/Addressing/URL/uri-spec.html
http://en.wikipedia.org/wiki/Percent-encoding

Insoftservice inso

ASKER

Actually these are the files provided by the client with utf8 characters which has to be scanned .So can't change the file name.is there any other way to read it as i don't have permission to rename the files

Ray Paseur

Maybe you can do what I did -- copy each URL and paste it into the browser address bar. Click the link to visit the URL. Then copy the URL out of the browser address bar and paste it into your code editor. There may be shortcuts - you can try other solutions, but the right answer is to follow the standards. I recommend that you raise your rates to cover the extra cost of converting the non-standard URLs into something that will work right!

Insoftservice inso

ASKER

Thx Ray