Insoftservice inso
asked on
Data between html tags using regex in php with utf8 characters
Below pattern works perfectly fine for extracting data with html tags without utf8 characters.
Please help me to find out method to find out all <h1>,<div> tags .
I have attached two sample files .
1> pattern works for "Backward-history-info.php " file even having UTF8 characters within it. But if edited it stops working.
2> file does not shows any data for other file i.e "Backward_past_références. php" .
3> please even let me know how to handle file name "Backward_past_références. php" with utf8 characters via file_get_contents()
$pattern = "/<" . $tag . "(>|.+?(?<!<|>)>).+?(?<!<" . $tag . ")(.+?)<\/" . $tag . ">/i";
https://www.experts-exchange.com/questions/27981504/Data-between-html-tags-using-regex-in-php.html
Backward-past-r-f-rences.php
Backward-history-info.php
Please help me to find out method to find out all <h1>,<div> tags .
I have attached two sample files .
1> pattern works for "Backward-history-info.php
2> file does not shows any data for other file i.e "Backward_past_références.
3> please even let me know how to handle file name "Backward_past_références.
$pattern = "/<" . $tag . "(>|.+?(?<!<|>)>).+?(?<!<"
https://www.experts-exchange.com/questions/27981504/Data-between-html-tags-using-regex-in-php.html
Backward-past-r-f-rences.php
Backward-history-info.php
ASKER
@Ray any success as i tried
$sHtml = utf8_encode($sHtml); But it did not helped.
$sHtml = utf8_encode($sHtml); But it did not helped.
This file has a bunch of problems. It looks like it was created with UTF-8, but then stored with ISO-8859-1 or similar character set. This can be caused by text editor settings. Check to see that your editor is saving the file in UTF-8 encoding.
http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php
This file also has character collisions, though not so many. Example: Considérations thérapeutiques
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php
I will try to repair one of these and see if there is a way to use REGEX with the repaired file. In directly related matters, you might want to learn about the HTML5 doctype and the rules for using the meta-charset tag. It's needed here!
http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php
This file also has character collisions, though not so many. Example: Considérations thérapeutiques
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php
I will try to repair one of these and see if there is a way to use REGEX with the repaired file. In directly related matters, you might want to learn about the HTML5 doctype and the rules for using the meta-charset tag. It's needed here!
ASKER
Thanx Ray .Please even do let me know steps how to repair it. File name was actually Backward_past_références.p hp i hope EE had rename the file
Progress! This worked to read the HTML document. What do you want that REGEX to do - it looks awfully complicated, maybe an explanation will be easier to understand!
<?php // demo/temp_insoftservice.php
/**
* http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
*/
error_reporting(E_ALL);
// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
// READ THE FILE AND DECODE THE INVALID CHARACTERS
$url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;
ASKER
I want to extract data within <h1><h2>...<hn> & <div> tags.
$sTag = 'h[0-9]';
$sPattern = "/<" . $sTag . "(>|.+?(?<!<|>)>).+?(?<!<" . $sTag . ")(.+?)<\/" . $sTag . ">/i";
$sTag = 'h[0-9]';
$sPattern = "/<" . $sTag . "(>|.+?(?<!<|>)>).+?(?<!<"
OK, let me see what I can do with that. I'm going to accept it as an article of faith that the REGEX works correctly on ASCII data.
ASKER
Ya it works perfectly fine for that. It works for
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php.
But if i edit and try to save it fails to work. As you had mentioned in your previous comments it might be due to not saving in proper format. But even i tried to save in notepad++ with Encoding type "Conevrt to UTf8 without BOM" it fails
http://filedb.experts-exchange.com/incoming/2015/05_w18/911673/Backward-history-info.php.
But if i edit and try to save it fails to work. As you had mentioned in your previous comments it might be due to not saving in proper format. But even i tried to save in notepad++ with Encoding type "Conevrt to UTf8 without BOM" it fails
Well, I've tried and I'm not having any luck. I'll try an alternative approach instead. While I work on that, please read this explanation of what happens when you try to use REGEX to parse XML or XHTML or HTML. It tells the "real story" in a funny way!
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top
ASKER
Ray i tried this method its able to fetch data but i have to do some formatting in it to display proper tags .
Please provide your comments if possible or if u have better option please do let me know
Another issue over here i have to call all tags via separate functions or by using loop
Please provide your comments if possible or if u have better option please do let me know
Another issue over here i have to call all tags via separate functions or by using loop
$url = "http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php";
$html = file_get_contents($url);
function get_all_string_between($string, $start, $end)
{
$result = array();
$string = " ".$string;
$offset = 0;
while(true)
{
$ini = strpos($string,$start,$offset);
if ($ini == 0)
break;
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
$result[] = substr($string,$ini,$len);
$offset = $ini+$len;
}
return $result;
}
$result = get_all_string_between($html, '<h2', '</h2>');
$result2 = get_all_string_between($html, '<h1', '</h1>');
echo "<pre>";print_r($result); echo "<pre>h1 tag ";print_r($result1);
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Its working as expected i had checked with few sample files.
Please do let me know how to use file_get_contents if file name has utf8 characters. Backward_pawn_voyez_égalem ent.php
Please do let me know how to use file_get_contents if file name has utf8 characters. Backward_pawn_voyez_égalem
The standard advice is: Please do not use a file name with UTF-8 characters. Use only ASCII characters in file names. If someone else is creating the files, and you're having trouble reading them because of the file names, please post a fully-qualified URL to the files and I'll try to help you find a solution.
This seems to work just fine. The accented "e" is URL-encoded to the hex UTF-8 characters. All I did was copy the URL (copy/paste) from the link posted above.
<?php // demo/temp_insoftservice.php
/**
* http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28666444.html
* http://iconoun.com/articles/collisions/
* http://php.net/manual/en/reference.pcre.pattern.modifiers.php
*
* I want to extract data within <h1><h2>...<hn> & <div> tags.
*/
error_reporting(E_ALL);
// BASIC SETTINGS
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
// READ THE FILE AND DECODE THE INVALID CHARACTERS
// $url = 'http://filedb.experts-exchange.com/incoming/2015/05_w18/911672/Backward-past-r-f-rences.php';
$url = 'http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php';
$htm = file_get_contents($url);
$htm = utf8_decode($htm);
echo $htm;
// WHAT TAGS WILL WE USE?
$tags = array( 'h1', 'h2', 'h3', 'div' );
echo '<pre>';
// PROCESS EACH TAG
foreach ($tags as $tag)
{
$rgx
= '#' // REGEX DELIMITER
. '\<' // START OF THE TAG
. $tag
. '.*?\>' // TAG ATTRIBUTES IF ANY
. '(.*?)' // CAPTURE GROUP CONTENT
. '\</' // END OF THE TAG
. $tag
. '\>'
. '#' // REGEX DELIMITER
. 'is' // FLAGS
;
// SHOW THE TAG WE ARE PROCESSING
echo PHP_EOL . "<h2>$tag</h2>";
// SHOW THE FINDINGS, IF ANY
preg_match_all($rgx, $htm, $mat);
var_dump($mat[1]);
echo PHP_EOL;
}
ASKER
File name is Backward_pawn_voyez_égalem ent.php and its not Backward_pawn_voyez_%C3%A9 galement.p hp. I hope you had copy pasted the url again from browser.
Let me go back to this. It really is the correct advice!
https://www.experts-exchange.com/questions/28666444/Data-between-html-tags-using-regex-in-php-with-utf8-characters.html?anchorAnswerId=40756338#a40756338
Click the link to the URL that you posted there. It will open a web page that says, "Gage arrière, etc."
Copy the URL from your browser address bar and paste it into a text document. You will see that the encoded URL address says this:
http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php
Now go back to the URL and change it. Remove the UTF-8 characters that say "%C3%A9" and insert the "é" character. Fire up the request again and you will see:
Kohana_Request_Exception [ 0 ]: Unable to find a route to match the URI: Backward_pawn_voyez_?galement.php
This fails because the UTF-8 characters are translated by the web server. That is why we advise you not to use accented characters (or any non-ASCII characters) in URLs or file names.
The disappearance of the "é" character and the replacement with the UTF-8 version, "%C3%A9" is the browser's way of trying to help you get a usable URL. There are rules that we have to follow, and putting the "é" character into the URL violates these rules. So just don't do that, and you'll be fine. Use the "percent-encoded" version instead.
http://tools.ietf.org/html/rfc3986
http://www.w3.org/Addressing/URL/uri-spec.html
http://en.wikipedia.org/wiki/Percent-encoding
The standard advice is: Please do not use a file name with UTF-8 characters. Use only ASCII characters in file names.To see what is going wrong, follow these steps. Please go to this URL:
https://www.experts-exchange.com/questions/28666444/Data-between-html-tags-using-regex-in-php-with-utf8-characters.html?anchorAnswerId=40756338#a40756338
Click the link to the URL that you posted there. It will open a web page that says, "Gage arrière, etc."
Copy the URL from your browser address bar and paste it into a text document. You will see that the encoded URL address says this:
http://www.lovetomarry.com/Backward_pawn_voyez_%C3%A9galement.php
Now go back to the URL and change it. Remove the UTF-8 characters that say "%C3%A9" and insert the "é" character. Fire up the request again and you will see:
Kohana_Request_Exception [ 0 ]: Unable to find a route to match the URI: Backward_pawn_voyez_?galement.php
This fails because the UTF-8 characters are translated by the web server. That is why we advise you not to use accented characters (or any non-ASCII characters) in URLs or file names.
The disappearance of the "é" character and the replacement with the UTF-8 version, "%C3%A9" is the browser's way of trying to help you get a usable URL. There are rules that we have to follow, and putting the "é" character into the URL violates these rules. So just don't do that, and you'll be fine. Use the "percent-encoded" version instead.
http://tools.ietf.org/html/rfc3986
http://www.w3.org/Addressing/URL/uri-spec.html
http://en.wikipedia.org/wiki/Percent-encoding
ASKER
Actually these are the files provided by the client with utf8 characters which has to be scanned .So can't change the file name.is there any other way to read it as i don't have permission to rename the files
Maybe you can do what I did -- copy each URL and paste it into the browser address bar. Click the link to visit the URL. Then copy the URL out of the browser address bar and paste it into your code editor. There may be shortcuts - you can try other solutions, but the right answer is to follow the standards. I recommend that you raise your rates to cover the extra cost of converting the non-standard URLs into something that will work right!
ASKER
Thx Ray
http://iconoun.com/articles/collisions/
Second, you might want to look at these:
http://php.net/manual/en/regexp.reference.unicode.php
http://php.net/manual/en/reference.pcre.pattern.modifiers.php (check out "u" in the modifiers).
I'll see if there is an easy way to modify what you've got here and post back in a little while.