PHP Encoding problem (How to do like google !)

Dear All,

I am facing a problem with character encoding in php , i will describe my problem by example :

I have 3 files :

file1.html : it's charset=WINDOWS-1256

file2.html : it's charset=UTF-8

file3.html : it's charset=ISO-8859-1


i use a php script to spider these files and index it (complete search engine spider,indexer,interface)

when the script finish spidering and indexing these files , i use the interface to search for specific keyword.

if i put the charset of the search page to WINDOWS-1256 only file1.html results will be found , and if i put charset of search page to UTF-8 only file2.html will be found ... etc .

Is there anyway to unified the charset of these pages (exactly like what google do , treat every thing as UTF-8 even if the page is not utf-8)

this is not a problem when the keyword i search for is english word , but if it is in another language , the problem appeared !

Regards
BinaryTreeAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
blue_hunterConnect With a Mentor Commented:
iconv got problem with convert chinese words into UTF8

use the function as below

function intoUTF($string)
{
      
      $string = iconv("GBK","UCS-2BE",$string);
      
      $string = bin2hex($string);
      $binarydata = unpack ("C*int_var", $string);

      $count = 0;
      $string = "";

      foreach ( $binarydata as $value)
      {
            if($count%4 == 0 )
                  $string .= "&#x";
            //echo $value;
            //echo "\n";
            $string .= chr($value);
            if($count%4 == 3)
                  $string .= ";";
            $count++;
      }
      
      return $string;
}
0
 
ps15Connect With a Mentor Commented:
0
 
BinaryTreeAuthor Commented:
Hi ,

Iconv require to specify the input text encoding , but the input text is not unique it can be WINDOWS-1256 , ISO-8859-1 or even UTF-8 !

So i need a standarized way to convert any text to UTF without knowing what the original encoding was !

Thanks
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
ps15Commented:
try just going:

iconv('', 'UTF-8', $str);
0
 
BinaryTreeAuthor Commented:
Dear ps15,

I tried this one , it does not work , it gives bad results !
0
 
ClickCentricConnect With a Mentor Commented:
Right idea, wrong direction.  The sites should be spidered as utf-8.  This may, if the ultimate goal is greater than what's described here, require some translation during the spidering process.  How google works is by actually checking the character set of the page, applying the proper transformations and then indexing as utf-8.
0
 
BinaryTreeAuthor Commented:
So how to do it using PHP , i think i can not ask a function like file_get_contents() to read the content as specific charset , it is just get it as it is !

0
 
BinaryTreeAuthor Commented:
I can not find any solution :( , i will split the points :)
0
 
ClickCentricCommented:
You can read in the rendered versions of the pages...that's what google does.  But this is no easy task and is something that google would have spent a lot of money and resources on early on.  I don't know of a quick and easy way to do this.  Essentially, what you'd need to do is use 'borrowed' browser code or function calls to retrieve and render the pages and then store them as utf-8.  Many browsers have this capability, it's just rarely utilized.  So there's no real simple solution to what you want to do, at least that I've ever seen.  
0
 
blue_hunterCommented:
although the question being closed,
but i found another solution on the encoding while i'm working on it today.
this require you to compile your php with
 "--enable-mbstring"

And then
use,
mb_detect_encoding()  to detect the encoding of the text

then,
mb_convert_encoding() to change the encoding into UTF-8

check out the option, "HTML-ENTITIES"  for mb_convert_encoding()   .. see whether it suit your requirement.


For the function that i pasted before,
$string = iconv("GBK","UCS-2BE",$string);   <---- you can modify the line, with GBK replaced with any character's encoding you wish to change into HEX code. This require a little knowledge for the encoding.

hope, i'm clear in explaining the solutions.
cheers






0
All Courses

From novice to tech pro — start learning today.