Link to home
Start Free TrialLog in
Avatar of BinaryTree
BinaryTreeFlag for Saudi Arabia

asked on

PHP Encoding problem (How to do like google !)

Dear All,

I am facing a problem with character encoding in php , i will describe my problem by example :

I have 3 files :

file1.html : it's charset=WINDOWS-1256

file2.html : it's charset=UTF-8

file3.html : it's charset=ISO-8859-1


i use a php script to spider these files and index it (complete search engine spider,indexer,interface)

when the script finish spidering and indexing these files , i use the interface to search for specific keyword.

if i put the charset of the search page to WINDOWS-1256 only file1.html results will be found , and if i put charset of search page to UTF-8 only file2.html will be found ... etc .

Is there anyway to unified the charset of these pages (exactly like what google do , treat every thing as UTF-8 even if the page is not utf-8)

this is not a problem when the keyword i search for is english word , but if it is in another language , the problem appeared !

Regards
SOLUTION
Avatar of ps15
ps15

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Avatar of blue_hunter
blue_hunter
Flag of Malaysia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of BinaryTree

ASKER

Hi ,

Iconv require to specify the input text encoding , but the input text is not unique it can be WINDOWS-1256 , ISO-8859-1 or even UTF-8 !

So i need a standarized way to convert any text to UTF without knowing what the original encoding was !

Thanks
Avatar of ps15
ps15

try just going:

iconv('', 'UTF-8', $str);
Dear ps15,

I tried this one , it does not work , it gives bad results !
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
So how to do it using PHP , i think i can not ask a function like file_get_contents() to read the content as specific charset , it is just get it as it is !

I can not find any solution :( , i will split the points :)
You can read in the rendered versions of the pages...that's what google does.  But this is no easy task and is something that google would have spent a lot of money and resources on early on.  I don't know of a quick and easy way to do this.  Essentially, what you'd need to do is use 'borrowed' browser code or function calls to retrieve and render the pages and then store them as utf-8.  Many browsers have this capability, it's just rarely utilized.  So there's no real simple solution to what you want to do, at least that I've ever seen.  
although the question being closed,
but i found another solution on the encoding while i'm working on it today.
this require you to compile your php with
 "--enable-mbstring"

And then
use,
mb_detect_encoding()  to detect the encoding of the text

then,
mb_convert_encoding() to change the encoding into UTF-8

check out the option, "HTML-ENTITIES"  for mb_convert_encoding()   .. see whether it suit your requirement.


For the function that i pasted before,
$string = iconv("GBK","UCS-2BE",$string);   <---- you can modify the line, with GBK replaced with any character's encoding you wish to change into HEX code. This require a little knowledge for the encoding.

hope, i'm clear in explaining the solutions.
cheers