Link to home
Start Free TrialLog in
Avatar of grblades
grbladesFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Problem reading UTF8 formatted file

Trying to read in TSV Unicode file and then insert the data into a MySQL database.  This example is in Mandarin, but we'll be dealing with Turkish, Arabic, Japanese... the whole works. Reading the data out of the file (first var_dump()) appears to show most of the Chinese character symbols, but when getting the contents into the array using fgetcsv(), both the English and the Mandarin get garbled with the <?> character interspersed, and the Chinese characters disappear completely - I've tried using several techniques two of the latest are included in the first 4 lines of the while() loop, and have no discernable effect to readability (although the outputted garbage does change).  Text file attached.


$row = 0;
        $importdata = Array();
        // Help fgetcsv() to read in UTF8
        setlocale( LC_ALL, 'en_US.UTF-8' );
        $handle = fopen( $_FILES[ "uploadcsv" ][ "tmp_name" ], "r" );
        var_dump( ( fread( $handle, 10000 ) ) );
        while ( ( $data = fgetcsv( $handle, 10000, "    " ) ) !== FALSE )
        {
                $importdata[ $row ][ "englishlanguagename" ] = 
mb_convert_encoding( $data[ 0 ], "UTF-8", "auto" );
                $importdata[ $row ][ "nativelanguagename" ] = 
mb_convert_encoding( $data[ 1 ], "UTF-8", "auto" );
                $importdata[ $row ][ "englishcategoryname" ] = 
utf8_encode( $data[ 2 ] );
                $importdata[ $row ][ "nativecategoryname" ] = 
utf8_encode( $data[ 3 ] );
                $importdata[ $row ][ "englishsubcategoryname" ] = $data[ 
4 ];
                $importdata[ $row ][ "nativesubcategoryname" ] = $data[ 5 ];
                $importdata[ $row ][ "englishphrase" ] = $data[ 6 ];
                $importdata[ $row ][ "nativephrase" ] = $data[ 7 ];
                $importdata[ $row ][ "audiofile" ] = $data[ 8 ];
                $row++;
        }
        fclose( $handle );
        unlink( $_FILES[ "uploadcsv" ][ "tmp_name" ] );
 
        echo "<pre>"; var_dump( $importdata ); echo "</pre>";

Open in new window

Mandarin-phrases-not-final.txt
output-html.log
Avatar of Loganathan Natarajan
Loganathan Natarajan
Flag of India image

please search in EE with "read utf-8 file" ... lot of answers are given already...
are you work on this?
Avatar of grblades

ASKER

Yes. I am in the UK so posted only an hour before finishing work. Going to try a few things today and I will let you know.
Thanks
Thanks for the link, however I already have the mb_convert functions installed, and am using them on lines 10 & 12 of the above, however using them appears to have no effect on the outcome.  The var_dump on line 6 does appear to get most of the multi-byte characters correct (in IE at least, Firefox shows something completely different!), but then once the lines have been run through the fgetcsv() function, they appear to turn to gibberish, so my feeling is that this is where the error lies.

I don't believe the file is having any issues with UTF8, and the fread() function working almost correctly leads me to believe that the reading of the file isn't having any problems - so searching for read utf8 file really doesn't help me much - and yes, I have tried it.
ASKER CERTIFIED SOLUTION
Avatar of grblades
grblades
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial