Problem reading UTF8 formatted file

Trying to read in TSV Unicode file and then insert the data into a MySQL database.  This example is in Mandarin, but we'll be dealing with Turkish, Arabic, Japanese... the whole works. Reading the data out of the file (first var_dump()) appears to show most of the Chinese character symbols, but when getting the contents into the array using fgetcsv(), both the English and the Mandarin get garbled with the <?> character interspersed, and the Chinese characters disappear completely - I've tried using several techniques two of the latest are included in the first 4 lines of the while() loop, and have no discernable effect to readability (although the outputted garbage does change).  Text file attached.


$row = 0;
        $importdata = Array();
        // Help fgetcsv() to read in UTF8
        setlocale( LC_ALL, 'en_US.UTF-8' );
        $handle = fopen( $_FILES[ "uploadcsv" ][ "tmp_name" ], "r" );
        var_dump( ( fread( $handle, 10000 ) ) );
        while ( ( $data = fgetcsv( $handle, 10000, "    " ) ) !== FALSE )
        {
                $importdata[ $row ][ "englishlanguagename" ] = 
mb_convert_encoding( $data[ 0 ], "UTF-8", "auto" );
                $importdata[ $row ][ "nativelanguagename" ] = 
mb_convert_encoding( $data[ 1 ], "UTF-8", "auto" );
                $importdata[ $row ][ "englishcategoryname" ] = 
utf8_encode( $data[ 2 ] );
                $importdata[ $row ][ "nativecategoryname" ] = 
utf8_encode( $data[ 3 ] );
                $importdata[ $row ][ "englishsubcategoryname" ] = $data[ 
4 ];
                $importdata[ $row ][ "nativesubcategoryname" ] = $data[ 5 ];
                $importdata[ $row ][ "englishphrase" ] = $data[ 6 ];
                $importdata[ $row ][ "nativephrase" ] = $data[ 7 ];
                $importdata[ $row ][ "audiofile" ] = $data[ 8 ];
                $row++;
        }
        fclose( $handle );
        unlink( $_FILES[ "uploadcsv" ][ "tmp_name" ] );
 
        echo "<pre>"; var_dump( $importdata ); echo "</pre>";

Open in new window

Mandarin-phrases-not-final.txt
output-html.log
LVL 36
grbladesAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Loganathan NatarajanLAMP DeveloperCommented:
0
Loganathan NatarajanLAMP DeveloperCommented:
please search in EE with "read utf-8 file" ... lot of answers are given already...
0
Loganathan NatarajanLAMP DeveloperCommented:
are you work on this?
0
Cloud Class® Course: CompTIA Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

grbladesAuthor Commented:
Yes. I am in the UK so posted only an hour before finishing work. Going to try a few things today and I will let you know.
Thanks
0
grbladesAuthor Commented:
Thanks for the link, however I already have the mb_convert functions installed, and am using them on lines 10 & 12 of the above, however using them appears to have no effect on the outcome.  The var_dump on line 6 does appear to get most of the multi-byte characters correct (in IE at least, Firefox shows something completely different!), but then once the lines have been run through the fgetcsv() function, they appear to turn to gibberish, so my feeling is that this is where the error lies.

I don't believe the file is having any issues with UTF8, and the fread() function working almost correctly leads me to believe that the reading of the file isn't having any problems - so searching for read utf8 file really doesn't help me much - and yes, I have tried it.
0
Loganathan NatarajanLAMP DeveloperCommented:
0
grbladesAuthor Commented:
The problem was lying somewhere to do with the files - not entirely sure if it was the Unicode export from Excel, the fgetcsv function, the fopen or wherever.  I've changed the front-end procedure now, and it all works.  The user no longer exports from Excel to Unicode text file & uploads it, I changed the code so that they copy and paste into a textarea.  I then parse this text manually instead of using fgetcsv - the only downside to this is that any tabs internal to the strings are not escaped, and so appear as seperate columns.  Because I know the data is coming out of Excel, and I know the people putting the data in, so I don't think it will ever occur, but a column count on the $data variable would be a quick fix to that....  Oh, and it won't be happy with 2 people using it at exactly the same time, but a random filename would sort that issue...

Here's the working code (I've made PHP save the textarea variable - contents - to a file, to see if it was the file functions that were causing me issues) :
if( isset( $_FILES[ "uploadcsv" ][ "tmp_name" ] ) && strlen( trim( $_FILES[ "uploadcsv" ][ "tmp_name" ] ) ) )
       $filename = $_FILES[ "uploadcsv" ][ "tmp_name" ];
else if( isset( $_POST[ "contents" ] ) && strlen( trim( $_POST[ "contents" ] ) ) )
{
       $filename = "/tmp/foo";
       file_put_contents( $filename, mb_convert_encoding( $_POST[ "contents" ], "UTF-8", "auto" ) ) );
}
if( isset( $filename ) )
{
       $row = 0;
       $importdata = Array();
       // Help fgetcsv() to read in UTF8
       setlocale( LC_ALL, 'en_US.UTF-8' );
       $handle = fopen( $filename, "r" );
       $contents = explode( "\n", fread( $handle, 10000000 ) );
       foreach( $contents as $line )
       {
               $data = explode( "      ", str_replace( "\r", "", $line ) );
               $importdata[ $row ][ "englishlanguagename" ] = mb_convert_encoding( $data[ 0 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativelanguagename" ] = mb_convert_encoding( $data[ 1 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "englishcategoryname" ] = mb_convert_encoding( $data[ 2 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativecategoryname" ] = mb_convert_encoding( $data[ 3 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "englishsubcategoryname" ] = mb_convert_encoding( $data[ 4 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativesubcategoryname" ] = mb_convert_encoding( $data[ 5 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "englishphrase" ] = mb_convert_encoding( $data[ 6 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativephrase" ] = mb_convert_encoding( $data[ 7 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "audiofile" ] = mb_convert_encoding( $data[ 8 ], "UTF-8", "auto" ) );
               $row++;
       }
 
       fclose( $handle );
       unlink( $filename ); 

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.