?
Solved

Problem reading UTF8 formatted file

Posted on 2008-02-11
7
Medium Priority
?
10,073 Views
Last Modified: 2013-12-12
Trying to read in TSV Unicode file and then insert the data into a MySQL database.  This example is in Mandarin, but we'll be dealing with Turkish, Arabic, Japanese... the whole works. Reading the data out of the file (first var_dump()) appears to show most of the Chinese character symbols, but when getting the contents into the array using fgetcsv(), both the English and the Mandarin get garbled with the <?> character interspersed, and the Chinese characters disappear completely - I've tried using several techniques two of the latest are included in the first 4 lines of the while() loop, and have no discernable effect to readability (although the outputted garbage does change).  Text file attached.


$row = 0;
        $importdata = Array();
        // Help fgetcsv() to read in UTF8
        setlocale( LC_ALL, 'en_US.UTF-8' );
        $handle = fopen( $_FILES[ "uploadcsv" ][ "tmp_name" ], "r" );
        var_dump( ( fread( $handle, 10000 ) ) );
        while ( ( $data = fgetcsv( $handle, 10000, "    " ) ) !== FALSE )
        {
                $importdata[ $row ][ "englishlanguagename" ] = 
mb_convert_encoding( $data[ 0 ], "UTF-8", "auto" );
                $importdata[ $row ][ "nativelanguagename" ] = 
mb_convert_encoding( $data[ 1 ], "UTF-8", "auto" );
                $importdata[ $row ][ "englishcategoryname" ] = 
utf8_encode( $data[ 2 ] );
                $importdata[ $row ][ "nativecategoryname" ] = 
utf8_encode( $data[ 3 ] );
                $importdata[ $row ][ "englishsubcategoryname" ] = $data[ 
4 ];
                $importdata[ $row ][ "nativesubcategoryname" ] = $data[ 5 ];
                $importdata[ $row ][ "englishphrase" ] = $data[ 6 ];
                $importdata[ $row ][ "nativephrase" ] = $data[ 7 ];
                $importdata[ $row ][ "audiofile" ] = $data[ 8 ];
                $row++;
        }
        fclose( $handle );
        unlink( $_FILES[ "uploadcsv" ][ "tmp_name" ] );
 
        echo "<pre>"; var_dump( $importdata ); echo "</pre>";

Open in new window

Mandarin-phrases-not-final.txt
output-html.log
0
Comment
Question by:grblades
  • 4
  • 3
7 Comments
 
LVL 36

Expert Comment

by:Loganathan Natarajan
ID: 20867595
0
 
LVL 36

Expert Comment

by:Loganathan Natarajan
ID: 20867633
please search in EE with "read utf-8 file" ... lot of answers are given already...
0
 
LVL 36

Expert Comment

by:Loganathan Natarajan
ID: 20872822
are you work on this?
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
LVL 36

Author Comment

by:grblades
ID: 20873279
Yes. I am in the UK so posted only an hour before finishing work. Going to try a few things today and I will let you know.
Thanks
0
 
LVL 36

Author Comment

by:grblades
ID: 20873691
Thanks for the link, however I already have the mb_convert functions installed, and am using them on lines 10 & 12 of the above, however using them appears to have no effect on the outcome.  The var_dump on line 6 does appear to get most of the multi-byte characters correct (in IE at least, Firefox shows something completely different!), but then once the lines have been run through the fgetcsv() function, they appear to turn to gibberish, so my feeling is that this is where the error lies.

I don't believe the file is having any issues with UTF8, and the fread() function working almost correctly leads me to believe that the reading of the file isn't having any problems - so searching for read utf8 file really doesn't help me much - and yes, I have tried it.
0
 
LVL 36

Expert Comment

by:Loganathan Natarajan
ID: 20873868
0
 
LVL 36

Accepted Solution

by:
grblades earned 0 total points
ID: 20877023
The problem was lying somewhere to do with the files - not entirely sure if it was the Unicode export from Excel, the fgetcsv function, the fopen or wherever.  I've changed the front-end procedure now, and it all works.  The user no longer exports from Excel to Unicode text file & uploads it, I changed the code so that they copy and paste into a textarea.  I then parse this text manually instead of using fgetcsv - the only downside to this is that any tabs internal to the strings are not escaped, and so appear as seperate columns.  Because I know the data is coming out of Excel, and I know the people putting the data in, so I don't think it will ever occur, but a column count on the $data variable would be a quick fix to that....  Oh, and it won't be happy with 2 people using it at exactly the same time, but a random filename would sort that issue...

Here's the working code (I've made PHP save the textarea variable - contents - to a file, to see if it was the file functions that were causing me issues) :
if( isset( $_FILES[ "uploadcsv" ][ "tmp_name" ] ) && strlen( trim( $_FILES[ "uploadcsv" ][ "tmp_name" ] ) ) )
       $filename = $_FILES[ "uploadcsv" ][ "tmp_name" ];
else if( isset( $_POST[ "contents" ] ) && strlen( trim( $_POST[ "contents" ] ) ) )
{
       $filename = "/tmp/foo";
       file_put_contents( $filename, mb_convert_encoding( $_POST[ "contents" ], "UTF-8", "auto" ) ) );
}
if( isset( $filename ) )
{
       $row = 0;
       $importdata = Array();
       // Help fgetcsv() to read in UTF8
       setlocale( LC_ALL, 'en_US.UTF-8' );
       $handle = fopen( $filename, "r" );
       $contents = explode( "\n", fread( $handle, 10000000 ) );
       foreach( $contents as $line )
       {
               $data = explode( "      ", str_replace( "\r", "", $line ) );
               $importdata[ $row ][ "englishlanguagename" ] = mb_convert_encoding( $data[ 0 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativelanguagename" ] = mb_convert_encoding( $data[ 1 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "englishcategoryname" ] = mb_convert_encoding( $data[ 2 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativecategoryname" ] = mb_convert_encoding( $data[ 3 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "englishsubcategoryname" ] = mb_convert_encoding( $data[ 4 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativesubcategoryname" ] = mb_convert_encoding( $data[ 5 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "englishphrase" ] = mb_convert_encoding( $data[ 6 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "nativephrase" ] = mb_convert_encoding( $data[ 7 ], "UTF-8", "auto" ) );
               $importdata[ $row ][ "audiofile" ] = mb_convert_encoding( $data[ 8 ], "UTF-8", "auto" ) );
               $row++;
       }
 
       fclose( $handle );
       unlink( $filename ); 

Open in new window

0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The title says it all. Writing any type of PHP Application or API code that provides high throughput, while under a heavy load, seems to be an arcane art form (Black Magic). This article aims to provide some general guidelines for producing this typ…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

600 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question