Link to home
Start Free TrialLog in
Avatar of debitto
debittoFlag for Japan

asked on

Japanese MBCS input by PHP garbled in MySQL

Maximum points to the expert who solves this problem, which has stumped my hosting service tech support and me after a week of struggling.

Problem: Japanese form input inserted into a MySQL database appears garbled on the hosting server (hereinafter "Server"), although it appears as readable Japanese when I do the same operation on my testing machine ("Local").

Factors: The input form and the processing script are constant. So are the MySQL character sets and collations: UTF-8 and utf8_unicode_ci, respectively, in all categories. The PHP and Apache settings probably differ in various ways. I've tried managing PHP with a php.ini file containing multibyte-related lines in my /scripts directory. Not only does that have no effect, but the php.ini on Local contains no such lines anyway. The Content-Type of my input page is "text/html; charset=utf-8". The PHP/MySQL installations are the same major versions: PHP 4.4.6 and MySQL 4.1.22-standard on Server, versus PHP 4.3.6 and MySQL 4.1.7-nt on Local. Server runs Unix, and Local runs Windows XP SP2 (Japanese).

When Japanese is typed directly into the MySQL db, or contained in data that's copied by SQL in phpMyAdmin, the problem does not occur. Since the input form and script work on Local, I'm assuming they're not at fault and am omitting them here. However, I'll supply them as necessary. I've tried html_entity_decode, SET NAMES, and more, all in vain. Here is a sample of the garbled Japanese, pasted from the db: æ¥æS¬èª~

In case that's not visible to all, it's a mixture of international phonetic symbols, non-English language symbols such as a French accent grave, and non-language symbols. In other words, it's not html entities or question marks.

Lacking a solution to this problem per se, I'd like to know whether I can dispose of it by upgrading my hosting account to PHP5/MySQL5, along with any vital details about making the transition and setting up MBCS support. I'm already using PHP/MySQL heavily on my sites and can't afford to go down for long. Since my (otherwise outstanding) hosting service doesn't understand what's causing the problem at hand, I'd like an outside opinion on this.

Thank you.
Avatar of Bernard Savonet
Bernard Savonet
Flag of France image

You seem to have explored most of the possible trouble sources.
Some additional tracks you might follow:
1 - On which machine is your Unix running? Is it running 64bits or 32bits? If it is not a X86 type machine, is it running "big Endian" or "little Endian"? [This problem should probably be searched and if necessary solved at your hosting IT-level]

2 - Getting wrong displays is NOT the sure sign that there is the problem in your DB! I would suggest that you read your character string then look at the Hex codes of each of the bytes. Some possible results:
- the codes are OK but they are not in the right order, ie they are always in the sequence "YX" while you would expect "XY": this is probably an "endian" problem, look 1/ above
- the codes are OK and in the right order, but they do not display correctly as character strings; this means that you "just" have a display problem in one of the following "layers": php multibyte manipulation (either by "your" program or php settings), character transfer by your web server, character page code on your browser. The last one is most probably the easiest one to experiment, but might not be really conclusive.
- the codes are not OK and no pattern is discernable; you should supsect that some problem happened at input time.

3 - I would probably use phpmyadmin and experiment with it several combinations of input and display with different codepages in my browser; most noticeably, I would use 2 different input methods: one by typing directly into phpmyadmin, one by typing into a friendly text editor where I am sure that what I type looks fine (at least in a certain tuning), and then copying and pasting into phpmyadmin.
NOTE: since there are several layers involved at input, storing, processing and display times, I would recommend that you write down notes on paper on which solution work or don't.

Good bug-hunting!
Do you use php multibyte extension or multibyte php? Do you have both fixed and variable fonts to enter and display database on displaying system?

wtf is "Japanese MBCS" ?
Avatar of debitto

ASKER

To gheist:

>Do you use php multibyte extension or multibyte php?

I'll have to learn the difference before I can answer that.

>wtf is "Japanese MBCS" ?

I typed both "Japanese" and "MBCS" in my title as keywords, thinking the body of the post would make the matter clear. Sorry it didn't. The language I'm typing in is Japanese. The character set, UTF-8,  is a MBCS.
Can you verify that data are correct just before insertion into DB?  For example:

Original: gY (desu)
UTF-8: e3 81 a7 e3 81 99
hmm, EE trashed utf-8.  Let's try HTML -- です
There is problem with your browser.

Lets see if mine is broken as well:

#:@0i=0
Experts-exchange is broke as well...
Avatar of debitto

ASKER

To svs:

Your first message looks all right on my browser here in Japan. However, I have to admit that I don't know how to do the verification you mean. I'll be glad to check the data if you tell me the procedure.

To fibo, belatedly:

Thanks for the various ideas. To reply in part:

1 - The Unix server architecture is i686, and it's running Linux. I don't know whether it's 64bits or 32bits, but I'll try to find out.

2 - Unfortunately, I don't know how to go about examining Hex codes, as all my experience is with plain-text scripting languages.

3 - I've done a little of the testing you suggest, but so far I haven't discovered anything. Japanese typed or pasted into phpmyadmin has not been a problem, and when I tried various browser character encodings with the garbled characters (see my original post), the results were just different styles of garbage.
Avatar of debitto

ASKER

New information on the problem:

I have installed WordPress Multilingual Edition (v2.2.3 Japanese distribution) on my site and have found   that there is NO troulbe with inserting Japanese in its MySQL database, either with the default editor or with the embedded tinyMCE editor.

In my unsuccessful tests, I have been using a standalone installation of tinyMCE, localized with a Japanese language file (ja_utf-8.js), in addition to non-tinyMCE text input fields; so please note that this is not a tinyMCE issue.

I've looked around in the WordPress source and the db tables created by WP, but can't tell what WP is getting right that I'm getting wrong. I've tried echoing my form input when it's submitted to the db, and I find that it displays properly.

Here are the essentials of my input form and processor:

-----FORM-----
<head>
<title>Input Form</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>

<body>
<form method="post" action="../scripts/myprocessor.php">
     <input type="text" id="nameFamily" name="nameFamily" size=20 />
     <input type="text" id="nameGiven" name="nameGiven" size=20 />
     <textarea id="mainText" name="mainText" rows="25" cols="70"></textarea>
     <input type="submit" name="save" value="Submit" />
</form>
-----END FORM-----

-----PROCESSOR-----
<?php
$host="localhost";
$database="mydatabase";
$user="myusername";
$password="mypassword";

$table="mytable";

$nameFamily = $_POST['nameFamily'];
$nameGiven = $_POST['nameGiven'];
$mainText = $_POST['mainText'];
mysql_connect($host,$user,$password) or die( 'Unable to connect to database server');
@mysql_select_db($database) or die( 'Unable to select database');
mysql_query("INSERT INTO $table (nameFamily, nameGiven, mainText) VALUES ('$nameFamily', '$nameGiven', '$mainText')")or die( "Unable to insert into table");
?>
-----END PROCESSOR-----

As reported before, I've tried a variety of tricks with these.

Any ideas will be appreciated. It seems that if I can make my handmade input tool behave like WordPress, I'll be all right.
ASKER CERTIFIED SOLUTION
Avatar of svs
svs
Flag of Afghanistan image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of debitto

ASKER

>Maybe wordpress does a "SET NAMES utf8" before all other database operations.

I can only find one instance of "SET NAMES" in all their code, and it doesn't seem to be used with INSERT operations. Anyway, I've tried it (again) with my script and it doesn't do any good.

Thanks anyway.
You said:
''Japanese typed or pasted into phpmyadmin has not been a problem"
Do you mean that you can paste, save, and display OK?
Avatar of debitto

ASKER

To fibo, svs, and gheist:

I've solved my problem. The solution was a single line of PHP code in the processor script. For the information of others with a similar problem, here it is:

mysql_query("SET NAMES utf8");

That should be placed just after the mysql_connect() line. I had tried it before, but in a slightly different form (which I found in the MySQL Manual!), with 'utf-8' hyphenated and enclosed in single quotes. I've reconfirmed that that breaks the code. It may work in some other context, but I won't take the trouble to find out. The separate page for the input form does not need to contain any reference to the Japanese language or the utf-8 encoding at all.

The online source in which I found the above line also said it was necessary to place the following lines in the script, BEFORE the mysql_connect() line:

mb_language("uni");
mb_internal_encoding("utf-8");
mb_http_input("auto");
mb_http_output("utf-8");

However, in my case they're not necessary. I wonder if that's because my hosting provider has all the MySQL character sets and collations set to utf8 and utf8_unicode_ci on the server? I imagine people started using the above lines a few years ago as a way of coping with "latin1" defaults, but it's just a guess.

Thank you for your suggestions and efforts to help me. This thread can now come to an end.
Avatar of debitto

ASKER

Wait!

Fibo, I see that you suggested a "SET NAMES utf8" before all other database operations. It seemed unlikely in the context of the discussion at that moment (learning from WordPress), but it did turn out to be the right answer. I'm accepting that as the solution to my problem. Congratulations, and thanks again!
Just out of curiosity, debitto -- are you Arudou Debito?
Avatar of debitto

ASKER

Hi, svs.

No, sorry. That's not me.
Debitto, I take a note of this solution;
And glad that you rightly gave points to svs since he, not me, did point the solution.
Avatar of debitto

ASKER

Oops! Fibo, I see that you're right: That idea came from svs. The mistake may seem to suggest that I was in a confused state from struggling with the problem, but actually it's my everyday state. Anyway, thanks again for contributing to the effort.

I'll take this opportunity to modify something I wrote in giving the details of my particular solution, for anybody who may be interested.

>The separate page for the input form does not need to contain any reference to the Japanese language or the utf-8 encoding at all.

That holds true on my hosting provider's server, where nothing works without "SET NAMES utf8" in the processing script, but on my local testing machine it's the opposite: If I put "SET NAMES utf8" in the script, I need the following line in the head of my input page:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Otherwise, I get html entities in the database. Without "SET NAMES utf8" in the processor, I don't need the above line; everything simply works. For the differences between the two systems, please see the original post.