johnike
asked on
MySql data conversion from latin1 to utf8
I am trying to convert my MySql 5.0.* application data from latin1 to UTF8. In doing so, my European words with special characters are getting truncated upon uploading.
My database settings were originally:
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_filesystem binary
character_set_results latin1
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/
Variable_name Value
collation_connection latin1_swedish_ci
collation_database latin1_swedish_ci
collation_server latin1_swedish_ci
Based on database config file changes, they are now:
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8
character_sets_dir C:\\apps\\mysql\\share\\ch arsets\\
Variable_name Value
collation_connection utf8_general_ci
collation_database utf8_general_ci
collation_server utf8_unicode_ci
This seems correct.
My code does the following:
mysqldump ${DATABASE} ${ID} ${PASSWORD} --default-character-set=la tin1 |
sed -e "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g" | \
sed -e "s/SET NAMES latin1/SET NAMES utf8/g" > datafile
I drop the old database and recreated it as UTF8.
Then I do:
cat datafile | mysql --default-character-set=ut f8 ${ID} ${PASSWORD} ${DATABASE}
Once this is done the data after special characters in fields is truncated in the new database. datafile holds data that looks like it still has the special characters.
Ideas on what I am doing wrong?
Thanks!
My database settings were originally:
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_filesystem binary
character_set_results latin1
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/
Variable_name Value
collation_connection latin1_swedish_ci
collation_database latin1_swedish_ci
collation_server latin1_swedish_ci
Based on database config file changes, they are now:
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8
character_sets_dir C:\\apps\\mysql\\share\\ch
Variable_name Value
collation_connection utf8_general_ci
collation_database utf8_general_ci
collation_server utf8_unicode_ci
This seems correct.
My code does the following:
mysqldump ${DATABASE} ${ID} ${PASSWORD} --default-character-set=la
sed -e "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g" | \
sed -e "s/SET NAMES latin1/SET NAMES utf8/g" > datafile
I drop the old database and recreated it as UTF8.
Then I do:
cat datafile | mysql --default-character-set=ut
Once this is done the data after special characters in fields is truncated in the new database. datafile holds data that looks like it still has the special characters.
Ideas on what I am doing wrong?
Thanks!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I'm trying things based on your suggestions. Do you think this could happen due to transfering the data from UNIX to windows?
If you are moving data from *nix to Windows then that could be the case if you
generate the file in *nix and open it in wondows.
You could alwasys use SQLYog , MySQLAdmint tool or Toad for MySQL to extract/import or replciate the data
beteween databases. I have used it to transfer data from Win to Linux where I have Latin1 charset on Windows
and UTF8 on Linux with great success.
Regards,
Tomas Helgi
generate the file in *nix and open it in wondows.
You could alwasys use SQLYog , MySQLAdmint tool or Toad for MySQL to extract/import or replciate the data
beteween databases. I have used it to transfer data from Win to Linux where I have Latin1 charset on Windows
and UTF8 on Linux with great success.
Regards,
Tomas Helgi
MySQLAdmint should be MySQLAdministrator :)
ASKER
I don't quite get it, but I think your answer worked Tomas Helgi. It does not really make sense to me, but I believe I am seeing a UTF8 result by not specifying the --default-character-set=la tin1 on the dump .
In the database I now see a sequence that looks a bit odd (instead of it being a tilda for example), but when I have it come back out to my screens, it is showing up correctly as a tilda.
I need to do a bit of looking into whether or not what I am seeing is correct in the database, but thank you for such a simple fix.
I will follow up on this later.
In the database I now see a sequence that looks a bit odd (instead of it being a tilda for example), but when I have it come back out to my screens, it is showing up correctly as a tilda.
I need to do a bit of looking into whether or not what I am seeing is correct in the database, but thank you for such a simple fix.
I will follow up on this later.
ASKER
Seems like it worked. I have yet to put it in production, but I figured I would give you the points now. Thank you very much.
I think that this is not enough and I will tell you why:
mysql >= 5.0 also use collation when it comes to VARCHAR declarations. So I guess you also need to change the collation to utf8_bin or utf8_ci depending on your needs.