I am an experienced Java / database developer, but I have never gone through a conversion of a database from latin1 to UTF8. I am using MySQL 5.0.27. I am especially concerned because I am working with a live US site that I need to convert to be able to handle also Chinese and other languages. The US site can't break.
The main piece for this question is how do I feel comfortable about the database changes?
So far what I have done so far is changed my my.cnf file to have the following information only:
[client]
default-character-set=utf8
[mysqld]
default-character-set=utf8
character_set_server=utf8
collation_server=utf8_gene
ral_ci
init_connect='SET collation_connection = utf8_general_ci'
init_connect='SET CHARACTER_SET utf8'
init_connect='SET NAMES utf8'
[mysql]
default-character-set=utf8
Now when I do: show variables like "%character%";show variables like "%collation%";
I am getting the results of:
+-------------------------
-+--------
----------
--------+
| Variable_name | Value |
+-------------------------
-+--------
----------
--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | C:\mysql\share\charsets\ |
+-------------------------
-+--------
----------
--------+
8 rows in set (0.06 sec)
+----------------------+--
----------
-------+
| Variable_name | Value |
+----------------------+--
----------
-------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_general_ci |
+----------------------+--
----------
-------+
3 rows in set (0.00 sec)
Note that I am still seeing the two configuration items still being latin1 oriented. Do I need to dump the database and recreated it? Do I need to do a data conversion first? If so what might that be? Would ALTER database be enough? Do I need to put UTF8 on all table creations in the future and database creations in the future? Do I need to do something to make sure dumps work correctly in UTF8?
I realize from my Java Object Relational Bridge connection I need to also set the CONNECT=UTF8 or some such. But is that all?
I am also using Lucene for search engine capabilities. I assume Chinese with multi-byte might not work the same, but don't understand the full scope of that. I think I might need to pull in a multi-byte parser.
If only one thing, I want to make sure I am using the database part right here.
I realize this is a difficult question, but thank you very much in advance for any help. Until I feel comfortable, I don't make much useful progress and I don't want to destroy our production environment.
John
Start Free Trial