MySql DB Conversion to UTF8

I am an experienced Java / database developer, but I have never gone through a conversion of a database from latin1 to UTF8.   I am using MySQL 5.0.27.  I am especially concerned because I am working with a live US site that I need to convert to be able to handle also Chinese and other languages.   The US site can't break.

The main piece for this question is how do I feel comfortable about the database changes?

So far what I have done so far is changed my my.cnf file to have the following information only:
[client]
default-character-set=utf8

[mysqld]
default-character-set=utf8
character_set_server=utf8
collation_server=utf8_general_ci
init_connect='SET collation_connection = utf8_general_ci'
init_connect='SET CHARACTER_SET utf8'
init_connect='SET NAMES utf8'

[mysql]
default-character-set=utf8

Now when I do:  show variables like "%character%";show variables like "%collation%";

I am getting the results of:

+--------------------------+--------------------------+
| Variable_name            | Value                    |
+--------------------------+--------------------------+
| character_set_client     | utf8                     |
| character_set_connection | utf8                     |
| character_set_database   | latin1                   |
| character_set_filesystem | binary                   |
| character_set_results    | utf8                     |
| character_set_server     | utf8                     |
| character_set_system     | utf8                     |
| character_sets_dir       | C:\mysql\share\charsets\ |
+--------------------------+--------------------------+
8 rows in set (0.06 sec)

+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | utf8_general_ci   |
+----------------------+-------------------+
3 rows in set (0.00 sec)

Note that I am still seeing the two configuration items still being latin1 oriented.   Do I need to dump the database and recreated it?  Do I need to do a data conversion first?   If so what might that be?   Would ALTER database be enough?   Do I need to put UTF8 on all table creations in the future and database creations in the future?  Do I need to do something to make sure dumps work correctly in UTF8?

I realize from my Java Object Relational Bridge connection I need to also set the CONNECT=UTF8 or some such.   But is that all?

I am also using Lucene for search engine capabilities.  I assume Chinese with multi-byte might not work the same, but don't understand the full scope of that.  I think I might need to pull in a multi-byte parser.

If only one thing, I want to make sure I am using the database part right here.

I realize this is a difficult question, but thank you very much in advance for any help.  Until I feel comfortable, I don't make much useful progress and I don't want to destroy our production environment.

John
johnikeAsked:
Who is Participating?
 
RurneConnect With a Mentor Commented:
You actually have to convert the database; the collation_database and character_set_database variables are determined at runtime, as to how the database was saved.  These should be able to sort you out:

ALTER DATABASE `myDb` CHARACTER SET UTF8;
ALTER DATABASE `myDb` COLLATE utf8_unicode_ci;

More info:
http://dev.mysql.com/doc/refman/4.1/en/alter-database.html
http://dev.mysql.com/doc/refman/4.1/en/charset.html
http://dev.mysql.com/doc/refman/4.1/en/charset-unicode-sets.html
0
 
johnikeAuthor Commented:
This has been very helpful.

Since I am just latin1 at the moment the ALTER databases should be  enough to have data converted to UTF8?

I was hoping to be able to change to UTF8 to support Asian character sets (esp Chinese).  Upon reading your third link, I am now not sure that will work.  I am thinking now that I will have to have a separate database that is the cp932 Character Set which is not what I was hoping to have to do to support the Asian set.  I was hoping to be able to have my site be both English and Asian and not have two databases.  I need to do more research.  I am hoping I am wrong and UTF8 will satisfy both.

If this is the case, then my Lucene Java code might have to have two different parsers as well for the search engine piece.

I sure wish there was one site that really gave more comprehensive information on what you need to internationalize (English and Asian) a database driven web site.  I have not found a good one yet.

Since the site is in production, I am very uncomfortable about corrupting the site's data.




0
 
AymanAdamConnect With a Mentor Commented:
hi johnike,

try this website:
http://micropipes.com/blog/category/code/
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.