September 29 2011
Converting MySQL from LATIN1 to UTF8
MySQL defaults to "latin1" as its character set, but at some point, most people want to migrate to "utf8". I realize that there are dozens of posts about how people handled this, and yet, not a single one of those worked completely for me.
I wanted MySQL to use "utf8" for the character set and "utf8_unicode_ci" (not "utf8_general_ci") for the collation, and I wanted it all to work on RDS with ActiveRecord on Rails and Rack.
Here are my steps:
- use mysqldump to extract the old data as "latin1"
- use sed to replace "latin1" with "utf8" in the dump file
- create the new database with the right parameters: character set utf8 collate utf8_unicode_ci
- use mysql --default-character-set=utf8 to pipe the converted dump into the new database
Here is my code:
# Dump the old database as latin1, because ironically, mysqldump defaults to utf8.
mysqldump --default-character-set=latin1 db > db.dump
# If you need to convert a MySQL dump from one character set to another, use iconv.
iconv -f LATIN1 -t UTF-8 < db.dump > db.dump
# If you've been running mysqldump without parameters on a latin1 instance, you can convert the dump from UTF8 to latin1 to correct it.
iconv -f UTF-8 -t LATIN1 < db.dump > db.dump
# Rewrite the dump to say 'utf8' and 'utf8_unicode_ci' in all the right places.
sed -e 's/SET NAMES latin1/SET NAMES utf8/g' -i db.dump
sed -e 's/CHARSET=latin1/CHARSET=utf8 COLLATE=utf8_unicode_ci/g' -i db.dump
# Create a new database with the correct parameters.
create database db character set utf8 collate utf8_unicode_ci;
# Verify it.
show create database db;
# Pipe the converted database dump into MySQL.
mysql -h hostname --default-character-set=utf8 -u root -p db < db.dump
To verify the character set and collation, you can always query the MySQL variables:
show variables like 'collation%';
show variables like 'character%';
Amazon RDS
For those using Amazon's AWS RDS for their MySQL instance, you have to create a parameter group with "utf8" values. I'd guess you could just modify the current parameter group then apply it, but I haven't verified that.
# Create a parameter group.
rds-create-db-parameter-group utf8 -e mysql5.1 -d utf8
# Modify the parameter group's values
rds-modify-db-parameter-group utf8 \
--parameters="name=character_set_server, value=utf8, method=immediate" \
--parameters="name=character_set_client, value=utf8, method=immediate" \
--parameters="name=character_set_results,value=utf8,method=immediate" \
--parameters="name=collation_server, value=utf8_unicode_ci, method=immediate" \
--parameters="name=collation_connection, value=utf8_unicode_ci, method=immediate"
# Check the parameter group's values.
rds-describe-db-parameters utf8 --source=User
# Push this new parameter group to your instance.
rds-modify-db-instance rds-db --db-parameter-group-name utf8
# Reboot the instance (necessary: http://aws.amazon.com/articles/Amazon-RDS/2935).
rds-reboot-db-instance rds-db
ActiveRecord in Rails and Rack
ActiveRecord supports an :encoding option in its parameters for ActiveRecord::Base.establish_connection. The option tells the connection to execute SET NAMES as soon as the connection is established, thereby telling the server what the character set the client wants.
However, I also wanted to specify the collation. When I added :encoding to ActiveRecord::Base.establish_connection, collation_connection (from the MySQL variables, not connection.collation) remained as "utf8_general_ci". Some people have indicated that you can specify :collation in database.yml for the establish_connection call, but that never worked for me. I think MySQL bug #34980 prevented it. Others indicated that you can simply add ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'") at the bottom of environment.rb for Rails; that also never worked for me. To specify the collation in Rails, I used a before_filter in application_controller.rb. See my code below.
# Rack: add a statement right after establishing the connection.
ActiveRecord::Base.establish_connection(
:adapter => "mysql",
:host => "host",
:username => "username",
:password => "password",
:database => "database",
:encoding => "utf8",
:reconnect => true
)
ActiveRecord::Base.connection.execute("SET collation_connection='utf8_unicode_ci'");
# Rails: add a before_filter in application_controller.rb
before_filter :set_database_collation
def set_database_collation
ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'")
end
Pointers
I pieced my steps together from the following helpful links: