September 29 2011
Converting MySQL from LATIN1 to UTF8

MySQL defaults to "latin1" as its character set, but at some point, most people want to migrate to "utf8". I realize that there are dozens of posts about how people handled this, and yet, not a single one of those worked completely for me.

I wanted MySQL to use "utf8" for the character set and "utf8_unicode_ci" (not "utf8_general_ci") for the collation, and I wanted it all to work on RDS with ActiveRecord on Rails and Rack.

Here are my steps:

  1. use mysqldump to extract the old data as "latin1"
  2. use sed to replace "latin1" with "utf8" in the dump file
  3. create the new database with the right parameters: character set utf8 collate utf8_unicode_ci
  4. use mysql --default-character-set=utf8 to pipe the converted dump into the new database

Here is my code:

# Dump the old database as latin1, because ironically, mysqldump defaults to utf8.
mysqldump --default-character-set=latin1 db > db.dump

# If you need to convert a MySQL dump from one character set to another, use iconv.
iconv -f LATIN1 -t UTF-8 < db.dump > db.dump

# If you've been running mysqldump without parameters on a latin1 instance, you can convert the dump from UTF8 to latin1 to correct it.
iconv -f UTF-8 -t LATIN1 < db.dump > db.dump

# Rewrite the dump to say 'utf8' and 'utf8_unicode_ci' in all the right places.
sed -e 's/SET NAMES latin1/SET NAMES utf8/g' -i db.dump
sed -e 's/CHARSET=latin1/CHARSET=utf8 COLLATE=utf8_unicode_ci/g' -i db.dump

# Create a new database with the correct parameters.
create database db character set utf8 collate utf8_unicode_ci;
# Verify it.
show create database db;

# Pipe the converted database dump into MySQL.
mysql -h hostname --default-character-set=utf8 -u root -p db < db.dump

To verify the character set and collation, you can always query the MySQL variables:

show variables like 'collation%';
show variables like 'character%';

Amazon RDS

For those using Amazon's AWS RDS for their MySQL instance, you have to create a parameter group with "utf8" values. I'd guess you could just modify the current parameter group then apply it, but I haven't verified that.

# Create a parameter group.
rds-create-db-parameter-group utf8 -e mysql5.1 -d utf8

# Modify the parameter group's values
rds-modify-db-parameter-group utf8 \
    --parameters="name=character_set_server, value=utf8, method=immediate" \
    --parameters="name=character_set_client, value=utf8, method=immediate" \
    --parameters="name=character_set_results,value=utf8,method=immediate" \
    --parameters="name=collation_server, value=utf8_unicode_ci, method=immediate" \
    --parameters="name=collation_connection, value=utf8_unicode_ci, method=immediate"

# Check the parameter group's values.
rds-describe-db-parameters utf8 --source=User

# Push this new parameter group to your instance.
rds-modify-db-instance rds-db --db-parameter-group-name utf8

# Reboot the instance (necessary:
rds-reboot-db-instance rds-db

ActiveRecord in Rails and Rack

ActiveRecord supports an :encoding option in its parameters for ActiveRecord::Base.establish_connection. The option tells the connection to execute SET NAMES as soon as the connection is established, thereby telling the server what the character set the client wants.

However, I also wanted to specify the collation. When I added :encoding to ActiveRecord::Base.establish_connection, collation_connection (from the MySQL variables, not connection.collation) remained as "utf8_general_ci". Some people have indicated that you can specify :collation in database.yml for the establish_connection call, but that never worked for me. I think MySQL bug #34980 prevented it. Others indicated that you can simply add ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'") at the bottom of environment.rb for Rails; that also never worked for me. To specify the collation in Rails, I used a before_filter in application_controller.rb. See my code below.

# Rack: add a statement right after establishing the connection.
  :adapter  => "mysql",
  :host     => "host",
  :username => "username",
  :password => "password",
  :database => "database",
  :encoding => "utf8",
  :reconnect => true
ActiveRecord::Base.connection.execute("SET collation_connection='utf8_unicode_ci'");

# Rails: add a before_filter in application_controller.rb
before_filter :set_database_collation
def set_database_collation
  ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'")


I pieced my steps together from the following helpful links:

Sanitizing POST params in Rack Rails Tip: Locking Cron Jobs
LinkedIn GitHub Email