bdunagan

Brian Dunagan

September 29 2011
Converting MySQL from LATIN1 to UTF8

MySQL defaults to "latin1" as its character set, but at some point, most people want to migrate to "utf8". I realize that there are dozens of posts about how people handled this, and yet, not a single one of those worked completely for me.

I wanted MySQL to use "utf8" for the character set and "utf8_unicode_ci" (not "utf8_general_ci") for the collation, and I wanted it all to work on RDS with ActiveRecord on Rails and Rack.

Here are my steps:

  1. use mysqldump to extract the old data as "latin1"
  2. use sed to replace "latin1" with "utf8" in the dump file
  3. create the new database with the right parameters: character set utf8 collate utf8_unicode_ci
  4. use mysql --default-character-set=utf8 to pipe the converted dump into the new database

Here is my code:

# Dump the old database as latin1, because ironically, mysqldump defaults to utf8.
mysqldump --default-character-set=latin1 db > db.dump

# If you need to convert a MySQL dump from one character set to another, use iconv.
iconv -f LATIN1 -t UTF-8 < db.dump > db.dump

# If you've been running mysqldump without parameters on a latin1 instance, you can convert the dump from UTF8 to latin1 to correct it.
iconv -f UTF-8 -t LATIN1 < db.dump > db.dump

# Rewrite the dump to say 'utf8' and 'utf8_unicode_ci' in all the right places.
sed -e 's/SET NAMES latin1/SET NAMES utf8/g' -i db.dump
sed -e 's/CHARSET=latin1/CHARSET=utf8 COLLATE=utf8_unicode_ci/g' -i db.dump

# Create a new database with the correct parameters.
create database db character set utf8 collate utf8_unicode_ci;
# Verify it.
show create database db;

# Pipe the converted database dump into MySQL.
mysql -h hostname --default-character-set=utf8 -u root -p db < db.dump

To verify the character set and collation, you can always query the MySQL variables:

show variables like 'collation%';
show variables like 'character%';

Amazon RDS

For those using Amazon's AWS RDS for their MySQL instance, you have to create a parameter group with "utf8" values. I'd guess you could just modify the current parameter group then apply it, but I haven't verified that.

# Create a parameter group.
rds-create-db-parameter-group utf8 -e mysql5.1 -d utf8

# Modify the parameter group's values
rds-modify-db-parameter-group utf8 \
    --parameters="name=character_set_server, value=utf8, method=immediate" \
    --parameters="name=character_set_client, value=utf8, method=immediate" \
    --parameters="name=character_set_results,value=utf8,method=immediate" \
    --parameters="name=collation_server, value=utf8_unicode_ci, method=immediate" \
    --parameters="name=collation_connection, value=utf8_unicode_ci, method=immediate"

# Check the parameter group's values.
rds-describe-db-parameters utf8 --source=User

# Push this new parameter group to your instance.
rds-modify-db-instance rds-db --db-parameter-group-name utf8

# Reboot the instance (necessary: http://aws.amazon.com/articles/Amazon-RDS/2935).
rds-reboot-db-instance rds-db

ActiveRecord in Rails and Rack

ActiveRecord supports an :encoding option in its parameters for ActiveRecord::Base.establish_connection. The option tells the connection to execute SET NAMES as soon as the connection is established, thereby telling the server what the character set the client wants.

However, I also wanted to specify the collation. When I added :encoding to ActiveRecord::Base.establish_connection, collation_connection (from the MySQL variables, not connection.collation) remained as "utf8_general_ci". Some people have indicated that you can specify :collation in database.yml for the establish_connection call, but that never worked for me. I think MySQL bug #34980 prevented it. Others indicated that you can simply add ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'") at the bottom of environment.rb for Rails; that also never worked for me. To specify the collation in Rails, I used a before_filter in application_controller.rb. See my code below.

# Rack: add a statement right after establishing the connection.
ActiveRecord::Base.establish_connection(
  :adapter  => "mysql",
  :host     => "host",
  :username => "username",
  :password => "password",
  :database => "database",
  :encoding => "utf8",
  :reconnect => true
)
ActiveRecord::Base.connection.execute("SET collation_connection='utf8_unicode_ci'");

# Rails: add a before_filter in application_controller.rb
before_filter :set_database_collation
def set_database_collation
  ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'")
end

Pointers

I pieced my steps together from the following helpful links:

Sanitizing POST params in Rack Rails Tip: Locking Cron Jobs
LinkedIn GitHub Email