MySQL and character sets

Nowadays, it is common place to use UTF-8 when working with text data. Unfortunately, this is not the case when you deal with legacy software or old configuration. MySQL supports a lot of character sets. There are two places where a developer has to choose the character set.

  • Every column of type VARCHAR has an associated character set, which means that data in this column are stored in the specified character set.
  • For every connection, the client can negotiate the character set that wants to send and receive. It can do that by issuing a SET NAMES SQL statement.

MySQL handles conversion between character sets when necessary. It can actually get really messy if you do not use the same character set everywhere. To demonstrate this fact, I have written a Python utility that tries combinations of character sets and operations and produces a nice summary in HTML. You can find the summary here: MySQL – character sets comparison tables.

The script uses the following MySQL character sets:

  • latin1
  • utf8
  • greek
  • binary

It performs this sequence of operations:

  • Flush the table.
  • Insert a row of data.
  • Select all rows.
  • Select the inserted row using a WHERE clause.
  • Select the inserted row using a fulltext index.

This sequence is repeated for every combination of the following:

  • Column character set.
  • Character set during INSERT.
  • Character set during SELECT.

The test data used are the first 4 (in order for the fulltext index to work) capital letters of the Greek alphabet. When working with the latin1 and binary character sets, these letters are encoded as UTF-8 (which is plain wrong, but it happens in bad configurations).
The output of the script consists of tables that describe the behavior for each case. The observed behavior fits in 5 categories:

  • INSERT triggers error.
  • SELECT … WHERE … triggers error.
  • SELECT … WHERE … returns empty set.
  • Almost correct behavior. Fulltext search does not work.
  • Correct behavior.

Notice, also, that by looking at the main diagonal of the tables you can understand how the system behaves when the client uses a specific character set on a column of the specified character set. By looking at other elements of the tables you can understand what will happen if you decide to switch your client from a character set A to character set B. You can see, that most of the time, this switch will cause you trouble.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: