Mathias Bynens

How to support full Unicode in MySQL databases

Published · tagged with MySQL, security, Unicode

Alternative title: The things we do to store U+1F4A9 PILE OF POO (💩) correctly.

Are you using MySQL’s utf8 charset in your databases? In this write-up I’ll explain why you should switch to utf8mb4 instead, and how to do it.

UTF-8

The UTF-8 encoding can represent every symbol in the Unicode character set, which ranges from U+000000 to U+10FFFF. That’s 1,114,112 possible symbols. (Not all of these Unicode code points have been assigned characters yet, but that doesn’t stop UTF-8 from being able to encode them.)

UTF-8 is a variable-width encoding; it encodes each symbol using one to four 8-bit bytes. Symbols with lower numerical code point values are encoded using fewer bytes. This way, UTF-8 is optimized for the common case where ASCII characters and other BMP symbols (whose code points range from U+000000 to U+00FFFF) are used — while still allowing astral symbols (whose code points range from U+010000 to U+10FFFF) to be stored.

MySQL’s utf8

For a long time, I was using MySQL’s utf8 charset for databases, tables, and columns, assuming it mapped to the UTF-8 encoding described above. By using utf8, I’d be able to store any symbol I want in my database — or so I thought.

While writing about JavaScript’s internal character encoding, I noticed that there was no way to insert the U+1D306 TETRAGRAM FOR CENTRE (𝌆) symbol into the MySQL database behind this site. The column I was trying to update had the utf8_unicode_ci collation, and the connection charset was set to utf8.

mysql> SET NAMES utf8; # just to emphasize that the connection charset is set to `utf8`
Query OK, 0 rows affected (0.00 sec)

mysql> UPDATE database_name.table_name SET column_name = 'foo𝌆bar' WHERE id = 9001;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1

mysql> SELECT column_name FROM database_name.table_name WHERE id = 9001;
+-------------+
| column_name |
+-------------+
| foo |
+-------------+
1 row in set (0.00 sec)

The content got truncated at the first astral Unicode symbol, in this case 𝌆 — so, attempting to insert foo𝌆bar actually inserted foo instead, resulting in data loss (and possibly introducing security issues; see below). MySQL returned a warning message, too:

mysql> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1 |
+---------+------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported.

Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8 implementation.

This doesn’t just affect the 𝌆 character, but more important symbols like U+01F4A9 PILE OF POO (💩) as well. In total, that’s 1,048,575 possible code points you can’t use. In fact, MySQL’s utf8 only allows you to store 5.88% ((0x00FFFF + 1) / (0x10FFFF + 1)) of all possible Unicode code points. Proper UTF-8 can encode 100% of all Unicode code points.

As shown above, this behavior can lead to data loss, but it gets worse — it can result in security vulnerabilities. Here are some examples, all of which were discovered after publishing this write-up:

TL;DR MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.

MySQL’s utf8mb4

Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode, including astral symbols.

Switching from MySQL’s utf8 to utf8mb4

Step 1: Create a backup

Create a backup of all the databases on the server you want to upgrade. Safety first!

Step 2: Upgrade the MySQL server

Upgrade the MySQL server to v5.5.3+, or ask your server administrator to do it for you.

Step 3: Modify databases, tables, and columns

Change the character set and collation properties of the databases, tables, and columns to use utf8mb4 instead of utf8.

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)

Since utf8mb4 is fully backwards compatible with utf8, no mojibake or other forms of data loss should occur. (But you have a backup, right?)

Step 4: Check the maximum length of columns and index keys

This is probably the most tedious part of the whole upgrading process.

When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three.

For example, a TINYTEXT column can hold up to 255 bytes, which correlates to 85 three-byte or 63 four-byte characters. Let’s say you have a TINYTEXT column that uses utf8 but must be able to contain more than 63 characters. Given this requirement, you can’t convert this column to utf8mb4 unless you also change the data type to a longer type such as TEXT — because if you’d try to fill it with four-byte characters, you’d only be able to enter 63 characters, but not more.

The same goes for index keys. The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. (Because of this, I had to change some indexed VARCHAR(255) columns to VARCHAR(191).)

Section 10.1.11 of the MySQL 5.5 Reference Manual has some more information on this.

Step 5: Modify connection, client, and server character sets

In your application code, set the connection character set to utf8mb4. This can be done by simply replacing any variants of SET NAMES utf8 with SET NAMES utf8mb4. If your old SET NAMES statement specified the collation, make sure to change that as well, e.g. SET NAMES utf8 COLLATE utf8_unicode_ci becomes SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci.

Make sure to set the client and server character set as well. I have the following in my MySQL configuration file (/etc/my.cnf):

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

You can easily confirm these settings work correctly:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

As you can see, all the relevant options are set to utf8mb4, except for character_set_filesystem which should be binary unless you’re on a file system that supports multi-byte UTF-8-encoded characters in file names, and character_set_system which is always utf8 and can’t be overridden.

Note: The default character set and collation can be configured at some other levels as well.

Step 6: Repair and optimize all tables

After upgrading the MySQL server and making the necessary changes explained above, make sure to repair and optimize all databases and tables. I didn’t do this right away after upgrading (I didn’t think it was necessary, as everything seemed to work fine at first glance), and ran into some weird bugs where UPDATE statements didn’t have any effect, even though no errors were thrown.

You could run the following MySQL queries for each table you want to repair and optimize:

# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

Luckily, this can easily be done in one go using the command-line mysqlcheck utility:

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

This will prompt for the root user’s password, after which all tables in all databases will be repaired and optimized.

Summary

Never use utf8 in MySQL — always use utf8mb4 instead. Updating your databases and code might take some time, but it’s definitely worth the effort. Why would you arbitrarily limit the set of symbols that can be used in your database? Why would you lose data every time a user enters an astral symbol as part of a comment or message or whatever it is you store in your database? There’s no reason not to strive for full Unicode support everywhere. Do the right thing, and use utf8mb4. 🍻

About me

Hi there! I’m Mathias. I work on Chrome DevTools and the V8 JavaScript engine at Google. HTML, CSS, JavaScript, Unicode, performance, and security get me excited. Follow me on Twitter, Mastodon, and GitHub.

Comments

David Higgins wrote on :

Nice write-up Mathias! You could also escape the Unicode symbols into some sort of sanitized version, before you add them into the database. E.g. 𝌆𝌆.

wrote on :

David: That goes against the principles of normalization. And what if you want to use the data in another format that doesn’t support HTML character entities? Better to just store the raw data, and make sure your database supports it.

maxw3st wrote on :

Thank you much for documenting this thoroughly. Looks like updating my existing databases and tables will be a pain, but adding new ones should be a snap once I’ve followed the connection/client/server instructions and updated the config files. In general, great point about increasing the available character set. Thank you.

MeMyselfAndI wrote on :

What is the purpose of character-set-client-handshake=FALSE?

Why do you use init-connect='SET NAMES utf8mb4', which is already the default because of character-set-server=utf8mb4?

wrote on :

MeMyselfAndI:

While your questions are probably a bit out of scope for this article, I’ll respond to them to the best of my knowledge:

What is the purpose of character-set-client-handshake=FALSE?

character-set-client-handshake=FALSE causes the server to ignore character set information sent by the client (e.g. if the client requests a connection in utf8, it would still use utf8mb4) . By using it you can rest assured that the default server character set will be used at all times.

Why do you use init-connect='SET NAMES utf8mb4', which is already the default because of character-set-server=utf8mb4?

SET NAMES indicates what character set the client will use to send SQL statements to the server, i.e. the connection charset. character-set-server sets the server charset. To use utf8mb4 correctly, you need to make sure the client, the server, and the connection are all set to utf8mb4.

That is my understanding of these settings. If you think this is wrong, I’d appreciate a clarification. Thanks!

MeMyselfAndI wrote on :

Did you test the effect of character-set-client-handshake=FALSE? In my MySQL 5.1.x my clients could still change the character set, no matter how I wrote this option (with -, with _, with skip-… instead of =FALSE). And why should a client not be able to change the character set for its own connection?

character-set-server isn’t the only value you can set (without init-connect) in the my.cnf (yep, character-set-filesystem too, but it doesn’t matter for now). This is also the default value for all other character set settings you do not explicitly change (in a client’s session, for a database, table, column, etc.). In other words, (nearly) all other character_set_* will inherit from this value. So the default value for a client’s connection (character_set_client, character_set_results, and character_set_connection) will always be this value. You don’t need to use init-connect unless you want your client’s connection’s default value to be different than what’s defined in character-set-server.

wrote on :

MeMyselfAndI: Setting character-set-client-handshake=FALSE (or using skip-character-set-client-handshake) is the only way I could get collation_connection to show up as utf8mb4_unicode_ci instead of utf8mb4_general_ci when performing a SHOW VARIABLES LIKE 'collation%' query. Unless there’s a better way to achieve the same effect, I’m afraid this setting can not be omitted.

Thanks for pointing out the init-connect setting was unnecessary — I’ve removed it now. Please let me know if the example /etc/my.cnf can be optimized further.

MeMyselfAndI wrote on :

Mathias: To set the connection’s collation to _unicode_ instead of _general_ I could do this successfully using SET NAMES … COLLATE …. But you only need to use _unicode_ for ß=ss, otherwise you can use _general_ which is faster as the manual says. https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html

To use utf8mb4 correctly, you need to make sure the client, the server, and the connection are all set to utf8mb4.

No, not really. Of course the client has to speak UTF-8. For MySQL you only need two things: the setting for your client’s connection and the setting for the string fields. character-set-server is only a default value for creating databases if you don’t set a character set in the CREATE DATABASE statement. The database’s value is only a default value for its tables if you don’t set it in CREATE TABLE. The table’s value is only a default value for the string fields if you don’t say something special for the fields. Only now both the character set and the collation have a real purpose for storing values and comparing strings etc. The second point is the connection’s value. MySQL needs it to decode your input correctly and to encode the results. (A recoding can be skipped if connection and fields use the same character set value.) See the first three sub-chapters of https://dev.mysql.com/doc/refman/5.5/en/charset-syntax.html (the key sentences are at the end of these chapters) and https://dev.mysql.com/doc/refman/5.5/en/charset-connection.html.

My conclusion is: you can leave the default values of the server as-is. Just specify the values for your fields (or defaults for your tables or your databases if you don’t want to set it field by field). And use SET NAMES or mysql(i)_set_charset() in PHP or something similar after each connect to a MySQL server. That’s all you need. Don’t rely on the server’s default values.

Patricio wrote on :

Nice article Mathias, this also took me by surprise some time ago. Who decided to name that other charset utf8? :s

I think it’s much safer not to trust any default charset set in the server configuration, and make yourself sure that your app is setting it correctly, whenever it creates/modifies tables and databases, and opens new connections.

Binyamin wrote on :

@Mathias, I followed your steps to change MySQL 5.5.25 encoding to utf8mb4, but unfortunately SQL returns:

Error #1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci at line 1.

…when I run the query:

ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

…and with character-set-client-handshake = FALSE the whole website returns Error 500; without it returns Error 500 for some pages. MySQL log does not show any error.

With utf8 the website works fine. Any suggestions? Seems like latest stable MySQL release is 5.5.25a.

wrote on :

Binyamin: The 500 error most likely has nothing to do with MySQL itself, but rather with the server-side code that connects to it, or elsewhere in your code. Take a closer look at the pages that result in the error, and try to figure out what they’re doing different than the other pages.

I’m using MySQL 5.5.25a as well, with these exact settings, and everything works flawlessly.

Binyamin wrote on :

Mathias: Got it to work after discovering a Webmin bug with character-set-client-handshake=FALSE and fixing it with nodbi=1.

Strange that server returned error 500 also because some MySQL row type was mediumtext. After I change the type to text, the error disappeared and the utf8mb4 encoding works fine now.

Binyamin wrote on :

Mathias: I am not sure about that, because in some cases a MEDIUMTEXT row will cause Error 500 even when it has a small character value, and using TEXT instead fixes it. Maybe it is related to a MySQL 5.5.25 bug.

ArunMohan wrote on :

Still I am getting java.sql.SQLException: Incorrect string value: '\xF0\x9D\x8C\x86\xF0\x9D...' for column 'title' at row 1… character(𝌆𝌆).

But I can insert those characters through my MySQL client (SqlYog)…

Richard wrote on :

What about performance and memory usage? Presumably this has an impact on both.

wrote on :

Richard: I don’t know about memory usage, but performance of utf8mb4 is slower than for the 3-byte utf8. I suspect the difference to be negligible though, although I’d love to see some numbers.

Jon T. wrote on :

Are you using MySQL’s utf8 charset in your databases? In this write-up I’ll explain why you should switch to PostgreSQL – the Real Man’s database.

Neil Martin wrote on :

I found many resources out there that claimed to provide solutions to using utf8mb4 with MySQL, but they were all incomplete. I read your write up and applied the suggested configuration changes and everything worked. Great work; thanks!

Deron Meranda wrote on :

If you’re using Python with the popular MySQLpy module for your database connectivity, you will need to apply a small patch to the python source code so it recognizes the new utf8mb4 character set. Until this patch is officially made in the up-stream code, see the small Pastebin patch.

CodeKingPlusPlus wrote on :

What about changing the collation settings for JDBC? I just fixed all my databases and tables. I can run queries fine inside my database but I am getting the following exception from my Java code:

java.sql.SQLException: Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='

I cannot figure out how to change the utf8_general_ci,COERCIBLE to utf8mb4_unicode_ci, IMPLICIT.

My connection string is:

con = DriverManager.getConnection("jdbc:mysql://localhost:3306/[dbName]?useUnicode=true&characterEncoding=utf8mb4&connectionCollation=utf8mb4_unicode_ci", "root", "root");

ArunMohan: Have you solved your Java.sql.Exception? I am having the same problem!

Julia Neystadt wrote on :

Thanks a lot for your article. Unfortunately, I followed all the instructions and I am still having a problem:

java.sql.SQLException: Incorrect string value: '\xF0\x9D\x8C\x86\xF0\x9D...' for column 'data' at row 1… character(𝌆𝌆).

This is the same problem ArunMohan reported earlier.

Any ideas?

Julia Neystadt wrote on :

Mathias: Thanks a lot for your fast reply. Unfortunately, this does not help me. My MySQL connector is 5.1.18.

But I paid attention that my DB version is 5.5.19-log. On the other hand, all the utf8mb4 definitions are set correctly:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+

Your help is greatly appreciated.

Julia

Michael wrote on :

Thanks for your article, it did help me a lot.

Here are my thoughts and some additional information that may help ArunMohan and Julia: http://info.michael-simons.eu/2013/01/21/java-mysql-and-multi-byte-utf-8-support/

CodeKingPlusPlus: You must have character-set-server = utf8mb4 in my.cnf and useUnicode=true in your JDBC connection. Depending on the version of your MySQL Connector/J, you need to have characterEncoding=UTF-8 as well or must not have it, see my article.

Itay wrote on :

Thanks a lot for the great write-up. I did all the steps but I still see question marks where 4-byte chars should display. What am I missing?

Jamie wrote on :

I found that phpMyAdmin didn’t cope with utf8mb4. Existing 4-byte characters display as question marks and trying to insert a 4-byte character seems to insert four actual question marks into the database. Turning on the mysqld log and looking and what phpMyAdmin was doing, I noticed it was setting the character set to utf8 each time. It’s hard-coded in line 1303 of libraries/database_interface.lib.php:

PMA_DBI_query("SET CHARACTER SET 'utf8';", $link, PMA_DBI_QUERY_STORE);

Changing to it to utf8mb4 solves the problem, although being a hack you’ll need to remember to make this change each time you upgrade phpMyAdmin.

Also, if you want to be sure of correct sorting of results you might want to set the Server connection collation on phpMyAdmin’s front page to utf8mb4_unicode_ci.

Paul wrote on :

At the top of your aticle, you really should mention the version of MySQL this applies to because the version in Debian Squeeze is 5.1 and that’s too old to have utf8mb4 apparently. Spent ages writing a program to update my MySQL tables etc. and then found that I couldn’t without the update of the package. Your article is good stuff though. :-)

cmr wrote on :

Hi guys, I tried the above. I’ve restarted the MySQL server but the output still looks like this:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+
10 rows in set (0.00 sec)

This is what my /usr/local/mysql/my.cnf looks like:

[client]
default-character-set = utf8mb4

[mysqld]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

I’ve looked at the default server settings and there is no mention of the above variables for my.cnf file.

I can’t save 𝌆 but I can save characters like , , and .

wrote on :

cmr: Are you sure the my.cnf file you’re editing is the one that gets loaded? Try the following command:

$ mysql --help | grep -A 1 "Default options are read from the following files"

On my machine, this prints:

Default options are read from the following files in the given order:

/etc/my.cnf /etc/mysql/my.cnf /usr/local/etc/my.cnf ~/.my.cnf`

Make sure the configuration file you’re saving to is listed. Also, check all the other files that get loaded and make sure they’re not overriding your settings.

cmr wrote on :

Mathias: I moved my.cnf to /usr/local/etc/ and it’s working now. Thanks heaps!

Update: I can’t seem to add the character to the database:

ERROR 1366: Incorrect string value: '\xF0\x9D\x8C\x86' for column 'text' at row 1

SQL Statement:

INSERT INTO `test`.`new_table` (`ID`, `text`) VALUES ('1', '𝌆');

I tried running this command

use test;
ALTER TABLE new_table CHANGE text text VARCHAR(191) CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;

The output says “0 rows affected”.

 0 row(s) affected Records: 0  Duplicates: 0  Warnings: 0

I’m guessing that utf8mb4 is not being applied to my row.

But when I ran the following:

ALTER DATABASE test CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

1 row(s) affected

What else could be happening?

Update: Found it! I was missing:

SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;

Thanks for the great post.

wrote on :

Binyamin: I think only the command-line option was removed, not the variable. A quick test confirms this: removing these variables from the config file impacts the results.

With these settings:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

…I get these results:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

Without the [client] and [mysql] sections, the results are different:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-----------------+
| Variable_name | Value |
+--------------------------+-----------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+--------------------------+-----------------+
10 rows in set (0.00 sec)

Mihai wrote on :

Very good writeup. We knew about this concoction called utf8, but you provide the mechanics to properly addressing this issue.

Steve Hibbert wrote on :

About the only place I could find that had the correct way to set collation_connection in 5.6.15! Thanks very much.

HS wrote on :

[client]
default_character_set = utf8mb4

[mysql]
default_character_set = utf8mb4

[mysqld]
character_set_client_handshake = FALSE
init_connect = 'SET collation_connection = utf8mb4_unicode_ci,NAMES utf8mb4'
collation_server = utf8mb4_unicode_ci
collation_connection = utf8mb4_unicode_ci
collation_database = utf8mb4_unicode_ci
character_set_system = utf8
character_set_server = utf8mb4
character_set_client = utf8mb4
character_set_connection = utf8mb4
character_set_database = utf8mb4
character_set_results = utf8mb4

That’s what I have in my.cnf under /etc/. Am I missing something or doing something wrong? I’m running MySQL 5.6.16-log.

HS wrote on :

After hitting my head around a lot, these are the best settings I could figure out to run everything with utf8mb4:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
explicit_defaults_for_timestamp = 1 # Posting it here as a tip to disable the Timestamp message, maybe it can help someone :)
character-set-client-handshake = 0
init_connect = 'SET character_set_system = utf8mb4'
init_connect = 'SET character_set_connection = utf8mb4'
init_connect = 'SET character_set_database = utf8mb4'
init_connect = 'SET character_set_results = utf8mb4'
init_connect = 'SET collation_database = utf8mb4_unicode_ci'
init_connect = 'SET collation_connection = utf8mb4_unicode_ci'
init_connect = 'SET NAMES utf8mb4'
character-set-server = utf8mb4
#character_set_client = utf8mb4
collation-server = utf8mb4_unicode_ci

When running:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

I get this:

+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

My questions are:

  • How can I also set the system to use utf8mb4? I’m using CentOS 6.5 and WHM. Or it would be a bad idea?
  • Is there something else I should do to completely force the use of utf8mb4?
  • Is there anything wrong with the code above on my.cnf?

Thanks.

P.S. With the above configuration in my.cnf, I don’t get ANY sort of warnings or errors anymore when checking the MySQL error messages. Another note is that I’m running the latest MySQL from WHM: 5.6.16-log MySQL Community Server (GPL).

Chris wrote on :

Thanks for writing this! If anyone is doing this migration with Django, don’t forget to add:

'OPTIONS': {'charset': 'utf8mb4'},

…to the DATABASES object in your settings file. Hope this saves someone else some time.

Serge wrote on :

Just one question — why do you prefer utf8mb4_unicode_ci? Why is it better than utf8mb4_bin?

Tam wrote on :

Hi guys, I followed the above setup, seems all is ok, but when I insert a NickName(Primary Key) to MySQL, it shows error 1062.

Table and this column is use UTF*mb4_unicode_ci.

NickName  (is Primary Key)
森下えりか <--- Insert without error
森下エリカ <--- now Insert the second row, got an error 1062 (Duplicate entry)

…after that I find out if change to use UTF8mb4_general_ci instead of UTF*mb4_unicode_ci then insert the second row without any error, so any suggestion to fix this if keeping use UTF*mb4_unicode_ci? Thanks.

HS wrote on :

Did you see that there is an utf8mb4_unicode_520_ci collation? I’m wondering how it differs from utf8mb4_unicode_ci. Any ideas? Would you be able to give us some examples and show us why one would be better than the other one?

Update: https://dev.mysql.com/doc/refman/5.6/en/charset-collation-names.html says:

Unicode collation names may include a version number to indicate the version of the Unicode Collation Algorithm (UCA) on which the collation is based. UCA-based collations without a version number in the name use the version-4.0.0 UCA weight keys. A collation name such as utf8_unicode_520_ci is based on UCA 5.2.0 weight keys.

It sounds like the _520 stands for language standards and not to the MySQL likes? I mean, it looks like that the 520_ci looks to be a better approach to the Unicode standards? Is this true? Please, also see: https://bugs.mysql.com/bug.php?id=9604

It seems that MySQL won’t update the supposed Unicode stuff to keep up with the standards as it seems it would imply in their clients to update the contents of their current databases. Thereafter they created the new collation, which keeps up with standard Unicode. And the version they have available is v5.2.0, which is this one: http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt

Then, it sounds like it would be better to stick with the real Unicode standard one, i.e. utf8mb4_unicode_520_ci?

Is there a way to compile MySQL 5.6 using the latest available UCA weight keys (v7.0.0 at the time of writing)?

Allan Wallace wrote on :

Nice article — although utf8mb4_bin may be more appropriate for certain tasks, as case insensitivity can be a problem in certain contexts, or where the use of accenting in languages alters the meaning of the words. E.g. if you are trying to build a dictionary of terms in Greek and you have a UNIQUE index defined in order to prevent the input of duplicates, you will get false duplicates for words where there should be no match.

As an example, here are two Greek words with considerably different meanings, one of which is rude, the other is not:

μαλακά
μαλάκα

If you try to insert these into a utf8mb4_unicode_ci column, you will get “MySQL Error 1062: duplicate entry for key”. So in this particular case, utf8mb4_bin is the logical option.

David Trotz wrote on :

Thank you for the article — it’s been a lifesaver in better understanding the MySQL documentation.

I had a question that I am still unclear on. If I have a VARCHAR(255) column and it is not indexed, can I leave its size at 255?

David Trotz wrote on :

I answered my own question, and discovered a few things that might help readers of your article.

Before I go into that I want to reiterate that your article has been extremely helpful in getting me started on understanding what I need to do in order to get to proper utf8 support. Thank you for being a pioneer here!

Notes on Converting Indexes

You mentioned a little bit why you changed the columns from VARCHAR(255) to VARCHAR(191) and I understand now why you did this, but you didn’t mention (although it is implied if you understand MySQL) that this will lead to truncation if you are using more than 191 characters in that column anywhere in your database table. MySQL will warn you after the dirty deed is done, but by then it is too late.

It might help to explain that you only need to do this if you index the column and are not worried about truncation but in reality you don’t need to do it, because you can simply create the index to have only use the valid 191 characters like so:

CREATE INDEX part_of_colum_name ON table_name (column_name(191));

For more on this see: MySQL 5.5 docs: 13.1.13 CREATE INDEX syntax.

Notes on Converting Tables

It is not necessary to convert all your columns as they will inherit the table’s encoding and be converted when the table is converted, in fact even if the column has a specified encoding the column will get the table’s default encoding. This is most likely what you want but I put it out there as a warning for others.

Also of note when converting a table to utf8mb4 from utf8: if you have a column that is TEXT, MySQL will automagically promote that column to MEDIUMTEXT to accomodate for the additional storage space needed for the new encoding. I only tested this with TEXT and assume it is similar with TINYTEXT etc.

I say all of that to note that you really do not need to run:

# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

…on all of your columns unless you want to override the default behaviour inherited from the table when it was converted. It will re-run and waste your time, as it does not look at your ALTER TABLE command and note that no changes will take place.

Peter Torelli wrote on :

Great post. I made all the updates to my server you specify. However, when I enter your test line foo𝌆bar there is still an extra space in the select results, like so:

mysql> select * from test;
+-----------------+
| name |
+-----------------+
| oo𝌆bar |
| 森下えりか |
| μαλάκα |
| μαλακά |
| foo𝌆bar |
+-----------------+
5 rows in set (0.00 sec)

What did I miss? Thanks!

Beau Key wrote on :

Maybe I am stupid but it is not easier to create a backup of all (relevant) databases, drop the databases, update my.ini and restore the databases from a backup?

Klaus wrote on :

I am using utf8mb4_unicode_ci in MySQL but when I upload a file with UTF-8 (CP 65001) then I get all the German “Umlaute” ÄäÜüÖöß wrong what I could see proper in the editor Notepad.

hnuecke wrote on :

Thank you for the excellent suggestions and explanations; they already helped me a lot.

I had the same messy config as cmr showed in his post on 30th May 2013 at 15:54. I made some progress, but I still cannot set the character-set-client and character-set-results to utf8mb4. My variables actually look like this:

+----------------------------+----------------------+
| Variable_name | Value |
+----------------------------+----------------------+
| `character_set_client` | `utf8` |
| `character_set_connection` | `utf8mb4` |
| `character_set_database` | `utf8mb4` |
| `character_set_filesystem` | `binary` |
| `character_set_results` | `utf8` |
| `character_set_server` | `utf8mb4` |
| `character_set_system` | `utf8` |
| `collation_connection` | `utf8mb4_unicode_ci` |
| `collation_database` | `utf8mb4_unicode_ci` |
| `collation_server` | `utf8mb4_unicode_ci` |
+----------------------------+----------------------+

I assume that this stops me from converting my WordPress site to a multisite. At least I still get error messages when WordPress tries to create a table for a new site:

WordPress database error COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'utf8mb4' for query

WordPress uses utf8mb4 since March or April 2015, I think. I’m also checking in the WP forum if what I see is a bug, because tables for the first site could be generated. Hopefully somebody here can give me a hint how I can manage to change the character sets for the remaining two variables.

Thanks!

Paul wrote on :

Thank you for this guide — it really helped.

I also had to drop and re-create all the database’s stored procedures (and functions too) in order that they execute within the new character set, otherwise I would receive: SQL state [HY000]; error code [1366]; Incorrect string value: '\xF0…' when calling them. Even when character_set_connection equalled utf8mb4.

Run:

SHOW PROCEDURE STATUS;

…to see which procedures have not been updated to the server’s new character_set_client, collation_connection and Database Collation values.

Justin Lee wrote on :

Thank you so much for this write-up, Mathias! I don’t know how I would have fixed my UTF-8 issues without this.

I followed some of the advice in some of the other comments, and I spent a day and rewrote my entire project to use Postgres instead. It was not too bad. It mostly involves changing all the non-standard funky MySQL SQL into proper SQL. And guess what, in Postgres the UTF-8 stuff works out of the box. You don’t have to specify anything. It just works by default because it’s the right way to do it.

For anyone else who stumbled on this website here, I encourage you to switch to PostgreSQL. MySQL was bought by Oracle and they haven’t done anything since. I think it was pretty broken even before.

Looks like Michael Stonebreaker will still get the last laugh over Larry Ellison ;) (at least in terms of software quality)

Erez wrote on :

Using SqlWorkbench I could not change the character_set_result and character_set_client until I used the following line as a part of your provided script: SET CHARACTER SET utf8mb4;.

This setting is erased each time I restart SqlWorkbench. Any idea how to keep it?

Dennis A wrote on :

I’ve done all these steps and everything looks fine in my tables now, yet the server still hangs when I try to submit a form that contains these annoying emojis that pop up in people’s communications nowadays. I am using the MySQL ODBC connector because the software is very old. I am wondering if the connector is the reason it is failing. My version of MySQL is 5.5.36.

Mighty Chaffinch wrote on :

Anyone using Connector/ODBC should be aware of the advice in MySQL bug #69021:

Please note that under no circumstances your application should set the character set for the connection or the results etc. It is always set by the driver at the connection time to UTF-8. UTF-8 is used as a “transport” character set to communicate with the server. So the data conversion normally goes similar to the following:

 UTF8MB4               UTF8                UTF8MB4
[ASP]<------->[ODBC Driver]<---->[MySQL Server]<------->[MySQL Table]

As you can see at both ends [ASP] and [MySQL Table] the data is in UFT8MB4.

Once again, the application should indicate the intended character set using the special option ...;CHARSET=UTF8MB4;... and should not attempt to set any of the connection properties because it confuses the driver conversion functions.

Also, if you want to round-trip text back from classic ASP forms back to MySQL, you need to set:

Session.CodePage = 65001;
Response.ContentType = "text/HTML";
Response.CodePage = 65001;
Response.Charset = "UTF-8";

And save your .asp files as UTF-8 in your text editor.

Hope this saves someone the pain and confusion I experienced arriving at this solution.

Moz wrote on :

And don’t forget to scatter --default-character-set=utf8mb4 liberally around any database-handling scripts you might have, as a lot of MySQL utilities default to utf8 and pick up the local environment if you don’t force proper utf8mb4 on everything.

Also, thank you so much for collecting everything into one page like this. It’s made an awful process somewhat easier.

hoekma wrote on :

I put the settings for the three sections into /etc/my.cnf and restarted the server but not all of them were applied when I checked the SHOW VARIABLES command. It was the exact same problem as @cmr’s post. However, once I put the settings into /etc/mysql/my.cnf they were all applied successfully.

Hopefully this helps someone!

P.S.: MySQL Server version 5.5.46 and Ubuntu 14.04.2.

Oliver wrote on :

I think you have an error in “Step 3: Modify databases, tables, and columns” in the ALTER TABLE statement.

ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

It says twice column_name column_name but actually it should be like this:

# For each column
ALTER TABLE table_name MODIFY COLUMN column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

…and you should use MODIFY instead of CHANGE because it’s less powerful.

Cheers Oliver

Mikael wrote on :

I was wondering why you recommend utf8mb4 instead of utf32? Any particular reason? I'd prefer utf32 to utf8mb4 anytime because I dislike variable length encodings, except for transportation purposes, so I'd like to know if this is a poor choice with MySQL/MariaDB.

Mikael wrote on :

Mikael: I just changed my server's default to utf32 and did the same with the databases and tables. I also changed the client default to utf32 and that turned out to be a bad idea. The 'mysql' command, on Linux, choked and came with this error (with no further info):

ERROR 1231 (42000):

So I had to change the client's default encoding back to "utf8mb4". The output of:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

Now makes sense as only the client side uses utf8mb4 and the server uses utf32.

Sam wrote on :

Chris: Do you know where to add that if I’m using Flask?

Paul: Did you add settings to the stored procedures when you re-created them? I’m running into the same issue (character-set-server is showing utf8mb4 but website is still throwing 1366).

ExplodingCabbage wrote on :

I think you have a typo: character-set-server and collation-server need to be written as character_set_server and collation_server (underscores instead of hyphens) in my.cnf, at least on my machine.

Shashi wrote on :

@ArunMohan I was facing the same problem… In my case the mysql-connector version was the problem. I was using 5.1.22; I changed it to 6.0.3 and it worked. Hope this helps!

Rick James wrote on :

The 3 ALTERs are too much, and probably broken. In particular, CONVERT TO will probably fail if the 191 is not done first.

The 191 problem has been eliminated in 5.7.

Raoul Duk wrote on :

Great article!

I have a follow-up question: is the conversion an all or nothing scenario?

Let’s say I have a database with > 70 tables and I know I really only care about utf8mb4_unicode_ci in one particular table. I know I can just convert that one particular table. But my question is: in my client connection (from a Python Django app) I need to set the encoding to utf8mb4 to ensure I am inserting/querying data in the correct encoding for the table in question.

But is it harmful to have that as a global setting in all DB connections for all the tables I have not migrated to utf8mb4?

Or is it just better to migrate all the tables? Thanks.

dGo wrote on :

Thanks for this! I wrote a little script to generate the update queries for tables, columns and views. If it’s correct (I’m asking), I figure it might be useful for others as well. So this script generates and list of SQL queries you’ll have to run to update the tables, columns and views charsets.

-- set your table name here
SET @MY_SCHEMA = "";

-- tables
SELECT DISTINCT
CONCAT("ALTER TABLE ", TABLE_NAME," CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;") as queries
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA=@MY_SCHEMA
AND TABLE_TYPE="BASE TABLE"

UNION

-- table columns
SELECT DISTINCT
CONCAT("ALTER TABLE ", C.TABLE_NAME, " CHANGE ", C.COLUMN_NAME, " ", C.COLUMN_NAME, " ", C.COLUMN_TYPE, " CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;") as queries
FROM INFORMATION_SCHEMA.COLUMNS as C
LEFT JOIN INFORMATION_SCHEMA.TABLES as T
ON C.TABLE_NAME = T.TABLE_NAME
WHERE C.COLLATION_NAME is not null
AND C.TABLE_SCHEMA=@MY_SCHEMA
AND T.TABLE_TYPE="BASE TABLE"

UNION

-- views
SELECT DISTINCT
CONCAT("CREATE OR REPLACE VIEW ", V.TABLE_NAME, " AS ", V.VIEW_DEFINITION, ";") as queries
FROM INFORMATION_SCHEMA.VIEWS as V
LEFT JOIN INFORMATION_SCHEMA.TABLES as T
ON V.TABLE_NAME = T.TABLE_NAME
WHERE V.TABLE_SCHEMA=@MY_SCHEMA
AND T.TABLE_TYPE="VIEW";

Björn wrote on :

Thank you for this article.

I tried a converstion of every field on a table with 1,000,000 rows and it performed very badly. A faster solution is to create a new table with correct charset and insert data directly from the old table. After this rename both tables and everything works fine.

Brendon wrote on :

Since this ancient article is still the very best reference on the whole web about setting up Mysql and Mariadb correctly, perhaps you can update to use the latest version of collation instead of the old standard?

utf8mb4_unicode_520_ci gives the most correct collation for non-English languages. It fixes several problems that exist in unicode_ci, including treating some characters as others and treating some characters differently depending on capitalization.

Thanks!

Leave a comment

Comment on “How to support full Unicode in MySQL databases”

Your input will be parsed as Markdown.