How to support full Unicode in MySQL databases
Alternative title: The things we do to store U+01F4A9 PILE OF POO (💩) correctly.
Are you using MySQL’s utf8 charset in your databases? In this write-up I’ll explain why you should switch to utf8mb4 instead, and how to do it.
UTF-8
The UTF-8 encoding can represent every symbol in the Unicode character set, which ranges from U+000000 to U+10FFFF. That’s 1,114,112 possible symbols. (Not all of these Unicode code points have been assigned characters yet, but that doesn’t stop UTF-8 from being able to encode them.)
UTF-8 is a variable-width encoding; it encodes each symbol using one to four 8-bit bytes. Symbols with lower numerical code point values are encoded using fewer bytes. This way, UTF-8 is optimized for the common case where ASCII characters and other BMP symbols (whose code points range from U+000000 to U+00FFFF) are used — while still allowing astral symbols (whose code points range from U+010000 to U+10FFFF) to be stored.
MySQL’s utf8
For a long time, I was using MySQL’s utf8 charset for databases, tables, and columns, assuming it mapped to the UTF-8 encoding described above. By using utf8, I’d be able to store any symbol I want in my database — or so I thought.
While writing about JavaScript’s internal character encoding, I noticed that there was no way to insert the U+1D306 TETRAGRAM FOR CENTRE (𝌆) symbol into the MySQL database behind this site. The column I was trying to update had the utf8_unicode_ci collation, and the connection charset was set to utf8.
mysql> SET NAMES utf8; # just to emphasize that the connection charset is set to `utf8`
Query OK, 0 rows affected (0.00 sec)
mysql> UPDATE database_name.table_name SET column_name = 'foo𝌆bar' WHERE id = 9001;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1
mysql> SELECT column_name FROM database_name.table_name WHERE id = 9001;
+-------------+
| column_name |
+-------------+
| foo |
+-------------+
1 row in set (0.00 sec)
The content got truncated at the first astral Unicode symbol, in this case 𝌆 — so, attempting to insert foo𝌆bar actually inserted foo instead, resulting in data loss. MySQL returned a warning message, too:
mysql> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1 |
+---------+------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported.
Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8 implementation.
This doesn’t just affect the 𝌆 character, but more important symbols like U+01F4A9 PILE OF POO (💩) as well. In total, that’s 1,048,575 possible code points you can’t use. In fact, MySQL’s utf8 only allows you to store 5.88% ((0x00FFFF + 1) / (0x10FFFF + 1)) of all possible Unicode code points. Proper UTF-8 can encode 100% of all Unicode code points.
TL;DR MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support.
MySQL’s utf8mb4
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode, including astral symbols.
Switching from MySQL’s utf8 to utf8mb4
Step 1: Create a backup
Create a backup of all the databases on the server you want to upgrade. Safety first!
Step 2: Upgrade the MySQL server
Upgrade the MySQL server to v5.5.3+, or ask your server administrator to do it for you.
Step 3: Modify databases, tables, and columns
Change the character set and collation properties of the databases, tables, and columns to use utf8mb4 instead of utf8.
# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)
Since utf8mb4 is fully backwards compatible with utf8, no mojibake or other forms of data loss should occur. (But you have a backup, right?)
Step 4: Check the maximum length of columns and index keys
This is probably the most tedious part of the whole upgrading process.
When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three.
For example, a TINYTEXT column can hold up to 255 bytes, which correlates to 85 three-byte or 63 four-byte characters. Let’s say you have a TINYTEXT column that uses utf8 but must be able to contain more than 63 characters. Given this requirement, you can’t convert this column to utf8mb4 unless you also change the data type to a longer type such as TEXT — because if you’d try to fill it with four-byte characters, you’d only be able to enter 63 characters, but not more.
The same goes for index keys. The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. (Because of this, I had to change some indexed VARCHAR(255) columns to VARCHAR(191).)
Section 10.1.11 of the MySQL 5.5 Reference Manual has some more information on this.
Step 5: Modify connection, client, and server character sets
In your application code, set the connection character set to utf8mb4. This can be done by simply replacing any variants of SET NAMES utf8 with SET NAMES utf8mb4. If your old SET NAMES statement specified the collation, make sure to change that as well, e.g. SET NAMES utf8 COLLATE utf8_unicode_ci becomes SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci.
Make sure to set the client and server character set as well. I have the following in my MySQL configuration file (/etc/my.cnf):
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
You can easily confirm these settings work correctly:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)
As you can see, all the relevant options are set to utf8mb4, except for character_set_filesystem which should be binary unless you’re on a file system that supports multi-byte UTF-8-encoded characters in file names, and character_set_system which is always utf8 and can’t be overridden.
Note: The default character set and collation can be configured at some other levels as well.
Step 6: Repair and optimize all tables
After upgrading the MySQL server and making the necessary changes explained above, make sure to repair and optimize all databases and tables. I didn’t do this right away after upgrading (I didn’t think it was necessary, as everything seemed to work fine at first glance), and ran into some weird bugs where UPDATE statements didn’t have any effect, even though no errors were thrown.
You could run the following MySQL queries for each table you want to repair and optimize:
# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;
Luckily, this can easily be done in one go using the command-line mysqlcheck utility:
$ mysqlcheck -u root -p --auto-repair --optimize --all-databases
This will prompt for the root user’s password, after which all tables in all databases will be repaired and optimized.
Summary
Never use utf8 in MySQL — always use utf8mb4 instead. Updating your databases and code might take some time, but it’s definitely worth the effort. Why would you arbitrarily limit the set of symbols that can be used in your database? Why would you lose data every time a user enters an astral symbol as part of a comment or message or whatever it is you store in your database? There’s no reason not to strive for full Unicode support everywhere. Do the right thing, and use utf8mb4. 🍻
Comments
David Higgins wrote on :
Nice write-up Mathias! You could also escape the Unicode symbols into some sort of sanitized version, before you add them into the database. E.g.
𝌆→𝌆.Mathias wrote on :
David: That goes against the principles of normalization. And what if you want to use the data in another format that doesn’t support HTML character entities? Better to just store the raw data, and make sure your database supports it.
maxw3st wrote on :
Thank you much for documenting this thoroughly. Looks like updating my existing databases and tables will be a pain, but adding new ones should be a snap once I’ve followed the connection/client/server instructions and updated the config files. In general, great point about increasing the available character set. Thank you.
MeMyselfAndI wrote on :
What is the purpose of
character-set-client-handshake=FALSE?Why do you use
init-connect='SET NAMES utf8mb4', which is already the default because ofcharacter-set-server=utf8mb4?Mathias wrote on :
MeMyselfAndI:
While your questions are probably a bit out of scope for this article, I’ll respond to them to the best of my knowledge:
character-set-client-handshake=FALSEcauses the server to ignore character set information sent by the client (e.g. if the client requests a connection inutf8, it would still useutf8mb4) . By using it you can rest assured that the default server character set will be used at all times.SET NAMESindicates what character set the client will use to send SQL statements to the server, i.e. the connection charset.character-set-serversets the server charset. To useutf8mb4correctly, you need to make sure the client, the server, and the connection are all set toutf8mb4.That is my understanding of these settings. If you think this is wrong, I’d appreciate a clarification. Thanks!
MeMyselfAndI wrote on :
Did you test the effect of
character-set-client-handshake=FALSE? In my MySQL 5.1.x my clients could still change the character set, no matter how I wrote this option (with-, with_, withskip-…instead of=FALSE). And why should a client not be able to change the character set for its own connection?character-set-serverisn’t the only value you can set (withoutinit-connect) in themy.cnf(yep,character-set-filesystemtoo, but it doesn’t matter for now). This is also the default value for all other character set settings you do not explicitly change (in a client’s session, for a database, table, column, etc.). In other words, (nearly) all othercharacter_set_*will inherit from this value. So the default value for a client’s connection (character_set_client,character_set_results, andcharacter_set_connection) will always be this value. You don’t need to useinit-connectunless you want your client’s connection’s default value to be different than what’s defined incharacter-set-server.character_set_systemis not changeable, neither in a configuration file nor in a user’s session.Mathias wrote on :
MeMyselfAndI: Setting
character-set-client-handshake=FALSE(or usingskip-character-set-client-handshake) is the only way I could getcollation_connectionto show up asutf8mb4_unicode_ciinstead ofutf8mb4_general_ciwhen performing aSHOW VARIABLES LIKE 'collation%'query. Unless there’s a better way to achieve the same effect, I’m afraid this setting can not be omitted.Thanks for pointing out the
init-connectsetting was unnecessary — I’ve removed it now. Please let me know if the example/etc/my.cnfcan be optimized further.MeMyselfAndI wrote on :
Mathias: To set the connection’s collation to
_unicode_instead of_general_I could do this successfully usingSET NAMES … COLLATE …. But you only need to use_unicode_forß=ss, otherwise you can use_general_which is faster as the manual says. http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.htmlNo, not really. Of course the client has to speak UTF-8. For MySQL you only need two things: the setting for your client’s connection and the setting for the string fields.
character-set-serveris only a default value for creating databases if you don’t set a character set in theCREATE DATABASEstatement. The database’s value is only a default value for its tables if you don’t set it inCREATE TABLE. The table’s value is only a default value for the string fields if you don’t say something special for the fields. Only now both the character set and the collation have a real purpose for storing values and comparing strings etc. The second point is the connection’s value. MySQL needs it to decode your input correctly and to encode the results. (A recoding can be skipped if connection and fields use the same character set value.) See the first three sub-chapters of http://dev.mysql.com/doc/refman/5.5/en/charset-syntax.html (the key sentences are at the end of these chapters) and http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html.My conclusion is: you can leave the default values of the server as-is. Just specify the values for your fields (or defaults for your tables or your databases if you don’t want to set it field by field). And use
SET NAMESormysql(i)_set_charset()in PHP or something similar after each connect to a MySQL server. That’s all you need. Don’t rely on the server’s default values.Patricio wrote on :
Nice article Mathias, this also took me by surprise some time ago. Who decided to name that other charset
utf8? :sI think it’s much safer not to trust any default charset set in the server configuration, and make yourself sure that your app is setting it correctly, whenever it creates/modifies tables and databases, and opens new connections.
Peter van Dijk wrote on :
If you really needed to
REPAIRafter these changes, that’s a bug and you should report it.Binyamin wrote on :
@Mathias, I followed your steps to change MySQL 5.5.25 encoding to
utf8mb4, but unfortunately SQL returns:…when I run the query:
…and with
character-set-client-handshake = FALSEthe whole website returns Error 500; without it returns Error 500 for some pages. MySQL log does not show any error.With
utf8the website works fine. Any suggestions? Seems like latest stable MySQL release is 5.5.25a.Mathias wrote on :
Binyamin: The 500 error most likely has nothing to do with MySQL itself, but rather with the server-side code that connects to it, or elsewhere in your code. Take a closer look at the pages that result in the error, and try to figure out what they’re doing different than the other pages.
I’m using MySQL 5.5.25a as well, with these exact settings, and everything works flawlessly.
Binyamin wrote on :
Mathias: Got it to work after discovering a Webmin bug with
character-set-client-handshake=FALSEand fixing it withnodbi=1.Strange that server returned error 500 also because some MySQL row type was
mediumtext. After I change the type totext, the error disappeared and theutf8mb4encoding works fine now.Mathias wrote on :
Binyamin: The
MEDIUMTEXTissue was probably a result of what’s described in step 4 (check the maximum length of columns and index keys), no?Binyamin wrote on :
Mathias: I am not sure about that, because in some cases a
MEDIUMTEXTrow will cause Error 500 even when it has a small character value, and usingTEXTinstead fixes it. Maybe it is related to a MySQL 5.5.25 bug.ArunMohan wrote on :
Still I am getting
java.sql.SQLException: Incorrect string value: '\xF0\x9D\x8C\x86\xF0\x9D...' for column 'title' at row 1… character(𝌆𝌆).But I can insert those characters through my MySQL client (SqlYog)…
Richard wrote on :
What about performance and memory usage? Presumably this has an impact on both.
Mathias wrote on :
Richard: I don’t know about memory usage, but performance of
utf8mb4is slower than for the 3-byteutf8. I suspect the difference to be negligible though, although I’d love to see some numbers.Jon T. wrote on :
Are you using MySQL’s
utf8charset in your databases? In this write-up I’ll explain why you should switch to PostgreSQL – the Real Man’s database.Neil Martin wrote on :
I found many resources out there that claimed to provide solutions to using
utf8mb4with MySQL, but they were all incomplete. I read your write up and applied the suggested configuration changes and everything worked. Great work; thanks!Deron Meranda wrote on :
If you’re using Python with the popular MySQLpy module for your database connectivity, you will need to apply a small patch to the python source code so it recognizes the new
utf8mb4character set. Until this patch is officially made in the up-stream code, see the small Pastebin patch.Roland Franssen wrote on :
ArunMohan:
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;made it work for me on MySQL 5.5.28.CodeKingPlusPlus wrote on :
What about changing the collation settings for JDBC? I just fixed all my databases and tables. I can run queries fine inside my database but I am getting the following exception from my Java code:
I cannot figure out how to change the
utf8_general_ci,COERCIBLEtoutf8mb4_unicode_ci, IMPLICIT.My connection string is:
ArunMohan: Have you solved your
Java.sql.Exception? I am having the same problem!Robert wrote on :
Thank you very very much, I searched for a solution for nearly a year!
Pawel wrote on :
If I stumble upon a PILE OF POO that needs to be inserted into database, I will remember your post :)
Julia Neystadt wrote on :
Thanks a lot for your article. Unfortunately, I followed all the instructions and I am still having a problem:
This is the same problem ArunMohan reported earlier.
Any ideas?
Mathias wrote on :
ArunMohan and Julia: You’re probably getting that exception because the JDBC driver (the connection between Java and your database) doesn’t recognize the
utf8mb4charset. Try using a JDBC driver withutf8mb4support — MySQL Connector/J 5.1.13 (or a newer version), for example.Julia Neystadt wrote on :
Mathias: Thanks a lot for your fast reply. Unfortunately, this does not help me. My MySQL connector is 5.1.18.
But I paid attention that my DB version is 5.5.19-log. On the other hand, all the
utf8mb4definitions are set correctly:Your help is greatly appreciated.
Julia
Julia Neystadt wrote on :
I resolved my problem by upgrading the MySQL Connector/J version to 5.1.22.
Michael wrote on :
Thanks for your article, it did help me a lot.
Here are my thoughts and some additional information that may help ArunMohan and Julia: http://info.michael-simons.eu/2013/01/21/java-mysql-and-multi-byte-utf-8-support/
CodeKingPlusPlus: You must have
character-set-server = utf8mb4inmy.cnfanduseUnicode=truein your JDBC connection. Depending on the version of your MySQL Connector/J, you need to havecharacterEncoding=UTF-8as well or must not have it, see my article.Itay wrote on :
Thanks a lot for the great write-up. I did all the steps but I still see question marks where 4-byte chars should display. What am I missing?
Jamie wrote on :
I found that phpMyAdmin didn’t cope with
utf8mb4. Existing 4-byte characters display as question marks and trying to insert a 4-byte character seems to insert four actual question marks into the database. Turning on themysqldlog and looking and what phpMyAdmin was doing, I noticed it was setting the character set toutf8each time. It’s hard-coded in line 1303 oflibraries/database_interface.lib.php:Changing to it to
utf8mb4solves the problem, although being a hack you’ll need to remember to make this change each time you upgrade phpMyAdmin.Also, if you want to be sure of correct sorting of results you might want to set the Server connection collation on phpMyAdmin’s front page to
utf8mb4_unicode_ci.Mathias wrote on :
Jamie: Thanks! You might want to append that info to the corresponding phpMyAdmin bug ticket: http://sourceforge.net/p/phpmyadmin/bugs/3619/
Paul wrote on :
At the top of your aticle, you really should mention the version of MySQL this applies to because the version in Debian Squeeze is 5.1 and that’s too old to have
utf8mb4apparently. Spent ages writing a program to update my MySQL tables etc. and then found that I couldn’t without the update of the package. Your article is good stuff though. :-)cmr wrote on :
Hi guys, I tried the above. I’ve restarted the MySQL server but the output still looks like this:
This is what my
/usr/local/mysql/my.cnflooks like:I’ve looked at the default server settings and there is no mention of the above variables for
my.cnffile.I can’t save
𝌆but I can save UTF-8 characters like♥,❤, and❥.Mathias wrote on :
cmr: Are you sure the
my.cnffile you’re editing is the one that gets loaded? Try the following command:On my machine, this prints:
Make sure the configuration file you’re saving to is listed. Also, check all the other files that get loaded and make sure they’re not overriding your settings.
cmr wrote on :
Thanks for the reply Mathias and fixing up my comment :)
I moved
my.cnfto/usr/local/etc/and it’s working now. Thanks heaps!Update: I can’t seem to add the character to the database:
SQL Statement:
I tried running this command
The output says 0 rows effected.
I’m guessing that
utf8mb4is not being applied to my row.But when I ran the following:
What else could be happening?
Update: Found it! I was missing:
Thanks for the great post.
Binyamin wrote on :
default-character-setremoved in MySQL 5.5.3 andutf8mb4supported since MySQL 5.5.3.Why would you use
default-character-setin[client]and[mysql]?Mathias wrote on :
Binyamin: Good point! There seems to be no need for those settings. I’ve updated the article with the simplified configuration settings. Thanks!