Mathias Bynens

How to support full Unicode in MySQL databases

· tagged with MySQL, Unicode

Alternative title: The things we do to store U+01F4A9 PILE OF POO (💩) correctly.

Are you using MySQL’s utf8 charset in your databases? In this write-up I’ll explain why you should switch to utf8mb4 instead, and how to do it.

UTF-8

The UTF-8 encoding can represent every symbol in the Unicode character set, which ranges from U+000000 to U+10FFFF. That’s 1,114,112 possible symbols. (Not all of these Unicode code points have been assigned characters yet, but that doesn’t stop UTF-8 from being able to encode them.)

UTF-8 is a variable-width encoding; it encodes each symbol using one to four 8-bit bytes. Symbols with lower numerical code point values are encoded using fewer bytes. This way, UTF-8 is optimized for the common case where ASCII characters and other BMP symbols (whose code points range from U+000000 to U+00FFFF) are used — while still allowing astral symbols (whose code points range from U+010000 to U+10FFFF) to be stored.

MySQL’s utf8

For a long time, I was using MySQL’s utf8 charset for databases, tables, and columns, assuming it mapped to the UTF-8 encoding described above. By using utf8, I’d be able to store any symbol I want in my database — or so I thought.

While writing about JavaScript’s internal character encoding, I noticed that there was no way to insert the U+1D306 TETRAGRAM FOR CENTRE (𝌆) symbol into the MySQL database behind this site. The column I was trying to update had the utf8_unicode_ci collation, and the connection charset was set to utf8.

mysql> SET NAMES utf8; # just to emphasize that the connection charset is set to `utf8`
Query OK, 0 rows affected (0.00 sec)

mysql> UPDATE database_name.table_name SET column_name = 'foo𝌆bar' WHERE id = 9001;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1

mysql> SELECT column_name FROM database_name.table_name WHERE id = 9001;
+-------------+
| column_name |
+-------------+
| foo |
+-------------+
1 row in set (0.00 sec)

The content got truncated at the first astral Unicode symbol, in this case 𝌆 — so, attempting to insert foo𝌆bar actually inserted foo instead, resulting in data loss (and possibly introducing security risks). MySQL returned a warning message, too:

mysql> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1 |
+---------+------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported.

Since astral symbols (whose code points range from U+010000 to U+10FFFF) each consist of four bytes in UTF-8, you cannot store them using MySQL’s utf8 implementation.

This doesn’t just affect the 𝌆 character, but more important symbols like U+01F4A9 PILE OF POO (💩) as well. In total, that’s 1,048,575 possible code points you can’t use. In fact, MySQL’s utf8 only allows you to store 5.88% ((0x00FFFF + 1) / (0x10FFFF + 1)) of all possible Unicode code points. Proper UTF-8 can encode 100% of all Unicode code points.

TL;DR MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support.

MySQL’s utf8mb4

Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode, including astral symbols.

Switching from MySQL’s utf8 to utf8mb4

Step 1: Create a backup

Create a backup of all the databases on the server you want to upgrade. Safety first!

Step 2: Upgrade the MySQL server

Upgrade the MySQL server to v5.5.3+, or ask your server administrator to do it for you.

Step 3: Modify databases, tables, and columns

Change the character set and collation properties of the databases, tables, and columns to use utf8mb4 instead of utf8.

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)

Since utf8mb4 is fully backwards compatible with utf8, no mojibake or other forms of data loss should occur. (But you have a backup, right?)

Step 4: Check the maximum length of columns and index keys

This is probably the most tedious part of the whole upgrading process.

When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three.

For example, a TINYTEXT column can hold up to 255 bytes, which correlates to 85 three-byte or 63 four-byte characters. Let’s say you have a TINYTEXT column that uses utf8 but must be able to contain more than 63 characters. Given this requirement, you can’t convert this column to utf8mb4 unless you also change the data type to a longer type such as TEXT — because if you’d try to fill it with four-byte characters, you’d only be able to enter 63 characters, but not more.

The same goes for index keys. The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. (Because of this, I had to change some indexed VARCHAR(255) columns to VARCHAR(191).)

Section 10.1.11 of the MySQL 5.5 Reference Manual has some more information on this.

Step 5: Modify connection, client, and server character sets

In your application code, set the connection character set to utf8mb4. This can be done by simply replacing any variants of SET NAMES utf8 with SET NAMES utf8mb4. If your old SET NAMES statement specified the collation, make sure to change that as well, e.g. SET NAMES utf8 COLLATE utf8_unicode_ci becomes SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci.

Make sure to set the client and server character set as well. I have the following in my MySQL configuration file (/etc/my.cnf):

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

You can easily confirm these settings work correctly:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

As you can see, all the relevant options are set to utf8mb4, except for character_set_filesystem which should be binary unless you’re on a file system that supports multi-byte UTF-8-encoded characters in file names, and character_set_system which is always utf8 and can’t be overridden.

Note: The default character set and collation can be configured at some other levels as well.

Step 6: Repair and optimize all tables

After upgrading the MySQL server and making the necessary changes explained above, make sure to repair and optimize all databases and tables. I didn’t do this right away after upgrading (I didn’t think it was necessary, as everything seemed to work fine at first glance), and ran into some weird bugs where UPDATE statements didn’t have any effect, even though no errors were thrown.

You could run the following MySQL queries for each table you want to repair and optimize:

# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

Luckily, this can easily be done in one go using the command-line mysqlcheck utility:

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

This will prompt for the root user’s password, after which all tables in all databases will be repaired and optimized.

Summary

Never use utf8 in MySQL — always use utf8mb4 instead. Updating your databases and code might take some time, but it’s definitely worth the effort. Why would you arbitrarily limit the set of symbols that can be used in your database? Why would you lose data every time a user enters an astral symbol as part of a comment or message or whatever it is you store in your database? There’s no reason not to strive for full Unicode support everywhere. Do the right thing, and use utf8mb4. 🍻

Comments

David Higgins wrote on :

Nice write-up Mathias! You could also escape the Unicode symbols into some sort of sanitized version, before you add them into the database. E.g. 𝌆𝌆.

Mathias wrote on :

David: That goes against the principles of normalization. And what if you want to use the data in another format that doesn’t support HTML character entities? Better to just store the raw data, and make sure your database supports it.

maxw3st wrote on :

Thank you much for documenting this thoroughly. Looks like updating my existing databases and tables will be a pain, but adding new ones should be a snap once I’ve followed the connection/client/server instructions and updated the config files. In general, great point about increasing the available character set. Thank you.

MeMyselfAndI wrote on :

What is the purpose of character-set-client-handshake=FALSE?

Why do you use init-connect='SET NAMES utf8mb4', which is already the default because of character-set-server=utf8mb4?

Mathias wrote on :

MeMyselfAndI:

While your questions are probably a bit out of scope for this article, I’ll respond to them to the best of my knowledge:

What is the purpose of character-set-client-handshake=FALSE?

character-set-client-handshake=FALSE causes the server to ignore character set information sent by the client (e.g. if the client requests a connection in utf8, it would still use utf8mb4) . By using it you can rest assured that the default server character set will be used at all times.

Why do you use init-connect='SET NAMES utf8mb4', which is already the default because of character-set-server=utf8mb4?

SET NAMES indicates what character set the client will use to send SQL statements to the server, i.e. the connection charset. character-set-server sets the server charset. To use utf8mb4 correctly, you need to make sure the client, the server, and the connection are all set to utf8mb4.

That is my understanding of these settings. If you think this is wrong, I’d appreciate a clarification. Thanks!

MeMyselfAndI wrote on :

Did you test the effect of character-set-client-handshake=FALSE? In my MySQL 5.1.x my clients could still change the character set, no matter how I wrote this option (with -, with _, with skip-… instead of =FALSE). And why should a client not be able to change the character set for its own connection?

character-set-server isn’t the only value you can set (without init-connect) in the my.cnf (yep, character-set-filesystem too, but it doesn’t matter for now). This is also the default value for all other character set settings you do not explicitly change (in a client’s session, for a database, table, column, etc.). In other words, (nearly) all other character_set_* will inherit from this value. So the default value for a client’s connection (character_set_client, character_set_results, and character_set_connection) will always be this value. You don’t need to use init-connect unless you want your client’s connection’s default value to be different than what’s defined in character-set-server.

Mathias wrote on :

MeMyselfAndI: Setting character-set-client-handshake=FALSE (or using skip-character-set-client-handshake) is the only way I could get collation_connection to show up as utf8mb4_unicode_ci instead of utf8mb4_general_ci when performing a SHOW VARIABLES LIKE 'collation%' query. Unless there’s a better way to achieve the same effect, I’m afraid this setting can not be omitted.

Thanks for pointing out the init-connect setting was unnecessary — I’ve removed it now. Please let me know if the example /etc/my.cnf can be optimized further.

MeMyselfAndI wrote on :

Mathias: To set the connection’s collation to _unicode_ instead of _general_ I could do this successfully using SET NAMES … COLLATE …. But you only need to use _unicode_ for ß=ss, otherwise you can use _general_ which is faster as the manual says. http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html

To use utf8mb4 correctly, you need to make sure the client, the server, and the connection are all set to utf8mb4.

No, not really. Of course the client has to speak UTF-8. For MySQL you only need two things: the setting for your client’s connection and the setting for the string fields. character-set-server is only a default value for creating databases if you don’t set a character set in the CREATE DATABASE statement. The database’s value is only a default value for its tables if you don’t set it in CREATE TABLE. The table’s value is only a default value for the string fields if you don’t say something special for the fields. Only now both the character set and the collation have a real purpose for storing values and comparing strings etc. The second point is the connection’s value. MySQL needs it to decode your input correctly and to encode the results. (A recoding can be skipped if connection and fields use the same character set value.) See the first three sub-chapters of http://dev.mysql.com/doc/refman/5.5/en/charset-syntax.html (the key sentences are at the end of these chapters) and http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html.

My conclusion is: you can leave the default values of the server as-is. Just specify the values for your fields (or defaults for your tables or your databases if you don’t want to set it field by field). And use SET NAMES or mysql(i)_set_charset() in PHP or something similar after each connect to a MySQL server. That’s all you need. Don’t rely on the server’s default values.

Patricio wrote on :

Nice article Mathias, this also took me by surprise some time ago. Who decided to name that other charset utf8? :s

I think it’s much safer not to trust any default charset set in the server configuration, and make yourself sure that your app is setting it correctly, whenever it creates/modifies tables and databases, and opens new connections.

Binyamin wrote on :

@Mathias, I followed your steps to change MySQL 5.5.25 encoding to utf8mb4, but unfortunately SQL returns:

Error #1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci at line 1.

…when I run the query:

ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

…and with character-set-client-handshake = FALSE the whole website returns Error 500; without it returns Error 500 for some pages. MySQL log does not show any error.

With utf8 the website works fine. Any suggestions? Seems like latest stable MySQL release is 5.5.25a.

Mathias wrote on :

Binyamin: The 500 error most likely has nothing to do with MySQL itself, but rather with the server-side code that connects to it, or elsewhere in your code. Take a closer look at the pages that result in the error, and try to figure out what they’re doing different than the other pages.

I’m using MySQL 5.5.25a as well, with these exact settings, and everything works flawlessly.

Binyamin wrote on :

Mathias: Got it to work after discovering a Webmin bug with character-set-client-handshake=FALSE and fixing it with nodbi=1.

Strange that server returned error 500 also because some MySQL row type was mediumtext. After I change the type to text, the error disappeared and the utf8mb4 encoding works fine now.

Binyamin wrote on :

Mathias: I am not sure about that, because in some cases a MEDIUMTEXT row will cause Error 500 even when it has a small character value, and using TEXT instead fixes it. Maybe it is related to a MySQL 5.5.25 bug.

ArunMohan wrote on :

Still I am getting java.sql.SQLException: Incorrect string value: '\xF0\x9D\x8C\x86\xF0\x9D...' for column 'title' at row 1… character(𝌆𝌆).

But I can insert those characters through my MySQL client (SqlYog)…

Richard wrote on :

What about performance and memory usage? Presumably this has an impact on both.

Mathias wrote on :

Richard: I don’t know about memory usage, but performance of utf8mb4 is slower than for the 3-byte utf8. I suspect the difference to be negligible though, although I’d love to see some numbers.

Jon T. wrote on :

Are you using MySQL’s utf8 charset in your databases? In this write-up I’ll explain why you should switch to PostgreSQL – the Real Man’s database.

Neil Martin wrote on :

I found many resources out there that claimed to provide solutions to using utf8mb4 with MySQL, but they were all incomplete. I read your write up and applied the suggested configuration changes and everything worked. Great work; thanks!

Deron Meranda wrote on :

If you’re using Python with the popular MySQLpy module for your database connectivity, you will need to apply a small patch to the python source code so it recognizes the new utf8mb4 character set. Until this patch is officially made in the up-stream code, see the small Pastebin patch.

CodeKingPlusPlus wrote on :

What about changing the collation settings for JDBC? I just fixed all my databases and tables. I can run queries fine inside my database but I am getting the following exception from my Java code:

java.sql.SQLException: Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='

I cannot figure out how to change the utf8_general_ci,COERCIBLE to utf8mb4_unicode_ci, IMPLICIT.

My connection string is:

con = DriverManager.getConnection("jdbc:mysql://localhost:3306/[dbName]?useUnicode=true&characterEncoding=utf8mb4&connectionCollation=utf8mb4_unicode_ci", "root", "root");

ArunMohan: Have you solved your Java.sql.Exception? I am having the same problem!

Julia Neystadt wrote on :

Thanks a lot for your article. Unfortunately, I followed all the instructions and I am still having a problem:

java.sql.SQLException: Incorrect string value: '\xF0\x9D\x8C\x86\xF0\x9D...' for column 'data' at row 1… character(𝌆𝌆).

This is the same problem ArunMohan reported earlier.

Any ideas?

Julia Neystadt wrote on :

Mathias: Thanks a lot for your fast reply. Unfortunately, this does not help me. My MySQL connector is 5.1.18.

But I paid attention that my DB version is 5.5.19-log. On the other hand, all the utf8mb4 definitions are set correctly:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+

Your help is greatly appreciated.

Julia

Michael wrote on :

Thanks for your article, it did help me a lot.

Here are my thoughts and some additional information that may help ArunMohan and Julia: http://info.michael-simons.eu/2013/01/21/java-mysql-and-multi-byte-utf-8-support/

CodeKingPlusPlus: You must have character-set-server = utf8mb4 in my.cnf and useUnicode=true in your JDBC connection. Depending on the version of your MySQL Connector/J, you need to have characterEncoding=UTF-8 as well or must not have it, see my article.

Itay wrote on :

Thanks a lot for the great write-up. I did all the steps but I still see question marks where 4-byte chars should display. What am I missing?

Jamie wrote on :

I found that phpMyAdmin didn’t cope with utf8mb4. Existing 4-byte characters display as question marks and trying to insert a 4-byte character seems to insert four actual question marks into the database. Turning on the mysqld log and looking and what phpMyAdmin was doing, I noticed it was setting the character set to utf8 each time. It’s hard-coded in line 1303 of libraries/database_interface.lib.php:

PMA_DBI_query("SET CHARACTER SET 'utf8';", $link, PMA_DBI_QUERY_STORE);

Changing to it to utf8mb4 solves the problem, although being a hack you’ll need to remember to make this change each time you upgrade phpMyAdmin.

Also, if you want to be sure of correct sorting of results you might want to set the Server connection collation on phpMyAdmin’s front page to utf8mb4_unicode_ci.

Paul wrote on :

At the top of your aticle, you really should mention the version of MySQL this applies to because the version in Debian Squeeze is 5.1 and that’s too old to have utf8mb4 apparently. Spent ages writing a program to update my MySQL tables etc. and then found that I couldn’t without the update of the package. Your article is good stuff though. :-)

cmr wrote on :

Hi guys, I tried the above. I’ve restarted the MySQL server but the output still looks like this:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+
10 rows in set (0.00 sec)

This is what my /usr/local/mysql/my.cnf looks like:

[client]
default-character-set = utf8mb4

[mysqld]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

I’ve looked at the default server settings and there is no mention of the above variables for my.cnf file.

I can’t save 𝌆 but I can save UTF-8 characters like , , and .

Mathias wrote on :

cmr: Are you sure the my.cnf file you’re editing is the one that gets loaded? Try the following command:

$ mysql --help | grep -A 1 "Default options are read from the following files"

On my machine, this prints:

Default options are read from the following files in the given order:

/etc/my.cnf /etc/mysql/my.cnf /usr/local/etc/my.cnf ~/.my.cnf`

Make sure the configuration file you’re saving to is listed. Also, check all the other files that get loaded and make sure they’re not overriding your settings.

cmr wrote on :

Thanks for the reply Mathias and fixing up my comment :)

I moved my.cnf to /usr/local/etc/ and it’s working now. Thanks heaps!

Update: I can’t seem to add the character to the database:

ERROR 1366: Incorrect string value: '\xF0\x9D\x8C\x86' for column 'text' at row 1

SQL Statement:

INSERT INTO `test`.`new_table` (`ID`, `text`) VALUES ('1', '𝌆')

I tried running this command

use test;
ALTER TABLE new_table CHANGE text text VARCHAR(191) CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;

The output says 0 rows effected.

 0 row(s) affected Records: 0  Duplicates: 0  Warnings: 0

I’m guessing that utf8mb4 is not being applied to my row.

But when I ran the following:

ALTER DATABASE test CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

1 row(s) affected

What else could be happening?

Update: Found it! I was missing:

SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci

Thanks for the great post.

Mathias wrote on :

Binyamin: I think only the command-line option was removed, not the variable. A quick test confirms this: removing these variables from the config file impacts the results.

With these settings:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

…I get these results:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

Without the [client] and [mysql] sections, the results are different:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-----------------+
| Variable_name | Value |
+--------------------------+-----------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+--------------------------+-----------------+
10 rows in set (0.00 sec)

Mihai wrote on :

Very good writeup. We knew about this concoction called utf8, but you provide the mechanics to properly addressing this issue.

Steve Hibbert wrote on :

About the only place I could find that had the correct way to set collation_connection in 5.6.15! Thanks very much.

HS wrote on :

[client]
default_character_set = utf8mb4

[mysql]
default_character_set = utf8mb4

[mysqld]
character_set_client_handshake = FALSE
init_connect = 'SET collation_connection = utf8mb4_unicode_ci,NAMES utf8mb4'
collation_server = utf8mb4_unicode_ci
collation_connection = utf8mb4_unicode_ci
collation_database = utf8mb4_unicode_ci
character_set_system = utf8
character_set_server = utf8mb4
character_set_client = utf8mb4
character_set_connection = utf8mb4
character_set_database = utf8mb4
character_set_results = utf8mb4

That’s what I have in my.cnf under /etc/. Am I missing something or doing something wrong? I’m running MySQL 5.6.16-log.

HS wrote on :

After hitting my head around a lot, these are the best settings I could figure out to run everything with utf8mb4:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
explicit_defaults_for_timestamp = 1 # Posting it here as a tip to disable the Timestamp message, maybe it can help someone :)
character-set-client-handshake = 0
init_connect = 'SET character_set_system = utf8mb4'
init_connect = 'SET character_set_connection = utf8mb4'
init_connect = 'SET character_set_database = utf8mb4'
init_connect = 'SET character_set_results = utf8mb4'
init_connect = 'SET collation_database = utf8mb4_unicode_ci'
init_connect = 'SET collation_connection = utf8mb4_unicode_ci'
init_connect = 'SET NAMES utf8mb4'
character-set-server = utf8mb4
#character_set_client = utf8mb4
collation-server = utf8mb4_unicode_ci

When running:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

I get this:

+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

My questions are:

  • How can I also set the system to use utf8mb4? I'm using CentOS 6.5 and WHM. Or it would be a bad idea?
  • Is there something else I should do to completely force the use of utf8mb4?
  • Is there anything wrong with the code above on my.cnf?

Thanks.

P.S. With the above configuration in my.cnf, I don’t get ANY sort of warnings or errors anymore when checking the MySQL error messages. Another note is that I’m running the latest MySQL from WHM: 5.6.16-log MySQL Community Server (GPL).

Leave a comment

Comment on “How to support full Unicode in MySQL databases”

Some Markdown is allowed; HTML isn’t. Keyboard shortcuts are available.

It’s possible to add emphasis to text:

_Emphasize_ some terms. Perhaps you’d rather use **strong emphasis** instead?

Select some text and press + I on Mac or Ctrl + I on Windows to make it italic. For bold text, use + B or Ctrl + B.

To create links:

Here’s an inline link to [Google](http://www.google.com/).

If the link itself is not descriptive enough to tell users where they’re going, you might want to create a link with a title attribute, which will show up on hover:

Here’s a [poorly-named link](http://www.google.com/ "Google").

Use backticks (`) to create an inline <code> span:

In HTML, the `p` element represents a paragraph.

Select some inline text and press + K on Mac or Ctrl + K on Windows to make it a <code> span.

Indent four spaces to create an escaped <pre><code> block:

    printf("goodbye world!"); /* his suicide note
was in C */

Alternatively, you could use triple backtick syntax:

```
printf("goodbye world!"); /* his suicide note
was in C */
```

Select a block of text (more than one line) and press + K on Mac or Ctrl + K on Windows to make it a preformatted <code> block.

Quoting text can be done as follows:

> Lorem iPad dolor sit amet, consectetur Apple adipisicing elit,
> sed do eiusmod incididunt ut labore et dolore magna aliqua Shenzhen.
> Ut enim ad minim veniam, quis nostrud no multi-tasking ullamco laboris
> nisi ut aliquip iPad ex ea commodo consequat.

Select a block of text and press + E on Mac or Ctrl + E on Windows to make it a <blockquote>.