June 16, 2017
by Matt Raines

MySQL and the case of the missing EU flag

A customer recently wrote to us to tell us that they wanted to add an EU flag character (??) to an email they sent out from on of our websites but doing so resulted an error message.

SQLSTATE[HY000]: General error: 1366 Incorrect string value: ‘\xF0\x9F\x87\xAA\xF0\x9F…’ for column ‘title’ at row 1

We’ve been using the MySQL database engine and PHP for a pretty long time now. When we started building websites, mainly for Liberal Democrats, character sets were still a thing, and we stored the data in ISO-8859-1, or sometimes ISO-8859-15. But the problem of character sets has now been solved, with websites the world over settling on UTF-8 as the de facto standard encoding.

It was a delightful day in or around 2010 when we converted all our website data to UTF-8, and no longer needed to worry every time anybody typed “Lembit Öpik” into one of our databases (yes, this is a thing which happened a lot). For us it was a double-whammy when Öpik went on to lose his parliamentary seat later the same year. We followed best practice at the time and set the character set in MySQL to “utf8”.

For the 7 years since our clients have ticked along happily managing their websites, only occasionally perhaps pausing to notice that entering non-Western characters is easier than it used to be. But it turns out there’s a gotcha with MySQL’s “utf8” character set — it isn’t UTF-8! It uses a maximum of 3 bytes, not 4, to encode characters and is therefore only able to store the 66,536 characters in the Basic Multilingual Plane.

This must have seemed a decent compromise to someone at MySQL at the time between storage space and character support, but there are nearly 60,000 characters not included in the BMP, among them mathematical symbols, musical notation, emoji, REGIONAL FLAGS, and a person on a mountain bike doing a wheelie. In version 5.5, MySQL quietly added support for the actual UTF-8, which they called “utf8mb4”, while keeping their increasing poorly-named proprietary “utf8” character set.

There might not be a very long window in which our clients need to use the EU flag on their website, but you never know, so it felt like time to convert the databases to do things properly. This can take a long time, and there are some gotchas relating to maximum index size and column length, but here’s a quick snippet of code to convert your whole database from utf8 to utf8mb4. You’re very welcome.

$pdo = new PDO("mysql:host=localhost;dbname=$your_database_name",
               $your_database_username, $your_database_password,
               [PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4"]);
$pdo->exec("ALTER DATABASE $your_database_name DEFAULT CHARACTER SET utf8mb4");
// Might be required if some of your table identifiers are CHAR or VARCHAR
// and are used in foreign key definitions.
$pdo->exec("SET FOREIGN_KEY_CHECKS=0");
$tables = [];
foreach ($pdo->query("SHOW TABLES") as $table) {
    echo "$table[0]\n";
    $pdo->exec("ALTER TABLE $table[0] CONVERT TO CHARACTER SET utf8mb4");
}
$pdo->exec("SET FOREIGN_KEY_CHECKS=1");