Basic linguistics in MySQL

Raj Samuel
4 min readApr 25, 2021

Basic rules for data values

Character set

Database character set is the combination of symbols (alphabets, numerals, punctuation etc.) and their encoded numeric values (as defined by an encoding scheme like ASCII or Unicode).

Default character set for user data in MySQL is ‘latin1’ which includes English and some European languages.

Collation

Collation implies the rules to compare and manipulate symbols in character set of data (alphabets, numerals etc.). Collation doesn’t apply to symbols in identifiers (table name, column name etc.) or keywords (select, from, where etc.)

Each character set in MySQL has a definitive list of collations that it supports. The default collation (that the default charset ‘latin1’ supports) is latin1_swedish_ci. The simple reason for not having something like uk_english or us_english as default is here.

However the database catalog views in information_schema is defaulted to utf8.

What does this mean for a database user?

MySQL, as a software system, inherently doesn’t care about character sets or collation. It stores all data in bits in binary format. Character sets and collation matter only when you query the database — that is when MySQL is asked to either present the data (as in a select clause) or analyze the data (as in a like operator in where clause). And even then it only applies to textual columns, not to numeric columns like integer.

How’s this set?

Character sets (and hence collation) can be defined on the whole MySQL server, or it can be overridden on a database, which can be further overridden on a table, and finally be overridden again on a column of the table. They can be specified in the DDL or control statements used to create these objects (create, alter). That’s not it — it can be overridden again for a client making a connection to MySQL. In this case when client sees data, regardless of settings on the database/table/column or server, the client sees characters as defined by this variable:
mysql> SHOW VARIABLES LIKE ‘character_set_client’;

Character set differences between applications and database

When converting from one character set to another, let’s say from the default ‘latin1’ character set to ‘utf8’ character set, direct conversion will work for characters in the traditional ASCII character set. So English characters will work fine. But direct conversion may result in unexpected symbols for non-English characters, such as French accents (diacritics). To convert such characters, convert the column into binary first.

Let’s say you have a column in latin1. To convert it to utf8 for display in your application do this:
select convert(cast(french_text_column as binary) using utf8) from mydb.my_table;

If most of your applications expect utf8 characters (which is very common), it is a good idea to use that on your database instead of the default one. Because this is not only a “display” issue, it could give wrong results when you look up data. For instance let’s say your app uses utf8 and the table it’s querying has the default latin1.

If you issue a where clause like this from the app:
where french_text_column = ‘résumé’
the database might return values that do not match the above string erroneously because the app encoded the literal with diacritics in utf8 and the values in database are stored in latin. A workaround is to use binary keyword but it skips indexes if there are any on that column.

utf8 represents the 128 ASCII symbols (all that you see on a normal QWERTY keyboard and more) using 1 byte for backward compatibility with ASCII and for compaction, and most symbols above the ASCII character set using 2 bytes which can include European languages like French, and for symbols that need more encoding bits (like Chinese) it can go up to 4 bytes. latin1 on the other hand uses 1 byte for everything which means it can represent a total of 256 symbols (which will include all symbols in English and European languages).

Basic rules for metadata

Identifier character limits

Object names (like table names) are all 64 characters which is long enough compared to Oracle’s 30-character limit (which was lifted in recent releases of Oracle).

Aliases allow 256 characters but not allowed if you create an object with those long aliases (create view for example) — it should be 64.

Table names that are close to 64 character limit will return an error upon creation if there are unnamed constraints defined on tables. This is because system generated constraint names add a string to the table name which might exceed 64 character limit.

Identifiers (like table names) are stored in mysql system catalog so they are stored in the default utf8 character set.

Identifier case sensitivity

Case sensitivity of table names and database names depend on the OS that MySQL server is running on.

  • Windows is case insensitive so select * from hr.employee is same as select * from HR.EMPLOYEE.
  • Unix is case sensitive so the two queries above look for two different databases (hr and HR) and table (employee and EMPLOYEE).

To alleviate porting issues, assume MySQL is case sensitive and name your tables and databases with intentional casing.

Column names, indexes, constraints etc. are not case sensitive. So select * from hr.employee where last_name = ‘john’ is the same as select * from hr.employee where LAST_NAME = ‘john’ on all platforms. As long as the default collation is retained (which is latin1_swedish_ci where ci stands for case insensitive), ‘JOHN’ and ‘john’ (or something like ‘JohN’) will return the same result.

--

--

Raj Samuel

I write because I forget. (PS: if you take what I wrote and post it as your own please try not to edit it and post rubbish. CTRL+C, CTRL+V is your friend.)