MySQL doesn't have a way to index JSON documents directly, but it has given us an
alternative: generated columns.
One thing that has been missing since MySQL added the JSON data type in version 5.7.8, is the ability to index JSON values, at least directly. We can though use generated columns to achieve something similar. Generated columns, introduced in MySQL 5.7.5, allow developers to create columns that hold information generated from other columns, predefined expressions, or calculations. By generating columns from values within a JSON document and then indexing that column, we can practically index a JSON field. Let's show you how to do that ...
In this article, we'll be using a player and games JSON dataset that can be downloaded here. The dataset contains a list of players with the following elements: a player ID, name, and games played: Battlefield, _Crazy Tennis, and Puzzler.
{
"id": 1,
"name": "Sally",
"games_played":{
"Battlefield": {
"weapon": "sniper rifle",
"rank": "Sergeant V",
"level": 20
},
"Crazy Tennis": {
"won": 4,
"lost": 1
},
"Puzzler": {
"time": 7
}
}
}
...
Battlefield contains the player's favorite weapon, their current rank, and the level of that rank, while Crazy Tennis includes the number of games won and lost, and Puzzlercontains the time it took a player to solve the game. Let's start with our initial table creation:
CREATE TABLE `players` (
`id` INT UNSIGNED NOT NULL,
`player_and_games` JSON NOT NULL,
PRIMARY KEY (`id`)
);
This creates a table called
players
that includes an ID, the JSON document, and sets the primary key to the ID we've provided. What we want to do is index that JSON name value. Let's look at what we need to add to the CREATE TABLE
command.Generating a column
When you want to create generated columns, you use this syntax within a
CREATE TABLE
statement to set them up:`column_name` datatype GENERATED ALWAYS AS (expression)
The key here are the words
GENERATED ALWAYS
and AS
. The phrase GENERATED ALWAYS
is actually optional; it's only needed if you want to explicitly state that the table column is going to be a generated column. What is necessary is the word AS
followed by an expression that will return a value for what you want in the generated column. Let's start there:`names_virtual` VARCHAR(20) GENERATED ALWAYS AS ...
We're making a column called
names_virtual
which is up to 20 characters long and will contain the value of the "name" property from the JSON dataset. We'll access the "name" using a JSON path using MySQL's ->>
operator, which is equivalent to writing JSON_UNQUOTE(JSON_EXTRACT(...))
that will return the "name" as an unquotedresult from the JSON document. We've talked about some of these JSON functions here.`names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name')
That means we're going to take the JSON field
player_and_games
and extract the key name
which is a child of the root.
As with most column definitions, there's a number of constraints and options you can apply to a column.
[VIRTUAL|STORED] [UNIQUE [KEY]] [[NOT] NULL] [[PRIMARY] KEY]
Unique to generated columns, the keywords
VIRTUAL
and STORED
indicate whether the values will not or will be stored in the table. The keyword VIRTUAL
is used by default, which means that the column's values are not stored so they don't take up storage space. They are evaluated every time the row is read. If you create an index with a virtual column, the value does get stored - in the index itself. The STORED
keyword, on the other hand, indicates that values are calculated as the data is written to a table, which means values are calculated when documents are inserted or updated. In this case, the index doesn't need to store the value and behaves more traditionally.
The last three parameters enforce whether the values can be
NOT NULL
or NULL
, and add index constraints such as UNIQUE
or PRIMARY KEY
. We probably should always use NOT NULL
when creating a column to ensure that values exist, but using index constraints depend on your use case.
The other options are optional constraints to enforce whether the values can be
NOT NULL
or NULL
, and add index constraints such as UNIQUE
or PRIMARY KEY
. If you are relying on a field existing, you should use NOT NULL
when creating a column to ensure that values exist. The constraints really depend on your use case. We'll use NOT NULL
as we expect players to have a name, just not a unique one.
Now let's look at our
CREATE TABLE
statement:CREATE TABLE `players` (
`id` INT UNSIGNED NOT NULL,
`player_and_games` JSON NOT NULL,
`names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL,
PRIMARY KEY (`id`)
);
If we use this to create the table, we can then insert some of the JSON documents. You can find the SQL for this in the compose examples respository. In this dataset, we've inserted the
id
for each player and then the JSON document like:INSERT INTO `players` (`id`, `player_and_games`) VALUES (1, '{
"id": 1,
"name": "Sally",
"games_played":{
"Battlefield": {
"weapon": "sniper rifle",
"rank": "Sergeant V",
"level": 20
},
"Crazy Tennis": {
"won": 4,
"lost": 1
},
"Puzzler": {
"time": 7
}
}
}'
);
...
Once we've run the code, and the data has been inserted into the
players
table, we can do a SELECT
query giving us the following:SELECT * FROM `players`;
+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
| id | player_and_games | names_virtual |
+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
| 1 | {"id": 1, "name": "Sally", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Sergeant V", "level": 20, "weapon": "sniper rifle"}, "Crazy Tennis": {"won": 4, "lost": 1}}} | Sally |
| 2 | {"id": 2, "name": "Thom", "games_played": {"Puzzler": {"time": 25}, "Battlefield": {"rank": "Major General VIII", "level": 127, "weapon": "carbine"}, "Crazy Tennis": {"won": 10, "lost": 30}}} | Thom |
| 3 | {"id": 3, "name": "Ali", "games_played": {"Puzzler": {"time": 12}, "Battlefield": {"rank": "First Sergeant II", "level": 37, "weapon": "machine gun"}, "Crazy Tennis": {"won": 30, "lost": 21}}} | Ali |
| 4 | {"id": 4, "name": "Alfred", "games_played": {"Puzzler": {"time": 10}, "Battlefield": {"rank": "Chief Warrant Officer Five III", "level": 73, "weapon": "pistol"}, "Crazy Tennis": {"won": 47, "lost": 2}}} | Alfred |
| 5 | {"id": 5, "name": "Phil", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Lt. Colonel III", "level": 98, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 130, "lost": 75}}} | Phil |
| 6 | {"id": 6, "name": "Henry", "games_played": {"Puzzler": {"time": 17}, "Battlefield": {"rank": "Captain II", "level": 87, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 68, "lost": 149}}} | Henry |
+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
As we can see, the table includes the
names_virtual
column with all the player's names inserted. Let's do a quick check to show how the columns have been set up by MySQL:SHOW COLUMNS FROM `players`;
+------------------+------------------+------+-----+---------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+------------------+------+-----+---------+-------------------+
| id | int(10) unsigned | NO | PRI | NULL | |
| player_and_games | json | NO | | NULL | |
| names_virtual | varchar(20) | NO | | NULL | VIRTUAL GENERATED |
+------------------+------------------+------+-----+---------+-------------------+
Since we haven't indicated whether the generated column is
VIRTUAL
or STORED
, by default MySQL automatically set up a VIRTUAL
column for us. If you don't know whether your columns are VIRTUAL
or STORED
, just run the above SHOW COLUMNS
query and it will either show VIRTUAL GENERATED
or STORED GENERATED
.
Now that we set up the table and our first
VIRTUAL
column, let's add four more columns using the ALTER TABLE
and ADD COLUMN
operations. These will hold the Battlefield levels, tennis games won, tennis games lost, and the Puzzler times.ALTER TABLE `players` ADD COLUMN `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL AFTER `names_virtual`;
ALTER TABLE `players` ADD COLUMN `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL AFTER `battlefield_level_virtual`;
ALTER TABLE `players` ADD COLUMN `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL AFTER `tennis_won_virtual`;
ALTER TABLE `players` ADD COLUMN `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL AFTER `tennis_lost_virtual`;
Again, running the query
SHOW COLUMNS FROM players;
, we see that all of the columns have VIRTUAL GENERATED
next to them, meaning that we've successfully set up new VIRTUAL
generated columns.+---------------------------+------------------+------+-----+---------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------------+------------------+------+-----+---------+-------------------+
| id | int(10) unsigned | NO | PRI | NULL | |
| player_and_games | json | NO | | NULL | |
| names_virtual | varchar(20) | NO | | NULL | VIRTUAL GENERATED |
| battlefield_level_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED |
| tennis_won_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED |
| tennis_lost_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED |
| times_virtual | int(11) | NO | | NULL | VIRTUAL GENERATED |
+---------------------------+------------------+------+-----+---------+-------------------+
Running the
SELECT
query shows us all the values from the VIRTUAL COLUMNS
, which should look like:SELECT `names_virtual`, `battlefield_level_virtual`, `tennis_won_virtual`, `tennis_lost_virtual`, `times_virtual` FROM `players`;
+---------------+---------------------------+--------------------+---------------------+---------------+
| names_virtual | battlefield_level_virtual | tennis_won_virtual | tennis_lost_virtual | times_virtual |
+---------------+---------------------------+--------------------+---------------------+---------------+
| Sally | 20 | 4 | 1 | 7 |
| Thom | 127 | 10 | 30 | 25 |
| Ali | 37 | 30 | 21 | 12 |
| Alfred | 73 | 47 | 2 | 10 |
| Phil | 98 | 130 | 75 | 7 |
| Henry | 87 | 68 | 149 | 17 |
+---------------+---------------------------+--------------------+---------------------+---------------+
Now that the data has been inserted and the generated columns set up, we can create indexes on each column to optimize our searches ...
Indexing generated columns
When putting secondary indexes on
VIRTUAL
generated column values, the values are materialized and stored in the index. This gives us the benefit of not increasing the table size and being able to take advantage of MySQL indexing.
Let's do a simple query on a generated column to see what it looks like before we index it. Examining the query plan when selecting
names_virtual
and the name "Sally", we'd get the following:EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: players
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 6
filtered: 16.67
Extra: Using where
For this query, MySQL has to look at every row to find "Sally". However, we get an entirely different result once we put an index on the column like:
CREATE INDEX `names_idx` ON `players`(`names_virtual`);
Now, running the same query, we get:
EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: players
partitions: NULL
type: ref
possible_keys: names_idx
key: names_idx
key_len: 22
ref: const
rows: 1
filtered: 100.00
Extra: NULL
As we can see, our index on the column sped up our query by only looking at one row instead of six using the index
names_idx
instead of all of the rows. Let's create indexes on the rest of our virtual columns following the same syntax as names_idx
like:CREATE INDEX `times_idx` ON `players`(`times_virtual`);
CREATE INDEX `won_idx` ON `players`(`tennis_won_virtual`);
CREATE INDEX `lost_idx` ON `players`(`tennis_lost_virtual`);
CREATE INDEX `level_idx` ON `players`(`battlefield_level_virtual`);
We can check to see if all of our columns have now been indexed by running:
SHOW INDEX ON `players`;
+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| players | 0 | PRIMARY | 1 | id | A | 6 | NULL | NULL | | BTREE | | |
| players | 1 | names_idx | 1 | names_virtual | A | 6 | NULL | NULL | | BTREE | | |
| players | 1 | times_idx | 1 | times_virtual | A | 5 | NULL | NULL | | BTREE | | |
| players | 1 | won_idx | 1 | tennis_won_virtual | A | 6 | NULL | NULL | | BTREE | | |
| players | 1 | lost_idx | 1 | tennis_lost_virtual | A | 6 | NULL | NULL | | BTREE | | |
| players | 1 | level_idx | 1 | battlefield_level_virtual | A | 6 | NULL | NULL | | BTREE | | |
+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Now that we created multiple indexes on our generated columns, let's make a more complex search to see how generated columns and indexes work. For this example, we'll get the ids, names, tennis games won, Battlefield level, and Puzzler time for players who have a level above 50 and who also have won 50 tennis games. All of the results will be ordered by ascending order according to the Puzzler time. The SQL command and results will look like:
SELECT `id`, `names_virtual`, `tennis_won_virtual`, `battlefield_level_virtual`, `times_virtual` FROM `players` WHERE (`battlefield_level_virtual` > 50 AND `tennis_won_virtual` > 50) ORDER BY `times_virtual` ASC;
+----+---------------+--------------------+---------------------------+---------------+
| id | names_virtual | tennis_won_virtual | battlefield_level_virtual | times_virtual |
+----+---------------+--------------------+---------------------------+---------------+
| 5 | Phil | 130 | 98 | 7 |
| 6 | Henry | 68 | 87 | 17 |
+----+---------------+--------------------+---------------------------+---------------+
Let's look at how MySQL planned out this query:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: players
partitions: NULL
type: range
possible_keys: won_idx,level_idx
key: won_idx
key_len: 4
ref: NULL
rows: 2
filtered: 66.67
Extra: Using where; Using filesort
When using the indexes on
won_idx
and level_idx
it only had to access two columns to return the result we wanted. As you can see, if the query had to do a complete table scan on millions of documents, it would probably have taken a very long time. However, with the power of generated columns and indexing those columns, MySQL has provided a very fast and convenient way to search elements within JSON documents.
One question that remains, nonetheless, is what do we do with
STORED
generated columns? How do we use them, and how do they work?Storing values in generated columns
Using the
STORED
keyword when setting up a generated column is generally not preferred since you're basically storing values twice in a table: once in the JSON document and again in the STORED
column. However, there are three scenarios when MySQL suggests you use a STORED
generated column: 1) indexing primary keys, 2) you need a fulltext/R-tree index, or 3) you have a column that is scanned a lot.
The syntax for adding a
STORED
generated column is the same as creating VIRTUAL
generated columns, except we append the keyword STORED
after the expression like:`id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL,
To show you how to use
STORED
, let's create another table that will take the IDs from our JSON documents and store them in a STORED
generated column. Then we'll set the PRIMARY KEY
to the id
column. At the same time, we'll create all of our VIRTUAL
columns and set up indexes for those columns. This is done entirely within the initial table creation:CREATE TABLE `players_two` (
`id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL,
`player_and_games` JSON NOT NULL,
`names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL,
`times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL,
`tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL,
`tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL,
`battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL,
PRIMARY KEY (`id`),
INDEX `times_index` (`times_virtual`),
INDEX `names_index` (`names_virtual`),
INDEX `won_index` (`tennis_won_virtual`),
INDEX `lost_index` (`tennis_lost_virtual`),
INDEX `level_index` (`battlefield_level_virtual`)
);
Next, we'll insert the same dataset into
players_two
except we'll remove the ID that we previously added to the INSERT
operation:INSERT INTO `players_two` (`player_and_games`) VALUES ('{
"id": 1,
"name": "Sally",
"games_played":{
...
);
After our data has been inserted into the table, we can run
SHOW COLUMNS
on the new table to see how MySQL has generated our columns. Notice that the id
field now has STORED GENERATED
beside it with a PRIMARY KEY
index.SHOW COLUMNS FROM `players_two`;
+---------------------------+-------------+------+-----+---------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------------+-------------+------+-----+---------+-------------------+
| id | int(11) | NO | PRI | NULL | STORED GENERATED |
| player_and_games | json | NO | | NULL | |
| names_virtual | varchar(20) | NO | MUL | NULL | VIRTUAL GENERATED |
| times_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED |
| tennis_won_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED |
| tennis_lost_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED |
| battlefield_level_virtual | int(11) | NO | MUL | NULL | VIRTUAL GENERATED |
+---------------------------+-------------+------+-----+---------+-------------------+
One thing to note about creating a
PRIMARY KEY
with generated columns is that MySQL will not allow you to create primary keys on VIRTUAL
generated columns. In fact, if we left out STORED
on the id
field, MySQL will throw the following error:ERROR 3106 (HY000): 'Defining a virtual generated column as primary key' is not supported for generated columns.
At the same time, if you leave off the primary key index and attempt to insert a data, MySQL will throw an error stating:
ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.
This means that you don't have a primary key on your table. Therefore, you must go back and either drop and recreate your table, or drop the
id
column and add the column as a STORED
generated column with a PRIMARY KEY
like:ALTER TABLE `players_two` ADD COLUMN `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED PRIMARY KEY;
0 comments:
Post a Comment