Monday, 16 July 2018

MySQL for JSON: Generated Columns and Indexing

MySQL doesn't have a way to index JSON documents directly, but it has given us an 
alternative: generated columns.
One thing that has been missing since MySQL added the JSON data type in version 5.7.8, is the ability to index JSON values, at least directly. We can though use generated columns to achieve something similar. Generated columns, introduced in MySQL 5.7.5, allow developers to create columns that hold information generated from other columns, predefined expressions, or calculations. By generating columns from values within a JSON document and then indexing that column, we can practically index a JSON field. Let's show you how to do that ...
In this article, we'll be using a player and games JSON dataset that can be downloaded here. The dataset contains a list of players with the following elements: a player ID, name, and games played: Battlefield, _Crazy Tennis, and Puzzler.
{
    "id": 1,  
    "name": "Sally",  
    "games_played":{    
       "Battlefield": {
          "weapon": "sniper rifle",
          "rank": "Sergeant V",
          "level": 20
        },                                                                                                                          
       "Crazy Tennis": {
          "won": 4,
          "lost": 1
        },  
       "Puzzler": {
          "time": 7
        }
     }
 }
...
Battlefield contains the player's favorite weapon, their current rank, and the level of that rank, while Crazy Tennis includes the number of games won and lost, and Puzzlercontains the time it took a player to solve the game. Let's start with our initial table creation:
CREATE TABLE `players` (  
    `id` INT UNSIGNED NOT NULL,
    `player_and_games` JSON NOT NULL,
    PRIMARY KEY (`id`)
);
This creates a table called players that includes an ID, the JSON document, and sets the primary key to the ID we've provided. What we want to do is index that JSON name value. Let's look at what we need to add to the CREATE TABLE command.

Generating a column

When you want to create generated columns, you use this syntax within a CREATE TABLE statement to set them up:
`column_name` datatype GENERATED ALWAYS AS (expression)
The key here are the words GENERATED ALWAYS and AS. The phrase GENERATED ALWAYS is actually optional; it's only needed if you want to explicitly state that the table column is going to be a generated column. What is necessary is the word AS followed by an expression that will return a value for what you want in the generated column. Let's start there:
`names_virtual` VARCHAR(20) GENERATED ALWAYS AS ...
We're making a column called names_virtual which is up to 20 characters long and will contain the value of the "name" property from the JSON dataset. We'll access the "name" using a JSON path using MySQL's ->> operator, which is equivalent to writing JSON_UNQUOTE(JSON_EXTRACT(...)) that will return the "name" as an unquotedresult from the JSON document. We've talked about some of these JSON functions here.
`names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name')
That means we're going to take the JSON field player_and_games and extract the key name which is a child of the root.
As with most column definitions, there's a number of constraints and options you can apply to a column.
[VIRTUAL|STORED] [UNIQUE [KEY]] [[NOT] NULL] [[PRIMARY] KEY]
Unique to generated columns, the keywords VIRTUAL and STORED indicate whether the values will not or will be stored in the table. The keyword VIRTUAL is used by default, which means that the column's values are not stored so they don't take up storage space. They are evaluated every time the row is read. If you create an index with a virtual column, the value does get stored - in the index itself. The STOREDkeyword, on the other hand, indicates that values are calculated as the data is written to a table, which means values are calculated when documents are inserted or updated. In this case, the index doesn't need to store the value and behaves more traditionally.
The last three parameters enforce whether the values can be NOT NULL or NULL, and add index constraints such as UNIQUE or PRIMARY KEY. We probably should always use NOT NULL when creating a column to ensure that values exist, but using index constraints depend on your use case.
The other options are optional constraints to enforce whether the values can be NOT NULL or NULL, and add index constraints such as UNIQUE or PRIMARY KEY. If you are relying on a field existing, you should use NOT NULL when creating a column to ensure that values exist. The constraints really depend on your use case. We'll use NOT NULL as we expect players to have a name, just not a unique one.
Now let's look at our CREATE TABLE statement:
CREATE TABLE `players` (  
   `id` INT UNSIGNED NOT NULL,
   `player_and_games` JSON NOT NULL,
   `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL, 
   PRIMARY KEY (`id`)
);
If we use this to create the table, we can then insert some of the JSON documents. You can find the SQL for this in the compose examples respository. In this dataset, we've inserted the id for each player and then the JSON document like:
INSERT INTO `players` (`id`, `player_and_games`) VALUES (1, '{  
    "id": 1,  
    "name": "Sally",
    "games_played":{    
       "Battlefield": {
          "weapon": "sniper rifle",
          "rank": "Sergeant V",
          "level": 20
        },                                                                                                                          
       "Crazy Tennis": {
          "won": 4,
          "lost": 1
        },  
       "Puzzler": {
          "time": 7
        }
      }
   }'
);
...
Once we've run the code, and the data has been inserted into the players table, we can do a SELECT query giving us the following:
SELECT * FROM `players`;

+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
| id | player_and_games                                                                                                                                                                                           | names_virtual |
+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
|  1 | {"id": 1, "name": "Sally", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Sergeant V", "level": 20, "weapon": "sniper rifle"}, "Crazy Tennis": {"won": 4, "lost": 1}}}                  | Sally         |
|  2 | {"id": 2, "name": "Thom", "games_played": {"Puzzler": {"time": 25}, "Battlefield": {"rank": "Major General VIII", "level": 127, "weapon": "carbine"}, "Crazy Tennis": {"won": 10, "lost": 30}}}            | Thom          |
|  3 | {"id": 3, "name": "Ali", "games_played": {"Puzzler": {"time": 12}, "Battlefield": {"rank": "First Sergeant II", "level": 37, "weapon": "machine gun"}, "Crazy Tennis": {"won": 30, "lost": 21}}}           | Ali           |
|  4 | {"id": 4, "name": "Alfred", "games_played": {"Puzzler": {"time": 10}, "Battlefield": {"rank": "Chief Warrant Officer Five III", "level": 73, "weapon": "pistol"}, "Crazy Tennis": {"won": 47, "lost": 2}}} | Alfred        |
|  5 | {"id": 5, "name": "Phil", "games_played": {"Puzzler": {"time": 7}, "Battlefield": {"rank": "Lt. Colonel III", "level": 98, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 130, "lost": 75}}}          | Phil          |
|  6 | {"id": 6, "name": "Henry", "games_played": {"Puzzler": {"time": 17}, "Battlefield": {"rank": "Captain II", "level": 87, "weapon": "assault rifle"}, "Crazy Tennis": {"won": 68, "lost": 149}}}             | Henry         |
+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
As we can see, the table includes the names_virtual column with all the player's names inserted. Let's do a quick check to show how the columns have been set up by MySQL:
SHOW COLUMNS FROM `players`;

+------------------+------------------+------+-----+---------+-------------------+
| Field            | Type             | Null | Key | Default | Extra             |
+------------------+------------------+------+-----+---------+-------------------+
| id               | int(10) unsigned | NO   | PRI | NULL    |                   |
| player_and_games | json             | NO   |     | NULL    |                   |
| names_virtual    | varchar(20)      | NO   |     | NULL    | VIRTUAL GENERATED |
+------------------+------------------+------+-----+---------+-------------------+
Since we haven't indicated whether the generated column is VIRTUAL or STORED, by default MySQL automatically set up a VIRTUAL column for us. If you don't know whether your columns are VIRTUAL or STORED, just run the above SHOW COLUMNSquery and it will either show VIRTUAL GENERATED or STORED GENERATED.
Now that we set up the table and our first VIRTUAL column, let's add four more columns using the ALTER TABLE and ADD COLUMN operations. These will hold the Battlefield levels, tennis games won, tennis games lost, and the Puzzler times.
ALTER TABLE `players` ADD COLUMN `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL AFTER `names_virtual`;  
ALTER TABLE `players` ADD COLUMN `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL AFTER `battlefield_level_virtual`;  
ALTER TABLE `players` ADD COLUMN `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL AFTER `tennis_won_virtual`;  
ALTER TABLE `players` ADD COLUMN `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL AFTER `tennis_lost_virtual`;  
Again, running the query SHOW COLUMNS FROM players;, we see that all of the columns have VIRTUAL GENERATED next to them, meaning that we've successfully set up new VIRTUAL generated columns.
+---------------------------+------------------+------+-----+---------+-------------------+
| Field                     | Type             | Null | Key | Default | Extra             |
+---------------------------+------------------+------+-----+---------+-------------------+
| id                        | int(10) unsigned | NO   | PRI | NULL    |                   |
| player_and_games          | json             | NO   |     | NULL    |                   |
| names_virtual             | varchar(20)      | NO   |     | NULL    | VIRTUAL GENERATED |
| battlefield_level_virtual | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
| tennis_won_virtual        | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
| tennis_lost_virtual       | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
| times_virtual             | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
+---------------------------+------------------+------+-----+---------+-------------------+
Running the SELECT query shows us all the values from the VIRTUAL COLUMNS, which should look like:
SELECT `names_virtual`, `battlefield_level_virtual`, `tennis_won_virtual`, `tennis_lost_virtual`, `times_virtual` FROM `players`;

+---------------+---------------------------+--------------------+---------------------+---------------+
| names_virtual | battlefield_level_virtual | tennis_won_virtual | tennis_lost_virtual | times_virtual |
+---------------+---------------------------+--------------------+---------------------+---------------+
| Sally         |                        20 |                  4 |                   1 |             7 |
| Thom          |                       127 |                 10 |                  30 |            25 |
| Ali           |                        37 |                 30 |                  21 |            12 |
| Alfred        |                        73 |                 47 |                   2 |            10 |
| Phil          |                        98 |                130 |                  75 |             7 |
| Henry         |                        87 |                 68 |                 149 |            17 |
+---------------+---------------------------+--------------------+---------------------+---------------+
Now that the data has been inserted and the generated columns set up, we can create indexes on each column to optimize our searches ...

Indexing generated columns

When putting secondary indexes on VIRTUAL generated column values, the values are materialized and stored in the index. This gives us the benefit of not increasing the table size and being able to take advantage of MySQL indexing.
Let's do a simple query on a generated column to see what it looks like before we index it. Examining the query plan when selecting names_virtual and the name "Sally", we'd get the following:
EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G  
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: players
   partitions: NULL
         type: ALL
possible_keys: NULL  
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 6
     filtered: 16.67
        Extra: Using where
For this query, MySQL has to look at every row to find "Sally". However, we get an entirely different result once we put an index on the column like:
CREATE INDEX `names_idx` ON `players`(`names_virtual`);  
Now, running the same query, we get:
EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = "Sally"\G  
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: players
   partitions: NULL
         type: ref
possible_keys: names_idx  
          key: names_idx
      key_len: 22
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL
As we can see, our index on the column sped up our query by only looking at one row instead of six using the index names_idx instead of all of the rows. Let's create indexes on the rest of our virtual columns following the same syntax as names_idxlike:
CREATE INDEX `times_idx` ON `players`(`times_virtual`);  
CREATE INDEX `won_idx` ON `players`(`tennis_won_virtual`);  
CREATE INDEX `lost_idx` ON `players`(`tennis_lost_virtual`);  
CREATE INDEX `level_idx` ON `players`(`battlefield_level_virtual`);  
We can check to see if all of our columns have now been indexed by running:
SHOW INDEX ON `players`;

+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name  | Seq_in_index | Column_name               | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| players |          0 | PRIMARY   |            1 | id                        | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | names_idx |            1 | names_virtual             | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | times_idx |            1 | times_virtual             | A         |           5 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | won_idx   |            1 | tennis_won_virtual        | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | lost_idx  |            1 | tennis_lost_virtual       | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | level_idx |            1 | battlefield_level_virtual | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Now that we created multiple indexes on our generated columns, let's make a more complex search to see how generated columns and indexes work. For this example, we'll get the ids, names, tennis games won, Battlefield level, and Puzzler time for players who have a level above 50 and who also have won 50 tennis games. All of the results will be ordered by ascending order according to the Puzzler time. The SQL command and results will look like:
SELECT `id`, `names_virtual`, `tennis_won_virtual`, `battlefield_level_virtual`, `times_virtual` FROM `players` WHERE (`battlefield_level_virtual` > 50 AND  `tennis_won_virtual` > 50) ORDER BY `times_virtual` ASC;

+----+---------------+--------------------+---------------------------+---------------+
| id | names_virtual | tennis_won_virtual | battlefield_level_virtual | times_virtual |
+----+---------------+--------------------+---------------------------+---------------+
|  5 | Phil          |                130 |                        98 |             7 |
|  6 | Henry         |                 68 |                        87 |            17 |
+----+---------------+--------------------+---------------------------+---------------+
Let's look at how MySQL planned out this query:
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: players
   partitions: NULL
         type: range
possible_keys: won_idx,level_idx  
          key: won_idx
      key_len: 4
          ref: NULL
         rows: 2
     filtered: 66.67
        Extra: Using where; Using filesort
When using the indexes on won_idx and level_idx it only had to access two columns to return the result we wanted. As you can see, if the query had to do a complete table scan on millions of documents, it would probably have taken a very long time. However, with the power of generated columns and indexing those columns, MySQL has provided a very fast and convenient way to search elements within JSON documents.
One question that remains, nonetheless, is what do we do with STORED generated columns? How do we use them, and how do they work?

Storing values in generated columns

Using the STORED keyword when setting up a generated column is generally not preferred since you're basically storing values twice in a table: once in the JSON document and again in the STORED column. However, there are three scenarios when MySQL suggests you use a STORED generated column: 1) indexing primary keys, 2) you need a fulltext/R-tree index, or 3) you have a column that is scanned a lot.
The syntax for adding a STORED generated column is the same as creating VIRTUALgenerated columns, except we append the keyword STORED after the expression like:
`id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL,
To show you how to use STORED, let's create another table that will take the IDs from our JSON documents and store them in a STORED generated column. Then we'll set the PRIMARY KEY to the id column. At the same time, we'll create all of our VIRTUAL columns and set up indexes for those columns. This is done entirely within the initial table creation:
CREATE TABLE `players_two` (  
    `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL,
    `player_and_games` JSON NOT NULL,
    `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL,
    `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL,
    `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".won') NOT NULL,
    `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played."Crazy Tennis".lost') NOT NULL,
    `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Battlefield.level') NOT NULL, 
    PRIMARY KEY (`id`),
    INDEX `times_index` (`times_virtual`),
    INDEX `names_index` (`names_virtual`),
    INDEX `won_index` (`tennis_won_virtual`),
    INDEX `lost_index` (`tennis_lost_virtual`),
    INDEX `level_index` (`battlefield_level_virtual`)
);
Next, we'll insert the same dataset into players_two except we'll remove the ID that we previously added to the INSERT operation:
INSERT INTO `players_two` (`player_and_games`) VALUES ('{  
    "id": 1,  
    "name": "Sally",  
    "games_played":{    
...
);
After our data has been inserted into the table, we can run SHOW COLUMNS on the new table to see how MySQL has generated our columns. Notice that the id field now has STORED GENERATED beside it with a PRIMARY KEY index.
SHOW COLUMNS FROM `players_two`;

+---------------------------+-------------+------+-----+---------+-------------------+
| Field                     | Type        | Null | Key | Default | Extra             |
+---------------------------+-------------+------+-----+---------+-------------------+
| id                        | int(11)     | NO   | PRI | NULL    | STORED GENERATED  |
| player_and_games          | json        | NO   |     | NULL    |                   |
| names_virtual             | varchar(20) | NO   | MUL | NULL    | VIRTUAL GENERATED |
| times_virtual             | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
| tennis_won_virtual        | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
| tennis_lost_virtual       | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
| battlefield_level_virtual | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
+---------------------------+-------------+------+-----+---------+-------------------+
One thing to note about creating a PRIMARY KEY with generated columns is that MySQL will not allow you to create primary keys on VIRTUAL generated columns. In fact, if we left out STORED on the id field, MySQL will throw the following error:
ERROR 3106 (HY000): 'Defining a virtual generated column as primary key' is not supported for generated columns.  
At the same time, if you leave off the primary key index and attempt to insert a data, MySQL will throw an error stating:
ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.  
This means that you don't have a primary key on your table. Therefore, you must go back and either drop and recreate your table, or drop the id column and add the column as a STORED generated column with a PRIMARY KEY like:

ALTER TABLE `players_two` ADD COLUMN `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED PRIMARY KEY;  

0 comments:

Post a Comment