Wednesday 27 November 2019

An Introduction to MySQL Indexes

We’re all familiar with the indexes in books: they help you find specific contents much faster by telling you where they’re located. In a nutshell, database indexes essentially do the same thing—they let you retrieve information from a database much faster by narrowing down the scope of your search.
In fact, creating indexes for database tables is one of the most important concepts of database modeling. It’s also often one of the first things you should consider if your query is running too slowly, in which case it may benefit from indexing. In this article, I'll explain all you need to know to understand mySQL indexes while at the same time demonstrating how indexing can improve the performance of your queries.

Overview

Imagine you want to find a particular chapter (say 10) in a book that happens to lack an index. The only certain way to find this chapter would be to scan the entire book until it’s found. If the book did have an index and it told you that chapter 10 starts on page 1023, you’d know that you can immediately skip all other pages and go directly to that chapter’s location.
Now, let's imagine a similar situation with a database table that has thousands, or perhaps millions, of data records. If you have to read each row to check if a particular record is present (called a full table scan), it will take you way more time to retrieve your record than if you were to specify an index pointing to the table section where your data is located.
As you can imagine, using indexes can significantly improve a query’s runtime performance if the number of records in your database grows. With our analogy, the advantage isn’t too obvious if we imagine having to sift through, say, a 100-page book page by page to locate the desired chapter, but suppose we had to scour all of a 5000-page encyclopedia—that would take forever!
I’ve prepared some SAMPLE DATA that you can download. This test data represents weather samples that I will use to guide our understanding of database indexing. Here’s a brief snapshot:
introduction to my SQL indexes
In this diagram, there are three tables. The first table is titled indexed_samples; this is where we will be adding indexes to compare records against those of the second table, unindexed_samples, which, as the name suggests, lacks indexes. Both of these tables store information from 1.8 million records of weather data. Finally, the third table small_samples will be used later on to illustrate special types of indexes.
I highly recommend you download the files and test this out in your local environment, as it will be faster and easier to go over such a large number of records. I’ve opted to use tables that store lots of data in order to illustrate just how much indexes can impact query importance as the number of records increases.

A quick peek at indexing

To see the power of indexing in action, go ahead and run the following query on the indexed_samples table to create an index for the wind speed column (we’ll get into the finer details afterwards):
ALTER TABLE indexed_samples 
ADD INDEX index_windspeed (wind_mtsperhour ASC);
Now, let’s see how long it takes to get the maximum wind speed on the unindexed table by executing the following query:
SET SESSION query_cache_type = OFF; #prevents caching so we can see the impact
SELECT max(wind_mtsperhour) FROM unindexed_samples;
Running the second query above will return the max(wind_mtsperhour) value of “1446485”. Pay attention to how long it took this query to run. Your results may vary, depending on the specifics of your machine’s hardware, but it should look something like this:

ActionMessageDuration
SELECT max(wind_mtsperhour)
FROM unindexed_samples
1 row(s) returned1.891 sec / 0.000 sec


Let’s now run the same exact query, this time using our indexed_samples table to demonstrate the computing power of indexing:
SELECT max(wind_mtsperhour) FROM indexed_samples;
The resulting max(wind_mtsperhour) value remains unchanged, as expected, but the time it took the query to execute is now so miniscule that it essentially rounds off to zero, suggesting it took almost no time to complete!

ActionMessageDuration
SELECT max(wind_mtsperhour)
FROM unindexed_samples
1 row(s) returned0.000 sec / 0.000 sec


Operations that benefit from indexing

As we saw in the previous section, indexing can significantly improve the performance of a query, and it doesn’t just apply to those that perform trivial tasks like searching for minimum/maximum values. In general, mySQL queries benefit from data indexing in the following scenarios:
  • finding rows that match a WHERE clause
  • removing rows that do not match a WHERE clause condition
  • row retrieval from other related tables when PERFORMING JOINS
  • finding the minimum/maximum value for a specific column, as in our example above
  • sorting or grouping the records of a table
More importantly, in some situations, a query can return the requested information directly from the index without ever having to consult the data rows themselves, allowing it to perform much faster than normal (a mechanism known as covering indexes).

Column indexes

Indexes may correspond to one (single-column indexes) or many columns (multi-column indexes). They may also be used for the entire column or just part of it (prefix indexes) and can even enforce special constraints, such as via PRIMARY KEYUNIQUE, or FULL TEXT. We won’t cover FULL TEXT in this article, but you can read more about it in this VERTABELO BLOG if you’re interested.
Index index_windspeed (wind_mtsperhour ASC)

Single-column indexes

Indexes can correspond to a single column, like the one we already saw, or even several columns. We can create a general index that corresponds to a single column either when we create our table or at a later time, once the table has already been created.
The syntax for adding an index to, say, the wind_mtsperhour column when creating our sample table would be the following (note that index_windspeed is the name of the index):
CREATE TABLE indexed_samples (
  measure_timestamp datetime NOT NULL,
  station_name varchar(1000) DEFAULT NULL,
  wind_mtsperhour int(11) NOT NULL,
  windgust_mtsperhour int(11) NOT NULL,
  windangle int(3) NOT NULL,
  rain_mm decimal(5,2) DEFAULT NULL,
  temperature_dht11 int(5) DEFAULT NULL,
  humidity_dht11 int(5) DEFAULT NULL,
  barometric_pressure decimal(10,2) NOT NULL,
  barometric_temperature decimal(10,0) NOT NULL,
  lux decimal(7,2) DEFAULT NULL,
  KEY index_windspeed (wind_mtsperhour)
);
Alternatively, if the table already exists, we can add the index to it by ALTERing the table, like we did in the introductory example:
ALTER TABLE indexed_samples
ADD INDEX index_windspeed (wind_mtsperhour);
It’s also possible to use the CREATE INDEX statement to achieve the same result:
CREATE INDEX index_windspeed
ON indexed_samples (wind_mtsperhour);
A full list of index options and modifiers can be found HERE. In the rest of this article, I’ll use the ALTER TABLE statement, since we will be modifying the base MySQL :: MySQL 5.7 Reference Manual :: 13.1.14 CREATE INDEX Syntax table.
Before we move on, you should know that you can remove a particular index at any time with the following syntax:
ALTER TABLE table_with_indexes
DROP INDEX index_name;

Indexing string columns

IN GENERAL, mySQL has a 767-byte index length limit for InnoDB (1000 on myIsam) for string columns. This means a regular index cannot include the full column string but only the first 767 bytes of it. The part that can be indexed is called the prefix. The prefix length can be specified after the indexed column name upon index definition, like so:
CREATE INDEX index_station_name
ON indexed_samples ( station_name (100) );
Here, we’re creating an index that will consider the first 100 characters of the station_name column (which is 1000 characters long). For CHARVARCHARBINARY, and VARBINARY column indexes, the prefix length is optional. If you don’t set it, the engine will use the maximum length possible when considering the column’s character set.
For BLOB and TEXT columns, you must specify the prefix length explicitly, or you will get ERROR 1170: BLOB/TEXT … without a key length. It’s important to note that the prefix size you specify will be considered in bytes instead of characters for the BINARYVARBINARY, and BLOB column types.
Using the special index modifier FULL TEXT will index an entire column, and the prefix will be ignored if it is specified. However, as I noted before, I won’t cover that type of index in this article.

Multi-column indexes

As noted before, indexes may also correspond to more than one column (up to 16 in mySQL). The only difference between single- and multi-column indexes is that the latter requires specifying several columns instead of just one. So, for starters, let’s create a single index for the wind speed and wind gust columns:
ALTER TABLE indexed_samples
ADD INDEX idx_speed_gust (wind_mtsperhour, windgust_mtsperhour);
Just as we did with single-column indexes, we can specify a prefix length for the string columns of a multi-column index if we want to. For instance, we could create another index that includes the first six characters of the station_name and the windangle columns:
ALTER TABLE indexed_samples
ADD INDEX idx_station_angle (station_name (6), windangle);
....
Index idx_station_angle (station_name (6), windangle)

How multi-column indexes are used

The engine will create multi-column index key values by concatenating the column values in the order they are specified for the index. Therefore, only queries that conform to the following criteria may use the index:
  • queries that test all columns used in the index
  • queries that test just the first column
  • queries that test the first two columns
  • queries that test the first three columns
  • and so on...
So, the order in which you specify the index definition is critical to how the engine can use it. Let’s take a look at when the engine can use our multi-column index from the prior example (with the station_name and windangle columns).
A query including all columns in the index will use it:
SELECT * FROM indexed_samples
WHERE station_name = “test 1” and windangle = 180
Or this one, too, which includes the first column:
SELECT * FROM indexed_samples
WHERE station_name = “test 1”
But, this one will not use the index:
SELECT * FROM indexed_samples
WHERE windangle = 180
The last query above does not use all columns: it only uses the second indexed column. Therefore, indexing cannot be used. This also applies when using OR in a WHERE clause, as each part must be evaluated separately. So, in the next example, the engine still cannot use the index:
SELECT * FROM indexed_samples
WHERE
station_name = "test 1" OR 
windangle = 180;
If we rewrite the query like so, it can use the index:
SELECT * FROM indexed_samples
WHERE
station_name = "test 1" OR
(station_name = "test 2" AND station_name = "test 3" AND windangle = 180);
Don’t worry if these rules seem complicated. Later in this article, I’ll show how you can detect whether your query is using indexing.

UNIQUE indexes

The concept here is fairly straightforward: when creating a UNIQUE index, all indexed values must be distinct. This applies to both single- and multi-column indexes. If you try to add a new row with a key value that matches an existing one, an error will occur. Also, if you try to create a unique index for a column with repeated values, the index will not be created, and Error 1062 will be generated. In practice, UNIQUE is used to prevent selecting duplicated rows.
Let’s see this in action. I’ll first try to create a multi-column index for the station_name (with a 100-character prefix) and measure_timestamp columns:
ALTER TABLE indexed_samples
ADD UNIQUE INDEX idx_station_timestamp (station_name(100), measure_timestamp);
As expected, the above code fails with Error Code: 1062. Duplicate entry 'test 2-2014-12-02 00:02:35' for key 'idx_station_timestamp'. This is because the table already contains duplicated values for these two columns. Now, let’s try using the small_samples table, which stores 10 non-repeated samples:
ALTER TABLE small_samples
ADD UNIQUE INDEX idx_station_timestamp (station_name(100), measure_timestamp);
The above query will succeed, and we can now try inserting a repeated value:
INSERT INTO small_samples VALUES('11', '2014-11-30 22:01:26', 'test 2', '5980', '9467', '171', '0.00', '16', '37', '93943.00', '17', '0.00');
The above query fails with Error Code 1062. Duplicate entry 'test 2-2014-11-30 22:01:26' for key 'idx_station_timestamp'. Again, this is expected.
Essentially, UNIQUE indexes prevent duplicate records from arising at a database level. If your application wanted to insert a record into uniquely indexed columns for some reason, you could rely on mySQL’s UNIQUE indexing to protect you from accidental duplicates, while at the same time speeding up searches for records in any indexed columns.

PRIMARY KEY indexes

From an indexing perspective, defining a column or columns as a PRIMARY KEY is very similar to creating a UNIQUE index. The main difference is that PRIMARY KEYs cannot hold null values. However, like UNIQUE indexes, PRIMARY KEYs cannot hold duplicates.
PRIMARY KEYs are usually defined on the smallest set of columns that can uniquely identify a row within your table, such as an id column. For this reason, mySQL optimizes data storage with PRIMARY KEY. Simply put, the recommendation is that you define a PRIMARY KEY for your table to improve performance. This is one of the core concepts of DATABASE NORMALIZATION.
Let’s see how we can add a PRIMARY KEY index to our small_samples table for the station_name and measure_timestamp columns:
ALTER TABLE small_samples
ADD PRIMARY KEY (station_name (100), measure_timestamp);
The query succeeds. Let’s now try to add another PRIMARY KEY to the id column to see what happens:
ALTER TABLE small_samples
ADD PRIMARY KEY (id);
 
Error Code: 1068. Multiple primary key defined
This reveals another core deviation from UNIQUE indexes: you are not allowed to have more than one PRIMARY KEY per table.
Let’s remove the existing PRIMARY KEY:
ALTER TABLE small_samples
DROP PRIMARY KEY;
And then try to add the id index again:
ALTER TABLE small_samples
ADD PRIMARY KEY (id);
Now, let’s confirm that we cannot insert duplicate entries or null values into the id column, as the index has been set as a PRIMARY KEY:
INSERT INTO small_samples (id) VALUES (null);
Error Code: 1048. Column 'id' cannot be null
INSERT INTO small_samples (id) VALUES (10);
Error Code: 1062. Duplicate entry '10' for key 'PRIMARY'
Both of these tests fail, as expected, preventing us from having duplicates or null values in column(s) that are set as PRIMARY KEYs.

EXPLAIN and SELECT optimizations

The best way to prove that your query is properly using the expected indexes is to use the EXPLAIN statement on the SELECT query you intend to analyze. Although a full optimization tutorial would be outside the scope of this article, I’d like to give you a quick example so you can see its power.
Let’s EXPLAIN a query that grabs samples whose wind speed values equal to “7000” to see what happens when we use the unindexed table:
EXPLAIN SELECT wind_mtsperhour = 7000 FROM unindexed_samples;
This query will return the following:

idselect_typetabletablepossible_keyskeykey_lenrefrowsextras
1SIMPLEunindexed_samplesALL1837353


As it stands, the query has no possible indexes to use, so a full table scan must be performed on all rows. However, if we run the same query on the indexed table:
EXPLAIN SELECT wind_mtsperhour = 7000 FROM indexed_samples;
We get the following feedback:

idselect_typetabletablepossible_keyskeykey_lenrefrowsextras
1SIMPLEindexed
_samples
indexindex_windspeed41750038Using index


As you can see, the extras column tells us that this query does indeed rely on indexing. Thus, the query need not read through the entire table to find the matching record, as the index_windspeed column already contains all the required information. The query can simply grab the records it needs directly from that column in a snap.

EXPLAIN on multicolumn index

Earlier in this article, I listed the criteria for using multicolumn indexes and mentioned that a SELECT statement that uses an OR can be modified to use indexes. Let’s EXPLAIN those queries to further understand the concept:
EXPLAIN SELECT * FROM indexed_samples
WHERE  station_name = "test 1" OR  windangle = 180;
This is what is returned:

idselect_typetabletablepossible_keyskeykey_lenrefrowsextras
1SIMPLEindexed_samplesALLidx_station_angle1750038Using where


However, since the second part of our OR clause uses a condition that does not comply with the multi-column indexing criteria I mentioned earlier, this query will not rely on any indexes and will thus have to scan through all 1.7 million rows.
Now, if we EXPLAIN the query that includes all the indexed columns as I suggested earlier, we can clearly observe the improvement in performance:
EXPLAIN SELECT * FROM indexed_samples
WHERE
station_name = "test 1" OR
(station_name = "test 2" AND station_name = "test 3" AND windangle = 180);
Here’s the feedback we get:

idselect_typetabletablepossible_keyskeykey_lenrefrowsextras
1SIMPLEindexed
_samples
refidx_station_angleidx_station_angle403const875019Using where


In this case, the index is used, and 50% less rows need to be scanned.

Final words

The key concept to keep in mind is that indexing speeds up your queries and helps you add an extra layer of consistency when used in combination with UNIQUE and/or PRIMARY KEY. The columns you are most likely going to use for your SELECTs are ideal candidates for indexing. Also, as I briefly showed, EXPLAIN can offer useful hints and insights into whether your query is properly using indexing in its searches.
Keep in mind that indexing also comes at hidden a cost: updating indexes as you insert/update/delete data impacts server performance. Choosing the best index, if any, will also consume mySQL processing time. Therefore, you should be extremely picky when choosing indexes and only use them where they are needed.

0 comments:

Post a Comment