Wednesday, 31 October 2018

Mysql: Is there a REAL performance difference between INT and VARCHAR primary keys?

Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.
My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.
So, does anyone have experience with this particular use-case and the performance concerns associated with it?

 Answers


You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.
That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.
Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.
Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.



Absolutely not.
I have done several... several... performance checks between INT, VARCHAR, and CHAR.
10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.
That being said... use whatever is best for your application. Don't worry about the performance.



For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).
For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.



At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.
Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.



I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.
Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.
The setup was as follows:
  • Intel® Core™ i7-7500U CPU @ 2.70GHz × 4
  • 15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
  • 148.6 GB SSD drive, with plenty of free space.
  • Ubuntu 16.04 64-bit
  • MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)
The tables:
create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;
Then, I filled 10 million rows in each table with a PHP script whose essence is like this:
$pdo = get_pdo();

$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];

for ($k = 0; $k < 10; $k++) {
    for ($j = 0; $j < 1000; $j++) {
        $val = '';
        for ($i = 0; $i < 1000; $i++) {
            $val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
        }
        $val = rtrim($val, ',');
        $pdo->query('INSERT INTO jan_char VALUES ' . $val);
    }
    echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}
For int tables, the bit ($keys[rand(0, 9)]) was replaced with just rand(0, 9), and for varchar tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string() generates a 10-character random string.
Then I ran in MySQL:
  • SET SESSION query_cache_type=0;
  • For jan_int table:
    • SELECT count(*) FROM jan_int WHERE myindex = 5;
    • SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
  • For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.
Times of the BENCHMARK query on each table:
  • jan_int: 21.30 sec
  • jan_int_index: 18.79 sec
  • jan_char: 21.70 sec
  • jan_char_index: 18.85 sec
  • jan_varchar: 21.76 sec
  • jan_varchar_index: 18.86 sec
Regarding table & index sizes, here's the output of show table status from janperformancetest; (w/ a few columns not shown):
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name              | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int           | InnoDB |      10 | Dynamic    | 9739094 |             43 |   422510592 |               0 |            0 |   4194304 |           NULL | utf8mb4_unicode_520_ci |  
| jan_int_index     | InnoDB |      10 | Dynamic    | 9740329 |             43 |   420413440 |               0 |    132857856 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_char          | InnoDB |      10 | Dynamic    | 9726613 |             51 |   500170752 |               0 |            0 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_char_index    | InnoDB |      10 | Dynamic    | 9719059 |             52 |   513802240 |               0 |    202342400 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_varchar       | InnoDB |      10 | Dynamic    | 9722049 |             53 |   521142272 |               0 |            0 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_varchar_index | InnoDB |      10 | Dynamic    | 9738381 |             49 |   486539264 |               0 |    202375168 |   7340032 |           NULL | utf8mb4_unicode_520_ci | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
My conclusion is that there's no performance difference for this particular use case.






The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.
Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.
The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.
Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.
My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app' s responsiveness.



Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):
1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)
2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)
3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)

0 comments:

Post a Comment