Wednesday, 31 October 2018

MySQL DATETIME vs TIMESTAMP vs INT performance and benchmarking with MyISAM

Recently there was a discussion about DATETIME vs TIMESTAMP vs INT performance and storage requirements. If one setup was bound to disk usage, one would expect INT to perform better, since storage requirements are smaller. However, the amount of calculations to use it as a timestamp can be overwhelming as well.
So I went to do some benchmarks to see what we can conclude. The tests were run with:
  • MySQL 5.4.0-beta
  • Intel Quad core x 2800 MHz
  • Solaris 10
  • Single thread script to generate the data, and later on, simple LOAD DATA INFILEstatements (to be deterministic);
  • I should also note that for each test, the tables were recreated to be positively sure there was no caching at all (also, but not every iteration, I did a filesystem cache flush) and that the test itself was repeated a couple of times.
The schema used was basically this:

CREATE TABLE `test_datetime` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`datetime` FIELDTYPE NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM;

Being FIELDTYPE either DATETIME, TIMESTAMP and INT. As you may already have guessed, the INT timestamp will be UNIX style, by means of UNIX_TIMESTAMP() built-in function, which obviously adds some work to the server. However, remember there are calculations done for DATETIME and TIMESTAMP as well, since they are a kind of packed values.
The configuration file was pretty default, but for this test I don’t think there was much tweaking possible, since it’s all about INSERTs and table scans:

skip-locking
key_buffer = 128M
max_allowed_packet = 1M
table_cache = 512
sort_buffer_size = 2M
read_buffer_size = 2M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 8M
thread_cache_size = 8
query_cache_type = 0
query_cache_size = 0
thread_concurrency = 4

For the initial generation of data, I wrote a small script to load sequentially 10.000.000 rows of timestamped data. The results were:
           avg          min          max       Data_length
DATETIME   14111 14010        14369     130000000
TIMESTAMP  13888        13887        14122     90000000
INT        13270        12970        13496     90000000

We can see that DATETIME performs better and indeed the INT time was the slower one. So far, it seems that the time conversion function has a notable impact, considering the same I/O is done for TIMESTAMP. That’s why I did another test: I dumped the data using SELECT … INTO OUTFILE so the same converted that was available:

mysql> select * from test_datetime into outfile ‘/tmp/test_datetime.sql’;
Query OK, 10000000 rows affected (6.19 sec)


mysql> select * from test_timestamp into outfile ‘/tmp/test_timestamp.sql’;
Query OK, 10000000 rows affected (8.75 sec)


mysql> select * from test_int into outfile ‘/tmp/test_int.sql’;
Query OK, 10000000 rows affected (4.29 sec)

Interesting how things turned: we now eliminated the calculations done for INT, but since the DATETIME and TIMESTAMP fields are exported as usual strings, they have to be reconverted for every row. By reading the calculations done for both types it’s easier to understand that the former was stored more packed than the latter.
I did more testing. In order to use the same data for the queries below, which will be mostly READ, I did a quick transformation to the data:

alter table test_datetime rename test_int;
alter table test_int add column datetimeint INT NOT NULL;
update test_int set datetimeint = UNIX_TIMESTAMP(datetime);
alter table test_int drop column datetime;
alter table test_int change column datetimeint datetime int not null;
select * from test_int into outfile ‘/tmp/test_int2.sql’;
drop table test_int;

So now I have exactly the same timestamps from the DATETIME test, and it will be possible to reuse the originals for TIMESTAMP tests as well.

mysql> load data infile ‘/export/home/ntavares/test_datetime.sql’ into table test_datetime;
Query OK, 10000000 rows affected (41.52 sec)
Records: 10000000 Deleted: 0 Skipped: 0 Warnings: 0


mysql> load data infile ‘/export/home/ntavares/test_datetime.sql’ into table test_timestamp;
Query OK, 10000000 rows affected, 44 warnings (48.32 sec)
Records: 10000000 Deleted: 0 Skipped: 0 Warnings: 44


mysql> load data infile ‘/export/home/ntavares/test_int2.sql’ into table test_int;
Query OK, 10000000 rows affected (37.73 sec)
Records: 10000000 Deleted: 0 Skipped: 0 Warnings: 0

As expected, since INT is simply stored as is while the others have to be recalculated. Notice how TIMESTAMP still performs worse, even though uses half of DATETIME storage size.
Let’s check the performance of full table scan:

mysql> SELECT SQL_NO_CACHE count(id) FROM test_datetime WHERE datetime > ‘1970-01-01 01:30:00′ AND datetime < ‘1970-01-01 01:35:00′;
+———–+
| count(id) |
+———–+
|    211991 |
+———–+
1 row in set (3.93 sec)


mysql> SELECT SQL_NO_CACHE count(id) FROM test_timestamp WHERE datetime > ‘1970-01-01 01:30:00′ AND datetime < ‘1970-01-01 01:35:00′;
+———–+
| count(id) |
+———–+
|   211991 |
+———–+
1 row in set (9.87 sec)


mysql> SELECT SQL_NO_CACHE count(id) FROM test_int WHERE datetime > UNIX_TIMESTAMP(’1970-01-01 01:30:00′) AND datetime < UNIX_TIMESTAMP(’1970-01-01 01:35:00′);
+———–+
| count(id) |
+———–+
|    211991 |
+———–+
1 row in set (15.12 sec)

Then again, TIMESTAMP performs worse and the recalculations seemed to impact, so the next good thing to test seemed to be without those recalculations: find the equivalents of those UNIX_TIMESTAMP() values, and use them instead:

mysql> select UNIX_TIMESTAMP(’1970-01-01 01:30:00′) AS lower, UNIX_TIMESTAMP(’1970-01-01 01:35:00′) AS bigger;
+——-+——–+
| lower | bigger |
+——-+——–+
|  1800 |   2100 |
+——-+——–+
1 row in set (0.00 sec)


mysql> SELECT SQL_NO_CACHE count(id) FROM test_int WHERE datetime > 1800 AND datetime < 2100;
+———–+
| count(id) |
+———–+
|    211991 |
+———–+
1 row in set (1.94 sec)


It’s very important to note that before the last SELECT I forced a filesystem cache flush. But finally, INT shows a huge difference - in fact, the expectable half times, probably due to the storage size (half of DATETIME). I’ve tried to benchmark UNIX_TIMESTAMP() alone, but could not come up with any conclusion:

mysql> select benchmark(10000000,UNIX_TIMESTAMP(’1970-01-01 01:35:00′))
+————————————————————+
| benchmark(100000000,UNIX_TIMESTAMP(’1970-01-01 01:35:00′)) |
+————————————————————+
|                                                          0 |
+————————————————————+
1 row in set (1 min 7.02 sec)

And this result is indeed weird, meaning that the function is not called twice for each row..?
Regarding TIMESTAMP vs DATETIME, I see only problems with the former: it’s limited to dates bigger than year 1970 (like INT, except that you can use the latter with your own time-offset), it’s slower in every aspect, and remeber that ‘0′ in a TIMESTAMP means ‘0000-00-00 00:00:00‘ and not ‘1970-01-01 00:00:01‘. According to a a related test by DBTuna team comparing both data type’s range scans, TIMESTAMP renders 55% slower, so I start to question this data type existence…
Anyway, what I’ve tried to demonstrate was usage scenarios that you’ll need to consider for your own real cases: INT remain smaller in storage (50%) and will only perform better if INSERTs and SELECTs are already fed with an INT value - and this is specially relevant for WRITE-intensive scenarios - but DATETIME alleviates extra responsability/care from the developer. Programmers don’t usually care about this, and want the most flexibility from the database, so it’s up to you to find with them a compromise. I may have provided both enough arguments for an endless discussion, though

0 comments:

Post a Comment