Wednesday, 31 October 2018

Mysql: Database Skills: A Sane Approach To Choosing Primary Keys

Welcome to the Database Programmer. This blog is for anyone who wants to learn about databases, both simple and advanced. Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use

There is no One True Primary Key

There are several competing theories out there on how to choose primary keys. Most of them tell you to use a single kind of key for all tables, usually an integer. In contrast to those theories I have found that a robust application uses different kinds of keys for different kinds of tables. In the last 15 years I have worked on projects large and small, simple and complex. Sometimes I had total technical control, and sometimes I had to work with what others gave me, and sometimes it was a little of both. Today's essay reflects what I have worked out in those years, and how I build my tables today. My goal is to report what actually works, not to promote a particular theory about how everybody should do something.
This week we will see "rules of thumb". A rule of thumb is a guiding idea that will tend to hold true most of the time, but which you may decide to change in certain circumstances.

Rule of Thumb 1: Use Character Keys For Reference Tables

A reference table is one that tends to be strongly constant over time and has relatively few columns. Sometimes a reference table may come already populated by the programmer. Examples include tables of country codes (perhaps with international telephone prefixes), a table of provinces or states within a country, or a table of timezones. In this series I have been using the example of a school management program, for that program we might give the user a reference table of school subjects like history, math, physics and so forth.
For these tables it is best to make a character primary key, which we often call a "code", as in "timezone code" or "country code" or "subject code." The strategy is to make a code which can be used on its own as a meaningful value that people can understand. This gives us tables that are easier to use for both programmer and end-user.
Let's consider our school management program. We have a table of teachers (populated by the school staff), and a table of subjects which we have provided as a reference table. When a teacher joins the faculty, somebody must enter the subjects that that teacher is qualified to each. The tables below show two examples of what this table might look like, which is easier to read?
TEACHER - SUBJECT CROSS REFERENCE

EXAMPLE 1: INTEGER KEYS           EXAMPLE 2: CHARACTER KEYS

Teacher | Subject                 Teacher     |  Subject
--------+----------               ------------+-----------
72      | 28                      SRUSSEL     |  PHYSICS
72      | 32                      SRUSSEL     |  CALCULUS
72      | 72                      SRUSSEL     |  HISTORY
45      | 28                      ACLAYBORNE  |  PHYSICS
45      | 29                      ACLAYBORNE  |  CELLBIOLOGY
45      | 45                      ACLAYBORNE  |  RUSSIAN
The table of character keys is much easier to work with, for the simple reason that many times you can just use the codes themselves, so you can avoid a lot of JOINs to the main tables. With integers you must always JOIN to the master table so you can get a meaningful value to show the user. But not only is the table itself easier to read when you are debugging, it is easier to work with when writing queries:
-- The character key example is pretty simple:
SELECT teacher,subject FROM teachers_x_subjects

-- The integer key absolutely requires joins
SELECT x.teacher_id,x.subject_id
       t.name,s.description
  FROM teachers_x_subjects x
  JOIN teachers t ON x.teacher_id = t.teacher_id
  JOIN subjects s ON x.subject_id = s.subject_id
I often hear people say they do not like SQL because it is so complicated and they hate doing so many JOINs. It makes me wonder if the person is lost in a JOIN jungle caused by very bad advice about always using integer primary keys.
If you are using some kind of ORM system that tries to protect you from coding any SQL, that basic problem of over-complicated tables will still appear in your code. One way or another you must enter details that tell the ORM system how to get the descriptions, which would not be necessary if the keys were meaningful character values.
We can now see the surprising fact that the integer keys will slow us down in many situations. Not only do they have no performance advantage, but they actually hurt performance. The reason is because they require joins on almost every query. A 3-table query with two joins will always be much slower than a 1-table query with no joins. If you are using an ORM system that does not do JOIN's, but instead does separate fetches, then you have 3 round trips to the server instead of 1, and heaven forbid you have queries in a nested loop, the performance will simply crash and burn. All of this is kind of ironic since you so often hear people blindly repeat the dogmatic phrase "We will use integer keys for performance reasons..."

Rule of Thumb 2: Use Character Keys for Small Master Tables

Many database programmers use the term "master table" to mean any table that lists the properties of things that have some permanence, like customers, teachers, students, school subjects, countries, timezones, items (skus), and anything else that can be listed once and used many times in other places. Generally a master table has more columns than a simple reference table.
Some master tables are small and do not change often. In our ongoing example of a school management application, the list of teachers is a good example of a small master table. Compared to the list of students, which is much larger and changes every year, the table of teachers at most schools (except for huge state universities) will have only a few changes each year.
For tables like this it is good to allow the user to enter character keys if they want to. Some schools will insist on being allowed to choose their own codes like 'SRUSSEL' for "Saxifrage Russel", while others will say, "Why should I have to make up a code, can't the computer do that?"
For these tables I have found it useful to always define the primary key as a character column, and then to allow some flexibility in how it is generated. Common ways of generating codes include:
  1. Letting the user make up their own code
  2. Generating a code out of some other column or columns, like first letter of first name, plus 5 letters of last name, plus three numeric digits. (This used to be very popular in decades past).
  3. Generate a number.
The key idea here is to follow the needs of your users. Option #2 above is one of the most useful because it gives you the best of both worlds.

Rule of Thumb 3: Use Integers For Large Master Tables

Some master tables are large or they change often, or both. In our ongoing example of a school management application, the list of students will change every year, with many students coming and going. Another example is a doctor's office that has patients coming and going all of the time. I have found it best to use plain integer keys here because:
  • Unlike small master tables (like teachers) or reference tables (like school subjects), a code is not likely to have any meaning for the end-user, so the biggest argument for using it does not hold.
  • Unlike reference tables, the master table is likely to have many more columns and you will probably end up JOINing to the table many times. This means our other big reason for using codes, which is to avoid JOINs, does not hold either.
  • It is not realistic to expect end-users to be making up codes for large tables, and since the codes will have no value, why should the end-user be troubled with the job?
  • Writing algorithms to generate unique codes will run into more difficulties, and since the code has no value why bother?

Rule of Thumb 4: Use Integers For Transaction Tables

Many database programmers use the term "transaction table" to mean any kind of table that records some kind of interaction or event between master tables. In an eCommerce program the shopping cart tables are all transaction tables, they record the purchase of items by customers. In our school management program the actual classes taken by students are transactions, because they record specific interactions between students and teachers.
For these tables the auto-generated integer key tends to be the most useful. I am not going to present any arguments for this because most programmers find it self-evident. It should be enough to say that any attempt to use a compound key (like customer + date ) always ends up causing a problem by limiting what can be entered, so the meaningless integer key is the way to go.

Rule of Thumb 5: Use Multi-Column Keys In Cross References

A useful database will end up with a lot of cross reference tables in it. A cross-reference table is any table that lists various facts about how master tables relate to each other. These tables are extremely useful for validating transactions. In fact, next week's entry will be all about these tables and how to use them.
For now the important point is that the primary key of a cross-reference is a combination of the foreign keys. We do not make up an extra column, either integer or character.
TEACHER-SUBJECT CROSS REFERENCE

Teacher     |  Subject
------ -----+-------------
SRUSSEL     |  PHYSICS
SRUSSEL     |  CALCULUS
SRUSSEL     |  HISTORY
ACLAYBORNE  |  PHYSICS
ACLAYBORNE  |  CELLBIOLOGY
ACLAYBORNE  |  RUSSIAN
The SQL for this table would resemble something like this:
CREATE TABLE teachers_x_subjects (
    teacher char(10)
   ,subject char(10)
   ,primary key (teacher,subject)
   ,foreign key (teacher) references teachers(teacher)
   ,foreign key (subject) references subjects(subject)
)
The reasons for this are rather complex, and next week the entire entry will be devoted to this and similar ideas. For now we will note that this approach lets us validate teacher-class assignments so that no teacher is assigned to teach a class she is not qualified for. Using a new column as a primary key does not allow that, and therefore leads to more complicated and error-prone code.

Rule of Thumb 6: Use Given Keys For Non-Insert Imports

Many systems today that we create will interact with systems that already exist. A typical eCommerce program will get a list of items and maybe even customers from the company's main computer system.
For some of these tables, your own system will absolutely never make new rows. A very common example is a table of items on an eCommerce site that is loaded up from some other computer system.
For these tables, the simplest route is to use whatever key exists on the table as it is given to you. Any other route involves more work with no clear motivation for putting out the effort.

Rule of Thumb 7: Use Integer Keys for Import/Export Tables

Sometimes you may have a table whose original values come from another system, but unlike the previous case your own system is generating new rows for the table, and you may have to send these rows back to the original system.
One classic example of this is a list of customers. I created a website a few years ago where the list of customers is updated from a different system from time-to-time. However, new customers can also sign up online. Both systems are handing the customer list back and forth from time to time to keep them reconciled.
In these cases I have an integer primary key for the table because it follows Rule of Thumb 3, it is a large master table. The most important concept here is that you must not try to combine your key and the key from the original table. Keep the key from the original table in its own column, index on it, and use it for updates, but do not try to enforce it as a unique column. The other system must take care of its own key, and your system must take care of yours.

Rule of Thumb 8: Use An Object Id On All Tables

Back when people were getting excited about the concept of "Object-Relational Databases", they came up with the term "object id" to denote a column that contains some unique value but otherwise has no meaning. The same idea exists with different names, but Object ID is now the term that most people understand so that is the term I will use.
Your programs can be made simpler in many cases if you add an object id to every single table in addition to the primary key. An object id is useful specifically for user interface code. If you use an object id, then it is easier to write UPDATE and DELETE statements, and it is easier to write framework or ORM code that does these things for you.
If you are following these rules of thumb closely in your project then it is important not to use the object id as a primary key, and therefore you may never use it as a foreign key either. If you use an object id as the primary key then you lose a lot of the benefits of the character keys listed above.
Also if you follow these rules in your projects it means that your transaction tables have both an auto-generated primary key like CART_ID and an auto-generated object id. Some programmers are bothered by this because we don't like the idea that two columns appear to be doing the same thing, and we try to save a column. But personally this does not bother me because it helps me write robust applications, and this is not 1985 where a 10MB hard drive cost hundreds of dollars.

Absolute Rule 1: Only Atomic Values

This is not merely a "rule of thumb" but a rule that I follow absolutely. It is actually part of First Normal Form, which is that column values must be atomic, or indivisible. Another way to say it is that the column must not have "subvalues" buried in it.
I have included this rule here instead of with First Normal Form because when most programmers violate this rule they are making primary keys by combining different values together. In our example of a school program, if we have a list of the actual students taking classes in a given school year, you might have a squashed-up primary key column like this:
CLASS_CODE                | STUDENT
--------------------------+---------------
SRUSSEL-2007-PHYSICS      | NAI
SRUSSEL-2007-MATH         | PCLAYBORNE
ACLAYBORNE-2007-RUSSIAN   | JBOONE
ACLAYBORNE-2007-MATH      | NAI
There are two practical problems with doing this:
  1. You cannot use a foreign key to validate the sub-values, so you must code validation manually.
  2. Retrieving the sub-values requires extra code, either in the SELECT or in your client code. If the values were in separate columns this would not be necessary.

Absolute Rule 2: No Magic Values

Another rule that I follow is to absolutely never have magic values. A magic value is a value in a column that causes some non-obvious result. I have included this is in this essay because most programmers who break this rule do so by hard-coding special actions to occur based on values of keys in reference tables and master tables.
An example might be a table of teachers, where one of the teacher values is something like "SUBSTITUTE", and the program is hardcoded to do a lot of different things when it sees this value. Magic values are bad because the code is harder to debug. It may not be obvious to a programmer that some special value of the TEACHER column would cause special actions to occur. But if you have a column called FLAG_SUBSTITUTE then any programmer who must maintain code written by somebody else will have a much easier time of it.
Magic numbers also confuse end-users. It may seem obvious to us that the value "SUBSTITUTE" in the teacher column means substitute, but if this value causes other things to occur, and we are in the regular habit of having these values in lots of tables, then the compound effect can be lots and lots of phone calls from confused users, and big trouble for the software developer's bottom line.
Finally, magic numbers limit you. If you use the value "SUBSTITUTE" as a single teacher in the teachers file, then how do you keep track of the dozen-odd substitutes the school may hire in a year? The end-user is stuck here, they must use pen and paper. It is much better to allow them to enter the substitute as a regular faculty member with a FLAG_SUBSTITUE column to check off.
Magic numbers have plagued programming since long before databases came around. 

Conclusion: Many Kinds of Tables, Many Kinds of Keys

This week we have seen that there can be many practical benefits to using different kinds of keys for different kinds of tables. Using the right kind of key for the right kind of table leads to simpler code and better performance, whether you code SQL directly or use an ORM system.
Remember always that your application will always follow the same structure as your tables. If the tables are designed well, the code will be lean, tight, efficient, and robust. Because table design is so important, it is best to know well the different kinds of tables there are: reference, master, cross-reference, and transaction, and to build the keys wisely.

0 comments:

Post a Comment