Friday, 2 November 2018

How to Write a SQL Exclusion Join

There is usually more than one way to write a given query, but not all ways are created equal. Some mathematically equivalent queries can have drastically different performance. This article examines one of the motivations for inventing LEFT OUTER join and including it in the SQL standard: improved performance through exclusion joins.
LEFT OUTER join syntax was added to the SQL-92 standard specifically to address certain queries that had only been possible with NOT INsubqueries. The disadvantage of using subqueries in these situations is that they may require creating many anonymous tables and probing into them. A clever optimizer could generate the same plan as a LEFT OUTER join, but since there was no such thing at the time and query optimizers were much less capable, query performance could take quite a hit. I should pause here and say that I wasn’t programming in 1992, so I’m only speaking from the history I’ve read and heard, not from personal experience. However, I definitely have personal experience with the performance hits of NOT IN queries!

Setup

I’ll use two tables of data, apples and oranges.
VarietyPrice
Fuji5.00
Gala6.00
VarietyPrice
Valencia4.00
Navel5.00

The old-style way

In old-style SQL, one joined data sets by simply specifying the sets, and then specifying the match criteria in the WHERE clause, like so:
select *
from apples, oranges
where apples.Price = oranges.Price
    and apples.Price = 5
Placing the join conditions in the WHERE clause is confusing when queries get more complex. It becomes hard to tell which conditions are used to join the tables (apples.Price = oranges.Price), and which are used to exclude results (apples.Price = 5). The two are equivalent in old-style joins, but as mentioned, some joins cannot be written in this style (more on this later).

The new way

The updated SQL standard addressed these issues by separating the join conditions from the WHERE clause. Join conditions now go in the FROM clause, greatly clarifying the syntax. Here is the simple join written in the newer style:
select *
from apples
    inner join oranges
         on apples.Price = oranges.Price
where apples.Price = 5

Outer joins

Separating the join conditions from the WHERE clause allows OUTERjoins. There are three kinds of OUTER joins: LEFTRIGHT and FULL. The most common is a LEFT OUTER join, but all three types have the characteristic of not eliminating rows entirely from the result set when they fail the condition. Instead, when data does not match, the row is included from one table as usual, and the other table’s columns are filled with NULLs (since there is no matching data to insert).
In a LEFT OUTER join, every row from the left-hand table is included, whether there is a matching row in the right-hand table or not. When there is a matching row in the right-hand table, it is included; otherwise the right-hand table’s columns are filled with NULLs. A demonstration may clarify:
select *
from apples
    left outer join oranges
        on apples.Price = oranges.Price
VarietyPriceVarietyPrice
Fuji5.00Navel5.00
Gala6.00NULLNULL
INNER joins select matching rows in the result set. It is possible to use an INNER join to select apples and oranges with matching prices, as above. With LEFT OUTER joins it is possible to answer the reverse query, “show me apples for which there are no oranges with a matching price.” Simply eliminate matching rows in the WHERE clause:
select apples.Variety
from apples
    left outer join oranges
        on apples.Price = oranges.Price
where oranges.Price is null

Outer joins are not possible with inner join

The above query is not possible with INNER JOIN. The following query does not accomplish the same thing:
select apples.Variety
from apples
    inner join oranges
        on apples.Price = oranges.Price
where apples.Price <> oranges.Price
In fact, this query will return nothing, because the join condition contradicts the WHERE clause. This query is not the same thing either:
select apples.Variety
from apples
    inner join oranges on
        apples.Price <> oranges.Price
Why? Because if there are no rows in oranges, nothing will get returned. It is simply not possible to write this query with an INNERjoin or an old-style join, no matter what technique is used. Don’t be fooled by analyzing the two data sets presented in this article; for some cases you may be able to get the same behavior, but not for all possible data sets. There is a way to write this query using subqueries, though:
select apples.Variety
from apples
where apples.Price not in (
        select Price from oranges)

Outer joins and subqueries

Why use a LEFT OUTER join instead of using a subquery? Depending on the query, this technique may force the subquery to be evaluated for every row in the left-hand table (especially for correlated subqueries, where the subquery refers to values from the left-hand table). A LEFT OUTER join, by contrast, can often use a much more efficient query plan. Again, they may be mathematically equivalent—and a good query optimizer may generate the same query plan, but this is not always the case. It depends heavily on the query, the optimizer, and how the tables are indexed. I have seen queries perform orders of magnitude better when rewritten with an exclusion join.

0 comments:

Post a Comment