Friday, 2 November 2018

How to Simulate FULL OUTER JOIN in MySQL

In this article I’ll show several ways to emulate a FULL OUTER join on a RDBMS that doesn’t support it, as is the case with even the most recent versions of MySQL. This useful query is surprisingly tricky to get right.

Introduction

A standard SQL FULL OUTER join is like a LEFT or RIGHT join, except that it includes all rows from both tables, matching them where possible and filling in with NULLs where there is no match. I’ll illustrate that for clarity. Here are two of my favorite tables, applesand oranges:
VarietyPrice
Fuji5.00
Gala6.00
VarietyPrice
Valencia4.00
Navel5.00
I’ll join them on price. Here is the left join:
select * from apples as a
    left outer join oranges as o on a.price = o.price
varietypricevarietyprice
Fuji5Navel5
Gala6NULLNULL
And the right join:
select * from apples as a
    right outer join oranges as o on a.price = o.price
varietypricevarietyprice
NULLNULLValencia4
Fuji5Navel5
The FULL OUTER JOIN of these two tables, on price, should give the following result:
varietypricevarietyprice
Fuji5Navel5
Gala6NULLNULL
NULLNULLValencia4
That’s the result I’ll be working toward in this article. Here is a script to create and populate the example tables, so you can follow along:
create table apples (variety char(10) not null primary key, price int not null);
create table oranges (variety char(10) not null primary key, price int not null);
insert into apples(variety, price) values('Fuji',5),('Gala',6);
insert into oranges(variety, price) values('Valencia',4),('Navel',5);

Method 1: two JOINs and a UNION

One method to simulate a full join is to take the union of two outer joins, for example,
select * from apples as a
    left outer join oranges as o on a.price = o.price
union
select * from apples as a
    right outer join oranges as o on a.price = o.price
This gives the desired results in this case, but it isn’t correct for all cases. Suppose there are duplicate records in the tables (remove the primary key and insert twice to create this situation). UNIONeliminates duplicates, which a full join doesn’t do. UNION ALL isn’t the right answer either, because it will cause spurious duplicates. In fact, UNION generates two independent result sets and then combines them, so there is no way to get around this, because the two result sets need to “know about each other” to produce the right results.
There are legitimate cases where duplicate results are expected and correct. For instance, even when the rows are unique, selecting only certain columns, in which there are duplicates, could cause this situation. This doesn’t apply in relational theory, because a set never has duplicates no matter what, but it does in SQL.

Method 2: UNION ALL and an exclusion join

One way to make UNION include only the duplicates I want is to use an exclusion join to eliminate anything from the second result that is already included in the first, like this:
select * from apples as a
   left outer join oranges as o on a.price = o.price
union all
select * from apples as a
   right outer join oranges as o on a.price = o.price
where a.price is null;
This handles duplicate rows correctly and doesn’t include anything it shouldn’t. It’s necessary to use UNION ALL instead of plain UNION, which would eliminate the duplicates I want to keep. This may be significantly more efficient on large result sets, since there’s no need to sort and remove duplicates.

Method 3: use a mutex table

There’s a case where UNION won’t work: older versions of MySQL don’t support it. All is not lost, though.
I’ve written several articles explaining how to start with a set of mutually exclusive numbers (which I informally call a “mutex table”), then use the mutual exclusivity of the numbers to join things together in interesting ways. This lets me simulate subqueries and unions on earlier versions of MySQL, for example. This approach seems like it might work well here, too. For the following queries I’ll assume my mutex table has the values 0 and 1. Here’s a baseline mutex query for these two tables:
select * from mutex
    left outer join apples as a on i = 0
    left outer join oranges as o on i = 1;
ivarietypricevarietyprice
Fuji5NULLNULL
Gala6NULLNULL
1NULLNULLValencia4
1NULLNULLNavel5
Of course this isn’t a full join. If I want to full join on price, naturally, I need to include price in the join criteria somewhere, and the query above doesn’t even mention the price. But it’s a starting point for tinkering.
The mutex values in the leftmost column, combined with the join criteria, ensure that every row in the two tables gets included on its own row. The mutual exclusivity causes the Navel row not to be matched to the Fuji row, even though they have the same price. The correct behavior of a full join on price is to “fill in” the NULL values where the prices are equal. This modification to the join criteria will fill it in:
select * from mutex
    left outer join apples as a on i = 0
    left outer join oranges as o on i = 1 or a.price = o.price;
The or a.price = o.price relaxes the mutual exclusivity, telling the join to keep the rows separated unless they have the same price:
ivarietypricevarietyprice
Fuji5Navel5
Gala6NULLNULL
1NULLNULLValencia4
1NULLNULLNavel5
That’s getting closer. There is a spurious row, though. The Navel row at the bottom of the result set shouldn’t be there; it has already been matched to the Fuji row earlier, so there’s no need to include it with all those NULLs as though there were no matching row in apples. Can I eliminate the Navel row without eliminating the Valencia row?
That turns out to be harder to do. I stared at it for a while, thinking I could include a WHERE clause that would eliminate spurious rows based on the value of i, but after a bit I got a reality check: the row has already been included above, and WHERE clauses work a row at a time, so there’s no way to assert something about one row while applying the WHERE clause to another row. This simple fact is all I needed to realize there’s no way to eliminate the Navel row with the given information.
What I can do, though, is stack another copy of the apples table onto the right-hand side of the results thus far, matching them to the oranges values and confining them to rows with mutex value 1 instead of 0. Now I can write a WHERE clause to see if a row in the i = 1 part of the result set matches a row in the i = 0 part. I’ll write it without the WHERE clause to start:
select * from mutex
   left outer join apples as a on i = 0
   left outer join oranges as o on i = 1 or a.price = o.price
   left outer join apples as a2 on i = 1 and a2.price = o.price;
ivarietypricevarietypricevarietyprice
Fuji5Navel5NULLNULL
Gala6NULLNULLNULLNULL
1NULLNULLValencia4NULLNULL
1NULLNULLNavel5Fuji5
Now there is a way to tell between the Navel row, which I don’t want, and the Valencia, which I do: the Navel has non-NULL values in the rightmost copy of apples, but the Valencia doesn’t. All I have to do is eliminate rows that have matching values:
select * from mutex
   left outer join apples as a on i = 0
   left outer join oranges as o on i = 1 or a.price = o.price
   left outer join apples as a2 on i = 1 and a2.price = o.price
where o.price is null or a2.price is null
If the WHERE clause is hard to understand, perhaps it’s easier to think of it this way: where not(o.price is not null and a2.price is not null). Both clauses are identical; all I did was apply some boolean identities. Here is the result:
ivarietypricevarietypricevarietyprice
Fuji5Navel5NULLNULL
Gala6NULLNULLNULLNULL
1NULLNULLValencia4NULLNULL
That result has the correct rows, but it has some extra columns, which I don’t need. Here’s the final query:
select a.*, o.* from mutex
   left outer join apples as a on i = 0
   left outer join oranges as o on i = 1 or a.price = o.price
   left outer join apples as a2 on i = 1 and a2.price = o.price
where o.price is null or a2.price is null
Remember, this can’t be done without the mutex table, because I need something to provide non-NULL values for every row; otherwise the joins would not include values where the leftmost table has no rows to contribute.
This technique works if there are duplicate rows, and works on older versions of MySQL, but is probably the least efficient of the three I’ve demonstrated here. As usual, which query is appropriate depends on circumstances.

0 comments:

Post a Comment