Rethinking the DBMS


Examining a join

Filed under: Theory — Tags: — Ben Samuel @ 12:28

Basic concept
The basic concept behind a join is that given two tables you can match rows based on some common attribute.
It’s simple enough in principle:
Join these two tables and you get an extended list of students and their professors:
It’s straightforward enough to think of stepping through each student and looking for the corresponding professor based on the course.
There’s also another way to use a join that might be a little surprising:
This involves joining against an infinite table of areas. It’s really just the times table, only laid out differently than the way it’s traditionally done in elementary school. The idea is that rather than have each row stored on disk, it’s done by lookup data up by passing it to a mathematical function.
Joining against it ought to work like any table. After all, when you look up the living room, you have width = 15 and length = 40, so the area must be 15 ⨉ 40 = 600.
There are problems, though. If it’s going to be just like any other table, you ought to be able to join against any columns. Say you had a table with width and area, it’s easy to see that length = area / width. But some mathematical operations aren’t easily reversed, such as taking a derivative. Some are deliberately difficult, like calculating a hash.
The reason I bring up virtual tables, though, is that it shows how a join can’t be expressed as something as simple as stepping through rows and finding matches. While that seems simple, it’s actually adding complexity.
Cartesian Product
Explaining why takes some doing. We start with the cartesian product.
Let’s go to something basic, the number line:
For now, just consider one point, say x = 4. So we’re just saying one thing about the world: that x is 4. If you add the other points in the line, it means that x = -3 or x = -1 or x = 1, &c.
It’s like a little table:
Let’s add another dimension:
This is another way of saying the same thing. Since we haven’t aid anything about y, it could be anything, hence the infinite vertical lines.
What if we leave x aside and say that y could be 1 or -1?
Right, now let’s ask what happens if (x = -3 or x = -1 or x = 1 or x = 3.5 or x = 5) AND (y = 1 or y = -1)?
It makes sense graphically. After all, I simply superimposed each set of lines and highlighted the points that match. Here’s what it looks like as tables:
Recall that the final table is the same as this:
(x = -3 or x = -1 or x = 1 or x = 3.5 or x = 5) AND (y = 1 or y = -1)
If you don’t remember your boolean algebra, what’s happening is simple distribution:
(a or b or c) and (d or e) = (a and d) or (a and e) or (b and d) or (b and e) or (c and d) or (c and e)
This might look more familiar:
(a + b + c) * (d + e) = a*d + a*e + b*d + b*e + c*d + c*e
Now, I’m not saying they’re the same thing, not by a long shot. For example, you certainly can’t say this:
(5 * 3) + (2 *4) = (5 + 2) * (5 + 4) * (3 + 2) * (3 + 4)
… but you can do that in boolean algebra. The point is that the pattern is the same, and that pattern arises because the cartesian product is essentially the result of asking about this and that.
Why does a join filter out incorrect rows?
The two parts of a join are to perform a cartesian product and then to “filter” the results. Recall the first example:
So, through the process of performing the join we generated every possible combination, and then threw out combinations that didn’t match. That makes sense, but why?
Consider the first row of Students. I’ll claim we’re saying, “Dennis is a student AND the grade is A+ AND the course is math.” So to join Dennis with Literature we’d have to say, “Dennis is a student AND the grade is A+ AND the course is math AND the course is literature AND the professor is Dr. Black AND the Credits are 3.”
By my claim, it’s clear that saying the course is two different things is a logical contradiction. Let’s also recall that the table is really just row OR row OR row. So the rows that are logical contradictions are false values, and when you have false OR true, you can just throw out the false values.
But how can I claim the ANDs I referred to? Consider the table decomposed like so:
These are simple little one column, one row tables. I call them “pairs” because they’re essentially a key-value pair, but they’re basically the smallest unit the relational algebra can address.
Normally, we don’t do cartesian products because the result becomes exponentially large. If you have two tables with 10 rows, the result is 100 rows.
But 1 * 1 * 1 is just 1! So each row is just the cartesian product of its pairs:
And since the whole table can be thought of as the union of its rows. This is a kind of normal form, I call it the “relational sum of products” which should be familiar to anyone whose done circuit design. More interesting is the fact that there actually is a “relational product of sums,” but I’ll get to that later.
The goal with this article was, hopefully, to get into the fine detail of why joins work the way they do. What we’ve covered is the two stages of a join: cartesian product and filtering, and shown the theory of why they work that way, and also, hopefully, made the case that there’s some simple and elegant math behind it. In a future article, I’ll explore the conjoin and disjoin operators and show how various operators are derived from these, and propose a negate operator.


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at

%d bloggers like this: