Rethinking the DBMS

2009/03/04

Examining a join

Filed under: Theory — Tags: — Ben Samuel @ 12:28

Basic concept
The basic concept behind a join is that given two tables you can match rows based on some common attribute.
It’s simple enough in principle:
untitledimage17.png
Join these two tables and you get an extended list of students and their professors:
untitledimage15.png
It’s straightforward enough to think of stepping through each student and looking for the corresponding professor based on the course.
There’s also another way to use a join that might be a little surprising:
untitledimage16.png
This involves joining against an infinite table of areas. It’s really just the times table, only laid out differently than the way it’s traditionally done in elementary school. The idea is that rather than have each row stored on disk, it’s done by lookup data up by passing it to a mathematical function.
Joining against it ought to work like any table. After all, when you look up the living room, you have width = 15 and length = 40, so the area must be 15 ⨉ 40 = 600.
There are problems, though. If it’s going to be just like any other table, you ought to be able to join against any columns. Say you had a table with width and area, it’s easy to see that length = area / width. But some mathematical operations aren’t easily reversed, such as taking a derivative. Some are deliberately difficult, like calculating a hash.
The reason I bring up virtual tables, though, is that it shows how a join can’t be expressed as something as simple as stepping through rows and finding matches. While that seems simple, it’s actually adding complexity.
Cartesian Product
Explaining why takes some doing. We start with the cartesian product.
Let’s go to something basic, the number line:
untitledimage22.png
For now, just consider one point, say x = 4. So we’re just saying one thing about the world: that x is 4. If you add the other points in the line, it means that x = -3 or x = -1 or x = 1, &c.
It’s like a little table:
untitledimage72.png
Let’s add another dimension:
untitledimage31.png
This is another way of saying the same thing. Since we haven’t aid anything about y, it could be anything, hence the infinite vertical lines.
What if we leave x aside and say that y could be 1 or -1?
untitledimage41.png
Right, now let’s ask what happens if (x = -3 or x = -1 or x = 1 or x = 3.5 or x = 5) AND (y = 1 or y = -1)?
untitledimage51.png
It makes sense graphically. After all, I simply superimposed each set of lines and highlighted the points that match. Here’s what it looks like as tables:
untitledimage62.png
Recall that the final table is the same as this:
(x = -3 or x = -1 or x = 1 or x = 3.5 or x = 5) AND (y = 1 or y = -1)
If you don’t remember your boolean algebra, what’s happening is simple distribution:
(a or b or c) and (d or e) = (a and d) or (a and e) or (b and d) or (b and e) or (c and d) or (c and e)
This might look more familiar:
(a + b + c) * (d + e) = a*d + a*e + b*d + b*e + c*d + c*e
Now, I’m not saying they’re the same thing, not by a long shot. For example, you certainly can’t say this:
(5 * 3) + (2 *4) = (5 + 2) * (5 + 4) * (3 + 2) * (3 + 4)
… but you can do that in boolean algebra. The point is that the pattern is the same, and that pattern arises because the cartesian product is essentially the result of asking about this and that.
Why does a join filter out incorrect rows?
The two parts of a join are to perform a cartesian product and then to “filter” the results. Recall the first example:
untitledimage17.png
So, through the process of performing the join we generated every possible combination, and then threw out combinations that didn’t match. That makes sense, but why?
Consider the first row of Students. I’ll claim we’re saying, “Dennis is a student AND the grade is A+ AND the course is math.” So to join Dennis with Literature we’d have to say, “Dennis is a student AND the grade is A+ AND the course is math AND the course is literature AND the professor is Dr. Black AND the Credits are 3.”
By my claim, it’s clear that saying the course is two different things is a logical contradiction. Let’s also recall that the table is really just row OR row OR row. So the rows that are logical contradictions are false values, and when you have false OR true, you can just throw out the false values.
But how can I claim the ANDs I referred to? Consider the table decomposed like so:
untitledimage18.png
These are simple little one column, one row tables. I call them “pairs” because they’re essentially a key-value pair, but they’re basically the smallest unit the relational algebra can address.
Normally, we don’t do cartesian products because the result becomes exponentially large. If you have two tables with 10 rows, the result is 100 rows.
But 1 * 1 * 1 is just 1! So each row is just the cartesian product of its pairs:
untitledimage19.png
And since the whole table can be thought of as the union of its rows. This is a kind of normal form, I call it the “relational sum of products” which should be familiar to anyone whose done circuit design. More interesting is the fact that there actually is a “relational product of sums,” but I’ll get to that later.
Conclusion
The goal with this article was, hopefully, to get into the fine detail of why joins work the way they do. What we’ve covered is the two stages of a join: cartesian product and filtering, and shown the theory of why they work that way, and also, hopefully, made the case that there’s some simple and elegant math behind it. In a future article, I’ll explore the conjoin and disjoin operators and show how various operators are derived from these, and propose a negate operator.

2009/02/24

Defining relation value and relation variable

Filed under: Theory — Tags: — Ben Samuel @ 15:49

This entry is to lay a little groundwork as many readers aren’t familiar with the relational algebra, or the calculus. (Incidentally, I recently got a copy of Logic and Databases, where Date goes into exactly what the two terms mean; more to come when I’ve finished reading it.) Even if you are, there are a lot of wrong ideas about some of the basics and anyone could be stuck with them.

  • A relation value, in the proper context, is a structured assertion about the world.

Context is important: if someone asks, “How many fingers do you have?” then the number 5, an integer value, is an assertion about the world. A value has meaning only inasmuch as it is placed in context.

As an example of what it means to place something in context, consider a computer program. By itself it is just a long stream of symbols, but if you run it in a computer it actually does something. This sonds like common sense, but in reality most people don’t understand this and often succumb to magical thinking. You see this often in the phrase “there ought to be a law” which carries the implicit assumption that merely writing a proscription against something will make it stop.

Structure means that you can derive new values from a value in a reproducible fashion. In particular, I’m talking about deterministic structure, such as the fact that you can add integers to get a certain sum, or decompose them to a product of primes. Nondeterministic structure would include statistical operations. For example, say you have a plain text representation of a book, written in English. You could analyze the frequency of various letters or letter combinations and assert with a high degree of confidence that the language is English, but there’s no mathematical operation that can say, for certain, that the language is X and you can’t take an argument from the book and algorithmically determine a counter-argument. The structure is there, the book definitely is written in English, it has sentences, ideas, themes, &c, but you’re forced to compare it indirectly with a mathematical model.

  • A relation variable is a portion of the larger context adding meaning to the value it is assigned.

The structure of a relation variable is typically represented as a table with labeled columns and unlabeled rows. The problem with any representation is that it tends to add stuff that shouldn’t be there. Here’s an example:

simple-table.png

This is a pretty typical example of how people record information. There are some minor issues, such as the fact that this representation suggests ordered columns and rows, but neither are ordered in a relation value. The biggest issue that this representation has added that simply doesn’t exist in an individual relation variable is meaning. Even if you weren’t sure what a sign-in sheet was, you could plainly see a list of people signing their name in chronological order all on the same day, during business hours.

In a typical DBMS, you might have a comment that explained roughly what your table is for, sometimes euphemistically known as a natural language predicate, but that’s just a comment to you. All you can really attach to a variable is a set of constraints, saying what values aren’t allowed. Even the predicate logic view doesn’t get you very far. The idea is that each row is a proposition. A proposition is saying that p(a, b, c, d) is true. And the predicate it exists in is just the pattern of the proposition, it adds no meaning.

The reason I bring all this up is that plenty of people did buy into the idea that adding labels and comments can add meaning; the latest example of attempting this is XML. Mathematically, XML gets you nothing you didn’t already have with symbolic expressions, which date back at least to the ‘60s. (This isn’t to say that the technologies that XML replaced weren’t worse.)

Conclusion What I’m going to describe in later posts is a minimalist representation for relation values, based on a small addition to the relational algebra. We’ll explore some basic relation operators and derive a standard form.

Blog at WordPress.com.