Rethinking the DBMS


Identifiers for science should be part of a broader system

Filed under: Uncategorized — Tags: , , — Ben Samuel @ 00:47

There’s an interesting Linked group about identifiers for researchers talking about the issue of unique identifiers for research.
I understand the desire to restrict the problem to a reasonable domain, but some of the suggestions I’ve read seem unrealistic, such as “it’s critical that one researcher only gets one [digital scientist identifier].” (source) That seems like it depends too much on too many people doing the right thing when there’s no reason they should really care to do the right thing.
This really lets the cat out of the bag when it comes to identifying researchers because the whole scientific process expands, but I think most scientists agree that it ought to. A large problem in science is that there’s no serious mechanism for scientists to address falsehoods in the popular media. And while the scientific community may form a consensus on a particular issue, it’s difficult to express this. For example, to express a consensus on evolution a group of scientists launched Project Steve, but that only worked because the consensus was overwhelming.
The larger process begins to make sense if instead of following the publishing model we look at a more general model of interaction made of up smaller actions. Rather than articles as the unit of publishing, you might break it down as far as observations, assertions, arguments, criticisms, etc. Consider a hypothetical chain of events:

  • A researcher generates statistical data on cancerous growths.
  • The author summarizes the work for a paper.
  • The paper is published in a journal.
  • An advocacy group cites the paper in a press release.
  • The press release is picked up by a journalist.
  • A legislator reads the press release.

Now, imagine if a concerned citizen hears about the new law and wants to investigate. There is virtually no way to go back and find the original research and determine if the law in question actually reflects the original science in any way.
To a large extent, the value of research is what people write about it and the actions they take because of that.
After coming back to finish writing this post, I found a very interesting interview on a technology called crossref that does pretty much what I’m asking for. I guess great minds think alike…


Not another article about the doomed relational model

Filed under: Uncategorized — Tags: , — Ben Samuel @ 10:18

This one is pretty bad. The author, Mr. Bain, is, according to the bio, the founder of “a company that makes investments in early stage software development.” Well, his article is hype about some new technologies and hype is a necessary reality of life, but if he represents investors he ought to give them better advice than he’s giving in this article.
And from the first line, it doesn’t make sense:

Recently, a lot of new non-relational databases have cropped up both inside and outside the cloud. One key message this sends is, “if you want vast, on-demand scalability, you need a non-relational database”.

After that complete non sequitur:

During this time, several so-called revolutions flared up briefly, all of which were supposed to spell the end of the relational database. All of those revolutions fizzled out, of course, and none even made a dent in the dominance of relational databases.

If he’s just saying that all the silly articles predicting that the relational model was doomed were completely wrong, well, that seems vaguely familiar. But if he’s saying that non-relational technologies were attempted and then abandoned, I’d have to take issue with that. OODBMSs should count as one of the “so-called revolutions” and they’re still around and are, in fact, incorporated in many mainstream SQL DBMSs. Most of the other trendy technologies I can think of, such as XML or some of the data warehousing tech, were never really considered direct competitors to the relational model.
The lesson here is that the marketplace of human ingenuity is always bigger than your imagination and there’s plenty of space for all these various technologies. And, to be clear, I’m not pooh-poohing these technologies, rather, I’d like to follow them more closely because I think they’re extremely interesting. My argument is that they aren’t going to shift relational DBMS technology because they don’t compete with it.
What exactly is it that the relational database dominates, and how does it dominate? The what of the answer isn’t way out there, and it surprises me that a business guy like Mr. Blain missed it.
Here’s the basic business reason for a SQL DBMS: your business has many processes that involve many (metaphorical) moving parts and you need to capture that logic, known as business logic, in a set of rules. You need your data to conform, at all times, to those rules.
This isn’t to say that there isn’t a whole lot of other data that can’t be stored in other ways or even must be stored in other ways, but here are a few of the things that businesses have to worry about:

  • Tracking employees.
  • Paying taxes.
  • Government regulations.
  • Tracking customers.
  • Tracking shipments.
  • Managing vendors.

This stuff goes back to before the Roman Empire, and, if anything, it’s only going to get more regulated and more complex, which means the need to track business logic, and thus the case for the relational DBMS, will only grow.
The second half of the answer, the how, is that the relational model is well defined. DBMSs using SQL (which is pretty much all of them) don’t use the relational model but what is often referred to as the SQL model. The differences between the relational model and the SQL model are a little arcane (and I’ll probably touch on them in this blog) but they are significant enough that it’s simply wrong to conflate the two.
Many of the competitors to the relational model, such as the old Pick systems (still around, incidentally) used a model that was not well defined. So what’s wrong with loosely defined systems? There are a few, but I’d argue the complete and total lack of interoperability is the problem. At first blush XML refutes this, but really, it proves it.
What did we have pre-XML? A complete mess of binary formats, and the n-squared problem of writing translators for all of them. Did you ever try these translators? Even simply opening a Word document in a later version tends to destroy the underlying data.
Along comes the web and the successful tag structure of HTML is generalized to arbitrary data. LISP advocates complain that it’s just symbolic expressions with angle brackets.
So now we have a post-XML world. DOC has become DOCX and what’s changed? Well, loading and saving documents is somewhat more robust because some of the logic has been offloaded to reusable libraries. And now you can use XSLT to transform documents.
But the original problem remains: you have a whole mess of incompatible file formats and n-squared translators that don’t actually work.
The various key-value stores are very new, and it remains to be seen how well they will work together, but I haven’t seen any evidence that they even attempt to address the issue.
(Updates: 4 Mar: formatting.)

Blog at