Oh relational databases, that tired old relic of another age. Codd and friends were great in their time, but serious software engineers need to move on.
People building Web Scale™ software
You’ve probably heard a similar sentiment at some point. That relational databases were great, but they are dinosaurs that are slowly being replaced by non-relational databases because they just can’t keep up with the demands of modern companies.
In other words, they don’t scale. They aren’t agile.
Some of these criticisms can be valid, but only in context.
We don’t ask ourselves how a technology that has arguably had more performance work done on it than almost any other class of software can be so easily surpassed by brand new entrants into the market.
As Rich Hickey once said:
Programmers know the benefits of everything and the tradeoffs of nothing.
We see the amazing benchmarks that some NoSQL databases provide, and we say “wow, they are so much faster/better/scalable than relational databases.” But we don’t ask ourselves why.
Instead of asking “how are they so much better?” we should be asking “what are they giving up?”
You see, relational databases aren’t dinosaurs. They aren’t lumbering prehistoric relics doomed to extinction by a changing world. They are sharks. Apex predators honed by millions of years of evolution into a perfectly adapted creature that is just as effective today as it was eons ago. There is a reason they are still around, and their supremacy in their domain hasn’t been questioned.
What Relational Databases Give Us
As a primer, most relational databases come with four primary guarantees:
Atomicity – The guarantee that any series of operations within a transaction are treated as a single unit. The entire thing either succeeds or fails, and won’t leave you in an invalid state.
Consistency – The guarantee that any operation against the database will leave it in a valid state.
Isolation – The guarantee that any operations executed concurrently will leave the database in the same state as they would have if they were executed sequentially. Generally this means that transactions can’t see data being modified by other transactions.
Durability – The guarantee that once a transaction is committed, it will stay committed, regardless of whether the system crashes or power fails.
And finally, not a core ACID guarantee of relational databases, but there is another critical feature set around data integrity. Tools such as foreign keys, unique constraints, not null constraints, check constraints, etc. Combine these features with transactions, and you now have customizable logical guarantees about your data that your database will enforce.
All of these guarantees come together to make writing reliable systems that keep consistent data a routine problem. Sure, there are still challenges around how and when to apply these tools, but when used correctly it makes keeping your data clean and consistent a tractable task.
In order to gain some of their superpowers, NoSQL databases generally make tradeoffs around these guarantees. Guarantees that a lot of software engineers take for granted, or don’t even know that they are leaning on.
An acronym was created, BASE, to describe the operational characteristics and tradeoffs of most NoSQL databases.
Basically Available – The system can guarantee availability, as defined by the CAP theorem, but by potentially trading off consistency.
Soft State – The database doesn’t enforce data consistency, and values may change without interaction, due to eventual consistency.
Eventual Consistency – When data is written, it isn’t guaranteed to be immediately consistent to all database consumers. Generally speaking, it has to be replicated across all nodes in the database, which means that any reads occurring during that time could be inconsistent.
NoSQL databases are offering different performance characteristics than most relational databases. So let’s look at some of the advantages that non-relational databases advertise, and what tradeoffs they make in order to accomplish them:
High Write Performance – Many NoSQL databases advertise having incredible write performance. Relational databases have spent years optimizing the speed of writes, but their speed is limited by their durability and consistency guarantees. Relational databases can only write as fast as their persistent memory will allow them to. Or as fast as they can process transactions, indexes, and foreign keys (especially if they have to block). Throwing out guaranteed immediate persistence, not allowing transactions, not enforcing foreign keys, etc… can immediately enable huge gains in write performance.
High Read Performance – Many NoSQL databases have incredible read performance. This is often enabled due to balancing reads across a number of instances in a cluster, which provides high write performance, but generally gives up consistency due to eventual consistency. Many NoSQL databases also increase read performance by having simpler query syntaxes that don’t allow complex joins or queries across multiple “tables”, or whatever grouping mechanism they use.
Easy Horizontal Sharding – Sharding historically has been a huge pain point for relational databases. The primary reason is that by sharding data across a number of remote instances, you end up having to give up a lot of the consistency guarantees that relational databases provide. For example, what if you had a foreign key between two tables that are sharded across a number of different instances. In order to enforce that foreign key, you would have to scan every instance in the cluster to find the related data. By getting rid of these kinds of data consistency checks, you can much more easily shard data across large numbers of instances without running into huge performance bottlenecks.
Easy Schema Updates – Many NoSQL databases make database modifications much easier by being “schemaless”. This usually means that whatever structure you put data into the system, it is simply stored in that way. So if you need to add a new field, you simply start saving data with that new field. If you want that field across all of your data, then you write a job to update every record and add that field. Relational databases, on the other hand, usually have strict schemas and require more complex operations to update schemas. Particularly when a database is under heavy load.
More Reliable, and Predictable, Performance – NoSQL databases often have much more reliable and predictable performance. Most relational databases provide a lot of opportunities to degrade performance, usually by poorly performing queries. SQL is an incredibly powerful language, but knowing the pitfalls of it, and how your particular relational database performs under different circumstances can be challenging. I’ve seen many instances where small changes to a SQL query can increase its performance by a hundred fold. Many NoSQL databases dramatically simplify data access (sometimes by only allowing key based access), and instead choose to duplicate data in order to make it more easily queryable or export data to dedicated reporting databases.
Look for tradeoffs, not benefits
Non-relational databases can scale because they frequently throw out a lot of the guarantees that made relational databases so powerful. Any time you make one of these tradeoffs, you’re moving the complexity from the database into your application. Features such as relational integrity and data consistency are a problem that software engineers often underestimate. It is easy to maintain data consistency in the happy path of your application, but the edge cases and failure cases are where you really get in trouble.
At Simple Thread we often like to say that your application will get rewritten, but your data is forever. Keeping your data consistent, well structured, and clean is a huge task. Doing whatever you can to make that easier is often a worthwhile investment.
However, in some instances you might work with vast quantities of data, or deal with transactional systems that just don’t easily fit the operational limitations of relational databases. And in those cases, you should consider moving some, or all, of your data into a non-relational database. Just make sure that you’re looking for the tradeoffs, and not solely at the benefits, because relational databases are sharks. They evolved over a long period of time to be really amazing at storing large volumes of data and offloading a lot of the complex work of maintaining data integrity and consistency. They continue to evolve, taking on workloads that you’d traditionally expect from non-relational databases (think of Postgres’ jsonb datatype).
The next time you hear someone describe relational databases as yesterday’s technology, or the next time you see someone assume a relational database can’t handle the needs of their unproven MVP, stop and ask them how they are going to account for the tradeoffs they’re making. Make sure they understand they aren’t skipping a dead dinosaur, they’re taking a pass on the thousands of human-years of effort that have made relational databases the sharks of the data industry.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.
Great article. How databases are used in machine learning would be another? How about looking at alphafold and DNA databases. What dependencies exist? DNA->Protein->drug?
Thanks for the nice article. Can you update the PostgreSQL link with a supported version (9.5 is EOL): https://www.postgresql.org/docs/current/functions-json.html would be much nicer :).
[Implement relational data model and programming based on hash-map (NoSQL)](https://github.com/linpengcheng/PurefunctionPipelineDataflow/blob/master/doc/relational_model_on_hashmap.md)
[Everything is RMDB](https://github.com/linpengcheng/PurefunctionPipelineDataflow/blob/master/doc/Everything_is_RMDB.md)
Also, sometimes you get both. See, for e.g., CockroachDB*. Serializable isolation with distributed transactions, easy horizontal scaling (just add more nodes). Schema updates are relational (didn’t want to give that up), and read/write performance can often be optimized in much the same way as a typical relational database, but with transactions and easy scaling.
* Full disclosure: I work there. Also I work there b/c I like the database and the culture, and they definitely don’t pay me to post blog comments.
The Reddit post on this essay has lots of comments. https://old.reddit.com/r/programming/comments/q663pf/relational_databases_arent_dinosaurs_theyre_sharks/
This is a good article. But how about covering some relational databases that don’t support SQL?
I want to find something similar to Btrieve or better yet DataFlex (which was Btrieve-based) in the 1980s. I don’t want a product that even has any knowledge of SQL. Also no distributed data (everything on local hard drives on a single computer), and no concepts like “eventual consistency.” I want rock-solid data integrity, and real-time updates of multiple multi-segment indexes.
Leave a comment