Why NoSQL Now?
Why have NoSQL database reached the tipping point now? Almost twenty years ago I lived through the attack of the killer object databases. While they had some lovely technical superiorities (and I still make part of my living helping maintain a Gemstone/Smalltalk application), relational databases were able to beat object databases in the market. It wasn’t time.
Now, though, seemingly suddenly, alternative data models are all the rage. While my narcissistic programer self would love to believe this had something to do with technical superiority, experience argues otherwise. Instead, if you want to understand technical change it’s more effective to follow the money.
I was talking to a developer of a cloud-deployed application (hi Patrick, love that BrowserMob.com!) and our design discussions quickly focused on money. SimpleDB is great for high-transaction rate storage, but it’s too dang expensive for reporting. A special-purpose database just for reporting makes sense. It’s worth extra programming to reduce operational costs (the classic capex/opex tradeoff familiar to telecom engineers).
Some Made Up Numbers
Here are some made up numbers (thanks @jkordyback) to illustrate the dynamic (adjust the numbers for your actual situation). Say you have a database from a commercial vendor. It costs you annually:
$50K (license) + $1.5K (electricity) + $1K (capital) = $52.5K
If you need more performance it makes sense to get beefier hardware:
$50K + $2K + $3K = $55K
The performance advantages of an alternative data storage paradigm (column-oriented, document-oriented, key-value, map-reduce) don’t justify the additional cost and complexity.
Eliminate the license and the cost of electricity becomes a huge percentage of the cost of a database. (EC2 is basically a really complicated way of charging for electricity.) Any technical advantage that reduces energy usage turns directly into profit. Your database now costs you:
$0 + $1.5K + $1K = $2.5K
Buying beefier hardware is a giant bump:
$0 + $2K +$3K = $5K
What if you can avoid the hardware upgrade by shifting to an alternative database? Factor in internet-scale applications so you’re multiplying all your costs by 100 or 1000. The engineering required to shift to a different store or to keep multiply stores in sync vanish in comparison to the operational expense savings (1000 servers for illustration):
$0 + $1500K + $1000K = $2500K
Improving performance with hardware:
$0 + $2000K + $3000K = $5000K
Versus switching to a different store:
$0 + $1500K + $3000K + $500K (engineering cost) = $4000K
Implications
There are several things that catch my eye in this picture. One is that when I was going to school we were always taught, “In the olden days of computing, computers were expensive and programmers were cheap. Now it’s the reverse. Therefore…” We are back to the future. At internet scale, programmers are (sometimes) cheap compared to the cost of electricity. That’s a pretty fundamental assumption to change. I’m sure we haven’t fully digested the implications.
Another is that the technical advantages of alternative stores translate directly into economic advantage. If I was a big database vendor, I’d be diversifying away from reliance on big-iron licenses, say by buying a hardware company (oh…) or running a variety of storage models on my cloud (double oh…) Again, I’m sure we haven’t fully digested the implications of this shift.
In spite of the roughness of the numbers above, based on this I feel justified in my gut feel that the row-oriented relational model we’ve lived with for 30 years is about to shatter. Look for opex optimization to become an increasingly important topic for engineers. Look for vendors, both software and services, to deliver further opex improvements. Where it goes from there isn’t clear, but it certainly will be interesting. The time has come.
The “financial” “models” above are just thinking tools to look at trends. I’d love to see some real numbers and trends to validate the qualitative conclusions I’ve already jumped to.
I’ve been scoping out a new IT organization that will need to scale up in shoulder style jumps. After running the numbers I believe that EC2 is the best way to go. Basically I can set a floor price on my operational costs (licensing, servers, etc) and hedge network and electricity prices. I now need to figure out a streamlined support and governance model to match.
Besides the technical and cost attractions of EC2 the ability to provision at will is a competitive advantage in a growing organization. But mainly I can work with fewer, smarter people and that is priceless.
@john kordyback: There are some companies now offering a support and deployment model on EC2 these days, http://www.engineyard.com/cloud is one of them if you’re into Ruby.
huh? this isn’t even hand-waving. you’ve made a flimsy argument and put some numbers on a line to make a curve fit. i can just as easily assert that denormalization will imply more spinning media, which means more power used by you.
$0 + $1500K + $3000K + $500K (engineering cost) = $4000K ??
Probably you meant
$0 + $1500K + $2000K + $500K (engineering cost) = $4000K
I don’t generally approve anonymous posts, but I appreciate the substance of what you’re saying. I use the numbers as a way of thinking about the shape of the market. If you have an alternative model that leads to different conclusions, please post it and we can discuss. Telling me, “That’s not science,” doesn’t really help clarify the picture. I know it’s not science.
I’m struggling with a qualitative phenomenon–row-oriented relational databases fought off object databases in spite of clear technical disadvantages. Now they are (in the early stages of) getting crushed by a variety of alternative storage paradigms. Why?
Using MySQL or PostgreSQL you are at license costs 0$. Given that many of the early adopters of NoSQL were firms of the “Internet World” and not necessarily from the “Old Business World” and often used MySQL and PostgreSQL at the heart of their applications your made up numbers make no sense in my opinion.
Kent, I don’t buy it. The row-oriented relational model we’ve lived with for 30 years is not about to shatter. It will continue to do what it’s always done, and done well. There is no reason for must organizations to consider a non-relational database for their order processing, inventory management, or human resources, for example. These uses are exactly what relational was designed for.
The difference is that some (though not all or even most) companies are doing very different things that scale beyond the enterprise, and have data needs like publishing that do not fit well into a classic relational model. These use cases need different models like XQuery and key-value stores. However these models will not replace relational databases for what SQL is good for. They’ll just expand the market for database services into areas where databases were never used or where SQL just wasn’t a good fit for the problem in the first place.
I think we actually agree. Once the capex/opex ratio falls far enough, the performance advantages of NoSQL stores becomes economically significant. It becomes worth the extra complexity to add them to the mix.
Elliotte,
Thank you for the followup. I spoke vaguely when I used the word “shatter”. What I should have said is that the monopoly of the row-oriented relational model is over. Because of the dominance of opex we’ll see a variety of data stores even inside of a single system. Where opex doesn’t dominate, like where you only have a little bit of data, it’s not worth the added complexity of multiple stores. Row-oriented relational will be part of that mix, certainly.
What I woke up this morning realizing was why this question matters to me. I invested a couple years of my life in getting good at object databases without getting much of a return on it. I’m eager to learn about the alternatives to ROR but I don’t want to waste my time again. The piece was my elaborate rationalization for why it makes sense to dive into the NoSQL world. NoSQL is here to stay because for certain applications they have an opex advantage once you get to a certain scale. The lack of capex in the form of licensing fees amplifies this advantage.
Elliotte Rusty Harold
Two things about nosql make me go “hmmmm”. Google’s bigtable and Amazon’s simpledb are fast. Bloody fast. I never underestimate fast. OO DB’s were unfortunately slow. Tweaking rdbms to be fast is an art and many developers are tired of doing it, they just want to move on to the solution. The behavioral characteristics of a rdbms when it has little data vs lots of data is also tiresome.
Second, access to data through mechanisms like REST or JSON may be “good enough” to support most problems. Granted it doesn’t have the flexibility of SQL but I don’t need it for most situations. And when I do need a complicated query then I can just write python or java or whatever. To me this beats the heck out of some whacky subselect outer join statement thing-a-ma-jig.
Third (me count good) it’s a testable approach. There are lots of angles for tests at various points. By all that’s holy it’s apparent to me how to test.
The big down side I see is that nosql solutions don’t quite work exactly as advertised on your laptop like they do on a server. It’s one of those “trust but verify” things most developers have to work around. Also I want to work through db upgrades in more detail. I think the secret there is that you should be upgrading more often rather than in one big splat.
And before a rubyist pipes in I will try CouchDB when I get a chance. I only have so many hours in the day and my dogs just stare at me when they want a walk. It’s unnverving.
Quote: “The time has come.”
Do you have any resources that show this is right? Some performance tests with larger databases (some hundred GB) with some thousand transactions per second? Did you do any own performance tests?
I’d be interested in some comparisons (not provided by vendor, please
).
Greets
Flo
No, I haven’t done any benchmarking myself. All of the NoSQL vendors claim roughly order-of-magnitude improvements for some range of scenarios.
Florian, I’ve done testing in the dozens of gigs with various loads. Comparing nosql to a db like SQL Server (which is a fine db in humble opinion) or Oracle is like comparing Salvidor Dali to Tony Curtis. They are both well known, they both painted pictures, and they both worked in movies. But I wouldn’t put them in the same category.
RDMBS do indexed queries really, really well. No question, no argument. Non-indexed queries and crappy joins poorly. As the load or data sizes increase you have to start thinking about hw, indexing, file layout, etc. Reasonably smart people have been doing this a long time so it isn’t all that difficult, just time consuming. It’s like owning a MG Midget, you really learn how to tune a carburetor, you want to drive it on a smooth road, and really enjoy the smell of musty leather seats.
nosql tends to maintain a fast and even response characteristic as the data grows. Indexing and caching works quite well without you even realizing it. Unstructured queries work very, very well. I don’t worry about hw, disks, reindexing, file layout, or any of that other stuff. I worry less about which ORM framework to use, think about scaling on the server (not on the database AND server) and take advantage of the frameworks the vendor provides. It’s like driving a Lexus with heated seats. You don’t fiddle with the motor but spend more time poking around the doo-dads.
I find nosql more straightforward for evolving design and change is less complex. I would describe the scaling as consistent and horizontal without shoulders of performance. I personally think you could tune a relational db to be faster but the overall experience for the end user would not be significantly different.
In a nutshell, nosql is very, very fast for something that is simpler to use. If you really want to dig in and spend time tuning your database then RDMBS will win on pure speed but that takes time, a lot more monitoring, and a lot more work on things than don’t differentiate your application. I’d rather spend that time focused on the user at the other end of the pipe.
Very interesting post. This is a hot issue of discussion among the 7-member team of which I am a part in Nashville, TN. We are currently evaluating CouchDB, and are approaching future projects with a message-bus architectural mindset. Right now, our guts are telling us that if we use something like Couch, along with Ruby and .NET MVC clients that are hooked into a message bus, we could also write a listener on that bus that would populate a read only RDBMS to be used for the ‘traditional’ ad-hoc analysis tools, and also be the starting point for a star schema data warehouse load process. not every project would require the RDBMS listener to exist, but we may get the best of both worlds when it does…
I think this kind of architecture will become more common, both for performance and price/performance reasons. It was always how something like Sabre was architected (I worked briefly on the European equivalent). My take is that the data comes into the cheaply transactional system and then “trickle charges” everything downstream.
In my current project we use EventSourcing to integrate a number of persistence technologies. Our main domain store is a key-value store (id->blob), which we then index using RDF for graph queries, Lucene for text queries, and finally we use asynch event listeners to populate denormalized MySQL for reporting. Once we had EventSourcing in place, integrating Lucene to take advantage of it took about a day. IMO using EventSourcing in your app is what is going to drive use of NOSQL (note the capital O), as it allows a combo of various techniques rather than having to choose one over the other.