Why NoSQL Now?

Why have NoSQL database reached the tipping point now? Almost twenty years ago I lived through the attack of the killer object databases. While they had some lovely technical superiorities (and I still make part of my living helping maintain a Gemstone/Smalltalk application), relational databases were able to beat object databases in the market. It wasn’t time.

Now, though, seemingly suddenly, alternative data models are all the rage. While my narcissistic programer self would love to believe this had something to do with technical superiority, experience argues otherwise. Instead, if you want to understand technical change it’s more effective to follow the money.

I was talking to a developer of a cloud-deployed application (hi Patrick, love that BrowserMob.com!) and our design discussions quickly focused on money. SimpleDB is great for high-transaction rate storage, but it’s too dang expensive for reporting. A special-purpose database just for reporting makes sense. It’s worth extra programming to reduce operational costs (the classic capex/opex tradeoff familiar to telecom engineers).

Some Made Up Numbers

Here are some made up numbers (thanks @jkordyback) to illustrate the dynamic (adjust the numbers for your actual situation). Say you have a database from a commercial vendor. It costs you annually:

$50K (license) + $1.5K (electricity) + $1K (capital) = $52.5K

If you need more performance it makes sense to get beefier hardware:

$50K + $2K + $3K = $55K

The performance advantages of an alternative data storage paradigm (column-oriented, document-oriented, key-value, map-reduce) don’t justify the additional cost and complexity.

Eliminate the license and the cost of electricity becomes a huge percentage of the cost of a database. (EC2 is basically a really complicated way of charging for electricity.) Any technical advantage that reduces energy usage turns directly into profit. Your database now costs you:

$0 + $1.5K + $1K = $2.5K

Buying beefier hardware is a giant bump:

$0 + $2K +$3K = $5K

What if you can avoid the hardware upgrade by shifting to an alternative database? Factor in internet-scale applications so you’re multiplying all your costs by 100 or 1000. The engineering required to shift to a different store or to keep multiply stores in sync vanish in comparison to the operational expense savings (1000 servers for illustration):

$0 + $1500K + $1000K = $2500K

Improving performance with hardware:

$0 + $2000K + $3000K = $5000K

Versus switching to a different store:

$0 + $1500K + $3000K + $500K (engineering cost) = $4000K

Implications

There are several things that catch my eye in this picture. One is that when I was going to school we were always taught, “In the olden days of computing, computers were expensive and programmers were cheap. Now it’s the reverse. Therefore…” We are back to the future. At internet scale, programmers are (sometimes) cheap compared to the cost of electricity. That’s a pretty fundamental assumption to change. I’m sure we haven’t fully digested the implications.

Another is that the technical advantages of alternative stores translate directly into economic advantage. If I was a big database vendor, I’d be diversifying away from reliance on big-iron licenses, say by buying a hardware company (oh…) or running a variety of storage models on my cloud (double oh…) Again, I’m sure we haven’t fully digested the implications of this shift.

In spite of the roughness of the numbers above, based on this I feel justified in my gut feel that the row-oriented relational model we’ve lived with for 30 years is about to shatter. Look for opex optimization to become an increasingly important topic for engineers. Look for vendors, both software and services, to deliver further opex improvements. Where it goes from there isn’t clear, but it certainly will be interesting. The time has come.

The “financial” “models” above are just thinking tools to look at trends. I’d love to see some real numbers and trends to validate the qualitative conclusions I’ve already jumped to.

18 Comments

John KordybackJanuary 29th, 2010 at 10:23 pm

I’ve been scoping out a new IT organization that will need to scale up in shoulder style jumps. After running the numbers I believe that EC2 is the best way to go. Basically I can set a floor price on my operational costs (licensing, servers, etc) and hedge network and electricity prices. I now need to figure out a streamlined support and governance model to match.

Besides the technical and cost attractions of EC2 the ability to provision at will is a competitive advantage in a growing organization. But mainly I can work with fewer, smarter people and that is priceless.

jamesJanuary 29th, 2010 at 11:37 pm

@john kordyback: There are some companies now offering a support and deployment model on EC2 these days, http://www.engineyard.com/cloud is one of them if you’re into Ruby.

whoop dedoJanuary 30th, 2010 at 12:11 am

huh? this isn’t even hand-waving. you’ve made a flimsy argument and put some numbers on a line to make a curve fit. i can just as easily assert that denormalization will imply more spinning media, which means more power used by you.

Sam SaaristeJanuary 30th, 2010 at 6:01 am

$0 + $1500K + $3000K + $500K (engineering cost) = $4000K ??
Probably you meant
$0 + $1500K + $2000K + $500K (engineering cost) = $4000K

adminJanuary 30th, 2010 at 6:19 am

I don’t generally approve anonymous posts, but I appreciate the substance of what you’re saying. I use the numbers as a way of thinking about the shape of the market. If you have an alternative model that leads to different conclusions, please post it and we can discuss. Telling me, “That’s not science,” doesn’t really help clarify the picture. I know it’s not science.

I’m struggling with a qualitative phenomenon–row-oriented relational databases fought off object databases in spite of clear technical disadvantages. Now they are (in the early stages of) getting crushed by a variety of alternative storage paradigms. Why?

beberleiJanuary 30th, 2010 at 6:24 am

Using MySQL or PostgreSQL you are at license costs 0$. Given that many of the early adopters of NoSQL were firms of the “Internet World” and not necessarily from the “Old Business World” and often used MySQL and PostgreSQL at the heart of their applications your made up numbers make no sense in my opinion.

Elliotte Rusty HaroldJanuary 30th, 2010 at 6:34 am

Kent, I don’t buy it. The row-oriented relational model we’ve lived with for 30 years is not about to shatter. It will continue to do what it’s always done, and done well. There is no reason for must organizations to consider a non-relational database for their order processing, inventory management, or human resources, for example. These uses are exactly what relational was designed for.

The difference is that some (though not all or even most) companies are doing very different things that scale beyond the enterprise, and have data needs like publishing that do not fit well into a classic relational model. These use cases need different models like XQuery and key-value stores. However these models will not replace relational databases for what SQL is good for. They’ll just expand the market for database services into areas where databases were never used or where SQL just wasn’t a good fit for the problem in the first place.

adminJanuary 30th, 2010 at 6:48 am

I think we actually agree. Once the capex/opex ratio falls far enough, the performance advantages of NoSQL stores becomes economically significant. It becomes worth the extra complexity to add them to the mix.

adminJanuary 30th, 2010 at 6:56 am

Elliotte,

Thank you for the followup. I spoke vaguely when I used the word “shatter”. What I should have said is that the monopoly of the row-oriented relational model is over. Because of the dominance of opex we’ll see a variety of data stores even inside of a single system. Where opex doesn’t dominate, like where you only have a little bit of data, it’s not worth the added complexity of multiple stores. Row-oriented relational will be part of that mix, certainly.

What I woke up this morning realizing was why this question matters to me. I invested a couple years of my life in getting good at object databases without getting much of a return on it. I’m eager to learn about the alternatives to ROR but I don’t want to waste my time again. The piece was my elaborate rationalization for why it makes sense to dive into the NoSQL world. NoSQL is here to stay because for certain applications they have an opex advantage once you get to a certain scale. The lack of capex in the form of licensing fees amplifies this advantage.

John KordybackJanuary 30th, 2010 at 8:36 am

Elliotte Rusty Harold

Two things about nosql make me go “hmmmm”. Google’s bigtable and Amazon’s simpledb are fast. Bloody fast. I never underestimate fast. OO DB’s were unfortunately slow. Tweaking rdbms to be fast is an art and many developers are tired of doing it, they just want to move on to the solution. The behavioral characteristics of a rdbms when it has little data vs lots of data is also tiresome.

Second, access to data through mechanisms like REST or JSON may be “good enough” to support most problems. Granted it doesn’t have the flexibility of SQL but I don’t need it for most situations. And when I do need a complicated query then I can just write python or java or whatever. To me this beats the heck out of some whacky subselect outer join statement thing-a-ma-jig.

Third (me count good) it’s a testable approach. There are lots of angles for tests at various points. By all that’s holy it’s apparent to me how to test.

The big down side I see is that nosql solutions don’t quite work exactly as advertised on your laptop like they do on a server. It’s one of those “trust but verify” things most developers have to work around. Also I want to work through db upgrades in more detail. I think the secret there is that you should be upgrading more often rather than in one big splat.

And before a rubyist pipes in I will try CouchDB when I get a chance. I only have so many hours in the day and my dogs just stare at me when they want a walk. It’s unnverving.

Florian ReischlJanuary 30th, 2010 at 11:16 am

Quote: “The time has come.”

Do you have any resources that show this is right? Some performance tests with larger databases (some hundred GB) with some thousand transactions per second? Did you do any own performance tests?

I’d be interested in some comparisons (not provided by vendor, please ;-) ).

Greets
Flo

adminJanuary 30th, 2010 at 11:34 am

No, I haven’t done any benchmarking myself. All of the NoSQL vendors claim roughly order-of-magnitude improvements for some range of scenarios.

John KordybackJanuary 30th, 2010 at 2:13 pm

Florian, I’ve done testing in the dozens of gigs with various loads. Comparing nosql to a db like SQL Server (which is a fine db in humble opinion) or Oracle is like comparing Salvidor Dali to Tony Curtis. They are both well known, they both painted pictures, and they both worked in movies. But I wouldn’t put them in the same category.

RDMBS do indexed queries really, really well. No question, no argument. Non-indexed queries and crappy joins poorly. As the load or data sizes increase you have to start thinking about hw, indexing, file layout, etc. Reasonably smart people have been doing this a long time so it isn’t all that difficult, just time consuming. It’s like owning a MG Midget, you really learn how to tune a carburetor, you want to drive it on a smooth road, and really enjoy the smell of musty leather seats.

nosql tends to maintain a fast and even response characteristic as the data grows. Indexing and caching works quite well without you even realizing it. Unstructured queries work very, very well. I don’t worry about hw, disks, reindexing, file layout, or any of that other stuff. I worry less about which ORM framework to use, think about scaling on the server (not on the database AND server) and take advantage of the frameworks the vendor provides. It’s like driving a Lexus with heated seats. You don’t fiddle with the motor but spend more time poking around the doo-dads.

I find nosql more straightforward for evolving design and change is less complex. I would describe the scaling as consistent and horizontal without shoulders of performance. I personally think you could tune a relational db to be faster but the overall experience for the end user would not be significantly different.

In a nutshell, nosql is very, very fast for something that is simpler to use. If you really want to dig in and spend time tuning your database then RDMBS will win on pure speed but that takes time, a lot more monitoring, and a lot more work on things than don’t differentiate your application. I’d rather spend that time focused on the user at the other end of the pipe.

Jim CowartJanuary 30th, 2010 at 11:25 pm

Very interesting post. This is a hot issue of discussion among the 7-member team of which I am a part in Nashville, TN. We are currently evaluating CouchDB, and are approaching future projects with a message-bus architectural mindset. Right now, our guts are telling us that if we use something like Couch, along with Ruby and .NET MVC clients that are hooked into a message bus, we could also write a listener on that bus that would populate a read only RDBMS to be used for the ‘traditional’ ad-hoc analysis tools, and also be the starting point for a star schema data warehouse load process. not every project would require the RDBMS listener to exist, but we may get the best of both worlds when it does…

adminJanuary 31st, 2010 at 7:21 am

I think this kind of architecture will become more common, both for performance and price/performance reasons. It was always how something like Sabre was architected (I worked briefly on the European equivalent). My take is that the data comes into the cheaply transactional system and then “trickle charges” everything downstream.

Rickard ÖbergFebruary 1st, 2010 at 8:51 pm

In my current project we use EventSourcing to integrate a number of persistence technologies. Our main domain store is a key-value store (id->blob), which we then index using RDF for graph queries, Lucene for text queries, and finally we use asynch event listeners to populate denormalized MySQL for reporting. Once we had EventSourcing in place, integrating Lucene to take advantage of it took about a day. IMO using EventSourcing in your app is what is going to drive use of NOSQL (note the capital O), as it allows a combo of various techniques rather than having to choose one over the other.

Philip HaynesSeptember 20th, 2010 at 11:37 pm

Serendipity. Tossing up whether to comment or not, and then received a call from our CEO to help him explain to a client CEO why Internet/Industry class systems are fundamentally different to existing enterprise ones. Where one system modelled from a multi-business network perspective, can more practically and economically deal with the an industry of inter-related parties working as a network. Web Models and efficient use of Multi-core processing / large data stores requires different approaches to storage. We have transitioned from OO/ORM to REST oriented computing ensure we can get the economics right and maintain a constant cost of change.

From an economies of scale point of view my experience is that, the scale figures you mention are a couple of orders of magnitude out. The Industry scale Self Managed Superannuation / Investment accounting system project I ran in 2007/2008 was built for ~$10M AUD using REST approaches. Historical figures for the previous 8 systems in that size range prior were between $150-$300M. By swapping out enterprise database approaches our cost per fund in IT dropped from $250 per fund per anum to 150M to install for *one* road. Processing cost per trip for SQL/Enterprise approaches is are known to be >$1 per vehicle trip. Web scale approaches see the development cost fall 10x and running costs fall 100x.

Why NoSQL now? It technically enables implementation of rather compelling business cases as the scale of computing is changed from enterprise to industry level where this is relevant.

mikeOctober 21st, 2011 at 5:36 am

Older piece, but I just read it. While NoSQL does generally have performance benefits, you ignore a key point: relational modelling exists for a reason. That reason is *data integrity*. In NoSQL models, long term data integrity is marginalized in the name of ease of use and performance. Relational DBs are still the preferred storage methodology…NoSQL DBs are a necessary evil to solve modern performance challenges that traditional relational DBs struggle to handle, but there is a clear and massive tradeoff.

In other words, carefully pick the right tool for the job!