The recent news that MongoDB, Inc. secured $150 million of investment capital underscored the fact that open source data is no longer in its infancy.
In fact, judging from the marquee names of the investors—including T. Rowe Price and Fidelity Investments—it’s fair to say that open source data has just passed out of late adolescence. It’s not fully grown up, but pretty soon it’ll have a mortgage and a retirement account of its own. And Wall Street is betting that it will some day hurl stones big enough to hurt giants like IBM and Oracle.
But the journey towards NoSQL adulthood hasn’t always been smooth.
Anyone who’s been lurking around the edges of the NoSQL world for a few years will remember the 2010 fallout when a MongoDB scaling and load balancing issue sent Foursquare into darkness for eleven straight hours.
Nathan Folkman, Foursquare Operations Director at the time, posted a detailed and technical explanation with the understated blog title “So, that was a bummer.” Eliot Horowitz, the CTO of 10gen (the company changed its name to MongoDB, Inc. this year), conducted a rigorous post mortem of what went wrong with the code and a lively discussion about necessary fixes ensued among Mongo developers everywhere.
The exact reasons for the outage are complicated—and better suited for a developer blog—but the underlying cause was that Foursquare couldn’t keep up with the growth of its customer data. The team had failed to plan adequately for that growth and the implementation of MongoDB was deeply flawed.
By August of 2010, Foursquare had more than 3 millions users but only 32 employees. The company was barely a year old. Welcome to the strange and bewildering world of coding for 21st century data.
In the age of distributed apps and real time data, the journey to a million customers can happen in a relative heartbeat. It took eBay and AOL years to reach their first million customers; Instagram achieved the same benchmark in its first few months. Scaling apps and data isn’t a new problem. What’s new is the shortened timeframe to get it right.
Three years after the Foursquare outage, MongoDB has emerged as the clear leader of the NoSQL movement. Cisco and Craigslist are using it, and so are Shutterfly and McAfee. Today, the company of MongoDB, Inc. is valued at over a billion dollars—not bad for a startup fueled by open source innovation.
To date, it’s had more than 5 million downloads and that number grows by 150,000 each month. Swarming around MongoDB are plenty of other non-relational players, including Cassandra, HBase, CouchDB, and Riak. In terms of technology, these NoSQL systems share a few common characteristics.
Generally, they’re all non-relational, open-source, cluster-friendly, schema-less, and built for the 21st century of Web computing. And they’re all vying for a piece of a $30 billion data industry. Just because the code is generally offered for free doesn’t mean the NoSQL space is run like a charity. The real value isn’t just in the code but in the knowledge required to run it at massive scale.
What companies and developers are realizing is that running data-intensive applications is all about performance. Not only from the code but also from the underlying architecture and hardware.
As the NoSQL ecosystem continues to evolve a sort of natural selection will unfold. Some NoSQL flavors will drop off the map and others will hit their stride. Enterprises and startups alike will demand more from their NoSQL partners, including tools to make scaling seamless and efficient.
A platform like MongoDB has broad appeal because it allows developers to use native data structures in many of today’s popular coding languages. Developers would rather code than wear the DBA hat and worry about the nuances of scaling. With MongoDB, automatic scaling and load balancing of data is essentially hardwired into the platform. It’s easy to see why developers love it.
But there’s a subtle distinction between automatic scaling (part of the design) and automated scaling (it happens without user involvement). And this is where the next big phase of NoSQL innovation is happening, further down the technology stack, well below the app.
MongoDB automatically supports horizontal scaling, or sharding. By default, the load balancer will move chunks of data to even out the workload across your sharded cluster. Mongo doesn’t much care whether the shards live on physical, virtual, or cloud servers, but your customers will care.
It turns out that the public cloud or cheap, virtualized servers are not a good performance fit for a resource hungry sprawl of NoSQL data. And without a method for automating sharding as your data grows on a cloud-like infrastructure, you can get caught chasing endlessly after your data.
So how does an app get the scalability of the cloud but the performance of bare metal servers? In a sense, this is the Holy Grail quest of the data and compute world. The good news is that some providers are making serious headway towards finding it.
Of course it’s possible to build a private cloud and keep all of your data behind your own firewall, but unless you like building massive datacenters it’s not likely to get investor or CFO approval. So here’s where a new generation of data-as-a-service providers come in. They’re engineering every layer of the stack—from the OS kernel to the disk drives to the file system—for optimized data performance.
For MongoDB, as an example, they’re developing tools to automate the selection of the right shard key (the linchpin in separating out your data workloads) and they’re automatically adding pre-provisioned shards to your cluster as your data grows. Think of this new model as using a kind of containerized unit of capacity.
Where it differs from the traditional cloud is that it doesn’t use a “one-size-fits-all approach.” Instead, it takes a hyper-customized approach to deploying a unit of scalability that has the performance specs of the most powerful bare metal servers. These new providers also understand that many apps rely equally on the relational and NoSQL sides of the data world, so they’re building seamless ways for the two realms to stay in sync. They finally understand the fact that NoSQL and MySQL have always been open source allies, not sworn enemies.
So if there’s a sequel to NoSQL, it’s that the promise of built-in scaling and retrieving massive amounts of data on the fly will finally be fulfilled. The line between the underlying infrastructure and the database technology itself will continue to blur. The container of data capacity will become the unit of engineering, not the server.
In my book, that’s a sequel worth seeing.