Big Data and Sharding Databases: To shard or not to shard

I've been building websites since the mid 90's and have witnessed the dot com boom, and bust. Over the course of the almost two decades I've also been exposed to many different configurations, requirements, and languages to all serve the purpose of serving the next website. I talk from my observations, and my opinions are my own.

Big Data

Databases of their different variety and flavor present great solutions and sometimes great problems. Take for instance big data. Big data, as I have come to understand it, is a mass of data to serve a purpose, be it analytical, overall business, or required for overall application operation, but on a very grand scale. For example, Google spiders and indexes millions of sites around the world on a constant basis, looking for updated content to expand its ever growing database of information relevant to a particular search. This database isn't a tiny one by any means, and it's a known fact that they constantly expand their data centre capacity by the container load (containers custom fitted with caseless servers--motherboards mounted directly to wooden boards for ease of assembly, compact etc--and they pack these into data centres for a technician to simply plug into a power and network source. They're massive. And the data that they contain has to be linked up to other data across the thousands of containers they operate.

"Jimmy Clidaras revealed that the core of the company's data centers are composed of standard 1AAA shipping containers packed with 1,160 servers each, with many containers in each data center."

Source: http://news.cnet.com/8301-1001_3-10209580-92.html

There's also a YouTube video showing how these Container Data centers are made up. When you see videos and articles such as this, you can very quickly and easily see that trying to build your own datacenter in a similar way would be exorbitantly expensive.

"The index is partitioned by document IDs into many pieces called shards. Each shard is replicated onto multiple servers. Initially, the index was being served from hard disk drives, as is done in traditional information retrieval (IR) systems. Google dealt with the increasing query volume by increasing number of replicas of each shard and thus increasing number of servers. Soon they found that they had enough servers to keep a copy of the whole index in main memory (although with low replication or no replication at all), and in early 2001 Google switched to an in-memory index system. This switch "radically changed many design parameters" of their search system, and allowed for a significant increase in throughput and a large decrease in latency of queries."

http://en.wikipedia.org/wiki/Google_platform

Why Shard

The simple fact of the matter is that very large databases are slow. You've all seen this I'm sure, a table or tables comprising of millions of rows can take several seconds to return a result, no matter how good your indexes are. So by splitting them up into smaller chunks, or shards, we can turn big databases into lots of smaller databases and then they become quick.

Another big problem with big databases is that they need big servers. Whenever most database issues are encountered, especially with some smaller sites, the simplest solution thought of is to upgrade the server hardware and throw more resources at it. These solutions should be considered a band-aid and WILL constantly require upgrades which in turn means...that's right...more money!

Google would never build a single server to accommodate it's entire database. It would just be totally crazy to expect all that data to run on one single machine. Even trying it as one server would cost an astronomical fortune. I think it's safe to say that fortune could well be in excess of the total GDP of the USA. So they do what anyone really can do, split it up.

CodeFutures, creators of dbShards spell this out in a clearly concise description of database sharding:

"Database Sharding provides a method for scalability across independent servers, each with their own CPU, memory and disk. Contrasted with other traditional methods of achieving greater database performance, it does not suffer from many of the typical limitations posed by these other approaches. The concept of a "shared-nothing" database implementation has been under research or discussion for 15+ years, but it appears that the business application market is just now finding the more general need for such capability due to the exponential increase in data volumes over the past several years."

Source: http://www.codefutures.com/database-sharding/

Problems Encountered with Sharding

The first and biggest problem with sharding is how to shard. The simplest is to use master/slave pairs and have a copy of the database on each shard. The master is configured to replicate to the slave. Then you have to figure out how to shard your data, with the simplest concept being write record 1 to shard 1, record 2 to shard 2 etc. This is your sharding algorithm which helps you find the data when you need it. So you will need to program your application to go to the server shard 2 to retrieve the record shard 2. It's also going to need logic built in to know how many shards you have, so if you only have 2 shards then record 3 will be on shard 1.

That leads to the second problem, resharding. When your database grows too big for 2 shards, you're going to need to expand. So capacity planning and resharding is needed so that you know when to reshard, and how large you need it to go.

Resharding is the process by which you expand your database by adding more shards. You can't just add more shards and then start distributing your data to them after the first 2 are filled. Well, you could but then your sharding algorithm is going to go nuts with exceptions to rebounce the query to the right server. You will have to set up the new shards and then copy the data across which allows the new sharding algorithm to kick in.

Additionally, you may have a really well designed schema, a solid sharding algorythm etc, however you will find that some shards will start to get unbalanced.

Visualize the following scenario:

You have a database that manages customers, products, and orders. Majority of customers will place roughly the same amount of orders as the next customer so sharding orders to customers by customer ID makes good sense as a sharding strategy. However, there will be some occasions where customer x, and customer z are whales, meaning they place many orders all the time. The shards they are hosted on will start getting unbalanced, or have more order records than most other customers. So you'll probably have to reshard to rebalance the data with some tweaks to the algorythm to accommodate some customers containing more related information than others.

You're also going to hit problems with the speed at which to process data that's returned. If you have a query that selects all the records in the database, each server will need to run the query, collate the responses and return them to the application. If you have millions of rows per server and many shards then this could result in billions of records. Network latency and additional hops rears its nasty head as data of that size will then need to travel down the wire back to the application.

I think it's also really important to address the problem of relational data. Relational data is what the web is made up of today. It's what most databases consist of. Let's face it, all data relates to other data, regardless of it's purpose. Granted, there are the likes of noSQL, Hadoop etc which are more commonly used for data sharding, but these require you to rescript your application using new query syntax etc and they are NOT relational but rather key/value pairs. The problem with key/value pair databases is that your application has to receive the records, iterate through them, process the type of data it is and then present it.

At the end of the day, we finally went for a DaaS provided by a very competent and hard working team at dbShards running mySQL to meet our relational data requirements, over numerous shards (more than 10) to support the sheer volume that was expected. The application launched successfully and is capable of handling tens of thousands of visitors per second with a scale-able application. Through the ingenious solution provided, and dedication and support of the DaaS provider, we could handle thousands of queries per second, all with guaranteed replication from each shard, and the ongoing monitoring and support meant that we were always on, satisfying our HA and DR requirements without my colleagues and I being on 24 hour call.

Ending this article...I was fortunate to spot a new Tweet posted to my twitter timeline about a comparison between dbShards and AWSRDS. It's data like this that you can't miss out on having and I highly recommend having a read.

Big Data and Sharding Databases: To shard or not to shard

Comments (0)