What Are They And Do I Need Them?
Where you have data though, you also have to have a way to store it, manage it and, if you expect to use it, a way to visualize it.
That almost always means Data Warehouses, Data Lakes and BI tools.
It really doesn’t seem all that long ago that organizations were still storing their data in simple paper files (or maybe the person writing this is just a lot older than the person reading it!). Over time those files were migrated to floppy disks, which then became on-prem hard drives and now everything is moving once again, this time into the cloud.
How data is stored has always had to evolve to match current tech but the recent rapid evolution in data storage is solely down to the sheer volume of data now accessible to organizations… data storage solutions have had to become more efficient and advanced just to be able to cope with the volume.
Consider how you interact with some of the organizations in your life… mobile, desktop, app, browser, with the sales team, with the customer service team… and that’s just a standard retail business. Other sectors have even larger amounts of data and data nodes to store, process, manage, analyze and visualize.
That volume of data, whilst infinitely useful to a business, puts a huge stress on database clusters.
They’re required to handle ever growing amounts of data (and the amount of data is growing exponentially) all whilst being called upon more and more often.
How are we supposed to handle that much data when it hits our cluster and still derive actionable business intelligence from it?
That’s where replication and sharding comes in… sharding especially!
What Is Replication?
First… what is replication?
Replication, in terms of data warehousing and database management, refers to a type of database solution in which copies of the exact same database are hosted on multiple, different machines.
Why would you do that?
Well the biggest reason is to create extra redundancy. If one of the machines hosting a database goes down then recovery is exponentially quicker as one of the other machines the database is replicated on can ‘step in’ to take the load.
That replication process both minimizes downtime and, because there’s a constant copy of an active database, loss of data is minimized too.
Key Features Of Replication
Replica sets are the clusters of disparate nodes that can all maintain the exact same copies of the data set with different types of setups such as master — slave or master — master.
Best practice is for any replica to be able to receive updates, with other replicas then being synced.
Any changes to the data are copied and applied to the secondary machines in an asynchronous process.
That means all secondary nodes are connected to the primary node. Should the primary sever go down, one of the secondaries is automatically ‘promoted’ to the status of the new primary.
Why Should I Replicate My Database?
Replicating a database maximizes the amount of data that can be recovered in the event of something untoward happening, in some cases limiting the loss of data to nothing.
It means downtime is never needed for maintenance as there’s always another version of the data to draw on whilst the maintenance work is being carried out. The other major advantage (major being underlined twice here) is it massively improves read scaling as there’s always extra copies of the data to read from, in fact the additional availability is almost a secondary effect of the read scaling.
In short, it offers high data redundancy and high availability.
Databases are made much more durable by keeping multiple copies (or replicas) on isolated servers.
And that’s replication in a nutshell.
What Is Sharding?
Sharding is a little more complicated than replicating, but not by much.
Once the amount of data an organization holds reaches a certain size, scaling becomes a huge performance issue.
A large number of users, all writing down on the same database can really cause issues.
Sharding is a solution that is designed to counteract that, by splitting the database up into separate shards, across multiple machines.
In essence, it’s a way of performing load balancing by splitting the database up over different database servers.
Unlike replication, where the entire database is replicated, only part of it is with sharding, with data being pulled at a user’s request from the different servers. The downside to that however is there’s a price to pay for the performance gain. Whenever you need data from multiple shards at the same time you have the overhead of recombining it, either within the calling server or in your application.
Why Should I Be Sharding My Database?
Large, traditional databases that have huge datasets or a lot of users calling on them (or both) will often prove problematic for a single server, challenging its CPU limit.
Sharding relieves that pressure, by distributing the load across multiple servers, without the need of replicating your entire database.
That means, instead of one server acting as a primary (as in the case of replication) we now have several sharded servers with each one only holding part of the data.
A configuration server holds the metadata on what should be where, configuring a routing server to integrate data to a particular shard.
Sharding means your servers can automatically adjust loads across your servers depending on demand whilst the total number of operations each individual shard has to manage is massively reduced.
By splitting your data up over multiple instances, your writing capacity is also massively increased.
What’s The Difference Between Replication And Sharding?
Think of replication vs sharding as a virtual cake.
With replication, you have the original cake and then duplicate it into two cakes but now require extra storage to keep them both.
You’ve got a spare cake should anything go wrong… but you’ve just doubled your storage costs.
With sharding, imagine the cake being cut up into four, eight or twelve pieces, with each piece being stored in a separate place. You still ‘have’ a whole cake, but need to go to different places to get each slice… that’s sharding (or horizontal scaling).
Here’s where it starts to get complicated.
Both replication and sharding are only part of the answer. A true Modern Data Stack should utilize both in a sharded cluster where each shard is replicated to preserve your data’s integrity and save costs on storage.
Replication and sharding then, are one way of managing your databases and networked applications to reduce cost.
If it sounds complicated that’s because, unfortunately, to optimize it correctly, it can be.
If it seems like too much work, there’s always the old solution of increasing the number of your databases but if that seems too expensive, it may be time to look at a managed service.