本文主要告诉你如何选择需要做sharding的表和如何选择sharding key。
Database Sharding has proven itself a very successful strategy for scaling relational databases. Almost every large web-site/SaaS solution uses sharding when writing to its relational database. The reason is pretty simple – relational database technology is showing its age and just can’t meet today’s requirements: a massive number of operations/second, a lot of open connections (since there are many application servers talking to the database), huge amounts of data, and a very high write ratio (anything over 10% is high when it comes to relational databases).
Many sites and blogs posts explain what sharding is, for example here and here. But how do you shard your application? Actually, the flow is quite simple, and consists of just four steps:
(Alas – it’s incredibly difficult to implement sharding.)
To build a sharding configuration, you’ll need the following data:
If you need instructions on how to collect this information, you can read our documentation here.
Of course, this section contains an assumption: that the database is well defined and already contains data. This is the easiest course of action. If you have a new database, and you are not sure yet how many rows each table will contain or what the SQL query log will show, you’ll just have to make some kind of educated guess.
Before you begin to implement sharding, it’s important to understand that not every table in the schema will be sharded. Since sharding limits your SQL capabilities (no join between sharded tables, uniqueness, auto-increment columns, etc.), you will enforce limits on your application that will be very difficult to overcome in code.
Usually, some tables will just be replicated across all the shards. As a matter of fact, most tables will be replicated (in ScaleBase, for the sake of discussion, we call these tables Global tables), and only some tables will be sharded. You can read more about table types here.
The algorithm for choosing which tables to shard is not a very complex one:
Once you have decided which tables should be sharded (all the rest should be global tables), the choice of sharding keys is rather straightforward, as most will use the table primary key as the shard key. Of course, if multiple tables are sharded, and there is a foreign key relationship between these tables, then the foreign key will serve as the shard key for some tables.
Many people attempt to shard based on customer_id or a resource id, but I have seen how this usually fails in production environments. It is very hard to know in advance which customers belong together in the same database, and since customers can suddenly increase their traffic, this might create an unbalanced situation in which some shards are very busy while others are relaxed (see the details of last year’s FourSquare outage for some possible results of unbalanced sharding).
As with database partitioning, there are multiple algorithms available for sharding: hash , list, or range.
zipcode
has a value between
70000
and
79999
.
Country
is either
Iceland
,
Norway
,
Sweden
,
Finland
or
Denmark
could build a partition for the Nordic countries.
Usually you’ll use list and range for multi-tenancy – saving customer information across different databases and maybe even different data centers. I’ll touch on that subject in a future post. But hash will probably give you the best results when it comes to sharding, as statistically it ensures that data is evenly distributed across all shards.
Ref: http://www.scalebase.com/how-to-implement-mysql-sharding/
http://www.scalebase.com/how-to-implement-mysql-sharding-%E2%80%93-part-2/