1.1 What is sharding?
If you persistently google, it turns out that there is a rather blurred border between the so-called partitioning and the so-called sharding. Everyone calls whatever they want, whatever they want. Some people distinguish between horizontal partitioning and sharding. Others say that sharding is a certain kind of horizontal partitioning.
I did not find a single terminological standard that would be approved by the founding fathers and certified by ISO. Personal inner conviction is something like this: Partitioning on average is “cutting the base into pieces” in an arbitrarily taken way.
- Vertical partitioning - by column. For example, there is a giant table with a couple of billion records in 60 columns. Instead of keeping one such giant table, we keep 60 at least giant tables of 2 billion records each - and this is not a column base, but vertical partitioning (as an example of terminology).
- Horizontal partitioning - we cut line by line, maybe inside the server.
The awkward moment here is the subtle difference between horizontal partitioning and sharding. I can be cut into pieces, but I can’t tell you for sure what it is. There is a feeling that sharding and horizontal partitioning are about the same thing.
Sharding is, in general, when a large table in terms of databases or a pro-collection of documents, objects, if you don’t have a database, but a document store, is cut exactly by objects. That is, from 2 billion objects, pieces are selected no matter what size. The objects themselves inside each object are not cut into pieces, we do not lay them out into separate columns, namely, we lay them out in batches in different places.
There are subtle terminological differences. For example, relatively speaking, Postgres developers can say that horizontal partitioning is when all the tables into which the main table is divided lie in the same schema, and when on different machines, this is already sharding.
In a general sense, without being tied to the terminology of a specific database and a specific data management system, there is a feeling that sharding is just slicing line by line / document by document, and so on - that's all.
I emphasize typical. In the sense that we are doing all this not just to cut 2 billion documents into 20 tables, each of which would be more manageable, but in order to distribute it over many cores, many disks or many different physical or virtual servers .
1.2 Divide the indivisible
It is understood that we do this so that each shard - each piece of data - is replicated many times. But really, no.
INSERT INTO docs00
SELECT * FROM documents WHERE (id%16)=0
...
INSERT INTO docs15
SELECT * FROM documents WHERE (id%16)=15
In fact, if you do such a slicing of data, and from one giant SQL table on MySQL on your valiant laptop, you will generate 16 small tables, without going beyond a single laptop, not a single schema, not a single database, etc. and so on. - that's it, you already have sharding.
This results in the following:
- The bandwidth increases.
- Latency does not change, that is, each, so to speak, worker or consumer in this case, gets his own. Different requests are serviced at about the same time.
- Or both, and another, and also high availability (replication).
Why bandwidth? We can sometimes have such volumes of data that do not fit - it is not clear where, but they do not fit - on 1 {kernel | disk | server | ...}. There just aren't enough resources, that's all. In order to work with this large dataset, you need to cut it.
Why latency? On one core, scanning a table of 2 billion rows is 20 times slower than scanning 20 tables on 20 cores, doing it in parallel. Data is processed too slowly on a single resource.
Why high availability? Or we cut the data in order to do both at the same time, and at the same time several copies of each shard - replication ensures high availability.
1.3 A simple example "how to do it by hand"
Conditional sharding can be cut out using the test table test.documents for 32 documents, and generating 16 test tables from this table, approximately 2 documents each test.docs00, 01, 02, ..., 15.
INSERT INTO docs00
SELECT * FROM documents WHERE (id%16)=0
...
INSERT INTO docs15
SELECT * FROM documents WHERE (id%16)=15
Why about? Because a priori we do not know how id are distributed, if from 1 to 32 inclusive, then there will be exactly 2 documents each, otherwise not.
We do it here why. After we have made 16 tables, we can "grab" 16 of what we need. Regardless of what we hit, we can parallelize these resources. For example, if there is not enough disk space, it would make sense to decompose these tables on separate disks.
All this, unfortunately, is not free. I suspect that in the case of the canonical SQL standard (I haven't read the SQL standard for a long time, maybe it hasn't been updated for a long time), there is no official standardized syntax for saying to any SQL server: "Dear SQL server, make me 32 shards and divide them into 4 disks. But in individual implementations, there is often a specific syntax for doing basically the same thing. PostgreSQL has mechanisms for partitioning, MySQL has MariaDB, Oracle probably did all this a long time ago.
Nevertheless, if we do it by hand, without database support and within the framework of the standard, then we pay conditionally with the complexity of data access . Where there was a simple SELECT * FROM documents WHERE id=123, now 16 x SELECT * FROM docsXX. And it's good if we tried to get the record by key. Much more interesting if we were trying to get an early range of records. Now (if we, I emphasize, are, as it were, fools, and remain within the framework of the standard), the results of these 16 SELECT * FROM will have to be combined in the application.
What performance change can you expect?
- Intuitively - linear.
- Theoretically - sublinear, because Amdahl law.
- Practically, maybe almost linearly, maybe not.
In fact, the correct answer is unknown. With a clever application of the sharding technique, you can achieve a significant super-linear degradation in the performance of your application, and even the DBA will come running with a red-hot poker.
Let's see how this can be achieved. It is clear that just setting the setting to PostgreSQL shards=16, and then it takes off by itself, is not interesting. Let's think about how we can make sure that we slow down from sharding by 16 times by 32 - this is interesting from the point of view of how not to do this.
Our attempts to speed up or slow down will always run into the classics - the good old Amdahl law, which says that there is no perfect parallelization of any request, there is always some consistent part.
1.4 Amdahl law
There is always a serialized part.
There is always a part of the query execution that is parallelized, and there is always a part that is not parallelized. Even if it seems to you that a perfectly parallel query, at least the collection of the result row that you are going to send to the client from the rows received from each shard is always there, and it is always sequential.
There is always some consistent part. It can be tiny, completely invisible against the general background, it can be gigantic and, accordingly, strongly affect parallelization, but it always exists.
In addition, its influence is changing and can grow significantly, for example, if we cut our table - let's raise the stakes - from 64 records into 16 tables of 4 records, this part will change. Of course, judging by such gigantic amounts of data, we are working on a mobile phone and a 2 MHz 86 processor, and we don’t have enough files that can be kept open at the same time. Apparently, with such inputs, we open one file at a time.
- It was Total = Serial + Parallel . Where, for example, parallel is all the work inside the DB, and serial is sending the result to the client.
- Became Total2 = Serial + Parallel/N + Xserial . For example, when the overall ORDER BY, Xserial>0.
With this simple example, I'm trying to show that some Xserial appears. In addition to the fact that there is always a serialized part, and the fact that we are trying to work with data in parallel, there is an additional part to provide this data slicing. Roughly speaking, we may need:
- find these 16 tables in the database's internal dictionary;
- open files;
- allocate memory;
- unallocate memory;
- merge results;
- synchronize between cores.
Some out-of-sync effects still appear. They can be insignificant and occupy one billionth of the total time, but they are always non-zero and always there. With their help, we can dramatically lose performance after sharding.
This is a standard picture about Amdahl's law. The important thing here is that the lines, which should ideally be straight and grow linearly, run into an asymptote. But since the graph from the Internet is unreadable, I made, in my opinion, more visual tables with numbers.
Let's say we have some serialized part of the request processing that only takes 5%: serial = 0.05 = 1 / 20 .
Intuitively, it would seem that with a serialized part that takes only 1/20 of the request processing, if we parallelize the request processing for 20 cores, it will become about 20, in the worst case 18, times faster.
In fact, mathematics is a heartless thing :
wall = 0.05 + 0.95/num_cores, speedup = 1 / (0.05 + 0.95/num_cores)
It turns out that if you carefully calculate, with a serialized part of 5%, the speedup will be 10 times (10.3), which is 51% compared to the theoretical ideal.
8 cores | = 5.9 | = 74% |
10 cores | = 6.9 | = 69% |
20 cores | = 10.3 | = 51% |
40 cores | = 13.6 | = 34% |
128 cores | = 17.4 | = 14% |
Having used 20 cores (20 disks, if you like) for the task that one used to work on, we will never even theoretically get an acceleration of more than 20 times, but in practice - much less. Moreover, with an increase in the number of parallels, the inefficiency increases greatly.
When only 1% of serialized work remains, and 99% is parallelized, the speedup values improve somewhat:
8 cores | = 7.5 | = 93% |
16 cores | = 13.9 | = 87% |
32 cores | = 24.4 | = 76% |
64 cores | = 39.3 | = 61% |
For a perfectly thermonuclear query, which naturally takes hours to complete, and the preparatory work and the assembly of the result take very little time (serial = 0.001), we will already see good efficiency:
8 cores | = 7.94 | = 99% |
16 cores | = 15.76 | = 99% |
32 cores | = 31.04 | = 97% |
64 cores | = 60.20 | = 94% |
Please note that we will never see 100% . In especially good cases, you can see, for example, 99.999%, but not exactly 100%.
GO TO FULL VERSION