5.1 Data distribution
Let's consider how the data is distributed depending on the key among the cluster nodes. Cassandra allows you to set a data distribution strategy. The first such strategy distributes data depending on the md5 key value - a random partitioner. The second takes into account the bit representation of the key itself - the ordinal markup (byte-ordered partitioner).
The first strategy, for the most part, gives more advantages, since you do not need to worry about even distribution of data between servers and such problems. The second strategy is used in rare cases, for example, if interval queries (range scan) are needed. It is important to note that the choice of this strategy is made before the creation of the cluster and in fact cannot be changed without a complete reload of the data.
Cassandra uses a technique known as consistent hashing to distribute data. This approach allows you to distribute data between nodes and make sure that when a new node is added and removed, the amount of data transferred is small. To do this, each node is assigned a label (token), which splits the set of all md5 key values into parts. Since RandomPartitioner is used in most cases, let's consider it.
As I said, RandomPartitioner calculates a 128-bit md5 for each key. To determine in which nodes the data will be stored, it simply goes through all the labels of the nodes from smallest to largest, and when the value of the label becomes greater than the value of the md5 key, then this node, along with a number of subsequent nodes (in the order of labels) is selected for storage. The total number of selected nodes must be equal to the replication factor. The replication level is set for each keyspace and allows you to adjust the redundancy of data (data redundancy).
Before a node can be added to the cluster, it must be given a label. The percentage of keys that cover the gap between this label and the next one determines how much data will be stored on the node. The entire set of labels for a cluster is called a ring.
Here is an illustration using the built-in nodetool utility to display a cluster ring of 6 nodes with evenly spaced labels.
5.2 Data consistency when writing
Cassandra cluster nodes are equivalent, and clients can connect to any of them, both for writing and for reading. Requests go through the stage of coordination, during which, having found out with the help of the key and the markup on which nodes the data should be located, the server sends requests to these nodes. We will call the node that performs coordination the coordinator , and the nodes that are selected to save the record with the given key, the replica nodes. Physically, one of the replica nodes can be the coordinator - it depends only on the key, markup, and labels.
For each request, both for reading and writing, it is possible to set the level of data consistency.
For a write, this level will affect the number of replica nodes that will wait for confirmation of successful completion of the operation (data written) before returning control to the user. For a record, there are these consistency levels:
- ONE - the coordinator sends requests to all replica nodes, but after waiting for confirmation from the first node, returns control to the user;
- TWO - the same, but the coordinator waits for confirmation from the first two nodes before returning control;
- THREE - similar, but the coordinator waits for confirmation from the first three nodes before returning control;
- QUORUM - a quorum is collected: the coordinator is waiting for confirmation of the record from more than half of the replica nodes, namely round (N / 2) + 1, where N is the replication level;
- LOCAL_QUORUM - The coordinator is waiting for confirmation from more than half of the replica nodes in the same data center where the coordinator is located (potentially different for each request). Allows you to get rid of the delays associated with sending data to other data centers. The issues of working with many data centers are considered in this article in passing;
- EACH_QUORUM - The coordinator is waiting for confirmation from more than half of the replica nodes in each data center, independently;
- ALL - the coordinator waits for confirmation from all replica nodes;
- ANY - makes it possible to write data, even if all replica nodes are not responding. The coordinator waits either for the first response from one of the replica nodes, or for the data to be stored using a hinted handoff on the coordinator.
5.3 Data consistency when reading
For reads, the consistency level will affect the number of replica nodes that will be read from. For reading, there are these consistency levels:
- ONE - the coordinator sends requests to the nearest replica node. The rest of the replicas are also read for read repair with the probability specified in the cassandra configuration;
- TWO is the same, but the coordinator sends requests to the two nearest nodes. The value with the largest timestamp is chosen;
- THREE - similar to the previous option, but with three nodes;
- QUORUM - a quorum is collected, that is, the coordinator sends requests to more than half of the replica nodes, namely round (N / 2) + 1, where N is the replication level;
- LOCAL_QUORUM - a quorum is collected in the data center in which coordination takes place, and the data with the latest timestamp is returned;
- EACH_QUORUM - The coordinator returns data after the meeting of the quorum in each of the data centers;
- ALL - The coordinator returns data after reading from all replica nodes.
Thus, it is possible to adjust the time delays of read and write operations and adjust the consistency (tune consistency), as well as the availability (availability) of each type of operation. In fact, availability is directly related to the consistency level of reads and writes, as it determines how many replica nodes can go down and still be confirmed.
If the number of nodes from which the write acknowledgment comes, plus the number of nodes from which the read is made, is greater than the replication level, then we have a guarantee that the new value will always be read after the write, and this is called strong consistency (strong consistency). In the absence of strong consistency, there is a possibility that a read operation will return stale data.
In any case, the value will eventually propagate between replicas, but only after the coordination wait has ended. This propagation is called eventual consistency. If not all replica nodes are available at the time of the write, then sooner or later recovery tools such as remedial reads and anti-entropy node repair will come into play. More on this later.
Thus, with a QUORUM read and write consistency level, strong consistency will always be maintained, and this will be a balance between read and write latency. With ALL writes and ONE reads there will be strong consistency and reads will be faster and more available, i.e. the number of failed nodes at which a read will still be completed can be greater than with QUORUM.
For write operations, all replica worker nodes will be required. When writing ONE, reading ALL, there will also be strict consistency, and write operations will be faster and write availability will be large, because it will be enough to confirm only that the write operation took place on at least one of the servers, while reading is slower and requires all replica nodes . If an application does not have a requirement for strict consistency, then it is possible to speed up both read and write operations, as well as improve availability by setting lower consistency levels.
GO TO FULL VERSION