In the previous article, I wrote about Cassandra Node. Although the node plays an important role by itself, things like high-availability, fault-tolerance, resiliency and scalability are only achieved when we get multiple of those nodes to work together in a cluster.
Everything starts with the node, but a single node does not suffice. If that node crashes or is restarted, we are offline. Reasons for that to happen might include patching, maintenance, or simply because you are in a public cloud and machines crash and are rebooted all the time.
Nowadays, it becomes more and more important to be online all the time. Cassandra uses the powerful capabilities provided by the ring to achieve such a demanding set of requirements on always-on/always-working systems.
Everything starts with the node, but a single node does not suffice. If that node crashes or is restarted, we are offline. Reasons for that to happen might include patching, maintenance, or simply because you are in a public cloud and machines crash and are rebooted all the time.
Nowadays, it becomes more and more important to be online all the time. Cassandra uses the powerful capabilities provided by the ring to achieve such a demanding set of requirements on always-on/always-working systems.
What is the ring?
The ring is a cluster of nodes. I do may use the words ring and cluster interchangeably throughout this article.
A node is able to do everything from a functional perspective: We tell it to store data, it stores data; We ask it for data, it gives us data back. From NFRs (Non-Functional Requirements) standpoint, the ring is all about high-availability, fault-tolerance, resiliency and scalability.
Ring is a logical representation, that doesn’t mean that a node can only communicate with the next or the previous node in the ring. Any node can (and will) communicate with any other node.
Each node is going to be assigned a range of token values lying between -2^63
and 2^63-1
, so there’s plenty of room for a ring to store data (though I may use much lower values in this article for the sake of the examples). But what are those useful for?
That data in Cassandra is partitioned by key. A partition can be written to any node in the ring (no special ones, no leaders, no single point of failure or write contention).
When a node receives a read/write operation, the Partitioner will be responsible for translating they key into a token value that will fit under one of the ranges. Any node has the necessary information locally to forward the request to the node where that token value belongs to.
The coordinator is an important role in the ring. Because the client can write to any node in the ring, when a node receives an operation, even if the partition token is not inside its range, it will act as the coordinator for that operation and will forward it to the proper node on behalf of the client, and will answer to the client as soon as the operation is completed.
It’s also possible to make the client token-aware so it sends the request to the owner of the data directly, using a proper load balancing policy as described later in this article.
Some questions arise that will be answered along the rest of this article:
- How does a node know about the other nodes?
- How does the driver know where to send data?
- What happens when a node joins or leaves the ring?
- How to determine the location of a node?
- How does replication work?
- How is consistency handled?
- What happens when a node fails to receive its data?
The ring is a cluster of nodes. I do may use the words ring and cluster interchangeably throughout this article.
A node is able to do everything from a functional perspective: We tell it to store data, it stores data; We ask it for data, it gives us data back. From NFRs (Non-Functional Requirements) standpoint, the ring is all about high-availability, fault-tolerance, resiliency and scalability.
Ring is a logical representation, that doesn’t mean that a node can only communicate with the next or the previous node in the ring. Any node can (and will) communicate with any other node.
Each node is going to be assigned a range of token values lying between
-2^63
and 2^63-1
, so there’s plenty of room for a ring to store data (though I may use much lower values in this article for the sake of the examples). But what are those useful for?
That data in Cassandra is partitioned by key. A partition can be written to any node in the ring (no special ones, no leaders, no single point of failure or write contention).
When a node receives a read/write operation, the Partitioner will be responsible for translating they key into a token value that will fit under one of the ranges. Any node has the necessary information locally to forward the request to the node where that token value belongs to.
The coordinator is an important role in the ring. Because the client can write to any node in the ring, when a node receives an operation, even if the partition token is not inside its range, it will act as the coordinator for that operation and will forward it to the proper node on behalf of the client, and will answer to the client as soon as the operation is completed.
It’s also possible to make the client token-aware so it sends the request to the owner of the data directly, using a proper load balancing policy as described later in this article.
Some questions arise that will be answered along the rest of this article:
- How does a node know about the other nodes?
- How does the driver know where to send data?
- What happens when a node joins or leaves the ring?
- How to determine the location of a node?
- How does replication work?
- How is consistency handled?
- What happens when a node fails to receive its data?
How does a node know about the other nodes?
The nodes communicate the metadata about the cluster with each other through Gossip, which is based on a well-established protocol. Every few seconds, each node starts to gossip with one to three other nodes in the cluster. A node can gossip with any other node.
Each node contains metadata that represents its view of the cluster. That includes:
- Heartbeat state: Generation and Version
- Application/Node state: Status, Datacenter, Rack, Schema, Load, etc
When a node gossips to another node, he sends a digest of the information he currently knows about.
The second node will check its own view of the state, it will request an update for its outdated information (based on the Heartbeat version) and will send back newer information it might have. The first node will then accept any newer information and build a third package to send the requested information to the second node.
The nodes communicate the metadata about the cluster with each other through Gossip, which is based on a well-established protocol. Every few seconds, each node starts to gossip with one to three other nodes in the cluster. A node can gossip with any other node.
Each node contains metadata that represents its view of the cluster. That includes:
- Heartbeat state: Generation and Version
- Application/Node state: Status, Datacenter, Rack, Schema, Load, etc
When a node gossips to another node, he sends a digest of the information he currently knows about.
The second node will check its own view of the state, it will request an update for its outdated information (based on the Heartbeat version) and will send back newer information it might have. The first node will then accept any newer information and build a third package to send the requested information to the second node.
How does the driver know where to send data?
The driver becomes a part of the game as well. Some load balancing policies exist to balance the load between the driver and the nodes in the ring.
- RoundRobinPolicy: Spreads the requests across the cluster in a round-robin fashion. It relies heavily on coordination as the node receiving the requests (the coordinator) might not be the owner of the partition token.
- TokenAwarePolicy: Understands the token ranges and send data directly to the owner of the partition. The coordinator role becomes less important when using this policy.
- DCAwareRoundRobinPolicy: Makes sure the data is staying in the local DC (Datacenter)
The following example sets the load balancing policy when building the cluster object in the client side.
The TokenAwarePolicy
decorates the underlying policy with the awareness for the token ranges.
The driver becomes a part of the game as well. Some load balancing policies exist to balance the load between the driver and the nodes in the ring.
- RoundRobinPolicy: Spreads the requests across the cluster in a round-robin fashion. It relies heavily on coordination as the node receiving the requests (the coordinator) might not be the owner of the partition token.
- TokenAwarePolicy: Understands the token ranges and send data directly to the owner of the partition. The coordinator role becomes less important when using this policy.
- DCAwareRoundRobinPolicy: Makes sure the data is staying in the local DC (Datacenter)
The following example sets the load balancing policy when building the cluster object in the client side.
The
TokenAwarePolicy
decorates the underlying policy with the awareness for the token ranges.What happens when a node joins or leaves the ring?
When a node is joining the ring, it gossips that out to the seed nodes. The seed nodes are part of the node’s configuration and have no special purpose apart from bootstrapping new nodes when they join the ring. It’s like the first couple of reliable neighbours a person has when moving in to the neighbourhood. They will play a relevant role when gossiping with the new joiner about the other neighbours (let’s just pretend gossiping is a good thing in real-life). After that, the cluster communication will carry on normally.
The other nodes will recalculate the token ranges and will start streaming data to the new joiner. While the node is in the Joining state receiving data, it’s not receiving any read requests.
On the other hand, when a node leaves the ring (let’s say gracefully, using the command nodetool decommission
), it changes to the Leaving state and starts being deactivated by streaming its data to another node.
Vnodes
Token assignment is not always an even spread and may originate hotspots. That may happen, for example, if the range for one of the nodes is set to be greater than the range for the other nodes.
Vnodes (or Virtual Nodes) allow us to create individual smaller ranges per node, so that when we add a new node to the cluster, it will be able to get data from all the nodes in the ring, which makes the new node to come online faster.
Adding and removing nodes, when using Vnodes, shall not make the cluster unbalanced. To use Vnodes, make sure to comment out the property initial_token
and assign the property num_tokens
with a value greater than 1. Please check out this documentation for details.
When a node is joining the ring, it gossips that out to the seed nodes. The seed nodes are part of the node’s configuration and have no special purpose apart from bootstrapping new nodes when they join the ring. It’s like the first couple of reliable neighbours a person has when moving in to the neighbourhood. They will play a relevant role when gossiping with the new joiner about the other neighbours (let’s just pretend gossiping is a good thing in real-life). After that, the cluster communication will carry on normally.
The other nodes will recalculate the token ranges and will start streaming data to the new joiner. While the node is in the Joining state receiving data, it’s not receiving any read requests.
On the other hand, when a node leaves the ring (let’s say gracefully, using the command
nodetool decommission
), it changes to the Leaving state and starts being deactivated by streaming its data to another node.
Vnodes
Token assignment is not always an even spread and may originate hotspots. That may happen, for example, if the range for one of the nodes is set to be greater than the range for the other nodes.
Vnodes (or Virtual Nodes) allow us to create individual smaller ranges per node, so that when we add a new node to the cluster, it will be able to get data from all the nodes in the ring, which makes the new node to come online faster.
Adding and removing nodes, when using Vnodes, shall not make the cluster unbalanced. To use Vnodes, make sure to comment out the property
initial_token
and assign the property num_tokens
with a value greater than 1. Please check out this documentation for details.How to determine the location of a node?
The Snitch is what determines the location of each node, in terms of Rack and the Datacenter and it’s configured through the property endpoint_snitch
. Popular implementations include:
- SimpleSnitch: All nodes belong to the same datacenter and rack. Good for local tests, not good for production environments
- PropertyFileSnitch: Datacenter and rack locations are determined by a static file,
cassandra-topology.properties
, containing the topology definition. Deterministic, but not scalable, as any change in the topology implies changing the file
- GossipingPropertyFileSnitch: Each node has a file identifying its own location,
cassandra-rackdc.properties
, and that information flows through the cluster by including it in the Gossip packages
- Ec2Snitch / Ec2MultiRegionSnitch: Specific implementations for Amazon Web Services (AWS) deployments
- GoogleCloudSnitch: Specific implementation for Google Cloud deployments
The Snitch is what determines the location of each node, in terms of Rack and the Datacenter and it’s configured through the property
endpoint_snitch
. Popular implementations include:- SimpleSnitch: All nodes belong to the same datacenter and rack. Good for local tests, not good for production environments
- PropertyFileSnitch: Datacenter and rack locations are determined by a static file,
cassandra-topology.properties
, containing the topology definition. Deterministic, but not scalable, as any change in the topology implies changing the file - GossipingPropertyFileSnitch: Each node has a file identifying its own location,
cassandra-rackdc.properties
, and that information flows through the cluster by including it in the Gossip packages - Ec2Snitch / Ec2MultiRegionSnitch: Specific implementations for Amazon Web Services (AWS) deployments
- GoogleCloudSnitch: Specific implementation for Google Cloud deployments
How does replication work?
Replication allows a node to replicate its data to a separate replica node for high-availability and fault-tolerance purposes. If something bad happens to a node and the data is not anywhere else, that piece of data might be lost forever and no other node can serve that data.
The Replication Factor describes how many copies of the data will exist and it can be set when creating the keyspace.
- RF=1 means that each node has a different set of data, which is equivalent to sharding.
- RF=2 means that upon a write operation, data is written into the node owning that partition and a replica. And so on.
It’s also common to have multi-region deployments to allow for things such as Disaster Recovery. Unlike SimpleStrategy, NetworkTopologyStrategy makes the cluster topology aware.
The above indicates RF=2 at dc-west RF=3 at dc-east.
When the coordinator receives a write, it writes asynchronously to the replicas but it also writes asynchronously to a remote coordinator that is then responsible for syncing the other remote replicas.
Replication allows a node to replicate its data to a separate replica node for high-availability and fault-tolerance purposes. If something bad happens to a node and the data is not anywhere else, that piece of data might be lost forever and no other node can serve that data.
The Replication Factor describes how many copies of the data will exist and it can be set when creating the keyspace.
- RF=1 means that each node has a different set of data, which is equivalent to sharding.
- RF=2 means that upon a write operation, data is written into the node owning that partition and a replica. And so on.
It’s also common to have multi-region deployments to allow for things such as Disaster Recovery. Unlike SimpleStrategy, NetworkTopologyStrategy makes the cluster topology aware.
The above indicates RF=2 at dc-west RF=3 at dc-east.
When the coordinator receives a write, it writes asynchronously to the replicas but it also writes asynchronously to a remote coordinator that is then responsible for syncing the other remote replicas.
How is consistency handled?
CAP stands for Consistency, Availability and network Partition tolerance. Distributed databases like Cassandra are built around this CAP theorem. Cassandra is designed to be AP, i.e. available and tolerant to network partitions.
But what about C? That doesn’t mean Cassandra is inconsistent. That just means that it favours Availability and Partition tolerance but pays with eventual consistency. The good thing about Cassandra is that Consistency Levels are tuneable, at the cost of things like latency or resiliency.
Each node is able to act as a coordinator of any request. This gives a lot flexibility, specially in split-brain scenarios. This is managed by Consistency Level (CL).
The client gets to choose the consistency level for reads and writes. When performing a write/read request, the coordinator returns back to the client in different conditions:
- CL=ALL: All replicas must store the data
- CL=ONE: At least one replica must store the data
- CL=QUORUM: Q replicas must store the data, where the value of Q depends on the replication factor (RF) and is calculated as
sum_rf_in_all_datacenters / 2 + 1
, rounded down.
- CL=LOCAL_QUORUM: Useful on multi-region scenarios, to maintain QUORUM consistency only within the local datacenter, avoiding a performance hit to ensure consistency across datacenters
- CL=ANY: Means that at least one node (any node) must store data. Not widely used as it’s a bad idea, but it’s important for the explanation about hinted handoffs, below
Other consistency levels exist, though the above are the most commonly used.
Strong consistency can be achieved with a write CL of ALL and a read CL of ONE, because at the read time, we know in advance that all replicas will have the data written there.
But that write CL=ALL also means that we cannot afford to lose any node in the cluster, otherwise we won’t be able write into our database. In general, a good compromise is using QUORUM levels for writes and reads, but of course that really depends on the use-case.
Read Consistency Level and Read-repairs
When we perform a read request with CL=ALL, all replicas will be queried. One of the nodes (typically the faster node) will return data, and the other two nodes will return a digest of that data. The coordinator compares the digest of the received block of data along with the other digests. If something does not match, the coordinator will get the data for the other nodes and will resolve the conflict by comparing timestamps.
It works the same way as in the SSTable compaction process, LWW (Last Write Wins). The coordinator will reply to the client with the most recent data, and will sync-up the inconsistent nodes with the correct data asynchronously.
When the read CL is less than ALL (let’s say ONE or QUORUM), the response is returned immediately once the consistency level is met and the property read_repair_chance
(which defaults to 10%) will be used as the probability of the read-repair being triggered in the background.
In this case, it’s not possible to guarantee that all replicas are in-sync.
CAP stands for Consistency, Availability and network Partition tolerance. Distributed databases like Cassandra are built around this CAP theorem. Cassandra is designed to be AP, i.e. available and tolerant to network partitions.
But what about C? That doesn’t mean Cassandra is inconsistent. That just means that it favours Availability and Partition tolerance but pays with eventual consistency. The good thing about Cassandra is that Consistency Levels are tuneable, at the cost of things like latency or resiliency.
Each node is able to act as a coordinator of any request. This gives a lot flexibility, specially in split-brain scenarios. This is managed by Consistency Level (CL).
The client gets to choose the consistency level for reads and writes. When performing a write/read request, the coordinator returns back to the client in different conditions:
- CL=ALL: All replicas must store the data
- CL=ONE: At least one replica must store the data
- CL=QUORUM: Q replicas must store the data, where the value of Q depends on the replication factor (RF) and is calculated as
sum_rf_in_all_datacenters / 2 + 1
, rounded down. - CL=LOCAL_QUORUM: Useful on multi-region scenarios, to maintain QUORUM consistency only within the local datacenter, avoiding a performance hit to ensure consistency across datacenters
- CL=ANY: Means that at least one node (any node) must store data. Not widely used as it’s a bad idea, but it’s important for the explanation about hinted handoffs, below
Other consistency levels exist, though the above are the most commonly used.
Strong consistency can be achieved with a write CL of ALL and a read CL of ONE, because at the read time, we know in advance that all replicas will have the data written there.
But that write CL=ALL also means that we cannot afford to lose any node in the cluster, otherwise we won’t be able write into our database. In general, a good compromise is using QUORUM levels for writes and reads, but of course that really depends on the use-case.
Read Consistency Level and Read-repairs
When we perform a read request with CL=ALL, all replicas will be queried. One of the nodes (typically the faster node) will return data, and the other two nodes will return a digest of that data. The coordinator compares the digest of the received block of data along with the other digests. If something does not match, the coordinator will get the data for the other nodes and will resolve the conflict by comparing timestamps.
It works the same way as in the SSTable compaction process, LWW (Last Write Wins). The coordinator will reply to the client with the most recent data, and will sync-up the inconsistent nodes with the correct data asynchronously.
When the read CL is less than ALL (let’s say ONE or QUORUM), the response is returned immediately once the consistency level is met and the property
read_repair_chance
(which defaults to 10%) will be used as the probability of the read-repair being triggered in the background.
In this case, it’s not possible to guarantee that all replicas are in-sync.
What happens when a node fails to receive its data?
Hinted handoffs are a feature of Cassandra that allows a coordinator to store data locally when it fails to write to a node in the replica set. Let’s say that the client tries to write data, it gets to the coordinator, and the coordinator will try to write into the replicas for that partition and one of them is down.
With hinted handoffs, the coordinator can store that data locally on disk, up to 3h by default. The maximum hint window may be configured through the property max_hint_window_in_ms
. Hinted handoffs are enabled by default. To disable, one can use the property hinted_handoff_enabled
.
When the failed node comes back online, it gossips that out, and the node holding the hint will start streaming that data into it.
This is tightly related with consistency levels.
- With a consistency level of ANY, the hint stored is enough to return back to the client, but it’s still not reliable.
- With a consistency level of ONE, if the coordinator can’t write to at least one node, the write is rejected.
What if the node is not back online after max_hint_window_in_ms
, let’s say for example, after a split-brain scenario? In that case, a data-repair will be required. The nodetool
offers the ability to manually repair the node using data coming from the other replicas.
Hinted handoffs are a feature of Cassandra that allows a coordinator to store data locally when it fails to write to a node in the replica set. Let’s say that the client tries to write data, it gets to the coordinator, and the coordinator will try to write into the replicas for that partition and one of them is down.
With hinted handoffs, the coordinator can store that data locally on disk, up to 3h by default. The maximum hint window may be configured through the property
max_hint_window_in_ms
. Hinted handoffs are enabled by default. To disable, one can use the property hinted_handoff_enabled
.
When the failed node comes back online, it gossips that out, and the node holding the hint will start streaming that data into it.
This is tightly related with consistency levels.
- With a consistency level of ANY, the hint stored is enough to return back to the client, but it’s still not reliable.
- With a consistency level of ONE, if the coordinator can’t write to at least one node, the write is rejected.
What if the node is not back online after
max_hint_window_in_ms
, let’s say for example, after a split-brain scenario? In that case, a data-repair will be required. The nodetool
offers the ability to manually repair the node using data coming from the other replicas.Wrapping up
This is the second part of a Cassandra architecture overview. In the first part, we’ve seen the responsibility of a single Cassandra node. In this article, we’ve seen how multiple nodes interact with each other in order to achieve a high-available, scalable, resilient and fault-tolerant database.
The ring connects a group of nodes together and offers powerful capabilities of coordination that allows the cluster to adapt beautifully when a node joins or leaves the ring, or something bad happens and a node is not able to respond.
This article was an introduction to the subject. For more details feel free to checkout Datastax Cassandra documentation, which is pretty well structured and complete or give it a try to dig into more distribution scenarios.
This is the second part of a Cassandra architecture overview. In the first part, we’ve seen the responsibility of a single Cassandra node. In this article, we’ve seen how multiple nodes interact with each other in order to achieve a high-available, scalable, resilient and fault-tolerant database.
The ring connects a group of nodes together and offers powerful capabilities of coordination that allows the cluster to adapt beautifully when a node joins or leaves the ring, or something bad happens and a node is not able to respond.
This article was an introduction to the subject. For more details feel free to checkout Datastax Cassandra documentation, which is pretty well structured and complete or give it a try to dig into more distribution scenarios.