Wednesday, 25 July 2018

Cassandra - The fellowship of The Ring

In the previous article, I wrote about Cassandra Node. Although the node plays an important role by itself, things like high-availability, fault-tolerance, resiliency and scalability are only achieved when we get multiple of those nodes to work together in a cluster.

Everything starts with the node, but a single node does not suffice. If that node crashes or is restarted, we are offline. Reasons for that to happen might include patching, maintenance, or simply because you are in a public cloud and machines crash and are rebooted all the time.
Nowadays, it becomes more and more important to be online all the time. Cassandra uses the powerful capabilities provided by the ring to achieve such a demanding set of requirements on always-on/always-working systems.

What is the ring?

The ring is a cluster of nodes. I do may use the words ring and cluster interchangeably throughout this article.
A node is able to do everything from a functional perspective: We tell it to store data, it stores data; We ask it for data, it gives us data back. From NFRs (Non-Functional Requirements) standpoint, the ring is all about high-availability, fault-tolerance, resiliency and scalability.
Ring is a logical representation, that doesn’t mean that a node can only communicate with the next or the previous node in the ring. Any node can (and will) communicate with any other node.
Cassandra ring
Each node is going to be assigned a range of token values lying between -2^63 and 2^63-1, so there’s plenty of room for a ring to store data (though I may use much lower values in this article for the sake of the examples). But what are those useful for?
That data in Cassandra is partitioned by key. A partition can be written to any node in the ring (no special ones, no leaders, no single point of failure or write contention).
When a node receives a read/write operation, the Partitioner will be responsible for translating they key into a token value that will fit under one of the ranges. Any node has the necessary information locally to forward the request to the node where that token value belongs to.
The coordinator is an important role in the ring. Because the client can write to any node in the ring, when a node receives an operation, even if the partition token is not inside its range, it will act as the coordinator for that operation and will forward it to the proper node on behalf of the client, and will answer to the client as soon as the operation is completed.
It’s also possible to make the client token-aware so it sends the request to the owner of the data directly, using a proper load balancing policy as described later in this article.
Some questions arise that will be answered along the rest of this article:
  • How does a node know about the other nodes?
  • How does the driver know where to send data?
  • What happens when a node joins or leaves the ring?
  • How to determine the location of a node?
  • How does replication work?
  • How is consistency handled?
  • What happens when a node fails to receive its data?

How does a node know about the other nodes?

The nodes communicate the metadata about the cluster with each other through Gossip, which is based on a well-established protocol. Every few seconds, each node starts to gossip with one to three other nodes in the cluster. A node can gossip with any other node.
Each node contains metadata that represents its view of the cluster. That includes:
  • Heartbeat state: Generation and Version
  • Application/Node state: Status, Datacenter, Rack, Schema, Load, etc
When a node gossips to another node, he sends a digest of the information he currently knows about.
Cassandra gossip
The second node will check its own view of the state, it will request an update for its outdated information (based on the Heartbeat version) and will send back newer information it might have. The first node will then accept any newer information and build a third package to send the requested information to the second node.

How does the driver know where to send data?

The driver becomes a part of the game as well. Some load balancing policies exist to balance the load between the driver and the nodes in the ring.
  • RoundRobinPolicy: Spreads the requests across the cluster in a round-robin fashion. It relies heavily on coordination as the node receiving the requests (the coordinator) might not be the owner of the partition token.
  • TokenAwarePolicy: Understands the token ranges and send data directly to the owner of the partition. The coordinator role becomes less important when using this policy.
  • DCAwareRoundRobinPolicy: Makes sure the data is staying in the local DC (Datacenter)
The following example sets the load balancing policy when building the cluster object in the client side.
Cluster cluster = Cluster.builder()
        .addContactPoint(endpoint)
        .withLoadBalancingPolicy(new TokenAwarePolicy(basePolicy))
        .build();
The TokenAwarePolicy decorates the underlying policy with the awareness for the token ranges.

What happens when a node joins or leaves the ring?

When a node is joining the ring, it gossips that out to the seed nodes. The seed nodes are part of the node’s configuration and have no special purpose apart from bootstrapping new nodes when they join the ring. It’s like the first couple of reliable neighbours a person has when moving in to the neighbourhood. They will play a relevant role when gossiping with the new joiner about the other neighbours (let’s just pretend gossiping is a good thing in real-life). After that, the cluster communication will carry on normally.
The other nodes will recalculate the token ranges and will start streaming data to the new joiner. While the node is in the Joining state receiving data, it’s not receiving any read requests.
On the other hand, when a node leaves the ring (let’s say gracefully, using the command nodetool decommission), it changes to the Leaving state and starts being deactivated by streaming its data to another node.
Vnodes
Token assignment is not always an even spread and may originate hotspots. That may happen, for example, if the range for one of the nodes is set to be greater than the range for the other nodes.
Vnodes (or Virtual Nodes) allow us to create individual smaller ranges per node, so that when we add a new node to the cluster, it will be able to get data from all the nodes in the ring, which makes the new node to come online faster.
Adding and removing nodes, when using Vnodes, shall not make the cluster unbalanced. To use Vnodes, make sure to comment out the property initial_token and assign the property num_tokens with a value greater than 1. Please check out this documentation for details.

How to determine the location of a node?

The Snitch is what determines the location of each node, in terms of Rack and the Datacenter and it’s configured through the property endpoint_snitch. Popular implementations include:
  • SimpleSnitch: All nodes belong to the same datacenter and rack. Good for local tests, not good for production environments
  • PropertyFileSnitch: Datacenter and rack locations are determined by a static file, cassandra-topology.properties, containing the topology definition. Deterministic, but not scalable, as any change in the topology implies changing the file
  • GossipingPropertyFileSnitch: Each node has a file identifying its own location, cassandra-rackdc.properties, and that information flows through the cluster by including it in the Gossip packages
  • Ec2Snitch / Ec2MultiRegionSnitch: Specific implementations for Amazon Web Services (AWS) deployments
  • GoogleCloudSnitch: Specific implementation for Google Cloud deployments

How does replication work?

Replication allows a node to replicate its data to a separate replica node for high-availability and fault-tolerance purposes. If something bad happens to a node and the data is not anywhere else, that piece of data might be lost forever and no other node can serve that data.
The Replication Factor describes how many copies of the data will exist and it can be set when creating the keyspace.
CREATE KEYSPACE sample WITH replication = { 'class': 'SimpleStrategy', 
                                            'replication_factor': 3 };
  • RF=1 means that each node has a different set of data, which is equivalent to sharding.
  • RF=2 means that upon a write operation, data is written into the node owning that partition and a replica. And so on.
Cassandra replication
It’s also common to have multi-region deployments to allow for things such as Disaster Recovery. Unlike SimpleStrategy, NetworkTopologyStrategy makes the cluster topology aware.
CREATE KEYSPACE sample WITH replication = { 'class': 'NetworkTopologyStrategy', 
                                            'dc-west': 2,
                                            'dc-east': 3 };
The above indicates RF=2 at dc-west RF=3 at dc-east.
Cassandra Multi-DC replication
When the coordinator receives a write, it writes asynchronously to the replicas but it also writes asynchronously to a remote coordinator that is then responsible for syncing the other remote replicas.

How is consistency handled?

CAP stands for Consistency, Availability and network Partition tolerance. Distributed databases like Cassandra are built around this CAP theorem. Cassandra is designed to be AP, i.e. available and tolerant to network partitions.
But what about C? That doesn’t mean Cassandra is inconsistent. That just means that it favours Availability and Partition tolerance but pays with eventual consistency. The good thing about Cassandra is that Consistency Levels are tuneable, at the cost of things like latency or resiliency.
Each node is able to act as a coordinator of any request. This gives a lot flexibility, specially in split-brain scenarios. This is managed by Consistency Level (CL).
The client gets to choose the consistency level for reads and writes. When performing a write/read request, the coordinator returns back to the client in different conditions:
  • CL=ALL: All replicas must store the data
  • CL=ONE: At least one replica must store the data
  • CL=QUORUM: Q replicas must store the data, where the value of Q depends on the replication factor (RF) and is calculated as sum_rf_in_all_datacenters / 2 + 1, rounded down.
  • CL=LOCAL_QUORUM: Useful on multi-region scenarios, to maintain QUORUM consistency only within the local datacenter, avoiding a performance hit to ensure consistency across datacenters
  • CL=ANY: Means that at least one node (any node) must store data. Not widely used as it’s a bad idea, but it’s important for the explanation about hinted handoffs, below
Other consistency levels exist, though the above are the most commonly used.
Cassandra Quorum
Strong consistency can be achieved with a write CL of ALL and a read CL of ONE, because at the read time, we know in advance that all replicas will have the data written there.
But that write CL=ALL also means that we cannot afford to lose any node in the cluster, otherwise we won’t be able write into our database. In general, a good compromise is using QUORUM levels for writes and reads, but of course that really depends on the use-case.
Read Consistency Level and Read-repairs
When we perform a read request with CL=ALL, all replicas will be queried. One of the nodes (typically the faster node) will return data, and the other two nodes will return a digest of that data. The coordinator compares the digest of the received block of data along with the other digests. If something does not match, the coordinator will get the data for the other nodes and will resolve the conflict by comparing timestamps.
It works the same way as in the SSTable compaction process, LWW (Last Write Wins). The coordinator will reply to the client with the most recent data, and will sync-up the inconsistent nodes with the correct data asynchronously.
When the read CL is less than ALL (let’s say ONE or QUORUM), the response is returned immediately once the consistency level is met and the property read_repair_chance (which defaults to 10%) will be used as the probability of the read-repair being triggered in the background.
ALTER TABLE staff WITH read_repair_chance = 0.1;
In this case, it’s not possible to guarantee that all replicas are in-sync.

What happens when a node fails to receive its data?

Hinted handoffs are a feature of Cassandra that allows a coordinator to store data locally when it fails to write to a node in the replica set. Let’s say that the client tries to write data, it gets to the coordinator, and the coordinator will try to write into the replicas for that partition and one of them is down.
With hinted handoffs, the coordinator can store that data locally on disk, up to 3h by default. The maximum hint window may be configured through the property max_hint_window_in_ms. Hinted handoffs are enabled by default. To disable, one can use the property hinted_handoff_enabled.
When the failed node comes back online, it gossips that out, and the node holding the hint will start streaming that data into it.
This is tightly related with consistency levels.
  • With a consistency level of ANY, the hint stored is enough to return back to the client, but it’s still not reliable.
  • With a consistency level of ONE, if the coordinator can’t write to at least one node, the write is rejected.
What if the node is not back online after max_hint_window_in_ms, let’s say for example, after a split-brain scenario? In that case, a data-repair will be required. The nodetool offers the ability to manually repair the node using data coming from the other replicas.

Wrapping up

This is the second part of a Cassandra architecture overview. In the first part, we’ve seen the responsibility of a single Cassandra node. In this article, we’ve seen how multiple nodes interact with each other in order to achieve a high-available, scalable, resilient and fault-tolerant database.
The ring connects a group of nodes together and offers powerful capabilities of coordination that allows the cluster to adapt beautifully when a node joins or leaves the ring, or something bad happens and a node is not able to respond.
This article was an introduction to the subject. For more details feel free to checkout Datastax Cassandra documentation, which is pretty well structured and complete or give it a try to dig into more distribution scenarios.

Optimizing Cassandra Performance: Sometimes Two Writes Are Better Than One


SignalFx is a modern monitoring service that ingests, stores and performs real-time streaming analyticson high-volume, high-resolution metric data from companies all over the world.
Providing real-time streaming analytics means that we ingest tens of billions of points of time series data per day, and we give our customers the capability to send data at one second resolution. All of this data ends up in Cassandra, which we use as the backend of our time series database (or TSDB).
We chose Cassandra for scalability and read and write performance at extremely high load. For operational efficiency, we’ve gone through multiple stages of performance optimization. And we came to the counterintuitive conclusion: sometimes two writes perform better than one.

Measuring Overall Performance

We built a test environment with a load simulator to measure the difference in Cassandra performance as we moved through each optimization stage. For a constant simulated load measured in data points per second, we monitored and measured:
  • Write volume per second
  • Write latency in milliseconds
  • Host CPU load utilization
  • Host disk writes in bytes per second
We compared Cassandra performance across these metrics as we transitioned from one stage to the next and show these comparisons in a handful of key charts.

Stage 1: Vertical Writes

Our Cassandra schema is what you would expect. Each data point consists of a key name (or a time series ID key), a timestamp and a value for that timestamp. Each time series has its own row in the table, and we create tables representing distinct time ranges.
TSDB Schema
CREATE TABLE table_0 (
  timeseries text
  time timestamp,
  value blob,
  PRIMARY KEY (timeseries, time)
) WITH COMPACT STORAGE;

As each datapoint comes in, we write it to the appropriate row and column for that time series.
Cassandra 1
It turns out that writing each data point individually is very expensive. In other words, touching every row, every second is a very expensive load pattern. So we decided to buffer data in memory and write multiple points for each time series in a single batch statement.
Cassandra
Join us for a free webinar on Optimizing Cassandra » 

 

Stage 2: Buffered Writes

In this version of the ingest system, we will write new data into a memory-tier. A migrator process will periodically read data from the memory-tier, write it to Cassandra, and, once it’s safely in Cassandra, remove it from the memory-tier. In other words, the migrator picks up a time range of data and moves it as a batch into Cassandra.
The TSDB is now effectively two-tiered. The memory-tier is essentially a higher performance backend for the most recent data. It knows whether data for a specific time range belongs in the memory-tier or on Cassandra, and therefore routes reads and writes appropriately.
Cassandra 2
There are two independent operations here:
  1. New points are being ingested on the right side (same as the non-buffered Cassandra case)
  2. Batches of points are being written to Cassandra on the left side
The buffered writes performance shows improvements in efficiency for Cassandra compared to vertical writes. While the write pattern is choppier, Cassandra is doing many fewer writes as the previous stage. These writes are larger, but the host CPU utilization decreases significantly.
Cassandra stage 1 to 2
So buffering was an improvement to performance, although there was more that we can do. In this stage, writing data point-by-point means that a column is created for each data point and this has implications for storage overhead in Cassandra.

Stage 3: Packed Writes

The next optimization stage is to have the migrator pack the contents of each batch of points into a single block which it writes to Cassandra. This reduces the number of columns and write operations in each row, which has a larger benefit for storage than CPU.
Cassandra 3
This packed write operation is essentially the same as the previous, buffered case. However, we are writing fewer, bigger objects to Cassandra and the write rate has dropped tremendously. Latency also improves with fewer writes while data writes per disk also drops as blocks are more compact.

Stage 4: Persistent Logs

While the above changes represent significant performance improvements, they introduce a big problem: if a memory-tier server crashes, we lose all of the buffered data. Obviously that is not acceptable. We’ll solve that by writing data to Cassandra as we receive it. Why is that a good idea now when it performed so poorly before?
The answer is that we use a more favorable write pattern for Cassandra. The schema has row for each timestamp. Data points arriving at the same time for different time series typically have similar timestamps. Therefore, we can write these data points across a small number of rows in Cassandra instead of a row per time series as we did originally.
TSDB Schema
                         
Log Schema
CREATE TABLE table_0 (
  timeseries text
  time timestamp,
  value blob,
  PRIMARY KEY (timeseries, time)
) WITH COMPACT STORAGE;
 
CREATE TABLE table_0 (
  stamp text,
  sequence bigint,
  value blob,
  PRIMARY KEY (stamp, sequence)
) WITH COMPACT STORAGE;

We write data points for different time series in the order they arrive so that we can get them onto persistent storage as quickly as possible. Because this order of arrival is effectively non-deterministic, there’s no efficient way to retrieve a datapoint for a specific time series.
This is not a problem as we only read this data when we need to construct the memory-tier of the ingest server; we do random reads from the memory-tier.
Cassandra 4
The data is migrated from the memory-tier just as before. Once it’s been migrated, we can also remove the log data from Cassandra simply by truncating the table in which we store it.
Cassandra 3 to 4
With this process, there are clearly more write operations and an increase in disk I/O. However there are no adverse effects on write latency or on CPU load and, of course, our data is protected from a crash.

Ongoing Optimizations

We’ve learned a lot about how to incrementally improve Cassandra performance based on these optimization stages. Our analysis shows that CPU load utilization is much more dependent on the rate of writes than on the volume of data being written. For our very write-heavy workload, we saw a very large efficiency improvement by doing fewer, larger writes. This lets us get much better utilization from our Cassandra cluster.

Understanding Cassandra tombstones

We recently deployed in production a distributed system that uses Cassandra as its persistent storage.

Not long after we noticed that there were many warnings about tombstones in Cassandra logs.
WARN  [SharedPool-Worker-2] 2017-01-20 16:14:45,153 ReadCommand.java:508 -
Read 5000 live rows and 4771 tombstone cells for query
SELECT * FROM warehouse.locations WHERE token(address) >= token(D3-DJ-21-B-02) LIMIT 5000
(see tombstone_warn_threshold)
We found it quite surprising at first because we’ve only inserted data so far and didn’t expect to see that many tombstones in our database. After asking some people around no one seemed to have a clear explanation on what was going on in Cassandra.
In fact, the main misconception about tombstones is that people associate it with delete operations. While it’s true that tombstones are generated when data is deleted it is not the only case as we shall see.

Looking into sstables

Cassandra provides a tool to look at what is stored inside an sstable: sstabledump. This tool comes with the ‘casssandra-tools’ package which is not automatically installed with Cassandra. It’s quite straight-forward to install on a delian-like (e.g. unbuntu) distribution:
sudo apt-get update
sudo apt-get install cassandra-tools
In this blog post I used ssltabledump to understand how Cassandra stores data and when tombstones are generated.
The syntax is pretty straightforward:
sstabledump /var/lib/cassandra/data/warehouse/locations-660dbcb0e4a211e6814a9116fc548b6b/mc-1-big-Data.db
sstabledump just takes the sstable file and displays its content as json. Before being able to dump an sstable we need to flush the in-memory data into an sstable file using nodetool:
nodetool flush warehouse locations
This command flushes the table ‘locations’ in the ‘warehouse’ keyspace.
Now that we’re all setup, let’s have a look at some cases that generate tombstones.

Null values creates tombstones

An upsert operation can generate a tombstone as well. Why? Because Cassandra doesn’t store ‘null’ values. Null means the absence of data. Cassandra returns a ‘null’ value when there is no value for a field. Therefore when a field is set to null Cassandra needs to delete the existing data.
INSERT INTO movements (
  id,
  address,
  item_id,
  quantity,
  username
) VALUES (
  103,
  'D3-DJ-21-B-02',
  '3600029145',
  2,
  null
);
This statements removes any existing username value for the movement identified by the id 103. And how does Cassandra remove data ? Yes, by inserting a tombstone.
This is the corresponding ssltabledump output:
[
  {
    "partition" : {
      "key" : [ "103" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-27T15:09:50.065224Z" },
        "cells" : [
          { "name" : "address", "value" : "D3-DJ-21-B-02" },
          { "name" : "item_d", "value" : "3600029145" },
          { "name" : "quantity", "value" : "2" },
          { "name" : "username", "deletion_info" : { "local_delete_time" : "2017-01-27T15:09:50Z" }
        ]
      }
    ]
  }
]
Cassandra is designed for optimised performance and every operation is written to an append-only log. When a data is removed we can’t removed the existing value from the log, instead a “tombstone” value is inserted in the log.
Moreover Cassandra doesn’t perform read before write (except for light-weight transactions) as it would be too expensive.
Therefore when the above insert is executed Cassandra insert a tombstone value for the username field (even if there was no existing data for this key before).
Now let’s consider the following statement that looks very similar to the previous one:
INSERT INTO movements (
  id,
  address,
  item_id,
  quantity
) VALUES (
  103,
  'D3-DJ-21-B-02',
  '3600029145',
  2
);
But there is one difference. The first statement creates a tombstone for the username whereas the second statement doesn’t insert anything in that column.
[
  {
    "partition" : {
      "key" : [ "103" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-27T15:09:50.065224Z" },
        "cells" : [
          { "name" : "address", "value" : "D3-DJ-21-B-02" },
          { "name" : "item_d", "value" : "3600029145" },
          { "name" : "quantity", "value" : "2" }
        ]
      }
    ]
  }
]
If this is the first insert for this key (no previously existing data) then both statements yield to the same state, except that the second one doesn’t insert an unnecessary tombstone.
If there is existing data then what ends up in the username column might be different. With statement 1 whatever data was there it is deleted with the tombstone and no longer returned. With statement 2 the username remains unchanged so whatever value was there before (if any) will get returned.
Therefore you should strive to only update the fields that you need to.
For instance let’s say that I need to update the status of a location. Then I should only update the status field rather than the whole object. That would avoid tombstones for every missing value in the object.
The ‘properties’ field is not set in the query so no value is stored in Cassandra.
The following statement is exactly what we need as it only sets the status field:
UPDATE locations SET status = 'damaged' WHERE location_address = 'D3-DJ-21-B-02';
as the sstable dump shows
[
  {
    "partition" : {
      "key" : [ "D3-DJ-21-B-02" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 28,
        "cells" : [
          { "name" : "status", "value" : "damaged", "tstamp" : "2017-01-28T11:55:18.146255Z" }
        ]
      }
    ]
  }
]
Compare it with the following one which saves the whole location object (which happens to not have any properties – and insert an unnecessary tombstone in the ‘properties’ column).
INSERT INTO locations (
  address,
  status,
  properties
) VALUES (
  'D3-DJ-21-B-02',
  'damaged',
  null
);
An empty collection is stored as a tombstone cell
[
  {
    "partition" : {
      "key" : [ "D3-DJ-21-B-02" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-28T11:58:59.160898Z" },
        "cells" : [
          { "name" : "status", "value" : "damaged" },
          { "name" : "properties", "deletion_info" : { "marked_deleted" : "2017-01-28T11:58:59.160897Z", "local_delete_time" : "2017-01-28T11:58:59Z" } }
        ]
      }
    ]
  }
]

Be aware of the collection types

In the previous example the ‘properties’ field is a collection type (most likely a set), so let’s talk about collections as they are trickier than it looks.
Let’s create a new location with the following statement
INSERT INTO locations (
  address,
  status,
  properties
) VALUES (
  'C3-BE-52-C-01',
  'normal',
  {'pickable'}
);
Everything looks good, doesn’t it? Every field has a value, so no tombstone expected. And yet, this statement does create a tombstone for the ‘properties’ field.
[
  {
    "partition" : {
      "key" : [ "C3-BE-52-C-01" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-28T12:01:00.256789Z" },
        "cells" : [
          { "name" : "status", "value" : "normal" },
          { "name" : "properties", "deletion_info" : { "marked_deleted" : "2017-01-28T12:01:00.256788Z", "local_delete_time" : "2017-01-28T12:01:00Z" } },
          { "name" : "properties", "path" : [ "pickable" ], "value" : "" }
        ]
      }
    ]
  }
]
To understand why, we need to look at how Cassandra store a collection in the underlying storage.
The collection field includes a tombstone cell to empty the collection before adding a value.
Cassandra appends new values to the set, so when we want the collection to contain only the values passed in the query, we have to remove everything that might have been there before. That’s why Cassandra inserts a tombstone and then our value. This makes sure the set now contains only the ‘pickable’ value whatever was there before.
That’s one more reason to just set the values you need to update and nothing more.

Be careful with materialised views

A materialised view is a table that is maintained by Cassandra. One of its main feature is that we can define a different primary key than the one in the base table. You can re-order the fields of the primary key from the base table, but you can also add one extra field into the primary key of the view.
This is great as it allows to define a different partitioning or clustering but it also generates more tombstones in the view. Let’s consider an example to understand what’s happening.
Imagine that we need to query the locations by status. For instance we want to retrieve all ‘damaged’ locations. We can create a materialised view to support this use case.
CREATE MATERIALIZED VIEW locations_by_status AS
  SELECT
    status,
    address,
    properties
  FROM locations
  WHERE status IS NOT NULL
  AND address IS NOT NULL
  PRIMARY KEY (status, address);
Good, now we can use this view to find out all the locations with a given status.
But let’s consider what happens in the view when we change the status of a location with an update query
UPDATE locations
SET status = 'damaged'
WHERE address = 'C3-BE-52-C-01';
As we’ve seen this query just updates one field in the base table (locations) and doesn’t generate any tombstone in this table. However in the materialised view the ‘status’ field is part of the primary key. When the status changes the partition key changes as well.
[
  {
    "partition" : {
      "key" : [ "normal" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "clustering" : [ "C3-BE-52-C-01" ],
        "deletion_info" : { "marked_deleted" : "2017-01-20T10:34:27.707604Z", "local_delete_time" : "2017-01-20T10:46:14Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "damaged" ],
      "position" : 31
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 49,
        "clustering" : [ "C3-BE-52-C-01" ],
        "liveness_info" : { "tstamp" : "2017-01-20T10:46:14.285730Z" },
        "cells" : [
          { "name" : "properties", "deletion_info" : { "marked_deleted" : "2017-01-20T10:46:14.285729Z", "local_delete_time" : "2017-01-20T10:46:14Z" } }
        ]
      }
    ]
  }
]
To maintain the view in sync with the base table Cassandra needs to delete the row from the existing partition and insert a new one into the new partition. And a delete means a tombstone.
The update in the base table triggers a partition change in the materialised view which creates a tombstone to remove the row from the old partition.
The key thing here is to be thoughtful when designing the primary key of a materialised view (especially when the key contains more fields than the key of the base table). That being said it might be the only solution and completely worth it.
Also consider the rate of the changes of the fields of the primary key. In our case we should evaluate the rate at which the status changes for a given location (the location address doesn’t change). The less often the better off we are with respect to tombstones.
Finally tombstones will disappear over time, when compaction occurs. The typical delay is 10 days (which corresponds to the ‘gc_grace_seconds’ configuration parameter). You may want to adjust it if there are too many tombstones generated during this period.