SignalFx is a modern monitoring service that ingests, stores and performs real-time streaming analyticson high-volume, high-resolution metric data from companies all over the world.
Providing real-time streaming analytics means that we ingest tens of billions of points of time series data per day, and we give our customers the capability to send data at one second resolution. All of this data ends up in Cassandra, which we use as the backend of our time series database (or TSDB).
We chose Cassandra for scalability and read and write performance at extremely high load. For operational efficiency, we’ve gone through multiple stages of performance optimization. And we came to the counterintuitive conclusion: sometimes two writes perform better than one.
Measuring Overall Performance
We built a test environment with a load simulator to measure the difference in Cassandra performance as we moved through each optimization stage. For a constant simulated load measured in data points per second, we monitored and measured:
- Write volume per second
- Write latency in milliseconds
- Host CPU load utilization
- Host disk writes in bytes per second
We compared Cassandra performance across these metrics as we transitioned from one stage to the next and show these comparisons in a handful of key charts.
Stage 1: Vertical Writes
Our Cassandra schema is what you would expect. Each data point consists of a key name (or a time series ID key), a timestamp and a value for that timestamp. Each time series has its own row in the table, and we create tables representing distinct time ranges.
TSDB Schema |
CREATE TABLE table_0 ( timeseries text time timestamp, value blob, PRIMARY KEY (timeseries, time) ) WITH COMPACT STORAGE; |
As each datapoint comes in, we write it to the appropriate row and column for that time series.
It turns out that writing each data point individually is very expensive. In other words, touching every row, every second is a very expensive load pattern. So we decided to buffer data in memory and write multiple points for each time series in a single batch statement.
Join us for a free webinar on Optimizing Cassandra » |
Stage 2: Buffered Writes
In this version of the ingest system, we will write new data into a memory-tier. A migrator process will periodically read data from the memory-tier, write it to Cassandra, and, once it’s safely in Cassandra, remove it from the memory-tier. In other words, the migrator picks up a time range of data and moves it as a batch into Cassandra.
The TSDB is now effectively two-tiered. The memory-tier is essentially a higher performance backend for the most recent data. It knows whether data for a specific time range belongs in the memory-tier or on Cassandra, and therefore routes reads and writes appropriately.
There are two independent operations here:
- New points are being ingested on the right side (same as the non-buffered Cassandra case)
- Batches of points are being written to Cassandra on the left side
The buffered writes performance shows improvements in efficiency for Cassandra compared to vertical writes. While the write pattern is choppier, Cassandra is doing many fewer writes as the previous stage. These writes are larger, but the host CPU utilization decreases significantly.
So buffering was an improvement to performance, although there was more that we can do. In this stage, writing data point-by-point means that a column is created for each data point and this has implications for storage overhead in Cassandra.
Stage 3: Packed Writes
The next optimization stage is to have the migrator pack the contents of each batch of points into a single block which it writes to Cassandra. This reduces the number of columns and write operations in each row, which has a larger benefit for storage than CPU.
This packed write operation is essentially the same as the previous, buffered case. However, we are writing fewer, bigger objects to Cassandra and the write rate has dropped tremendously. Latency also improves with fewer writes while data writes per disk also drops as blocks are more compact.
Stage 4: Persistent Logs
While the above changes represent significant performance improvements, they introduce a big problem: if a memory-tier server crashes, we lose all of the buffered data. Obviously that is not acceptable. We’ll solve that by writing data to Cassandra as we receive it. Why is that a good idea now when it performed so poorly before?
The answer is that we use a more favorable write pattern for Cassandra. The schema has row for each timestamp. Data points arriving at the same time for different time series typically have similar timestamps. Therefore, we can write these data points across a small number of rows in Cassandra instead of a row per time series as we did originally.
TSDB Schema | Log Schema | |
CREATE TABLE table_0 ( timeseries text time timestamp, value blob, PRIMARY KEY (timeseries, time) ) WITH COMPACT STORAGE; | CREATE TABLE table_0 ( stamp text, sequence bigint, value blob, PRIMARY KEY (stamp, sequence) ) WITH COMPACT STORAGE; |
We write data points for different time series in the order they arrive so that we can get them onto persistent storage as quickly as possible. Because this order of arrival is effectively non-deterministic, there’s no efficient way to retrieve a datapoint for a specific time series.
This is not a problem as we only read this data when we need to construct the memory-tier of the ingest server; we do random reads from the memory-tier.
The data is migrated from the memory-tier just as before. Once it’s been migrated, we can also remove the log data from Cassandra simply by truncating the table in which we store it.
With this process, there are clearly more write operations and an increase in disk I/O. However there are no adverse effects on write latency or on CPU load and, of course, our data is protected from a crash.
Ongoing Optimizations
We’ve learned a lot about how to incrementally improve Cassandra performance based on these optimization stages. Our analysis shows that CPU load utilization is much more dependent on the rate of writes than on the volume of data being written. For our very write-heavy workload, we saw a very large efficiency improvement by doing fewer, larger writes. This lets us get much better utilization from our Cassandra cluster.
0 comments:
Post a Comment