Wednesday, 25 July 2018

Sorted pagination in Cassandra

Cassandra is a fantastic database for different use cases. There are different situations when you need to twist Cassandra a little and studying one of those could be a helpful exercise to better understand what is Cassandra about. Databases are complex beasts, approaching them with the right level of abstraction is vital. Their final goal is not storing data per se, but make that data accessible. Those read patterns will define which database is the best tool for the job.

Time series in Cassandra
A time series is a collection of data related to some variable. Facebook's timeline would be a great example. A user will write a serie of posts over time. The access patterns to that data will be something like 'return the 20 last posts of user 1234'. The DDL of a table that models that query would be:

CREATE TABLE timeline (
    user_id uuid,
    post_id timeuuid,
    content text,
    PRIMARY KEY (user_id, post_id)
)
WITH CLUSTERING ORDER BY (post_id DESC);
In Cassandra Primary Keys are formed by Partition Keys and Clustering Keys. Primary keys enforce the uniqueness of some cells in a different way to relational databases. There is no strong enforcement of that uniqueness, if you try to insert some cell related to an already existing primary key, that will be updated. Also the other way around: a 'missing' update will end up as insert. That's called upsert.

Partition keys ensure in which node of the cluster the data is going to live. If you include at least one clustering key, the partition key will identify N rows. That could be confusing for someone coming from traditional relational databases. Cassandra does its best trying to bring its concepts into SQL terminology, but sometimes it could be weird for newbies. An example of Timeline table would be:

user_id--------------------------------post_id--------content
346e896a-c6b4-4d4e-826d-a5a9eda50636---today----------Hi
346e896a-c6b4-4d4e-826d-a5a9eda50636---yesterday------Hola
346e896a-c6b4-4d4e-826d-a5a9eda50636---one week ago---Bye
346e896a-c6b4-4d4e-826d-a5a9eda50636---two weeks ago--Ciao
In order to understand the example I converted post_id values into something that makes sense for the reader. As you can see there are several values with the same partition key (user_id) and that works as we defined a clustering key (post_id) that clusters those values and sorts them (descending in this case). Remember that uniqueness is defined by the primary key (partition plus clustering key) so if we insert a row identified with '346e896a-c6b4-4d4e-826d-a5a9eda50636' and 'today' the content will be updated. Nothing gets really updated in disk as Cassandra works with immutable structures in disk, but at read time different writes with the same primary key will be resolved in descending order.

Let's see some queries to finish this example:

SELECT * FROM timeline
where user_id = 346e896a-c6b4-4d4e-826d-a5a9eda50636
-> It will return four rows sorted by post_id DESC

SELECT content FROM timeline
where user_id = 346e896a-c6b4-4d4e-826d-a5a9eda50636 LIMIT 1
-> It will return 'Hi'

SELECT content FROM
timeline where user_id = 346e896a-c6b4-4d4e-826d-a5a9eda50636 and post_id > today LIMIT 2
-> It will return 'Hola' and 'Bye'

As you can see implementing sorted pagination is extremely easy when modeling Time Series in Cassandra. Besides it will be super performant as Cassandra stores all the rows identified by a single partition key in the same node, so a single roundtrip will be needed to fetch this data (assuming read consistency level ONE)

Let's see what happens when we want to implement sorted pagination in a different use case.

Sorted sets in Cassandra
If we think in the previous example at data structure abstraction level, we can see that we just modeled a Map whose values are Sorted Sets. What happens if we want to model something like a Sorted Set with Cassandra?

Our scenario is the following. The users of our system can be suspended or unsuspended through some admin portal. The admins would like to have a look into the last users that have been suspended along the suspension's reason in order to verify that decision or revoke it. That's pretty similar to our previous paginated queries so let's how we can model that with Cassandra.

CREATE TABLE suspended_users (
    user_id uuid,
    occurred_on timestamp,
    reason text
)
I've deliberately left out the Primary Key from this DDL so we can discuss different options.

0 comments:

Post a Comment