(Go: >> BACK << -|- >> HOME <<)

Friday, 27 December 2013

Rolling your own Bitmap Index in Cassandra with CQL

This blog is about a specific modeling technique that can be used to support efficient indexing of your data in Cassandra so that you can run pretty dynamic queries against it - in particular rolling your own Bitmap Index with CQL.

Cassandra is not an relational database (shock, horror) :) Amongst other things, it does not support joins and foreign keys and there are some pretty good reasons as to why it doesn't. If you think about it, in a distributed database where your data is residing on various nodes spread all around the place - if you attempted to do a join the query would be massively inefficient as its likely to have to go out to other nodes etc.. - this would bring performance, scaling and predictability issues (a topic for another day).

Everything in how you model your data in Cassandra is attempting to get your query down to a level where it is being executed on a single node and does not need to go off querying other nodes to return your results. As a consequence, you do not model you data in the same way you would in an RDMS. For Cassandra you use "Query-Driven Data Modeling"!!

What this means is you denormalise your data and persist it multiple times to different tables (or column families) where that table has been optimised around the specific queries you need to perform. This is often one of the things developers, dba's etc.. start to get twitchy about. Duplicating your data has been drilled into us as a bad thing and it is - when your using a traditional RDMS. With Cassandra it's not - writes are faster than you could ever have dreamed off, it scales in a predictable horizontal fashion and compresses the data when persisting to disk. Have a read of this blog if you want to know more. So go ahead - duplicate your data, throw whatever you want at Cassandra - it will just keep taking it!

This type of index is quite similar to a Composite Key Index. Basically, imagine it like a truth table where you duplicate your data for every combination of fields you want to query by. I have called this a bitmap index as although not strictly a bitmap index (thanks Artem) this approach has previously been described as such.

For example, I have three fields A, B, C. This gives us 8 possible combinations (or 7 if you want to ignore the null one):

A  B  C 
-------
0  0  0 
0  0  1 
0  1  0 
0  1  1  
1  0  0  
1  0  1 
1  1  0  
1  1  1 

(See here for a handy truth table generator)

For example, imagine the use case where you wanted to persist bookmarks for specific locations in a video and then query those bookmarks in an a variety of different ways.

The following three examples show how you can do this in the usual table-per-query model approach for an example set of requirements.

My bookmark data consists of:
  • user_id - the unique ID of the user
  • tvshow_id - the unique ID of the tv show
  • season - the season the of the show
  • continue_watching - a boolean to handle automatically creating bookmarks while the user is watching the video
  • video_id - the unqique ID of the specific video
  • created - the timestamp the user created the bookmark on (might not be the same as the persisted timestamp in Cassandra)
  • time_in_video - the time in seconds into the video being watched

First, create a keyspace.

Note: please dont use SimpleStrategy in production or a replication_factor of 1!

Query Tables


Requirement 1: I want to get video_id, created and time_in_video for a specific users bookmarks for a specific video

This has a primary key where the user_id and video_id act as a composite partition key for my data.

Insert some data:
Check my data is inserted (do not do this in production or with lots of data):

Now for the specific query for this requirement:

Great - it does what its meant to! Its an efficient query using the partition key to return all bookmarks for johnny for videos with the id 666QAZ.

Requirement 2: I want to get video_id, created and time_in_video for specific users bookmarks where the continue watching flag is true

This has a primary key where the user_id and video_id act as a composite partition key for my data. However, video_id is in there as a clustering column for uniqueness.

Now insert some data:
Check my data is inserted (do not do this in production or with lots of data):

Now for the specific query for this requirement:

As before, pretty simple stuff. The data is being queried directly on the partition key and if I wanted I could also add the video_id to the query as its a clustering column.

Requirement 3: I want to get video_id, created and time_in_video for all user bookmarks for a specific tvshow and season

This has a primary key where user_id, tvshow_id and season act as a composite partition key for my data and again video_id is in there as a clustering column for uniqueness.

Now insert some data:
Check my data is inserted (do not do this in production or with lots of data):

Now for the specific query for this requirement:

Not overly different to the other query tables, but with different partitioning values optimised around how the data needs to be queried.

There is nothing wrong with the approach for these queries. In fact, it could probably be refined further into fewer tables. The reason I have described it is so that it can be contrasted with the following approach based around a bitmap index.


Alternative Solution: Roll your own index


So, based on those three requirements there are five fields we want to query on: user_id, tvshow_id, season, continue_watching, video_id - this equates to 32 possible combinations and we will need to insert a record for each combination.

I know this sounds expensive, but remember inserts are super fast and your reads will be very efficient. Obviously, some of these combinations make no sense so you could reduce the number of inserts further, specifically the null inserts are pointless so wont be included.

A  B  C  D  E
-------------
0  0  0  0  0
0  0  0  0  1
0  0  0  1  0
0  0  0  1  1
0  0  1  0  0
0  0  1  0  1
0  0  1  1  0
0  0  1  1  1
0  1  0  0  0  
0  1  0  0  1
0  1  0  1  0
0  1  0  1  1
0  1  1  0  0  
0  1  1  0  1
0  1  1  1  0
0  1  1  1  1
1  0  0  0  0
1  0  0  0  1
1  0  0  1  0
1  0  0  1  1
1  0  1  0  0
1  0  1  0  1
1  0  1  1  0
1  0  1  1  1
1  1  0  0  0
1  1  0  0  1
1  1  0  1  0
1  1  0  1  1
1  1  1  0  0
1  1  1  0  1
1  1  1  1  0
1  1  1  1  1


So, now we need to define a table for this data:

You will notice the addition of a new field vid_id. This has the same value as video_id because it is always required to be returned but may not be part of the query.

31 inserts for each bookmark (remember, ignoring the null one).  I have added all the inserts for each bookmark here.

So, now that this data is inserted it can be queried in a multitude of ways.

Requirement 1: I want to get video_id, created and time_in_video for a specific users bookmarks for a specific video

Requirement 2: I want to get video_id, created and time_in_video for specific users bookmarks where the continue watching flag is true

Requirement 3: I want to get video_id, created and time_in_video for all user bookmarks for a specific tvshow and season

Plus you can also do other queries, for example:

Get all bookmarks for a specific tvshow_id

Get all bookmarks for a bunch of of video_id's

Note: the use of IN on a composite partition key is only possible on the last column in the partition

Conclusion


There are obvious disadvantages to this approach and in a sense it is over engineering when you compare it to creating the query based tables. Also, you can end up with a pretty large table. It is certainly duplicating more data, however it does not require creating lots of tables and is quite efficient to query as its going directly to the data via the partition key.

One important consideration for this approach is going to be how frequently the data is changing - if the data in the index isn't changing that often, than its a reasonable approach. However if you are updating it frequently than this can be expensive and I would probably evaluate alternative approaches.

So go ahead and try it out, but be conscious of the type of workload you are targeting


Credits:

Patrick McFadin, Become a super modeler
Matt Stump, Advanced Data Modeling and Bitmap Indexes
Ilya Katsov, NoSQL Data Modeling Techniques
Artem Chebotko

How to find out what's listening on a specific port in Linux

If you need to find out the status of a port and what its listening on the following command is pretty handy.

The sample below show the result of looking at whats running on port 9160:

Command:
sudo netstat -anltp|grep :9160

Result:
tcp        0      0 127.0.0.1:9160          0.0.0.0:*               LISTEN      1120/jsvc.exec

In this instance I was looking to see if the port Cassandra was running was open as I couldn't connect (it's only listening on localhost).