Data modeling in Cassandra is a little tricky and requires a combination of science and art. Think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted. To maximize Cassandra’s capabilities and for long term maintenance need’s, it’s better to analyze, know and follow certain high level rules while implementing Cassandra.
A few things to consider while implementing Cassandra:
- Column based
- Cluster
- Nodes
- Duplicated data
- Distributed data platform
- Performance should scale linearly when more nodes are added to the cluster
- Writes in Cassandra is cheaper than reads and less problematic.
- Denormalization and duplication are encouraged in Cassandra. Efficiency in Cassandra is partly because of data duplication
- Forget about what you know about Joins in RDBMS because there’s no Joins in Cassandra
In Cassandra, you have clusters and nodes, you want to make sure that during write’s, data is written to all cluster nodes evenly. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. To increase read efficiency make sure that data is read from as few nodes as possible.
A Cassandra stress test below with Consistency level set to ALL and ONE proves why it’s better to read from as few nodes as possible
Command:
- Cassandra-stress read n= 2000 cl=ALL no-warmup –rate threads=1
Result:
Command:
- Cassandra-stress read n= 2000 cl=ONE no-warmup –rate threads=1
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
Result:
Isolate Clusters by functional areas and criticality. Use cases with similar criticality from the same functional area share the cluster and reside in different Keyspaces(Database). Determine Queries and Build Model based on those queries. Design and think about query pattern up front and design column families also ahead. Another reason why this rule should be followed is that unlike relational database, it’s not easy to tune or introduce new query patterns in Cassandra. In other words, you can’t just introduce or add a complex SQL (TSQL, PLSQL etc.) or Secondary Indexes to Cassandra because of it highly distributed nature.
On the high level bases, Below are some of the things you need to do to determine with your query pattern:
- Enforcing uniqueness in the result set
- Filtering based on some set of conditions
- Ordering by an attribute
- Grouping by an attribute
- Identify the most frequently used query pattern
- Identify queries that are sensitive to latency
Create your queries to read from one partition. Keep in mind that your data is replicated to multiple nodes and so you can create individual queries that reads from different partition. When you query reads from multiple nodes, It has to go to each individual nodes and get the data and this takes time. But when it gets the data from one node it saves time.
An example would be the create Table below:
CREATE TABLE users_by_email( Name VARCHAR, Dob TIMESTAMP, Email VARCHAR, Join_date TIMESTAMP, PRIMARY KEY (email));
CREATE TABLE users_by_join_date( Name VARCHAR, Dob TIMESTAMP, Email VARCHAR, Join_date TIMESTAMP, PRIMARY KEY (join_date,email));
The above creates tables that enables you to read from one partition and basically, each user gets their own partition.
If you are trying to fit a group into a partition, you can use a compound PRIMARY KEY for this example:
CREATE TABLE groups (groupname text, username text, email text, join_date int, PRIMARY KEY (groupname, username)).