What are the differences between column families in Cassandra's data model compared to Bigtable?

Guy · December 5, 2022, 4:40am

I am learning about Cassandra’s data model and its relation to Bigtable but have some things I still don’t understand regarding the Column Family concept.

Is the column-family based data model of Cassandra the same as the column-family based data model of Google’s BigTable?

Firstly I’ve read the Bigtable paper, including the part about its data model, that is, how data is stored. As far as I understood, each table in Bigtable relies on a multi-dimensional sparse map with the dimensions row, column, and time. The map is sorted by rows. Columns can be grouped with the name convention family:qualifier to a column family. Therefore, a single row can contain multiple-column families.

Although it is stated that Cassandra relies on the Bigtable data model, I read multiple times that in Cassandra, a column family contains multiple rows and is, to some extent, comparable to a table in relational data stores. Isn’t this contrary to Bigtable’s approach, where a row could contain multiple column families? What comes first, the column family or row? Are these concepts even comparable?

*Based on a question originally asked on Stack Overflow by OxideNt

Guy · December 5, 2022, 4:43am

When Cassandra started, its data model was indeed based on BigTable’s. A row of data could include any number of columns, each of these columns has a name and a value. A row could have a thousand different columns, and a different row could have a thousand other columns - rows do not have to have the same columns. Such a database is called “schema-less”, because there is no schema that each row needs to adhere to.

But Toto, we’re not in Kansas anymore - and Cassandra’s model changed in focus (though not in essence) since, and I’ll try to explain how and why:

As Cassandra matured, its developers started to realize that schema-less isn’t as great as they once thought it was. Schemas are valuable in ensuring application correctness. Moreover, one doesn’t normally get to 1000 columns in a single row just because there are 1000 individually-named fields in one record. Rather, the more common case is that the record actually contains 200 entries, each with 5 fields. The schema should fix these 5 fields that every one of these entries should have, and what defines each of these separate entries is called a “clustering key”. So around the time of Cassandra 0.8, these ideas were introduced to Cassandra as the “CQL” (Cassandra Query Language).

For example, in CQL, one declares that a column-family (which was dutifully renamed “table”) has a schema, with a known list of fields:

CREATE TABLE groups (
    groupname text,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, username)
)

This schema says that each wide row in the table (now, in modern Cassandra, this was renamed a “partition”) with the key “groupname” is a possibly long list of users, each with username, email, and age fields. The first name in the “PRIMARY KEY” specifier is the partition key (it determines the key of the wide rows), and the second is called the clustering key (it determines the key of the small rows that together make up the wide rows).

Despite the new CQL dressup, Cassandra continued to implement these new concepts using the good-old-BigTable-wide-row-without-schema implementation. For example, consider that our data has a group “mygroup” with two people, (john, john@somewhere.com, 27) and (joe, joe@somewhere.com, 38). Cassandra adds the following four column names->values to the wide row:

john:email -> john@somewhere.com
john:age -> 27
joe:email -> joe@somewhere.com
joe:age -> 27

Note how we ended up with a wide row with 4 columns - 2 non-key fields per row (email and age), multiplied by the number of rows in the partition (2). The clustering key field “username” no longer appears anywhere as the value, but rather as part of the column’s name! So If we have two username values “john” and “joe”, We have some columns prefixed “john” and some columns prefixed “joe”, and when we read the column “joe:email” we know this is the value of the email field of the row which has username=joe.

Cassandra still has this internal duality - converting the user-facing CQL rows and clustering keys into old-style wide rows. Previously, Cassandra’s on-disk format known as “SSTables” was still schema-less and used composite names as shown above for column names. I wrote a detailed description of the SSTable format on Scylla’s site SSTables Data File · scylladb/scylladb Wiki · GitHub (Scylla is a more efficient C++ re-implementation of Cassandra to which I contribute). However, column names are very inefficient in this format so Cassandra, in version 3.0, switched to a different file format, which for the first time, accepts clustering keys and schema-full rows as first-class citizens. This was the last nail in the coffin of the schema-less Cassandra from 13 years ago. Cassandra is now schema-full, all the way.

*Based on an answer originally on Stack Overflow by Nadav Har’El

Topic		Replies	Views
What is the right term for the DynamoDB and Cassandra data model? ScyllaDB data-model , cassandra , amazon-dynamodb	1	390	December 27, 2022
Data model with a lot of empty columns, collections ScyllaDB	4	339	November 22, 2022
What is the difference between Clustering, Primary, Partition, and Composite (or Compound) Keys in ScyllaDB? Knowledge Base	0	987	November 2, 2022
ScyllaDB column storage on disk ScyllaDB	1	279	December 14, 2023
Knowing ScyllaDB Limitations ScyllaDB data-model	1	2858	January 9, 2023

What are the differences between column families in Cassandra's data model compared to Bigtable?

Related topics