Taming Cassandra in Node.js with Datastar

by Jarrett Cruger and Charlie Robbins

Photo of an album

At GoDaddy, Cassandra is a core system in our infrastructure. Many teams leverage it to ensure redundancy, reliability, and reduce the probability of downtime. With many teams also using Node.js, we had several experiments with different patterns leveraging Cassandra such as a fluent wrapper around the various drivers that existed. After exploring the Node.js ecosystem, we realized there was a need for a robust, proper Object Data Mapper (ODM) for Cassandra that could define, manage, and use Cassandra tables in our applications and comprehensive validation. It needed to be simple and usable while still offering enough flexibility for the power user including control over batching, compound partition keys, and using streams where appropriate. From these requirements, we built datastar to eliminate the need for redundant work around data modeling and statement creation in Cassandra. We are happy to announce that datastar is now available via Open Source under the MIT license.

Mapping the Cassandra Data Model into JavaScript

Models in datastar are vanilla JavaScript constructors with prototypes as you might expect. One of these constructor objects that represents a model is returned from calling datastar.define() with the name, schema, and any other options needed to configure this model.

We support defining read/write consistency, automatically creating tables, and any additional modifications needed when the table is created.

How do we Define Schemas?

When working with Cassandra and most schema-based databases, there usually is some kind of dialect or syntax used to define the name of the table, the properties, and types in order to keep the data fetched predictable. We use this explicitly for table creation and then our table can start accepting data of that form. We wanted something flexible that could be extended without being completely declarative, like a json-schema. Thankfully there was already a feature-rich option available: joi.

joi is a well-known validation library used in the hapi http framework written originally by Eran Hammer. The goal of joi is to provide an expressive and clear way to define a schema that receives data to validate. It does so with the terseness and flexibility you would expect in JavaScript.

In order to make joi work with Cassandra, we needed to blend its functionality with the specific types and other concerns of Cassandra and Cassandra Query Language (CQL). This need became joi-of-cql thanks to the work of some GoDaddy engineers – Sam Shull and Fady Matar. Let’s look at a sample Artist Model below where we define our overall schema using schema.object() which returns our extended joi-of-cql instance and accepts properties just like any vanilla joi instance.

//
// This `schema` object that we are grabbing is an alias to joi-of-cql module.
// We extend joi and create functions that represent the cassandra types for
// proper validation. The cql object represents the individual types that use
// joi under the hood.
//
var cql = datastar.schema.cql;

var Artist = datastar.define('artist', {
  schema: datastar.schema.object({
    artist_id: cql.uuid(),
    name: cql.text(),
    create_date: cql.timestamp({ default: 'create' }),
    update_date: cql.timestamp({ default: 'update' }),
    members: cql.set(cql.text()),
    related_artists: cql.set(cql.uuid()).allow(null),
    traits: cql.set(cql.text()),
    metadata: cql.map(cql.text(), cql.text()).allow(null)
  }).partitionKey('artist_id'),
  readConsistency: 'one',
  writeConsistency: 'localQuorum',
  with: {
    compaction: {
      class: 'LeveledCompactionStrategy'
    }
  }
});

These properties are then defined using the cql variable that hangs off our joi-of-cql object that is attached to datastar. All functions on our cql object are what we use to define the very specific cassandra types using joi’s dialect to do so. joi gave us the primitive types that we used to build higher level types specific to Cassandra and its capabilities as a database. We also have very specific functions on our joi-of-cql schema instance that act as setters for our schema, e.g partitionKey(‘artist_id’). Without the help of joi, we would not have such powerful expression in such a terse format.

Extending the Prototype, Smart Models

The power of datastar comes from how we utilize the JavaScript language itself. We have mentioned that datastar.define returns a constructor function that represents the configured model, and because it is a vanilla constructor function, we can define prototype methods on it. These application-specific prototype method extensions are then available for use on any object returned from the find methods of datastar. Let’s look at our Artist Model once again and see how we would extend its prototype and use it.

var Artist = datastar.define(‘artist’, {
  schema: // … see above example
});

Artist.prototype.fetchAlbums = function (callback) {
  Album.findAll({ artistId: this.artistId }, callback);
});

Artist.findOne(artistId, (err, artist) => {
  if (err) /* handle me */ return;
  
  artist.validate() // returns true
  artist.name = ‘Pink Floyd’ // this operates as a setter
  artist.save((err) => { // Save calls `validate` implicitly
    if (err) /* handle me */ return;
    console.log('We have saved our model!')
    artist.fetchAlbums((err, albums) => {
      if (err) /* handle me */ return;
      console.log(‘we have all the albums!’);
    });
  });
});

This ability to extend models with more functionality is often referred to as “fat models”. The simple cases above allows us to put more data centric logic on the data to prevent the need to write redundant logic using just the model constructor. By extending the prototype, we leveraged an already available Album Model and linked it directly to the Artist Model, which is most useful for embedding the common path queries onto the data model that is most likely to be there automatically.

Modularity, Reusability, and Micro-Services

“Configuration as code, models as modules”

At GoDaddy, we care about modularity within our software. We want to make our software more testable and encourage more re-use. The same goes for interacting with a database. The abstraction needed to fit so we could share the same models within multiple services that talk the same tables. In addition, given that we are defining our tables in code and not executing straight CQL, we need to be able to have a module that can be used for a simple management tool as well as something we can integrate with our various services. What would this look like?

album.js

module.exports = function (datastar, models) {
  var cql = datastar.schema.cql;
  var Album = datastar.define('album', {
    schema: datastar.schema.object({
      album_id: cql.uuid(),
      artist_id: cql.uuid(),
      name: cql.text(),
      track_list: cql.list(cql.text()),
      song_list: cql.list(cql.uuid()),
      release_date: cql.timestamp(),
      create_date: cql.timestamp(),
      producer: cql.text()
    }).partitionKey('artist_id')
      .clusteringKey('album_id')
  });
  return Album;
}

We now have a simple function wrapper that accepts a datastar instance and then returns the model constructed. This allows models to be easily decoupled, which in turn allows that group of models to be decoupled from an application. For example, we could use the same pattern for our Artist Model so we can put them together in our models module.

model.js

var models = module.exports = function (datastar) {
  return new Models(datastar);
};

function Models(datastar) {
  this.Album = require('./album')(datastar, this);
  this.Artist = require('./artist')(datastar, this);
}

Decoupling models allows multiple decoupled micro-services backed by the same entities to share the same code. For example, consider a service for various types of data associated with music. There could be a service that acts as the main write pipeline for these entities and a second service that reads the data to be dumped into a machine-learning pipeline t integrating with the number of listens per album, per song, etc. The possibility of how we interact with our data has opened up through very simple abstraction.

Range Queries

Beyond promoting modularity, datastar has first-class support for more advanced Cassandra features like range queries, which is the most efficient way to access a series of data. Consider again our example of artists and albums. Each artist in this case has 1 to N number of albums that can be fetched out of the database with a single range query.

It is a best practice to store data in this way as it creates a very specific hierarchy based on the partitionKey. The partitionKey is similar to a primary key used to decide where in the Cassandra cluster a particular record is stored. When we have both a partitionKey and a clusteringKey in a model, we are expecting to have multiple rows per a single partitionKey. By having the Artist and Album Models implement this best practice, the range query below is exceptionally fast. This is because all the records live on a single partition and are ordered on the disk based on the clusteringKey.

var Album = datastar.define('album', {
    schema: datastar.schema.object({
      album_id: cql.uuid(),
      artist_id: cql.uuid(),
      name: cql.text(),
      track_list: cql.list(cql.text()),
      song_list: cql.list(cql.uuid()),
      release_date: cql.timestamp(),
      create_date: cql.timestamp(),
      producer: cql.text()
    }).partitionKey('artist_id')
      .clusteringKey('album_id')

Streams as First-Class Citizens

Given our model, if we assume we have inserted a handful of albums for a particular artist at artistId, let’s fetch them all and write it to disk as JSON. To do this we are going to use the streams API that is exposed by the find* functions of a datastar model.

var fs = require('fs');
var stringify = require('stringify-stream');
Album.findAll({ artistId: artistId })
  .on('error', function (err) {
    /* handle me */
  })
  //
  // Turn objects into a stringified array and write the json file to disk
  // without buffering it all in memory.
  //
  .pipe(stringify({ open: '[', close: ']' }))
  .pipe(fs.createWriteStream(albums.json'))

Here we were able to create a very sleek pipeline that fetches the data, and simply strings it together without loading it all into memory and writes it to disk. If you are not familiar with streams, think of it as an array in time rather than in “space” or memory. They allow us to make our web services less memory hungry when we do not have a need to process an entire collection of records at once.

If we needed to modify any properties or whitelist part of the object before serializing it to disk, we could have added a transform stream before we “stringified” the objects. Think of streams as first-class citizens in node an http response is a stream, a TCP socket is a stream or a file stream as we are using here. You get a lot of power and composability when using streams so it makes sense to make them first class citizens in datastar.

To Batch or not to Batch

One Cassandra feature that was essential to a small subset of use cases at GoDaddy is batching. Cassandra supports the ability to batch any write operation on the database. The intention is for the entire batch of operations to act as a single atomic operation on the database. This is a very useful property when you need to make sure multiple records stay in sync with one another. You want all of the statements to either fail or succeed and not to leave you in an undetermined state. Batching is especially useful when there are associated properties that need to be updated on multiple models as part of a single operation. We take full advantage of this feature in our implementation of datastar and it plays a critical role in our Lookup Tables implementation.

While this is an extremely useful feature of Cassandra, it is only for very particular scenarios. Abuse of batching in Cassandra can lead to serious performance implications that will affect not only your application, but also the entire Cassandra cluster. This is due to the atomic nature of a batched operation on a distributed database as it needs to go through a coordinator node. Too many batch operations or batch large sizes create excess strain on this coordinator node and can degrade the overall performance and health of the system. For operations that do not require this batching functionality, we recommend passing a number as a strategy if you are using our statement building feature.

Looking Forward as Open Source

This project has evolved considerably since its inception and this release marks the next stage: a stable and robust ODM for Cassandra. Features include:

  • Model validation with joi
  • Vanilla prototype Models to promote modularity
  • Range Queries with streams as first class citizens
  • Batching updates on multiple models

In addition to the above there are a number of ways that datastar takes advantage of Cassandra’s features that are worth digging into if this topic interests you:

Now that datastar is available to the Open Source community, we have a simple ask for you – try it out! Dig into open issues, ask questions, and if you can spare some of your valuable time please contribute. We are exploring new and innovative ways to use datastar and value input and diverse points of view to make that happen.

Join us!

Interested in working on this or solving other hard technology problems to help small businesses be more successful? If so, we are growing our team. Check out the GoDaddy Jobs page to learn about all our current openings across the company.

Photo Credit: John Donges via Compfight cc