Skip to content

Instantly share code, notes, and snippets.

@shashankgroovy
Created August 27, 2019 10:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shashankgroovy/0cb4e449e3f24bd93669e47818fb6f32 to your computer and use it in GitHub Desktop.
Save shashankgroovy/0cb4e449e3f24bd93669e47818fb6f32 to your computer and use it in GitHub Desktop.
Case study for analytics models

Case study

Let’s say you have been tasked to create a new feature, and you have to create a new schema, since the current ones don’t suffice the needs. In that case, below we list down the steps to achieve the same and also talk a bit about data modelling.

Creating a schema:

Creating a new schema is all about data modelling. So, make sure you study Cassandra data modelling and choose a strategy on how denormalized you want to keep your data.

There are 2 essential rules about data modelling that one needs to follow and typically find a right balance between them:

  1. Spread data evenly around the cluster
  2. Minimize the number of partitions read

Both of these rules depend on the choice of a partition key i.e. the first primary key. Thus, it’s essential to spend good amount of time on data modelling.

The following steps entail how to write your schema (it does not tell you about - how to model your data).

  1. Pick a unique name for the schema that you’d be creating. For example, ”new_feature_model”.
  2. In analytics-models create a new file with the same name, in our case - new_feature_model.py.
  3. Carefully think about the constraints of the new schema and select the primary keys also known as, row keys for the modeling Cassandra. Make sure the primary keys are selected in a fashion which enables you to do a faster search query.
  4. Write the schema file as below and make sure to implement the get_mapper function which extends the BaseModel class.
  5. Also, add a function to sync this model in Cassandra in sync_database.py file.
# Filename: new_feature_model.py

import os

from cassandra.cqlengine import columns
from cassandra.cqlengine.models import Model
from analyticsmodels import model_constants
from analyticsmodels.base_model import BaseModel

class NewFeatureModel(Model, BaseModel):
    __table_name__ = "new_feature_model"
    __keyspace__ = os.getenv("CASSANDRA_DB",
                             model_constants.DEFAULT_KEYSPACE)
    __options__ = model_constants.KEYSPACE_OPTIONS

    brand_id = columns.Text(primary_key=True, partition_key=True)
    outlet_id = columns.Text(primary_key=True)
    business_date = columns.Text(primary_key=True, clustering_order='DESC')
    feature_id = columns.Text(primary_key=True)

    orders = columns.Integer()
    amount = columns.Double()
    created_at = columns.Text()

    def get_mapper(self):
        return {
            "brand_id": {
                "mapping_value": "brand_id",
                "data_type": "string"
            },
            "outlet_id": {
                "mapping_value": "outlet_id",
                "data_type": "string"
            },
            "business_date": {
                "mapping_value": "business_date",
                "data_type": "string"
            },
            "feature_id": {
                "mapping_value": "feature_id",
                "data_type": "string"
            },
            "orders": {
                "mapping_value": "orderCount",
                "data_type": "int"
            },
            "amount": {
                "mapping_value": "amount",
                "data_type": "float"
            },
            "created_at": {
                "mapping_value": self.get_time_now,
                "data_type": "int"
            }
        }

    def get_time_now(self, order):
        return int(time.time())

NOTES:

  • __table_name__ : Defines a unique table name to be used for new schema model.

  • __keyspace__ : Defines which keyspace to use on Cassandra. By default it tries to pick up the name defined in environment variables else it uses the default keyspace name defined in model_constants.py file. Each environment has a different keyspace name:

    • Test: “foo_test”
    • Staging: “foo_stg”
    • Production: “foo”
  • __options__ : Defines what compaction strategy and tombstone threshold value to choose. Use the default defined in model_constants.py.

Below is the sample function which needs to be created for every new schema model to reflect the schema model in Cassandra.

# Add the following function in sync_database.py file

    def sync_new_feature_model(self):
        from analyticsmodels.new_feature_model import NewFeatureModel
        sync_table(NewFeatureModel)

For more information on data modelling read the Datastax docs and other resources available online.

Syncing the schema:

Syncing any new schema model or updating an existing one is done via processing-service. There is only one step to sync a schema i.e. to fire up processing-service and hit it with the following curl which we saw earlier:

curl -X POST -H 'Content-Type:application/json' --data
'{"models":["sync_new_feature_model"]}' http://localhost:5000/db/sync

Common Gotchas:

There are a couple of gothas related to using User Defined Types(UDT) in Cassandra, like:

  • DELETE a column - (NOT ALLOWED) You can’t delete a column in a User defined types if they are being used i.e. they are frozen.
  • UPDATE a column - (NOT ALLOWED) You can’t change the datatype of a column if it is a frozen User Defined type
  • ADD a column - You can however, add a new column in a User Defined type just make sure to sync the data type
  • RENAME a column - Renaming a column is also allowed.

So, if you run into a problem like for example, a column was added in a UDT and now the UDT is also updated. You can perform all the above operations iff the table which uses the given UDT has not been updated. As soon as a table which uses the UDT gets updated, Cassandra freezes the UDT w.r.t. to that table. Other tables however, can receive the change in a UDT if they haven’t been updated.

Happy hacking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment