shashankgroovy/analytics-models.md

## analytics-models.md

      
    Raw
  

              analytics-models.md
            
          
    Case study

Let’s say you have been tasked to create a new feature, and you have to create
a new schema, since the current ones don’t suffice the needs. In that case,
below we list down the steps to achieve the same and also talk a bit about data
modelling.
Creating a schema:

Creating a new schema is all about data modelling. So, make sure you study
Cassandra data modelling
and choose a strategy on how denormalized you want to keep your data.
There are 2 essential rules about data modelling that one needs to follow and
typically find a right balance between them:

Spread data evenly around the cluster
Minimize the number of partitions read

Both of these rules depend on the choice of a partition key i.e. the first
primary key. Thus, it’s essential to spend good amount of time on data
modelling.
The following steps entail how to write your schema (it does not tell you
about - how to model your data).

Pick a unique name for the schema that you’d be creating. For example,
”new_feature_model”.
In analytics-models create a new file with the same name, in our case -
new_feature_model.py.
Carefully think about the constraints of the new schema and select the
primary keys also known as, row keys for the modeling Cassandra. Make sure
the primary keys are selected in a fashion which enables you to do a faster
search query.
Write the schema file as below and make sure to implement the get_mapper
function which extends the BaseModel class.
Also, add a function to sync this model in Cassandra in sync_database.py
file.

# Filename: new_feature_model.py

import os

from cassandra.cqlengine import columns
from cassandra.cqlengine.models import Model
from analyticsmodels import model_constants
from analyticsmodels.base_model import BaseModel

class NewFeatureModel(Model, BaseModel):
    __table_name__ = "new_feature_model"
    __keyspace__ = os.getenv("CASSANDRA_DB",
                             model_constants.DEFAULT_KEYSPACE)
    __options__ = model_constants.KEYSPACE_OPTIONS

    brand_id = columns.Text(primary_key=True, partition_key=True)
    outlet_id = columns.Text(primary_key=True)
    business_date = columns.Text(primary_key=True, clustering_order='DESC')
    feature_id = columns.Text(primary_key=True)

    orders = columns.Integer()
    amount = columns.Double()
    created_at = columns.Text()

    def get_mapper(self):
        return {
            "brand_id": {
                "mapping_value": "brand_id",
                "data_type": "string"
            },
            "outlet_id": {
                "mapping_value": "outlet_id",
                "data_type": "string"
            },
            "business_date": {
                "mapping_value": "business_date",
                "data_type": "string"
            },
            "feature_id": {
                "mapping_value": "feature_id",
                "data_type": "string"
            },
            "orders": {
                "mapping_value": "orderCount",
                "data_type": "int"
            },
            "amount": {
                "mapping_value": "amount",
                "data_type": "float"
            },
            "created_at": {
                "mapping_value": self.get_time_now,
                "data_type": "int"
            }
        }

    def get_time_now(self, order):
        return int(time.time())
NOTES:


__table_name__ : Defines a unique table name to be used for new schema model.


__keyspace__ : Defines which keyspace to use on Cassandra. By default it
tries to pick up the name defined in environment variables else it uses the
default keyspace name defined in model_constants.py file. Each environment
has a different keyspace name:

Test: “foo_test”
Staging: “foo_stg”
Production: “foo”


__options__ : Defines what compaction strategy and tombstone threshold
value to choose. Use the default defined in model_constants.py.


Below is the sample function which needs to be created for every new schema
model to reflect the schema model in Cassandra.
# Add the following function in sync_database.py file

    def sync_new_feature_model(self):
        from analyticsmodels.new_feature_model import NewFeatureModel
        sync_table(NewFeatureModel)
For more information on data modelling read the Datastax
docs
and other
resources
available online.
Syncing the schema:

Syncing any new schema model or updating an existing one is done via
processing-service. There is only one step to sync a schema i.e. to fire up
processing-service and hit it with the following curl which we saw earlier:
curl -X POST -H 'Content-Type:application/json' --data
'{"models":["sync_new_feature_model"]}' http://localhost:5000/db/sync
Common Gotchas:

There are a couple of gothas related to using User Defined Types(UDT) in
Cassandra, like:

DELETE a column - (NOT ALLOWED) You can’t delete a column in a User defined
types if they are being used i.e. they are frozen.
UPDATE a column - (NOT ALLOWED) You can’t change the datatype of a column
if it is a frozen User Defined type
ADD a column - You can however, add a new column in a User Defined type
just make sure to sync the data type
RENAME a column - Renaming a column is also allowed.

So, if you run into a problem like for example, a column was added in a UDT and
now the UDT is also updated.  You can perform all the above operations iff
the table which uses the given UDT has not been updated. As soon as a table
which uses the UDT gets updated, Cassandra freezes the UDT w.r.t. to that
table. Other tables however, can receive the change in a UDT if they haven’t
been updated.
Happy hacking!