Skip to content

Instantly share code, notes, and snippets.

@karanth
Last active August 29, 2015 13:56
Show Gist options
  • Save karanth/8952341 to your computer and use it in GitHub Desktop.
Save karanth/8952341 to your computer and use it in GitHub Desktop.
Lecture notes on data modeling for Big Data - III

[Part 2] (https://gist.github.com/karanth/8931761) illustrated 2 important principles of data modeling in NoSQL, one, that modeling is a design time exercise and is application-specific, and two, duplication of data is a must with the absence of joins. Techniques like denormalization and aggregates are used for data modeling.

####Key-Value Stores These are simplistic and extremely flexible stores. Each record is identified by a key. Anything can be stuffed into a value (though some stores provide data structures as values too). Distribution across nodes in a cluster are based on the key, making the distribution model simple. Queries cannot be made on the values. The most common use case for these kind of stores, are caching in an application for quick lookup. In practice, most of these stores will be memory-based with occassional writes to disk. Key-Value stores can be visualized as a distributed Hashtable.

Range queries are not possible without scan of the entire data in these stores. There are some stores that provide ordered keys. Use of composite keys is a technique to perform range queries in ordered key stores. Composite keys can also be used for modeling dimensions.

Examples: Redis, Memcached etc.

####Column-Family Stores These impose more structure to the value when compared to key-value stores. The structure imposed in the value is a map or a map-of-maps. The entire store can be visualized as a map-of-maps-of-maps. There are row keys that have wide columns i.e. there is no restriction on the size of the column. Columns are maps again allowing for column name and a value. Most column-family stores allow one more level of map indirection where columns can elevated to be super-columns and host columns within them. A common use case for such stores is modeling of sparse data. They are generally used to replace relational models in the NoSQL world too.

Another key attribute of column-family stores is the presence of a version timestamp with every value in a column. This can be used for conflict resolution in a distributed setting. Almost all column-family stores, store the row keys in a sorted order. This allows for range queries and aggregations by using the composite key technique of modeling data.

Examples: Cassandra, BigTable, HBase.

####Document Stores Document stores take column-family stores to the next level by allowing arbitrary nested structures as record values, not just map-of-maps or map-of-maps-of-maps (with super-columns). Some implementations also allow for database-styled indexes on document fields. Document store appear in use cases where one-to-many relations are required or when many nesting levels are a must. The provide a high degree of aggregation, helping coalesce different types to a few abstract types.

Full-text search engines like Lucene also fall in the document store category, when considered broadly. They can have flexible schemas (not sure if they provide arbitrary nesting), and can be indexed as well. The index is based on units of values in a document's fields.

Examples: CouchDB, MongoDB, Solr

####Graph Stores Graph stores are niche kind of stores that allow modeling graph data. They allow arbitrary relationships between the different data points and facilitate fast multi-hop lookup across these relations. Relational stores struggle to provide ease of graph and hierarchical modeling. However, most other NoSQL stores come pretty close because of the aggregate techniques they provide.

A common use case for graph stores is network modeling, like a social network.

Examples: Allegro, Neo4j

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment