Skip to content

Instantly share code, notes, and snippets.

@TiredFalcon
Last active January 16, 2019 10:45
Show Gist options
  • Save TiredFalcon/25812a48a0f38b8e83774e6e8a645e0c to your computer and use it in GitHub Desktop.
Save TiredFalcon/25812a48a0f38b8e83774e6e8a645e0c to your computer and use it in GitHub Desktop.
Data Design & Modeling

Data Design & Modeling

1. Modern Data problems

  1. The complexity of knowledge
  2. Cognitive bias (of the observer)
  3. Data quality
  4. Content bias (of the source)
  5. Granularity (time, space, ...)
  6. Availability and access
  7. Consistency
  8. Size

2. Relational databases

  • Conceptual design (ER model):
    • Entities
    • Relationships
    • Attributes (simple, no composite, no derived)
    • Attribute domains
    • Key attributes
  • Logical design:
    • From above ER model, review: always binary, 1-to-many, no complex/redundant/recursive rel, norel with attributes
    • Normalize relations: PK, FK, normal form, Boyce-Codd normal form
    • Add constraints
    • Referential integrity (on update or delete)
  • Physical design:
    • Indexes, disk space
    • Secutiry: User view, access rules, transmission

Normal forms:

  • 1NF: No duplicate columns, no multivalued attributes, atomic attributes.
  • 2NF: 1NF + Any non-key columns must depend on the entire primary key (no partial functional dependencies).
  • 3NF: 2NF + All non-key columns are mutually independent and fully dependent upon the primary key (no transitive dependencies ).
  • 3.5NF or Boyce Codd NF: ...

3. SQL

Structured Query Language.

  • Specify what, not how.
  • Query optimization engine
  • Physical layer can change without modifying applications

4. Data wharehousing

OLTP: Onl Line Transaction Processing - Describes processing at operational sites OLAP: On Line Analytical Processing - Describes processing at large, integrated data warehouses

We can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it. Standard DBs at information sources use OLTP, then an integration system stores all data in data wharehouse where we use OLAP, with complex queries, organized and summarized data, ...

DW provides access for analytical use. There can be other OLAP servers as intermediate "helpers" between DW and client.

4.1 Schema used by DW

Star schema: central entity called FACT, then many relationships with other entities called rays (DIMENSIONS). Attributes all over the place. FACT is aggregated, has key composed by dimensions, is normalized. DIMENSIONS are not normalized.

Schema could also be snowflake (more levels).

Data cube (three dimensions): slice and dice operations.

5. ACID vs. BASE and SQL vs. NoSQL

5.1 ACID (properties of transactions)

  • Atomicity
  • Consistency
  • Isolation
  • Durability

5.2 CAP Theorem

Only two of the following three properties can be fully satisfied in a distributed computer system:

  • Consistency
  • Availability
  • Partition tolerance

P should be always satisfied, so choose AP or CP.

5.2.1 AP

A partitioned node returns:

  • a correct value, if in a consistent state;
  • a timeout error or an error, otherwise

e.g., DynamoDB, CouchDB, and Cassandra

5.2.2 CP

A partitioned node returns the most recent version of the data, which could be stale.
e.g., MongoDB, Redis, AppFabric Caching, and MemcacheDB

Some DB systems do however satisfy CA: MySQL, PostgreSQL, Vertica

5.3 The NoSQL world (Not only SQL)

Two main systems:

  • Key/Value or ‘the big hash table’ (remember caching?):
    • Amazon S3 (Dynamo)
    • Voldemort
    • Scalaris
    • MemcacheDB,
    • Azure Table Storage,
    • Redis,
    • Riak
  • Schema-less
    • Cassandra (column-based)
    • CouchDB (document-based)
    • Neo4J (graph-based)
    • HBase (column-based)

5.4 BASE

  • Basically Available
  • Soft-State
  • Eventually Consistent

5.5 Types of NoSQL DBs

5.5.1 Graph

Pros:

  • Associative data, explore relationships, map easily to OO

Cons:

  • Not efficient on whole graph analysis

Neo4j: schemaless, ACID Meant to be an operational DB, not specifically for analytics.

5.5.2 Key-Value wide column based

Cassandra:

  • Eventual (weak) consistency, Availability, Partition-tolerance

HBase:

  • Prefers (strong) Consistency over availability

5.5.3 Column-oriented

Mainly used in OLAP.

Pros:

  • Data compression
  • Improved Bandwidth Utilization
  • Improved Code Pipelining
  • Improved cache locality

Cons:

  • Increased Disk Seek Time
  • Increased cost of Inserts
  • Increased tuple reconstruction costs

5.5.4 Document-oriented

Pros:

  • Handle schema changes very well
  • Easy to use with JSON
  • solves object-relational impedence mismatch problem
  • automatic data sharding
  • easy indexing

Cons:

  • No joins, less flexible queries
  • For complex jobs needs map-reduce

Sharding strategies:

  • Range or Hash for:
    • Scale
    • Low recovery times
  • Tag-aware for:
    • Geo-locality
    • Hardware optimization

MongoDB focuses on CP

6. Hadoop

Software platform to easily process vast amounts of data. Includes:

  • HDFS (Hadoop Distributed File System)
  • Map-Reduce (offline computing engine)
  • HBase (online data access)
  • other stuff (Pig, Hive, Zookeeper, Flume, Sqoop, Hue, Oozie)

Pros:

  • scalable
  • economical
  • efficient
  • reliable

Hadoop 2.0 introduces Yarn, a cluster resource manager Yarn: Yet Another Resource Negotiator

6.1 HDFS

Architecture and components:

  • NameNode:
    • Maps a file name to a set of blocks.
    • Maps a block to the DataNodes where it resides.
    • Stores other file metadata.
    • Cluster Configuration Management.
    • Replication Engine for Blocks.
  • DataNode (many):
    • A Block Server, stores blocks, metadata of blocks.
    • Sends periodic Block Report to NameNode.
    • Sends heartbeat to NameNode every 3 seconds (that's a slow heartbeat!).
    • Pipelining of data: Forwards data to other specified DataNodes.

Facts:

  • Data files split into blocks (typically 64 MB) and distributed across nodes of the cluster.
  • Blocks replicated multiple times (default replication is 3-fold).
  • Provides redundant storage for massive amounts of data.
  • Works better with less larger files.
  • Files usually > 100 MB.
  • Write once (and read many).
  • Optimized for streaming, not random reads.

6.2 Hadoop subprojects

Pig: High-level language for data analysis.

  • Expresses sequences of MapReduce jobs
  • Data model: nested “bags” of items
  • Provides relational (SQL) operators (JOIN, GROUP BY, etc.)
  • Easy to plug in Java functions

HBase: Table storage for semi-structured data.

  • Key-value row/column store, has support for row/column queries
  • column-oriented, so nulls are free
  • Column families
  • Stores bytes

Hive (stupid bee elephant): SQL-like Query language and Metastore

  • “Relational database” built on Hadoop
  • Maintains list of table schemas
  • SQL-like query language (HiveQL)
  • Can call Hadoop Streaming scripts from HiveQL
  • Supports table partitioning, clustering, complex data types, some optimizations

Impala: Massive parallel processing (MPP) database engine, developed by Cloudera.

  • Integrated into Hadoop stack on the same level as MapReduce, and not above it (as Hive and Pig)
  • Process data in Hadoop cluster without using MapReduce

Storm: provides realtime computation

  • Scalable
  • Guarantees no data loss
  • Extremely robust and fault-tolerant
  • Programming language agnostic

7. Map-Reduce

Pros:

  • Easy to use scalable programming model for large-scale data processing on clusters
  • Allows users to focus on computations
  • Hides issues of parallelization, fault tolerance, data partitioning & load balancing
  • Achieves efficiency through disk-locality
  • Achieves fault-tolerance through replication

Cons:

  • Faster solutions exist (Spark)
  • If Master crashes, all tasks aborted (rare)

8. Flume

Source(s) collects data, puts it into a channel(s), sink(s) gets data from channel (only one per sink) and writes it to output. Sinks use transactions to ensure no data is lost.

  • Flume is suitable for large volume data collection, especially when data is being produced in multiple locations
  • Once planned and sized appropriately, Flume will practically run itself without any operational intervention
  • Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline
  • Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations
  • Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems

9. Spark

Much faster than the rest.

One stack to rule them all: Spark + SparkSQL + Spark Streaming

  • Ad-hoc queries
  • Batch processing
  • Stream processing

10. Caching

Memcached:

Redis: REmote DIctionaty Server. An advanced key-value store with types. Can be used as DB or as caching layer.
Main usages: caching, blocking queues, any real-time, cross-application/-platform communication.
Consider it when:

  • Speed is critical
  • More than just key-value pairs
  • Dataset can fit in memory
  • Dataset is not critical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment