TiredFalcon/gist:25812a48a0f38b8e83774e6e8a645e0c

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Data Design & Modeling

1. Modern Data problems


The complexity of knowledge
Cognitive bias (of the observer)
Data quality
Content bias (of the source)
Granularity (time, space, ...)
Availability and access
Consistency
Size

2. Relational databases


Conceptual design (ER model):

Entities
Relationships
Attributes (simple, no composite, no derived)
Attribute domains
Key attributes


Logical design:

From above ER model, review: always binary, 1-to-many, no complex/redundant/recursive rel, norel with attributes
Normalize relations: PK, FK, normal form, Boyce-Codd normal form
Add constraints
Referential integrity (on update or delete)


Physical design:

Indexes, disk space
Secutiry: User view, access rules, transmission


Normal forms:

1NF: No duplicate columns, no multivalued attributes, atomic attributes.
2NF: 1NF + Any non-key columns must depend on the entire primary key (no partial functional dependencies).
3NF: 2NF + All non-key columns are mutually independent and fully dependent upon the primary key (no transitive dependencies ).
3.5NF or Boyce Codd NF: ...

3. SQL

Structured Query Language.

Specify what, not how.
Query optimization engine
Physical layer can change without modifying applications

4. Data wharehousing

OLTP: Onl Line Transaction Processing - Describes processing at operational sites
OLAP: On Line Analytical Processing - Describes processing at large, integrated data warehouses
We can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
Standard DBs at information sources use OLTP, then an integration system stores all data in data wharehouse where we use OLAP, with complex queries, organized and summarized data, ...
DW provides access for analytical use. There can be other OLAP servers as intermediate "helpers"
between DW and client.
4.1 Schema used by DW

Star schema: central entity called FACT, then many relationships with other entities called rays (DIMENSIONS).
Attributes all over the place.
FACT is aggregated, has key composed by dimensions, is normalized.
DIMENSIONS are not normalized.
Schema could also be snowflake (more levels).
Data cube (three dimensions): slice and dice operations.
5. ACID vs. BASE and SQL vs. NoSQL

5.1 ACID (properties of transactions)


Atomicity
Consistency
Isolation
Durability

5.2 CAP Theorem

Only two of the following three properties can be fully satisfied in a distributed computer system:

Consistency
Availability
Partition tolerance

P should be always satisfied, so choose AP or CP.
5.2.1 AP

A partitioned node returns:

a correct value, if in a consistent state;
a timeout error or an error, otherwise

e.g., DynamoDB, CouchDB, and Cassandra
5.2.2 CP

A partitioned node returns the most recent version of the data, which could be stale.

e.g., MongoDB, Redis, AppFabric Caching, and MemcacheDB
Some DB systems do however satisfy CA:
MySQL, PostgreSQL, Vertica
5.3 The NoSQL world (Not only SQL)

Two main systems:

Key/Value or ‘the big hash table’ (remember caching?):

Amazon S3 (Dynamo)
Voldemort
Scalaris
MemcacheDB,
Azure Table Storage,
Redis,
Riak


Schema-less

Cassandra (column-based)
CouchDB (document-based)
Neo4J (graph-based)
HBase (column-based)


5.4 BASE


Basically Available
Soft-State
Eventually Consistent

5.5 Types of NoSQL DBs

5.5.1 Graph

Pros:

Associative data, explore relationships, map easily to OO

Cons:

Not efficient on whole graph analysis

Neo4j: schemaless, ACID
Meant to be an operational DB, not specifically for analytics.
5.5.2 Key-Value wide column based

Cassandra:

Eventual (weak) consistency, Availability, Partition-tolerance

HBase:

Prefers (strong) Consistency over availability

5.5.3 Column-oriented

Mainly used in OLAP.
Pros:

Data compression
Improved Bandwidth Utilization
Improved Code Pipelining
Improved cache locality

Cons:

Increased Disk Seek Time
Increased cost of Inserts
Increased tuple reconstruction costs

5.5.4 Document-oriented

Pros:

Handle schema changes very well
Easy to use with JSON
solves object-relational impedence mismatch problem
automatic data sharding
easy indexing

Cons:

No joins, less flexible queries
For complex jobs needs map-reduce

Sharding strategies:

Range or Hash for:

Scale
Low recovery times


Tag-aware for:

Geo-locality
Hardware optimization


MongoDB focuses on CP
6. Hadoop

Software platform to easily process vast amounts of data.
Includes:

HDFS (Hadoop Distributed File System)
Map-Reduce (offline computing engine)
HBase (online data access)
other stuff (Pig, Hive, Zookeeper, Flume, Sqoop, Hue, Oozie)

Pros:

scalable
economical
efficient
reliable

Hadoop 2.0 introduces Yarn, a cluster resource manager
Yarn: Yet Another Resource Negotiator
6.1 HDFS

Architecture and components:

NameNode:

Maps a file name to a set of blocks.
Maps a block to the DataNodes where it resides.
Stores other file metadata.
Cluster Configuration Management.
Replication Engine for Blocks.


DataNode (many):

A Block Server, stores blocks, metadata of blocks.
Sends periodic Block Report to NameNode.
Sends heartbeat to NameNode every 3 seconds (that's a slow heartbeat!).
Pipelining of data: Forwards data to other specified DataNodes.


Facts:

Data files split into blocks (typically 64 MB) and distributed across nodes of the cluster.
Blocks replicated multiple times (default replication is 3-fold).
Provides redundant storage for massive amounts of data.
Works better with less larger files.
Files usually > 100 MB.
Write once (and read many).
Optimized for streaming, not random reads.

6.2 Hadoop subprojects

Pig: High-level language for data analysis.

Expresses sequences of MapReduce jobs
Data model: nested “bags” of items
Provides relational (SQL) operators (JOIN, GROUP BY, etc.)
Easy to plug in Java functions

HBase: Table storage for semi-structured data.

Key-value row/column store, has support for row/column queries
column-oriented, so nulls are free
Column families
Stores bytes

Hive (stupid bee elephant): SQL-like Query language and Metastore

“Relational database” built on Hadoop
Maintains list of table schemas
SQL-like query language (HiveQL)
Can call Hadoop Streaming scripts from HiveQL
Supports table partitioning, clustering, complex data types, some optimizations

Impala: Massive parallel processing (MPP) database engine, developed by Cloudera.

Integrated into Hadoop stack on the same level as MapReduce, and not above it (as Hive and Pig)
Process data in Hadoop cluster without using MapReduce

Storm: provides realtime computation

Scalable
Guarantees no data loss
Extremely robust and fault-tolerant
Programming language agnostic

7. Map-Reduce

Pros:

Easy to use scalable programming model for large-scale data processing on clusters
Allows users to focus on computations
Hides issues of parallelization, fault tolerance, data partitioning & load balancing
Achieves efficiency through disk-locality
Achieves fault-tolerance through replication

Cons:

Faster solutions exist (Spark)
If Master crashes, all tasks aborted (rare)

8. Flume

Source(s) collects data, puts it into a channel(s), sink(s) gets data from channel (only one per sink) and writes it to output. Sinks use transactions to ensure no data is lost.

Flume is suitable for large volume data collection, especially when data is being produced in multiple locations
Once planned and sized appropriately, Flume will practically run itself without any operational intervention
Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline
Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations
Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems

9. Spark

Much faster than the rest.
One stack to rule them all: Spark + SparkSQL + Spark Streaming

Ad-hoc queries
Batch processing
Stream processing

10. Caching

Memcached:
Redis: REmote DIctionaty Server. An advanced key-value store with types. Can be used as DB or as caching layer.

Main usages: caching, blocking queues, any real-time, cross-application/-platform communication.

Consider it when:

Speed is critical
More than just key-value pairs
Dataset can fit in memory
Dataset is not critical