letslego/BigDataTriangle.jpg

## BigDataTriangle.jpg

      
    Raw
  

              BigDataTriangle.jpg
            
          
## Data Lake on EMR and Data Governance
http://aws.amazon.com/big-data/

1. MDM
2. Data Governance
3. MetaData
4. Data Quality
5. Data Security

Map - Identify and filter data.
Reduce -Compress filtered data.

Mahout - Machine Learning
MLib, Pandas, R.

Pig (Query Language) and Hive (BI) - New big data etl. Hive is the Datawarehouse of Big Data

Inorder to put context we need reference data. We are sending the answers back to the warehouse. All BI tools need to write SQL.
Cloudera Impala speaks sql, HAWK is sql and C* is cql. Datameer can go against native HDFS.

Tools:

Kafka, Mahaout, Hive, Pig, Sqoop, ZooKeeper, Storm, Spark,

NoSQL :
======
Document : MongoDB,CouchDB,
Key Value : Redis, Riak
Columnar : C*, Hbase
Graph : Neo4j, Titan, OrientDB

Languages: Scala, Java, Python, Ruby, SciPy, Pandas, R

Challenges : Volume, Variety, Veracity, Velocity
===================================

Volume : Huge. Using Storm, C* and Hadoop.
Variety : Structured and Unstructured. How to govern ?
Veracity : How do you certify the quality of sentiment analysis ?
Velocity : Data Warehouse must be complete, one of the Data Quality tests. If you receiving in stream, how do you apply governance since its streaming and we don’t know when its complete.

Data Governance
=============

When we say that the data is governed there are 7 things that need to happen:

1. People: Data Stewards and Enterprise Data Council.
2. Catalog: Provide to users the catalog of what data is available and what it means.
3. Quality :  ETL can’t be done in rest.

Rules for Data Ingestion
==================
With Big Data we tend to ingest everything, in DW world we run business rules before ingesting.
Rules: A. Don’t dump data before you govern.
           B. Information harvested from old systems.

Making data right is immature.
Existing Data Governance needs to change. Move from Data Science to Data Governance.
Everything must be automated.

Highlights
=========

1. Org and Process changes for doing Data Governance right.
2. MDM: Managing hierarchies, graph data bases are ideal for managing hierarchical data.
3. MetaData: Catalog. There is no silver bullet.
4. Data Quality & Monitoring
5. Information Security
6. Information Lifecycle

Big Data Governance - Truth
======================

Full Data Governance can only be applied to structured data.
Data must have known schema. (This can include materialized endpoints e.g. files or tables or projections such as Hive table)
Governed structured data must have:
A known schema for metadata.
A known and certified lineage.
A monitored, quality test, managed processes for ingestion and transformation.
A governed usage —> data isn’t just for enterprise BI tools.
Hadoop contains more of semi structured / structured with definable schema rather than complete unstructured data.
Even in the case of unstructured data, structure may be applied in just about every case before analysis is done.


The rise of the Data Scientist
========================

Provide requirements for Data Lake
Proper metadata established
Catalog
Data Definition
Lineage
Quality monitoring
Know and validate data completeness
Data Science to Big Data Warehouse mapping
Full data governance requirements
Provide full process lineage
Data certification process by data stewards and business owners
Ongoing data quality monitoring that includes machine learning algorithm enrichment and Quality checks.


Feed Lifecycle management and data processing platform
=============================================

Apache Falcon.
OR
Oozie + retention metadata.


Master Data Management
====================

Why ? Needed during the data transformation stage for adding proper context to the raw big data.
Consistent policy enforcement and security
Integration with existing ecosystem
Data Governance through workflow management
Data Quality enforcement through meta data driven rules
Time variant hierarchies and attributes
Graph db - high performance, flexible and scalable
Unifying information coming from uncontrolled sources. Since the sources are uncontrolled, their hierarchies are different and based on context where the data is coming from.


Big Data Security
=============

Determining who sees what:
Need to secure as many data types as possible
Auto -discovery is important

Current products
Sentry - SQL security semantics on Hive
Knox - Central auth mechanism on HDFS
Cloudera Navigator - Central security auditing
Hadoop - *NIX permission with LDAP
Dataguise - Auto discovery, masking, encryption
Datameer - The BI tool for Hadoop


Hadoop Metadata
===============

Products like Loom
OSS alternatives include HCatalog.
Maps to relational schema
Devs don’t have to worry about data format and storage
Can use Superluminate to get started.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dev-create-metastore-outside.html

ILM
====

Twitter DAL (not open sourced)
Apache Falcon

Data Quality and monitoring
======================

Continuous monitoring is needed
Accuracy and completeness of data
All data in the BDW must have monitoring
Basic stats - Source to target counts
Error Events - Did we capture any errors while processing
Tolerance - Is the metric within tolerance limit ? What is the standard deviation from the calculated ideal

Error Event Fact Table

Part of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in the data flow that, if it fails records an error in the Error Event Schema. Quality screens are divided into three categories:

Column screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric values that should be numeric; out of range values; etc.
Structure screens. These are used to test for the integrity of different relationships between columns (typically foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is valid according to some structural definition it should adhere.
Business rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables, follow specific business rules. An example could be, that if a customer is marked as a certain type of customer, the business rules that define this kind of customer should be adhered.

When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else than the target system or tag the data. The latter option is considered the best solution because the first option requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are missing from the target system (integrity) and it is often unclear, what should happen to these data.


Error Event Schema

This schema is the place, where all error events thrown by quality screens, are recorded. It consists of an Error Event Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who produced error). It also holds information about exactly when the error occurred and the severity of the error. In addition there is an Error Event DetailFact table with a foreign key to the main table that contains detailed information about in which table, record and field the error occurred and the error condition.

Use HBase for metadata and error events
Oozie for orchestration


## MasterDataM.jpg

      
    Raw
  

              MasterDataM.jpg
	http://aws.amazon.com/big-data/

	1. MDM
	2. Data Governance
	3. MetaData
	4. Data Quality
	5. Data Security

	Map - Identify and filter data.
	Reduce -Compress filtered data.

	Mahout - Machine Learning
	MLib, Pandas, R.

	Pig (Query Language) and Hive (BI) - New big data etl. Hive is the Datawarehouse of Big Data

	Inorder to put context we need reference data. We are sending the answers back to the warehouse. All BI tools need to write SQL.
	Cloudera Impala speaks sql, HAWK is sql and C* is cql. Datameer can go against native HDFS.

	Tools:

	Kafka, Mahaout, Hive, Pig, Sqoop, ZooKeeper, Storm, Spark,

	NoSQL :
	======
	Document : MongoDB,CouchDB,
	Key Value : Redis, Riak
	Columnar : C*, Hbase
	Graph : Neo4j, Titan, OrientDB

	Languages: Scala, Java, Python, Ruby, SciPy, Pandas, R

	Challenges : Volume, Variety, Veracity, Velocity
	===================================

	Volume : Huge. Using Storm, C* and Hadoop.
	Variety : Structured and Unstructured. How to govern ?
	Veracity : How do you certify the quality of sentiment analysis ?
	Velocity : Data Warehouse must be complete, one of the Data Quality tests. If you receiving in stream, how do you apply governance since its streaming and we don’t know when its complete.

	Data Governance
	=============

	When we say that the data is governed there are 7 things that need to happen:

	1. People: Data Stewards and Enterprise Data Council.
	2. Catalog: Provide to users the catalog of what data is available and what it means.
	3. Quality : ETL can’t be done in rest.

	Rules for Data Ingestion
	==================
	With Big Data we tend to ingest everything, in DW world we run business rules before ingesting.
	Rules: A. Don’t dump data before you govern.
	B. Information harvested from old systems.

	Making data right is immature.
	Existing Data Governance needs to change. Move from Data Science to Data Governance.
	Everything must be automated.

	Highlights
	=========

	1. Org and Process changes for doing Data Governance right.
	2. MDM: Managing hierarchies, graph data bases are ideal for managing hierarchical data.
	3. MetaData: Catalog. There is no silver bullet.
	4. Data Quality & Monitoring
	5. Information Security
	6. Information Lifecycle

	Big Data Governance - Truth
	======================

	Full Data Governance can only be applied to structured data.
	Data must have known schema. (This can include materialized endpoints e.g. files or tables or projections such as Hive table)
	Governed structured data must have:
	A known schema for metadata.
	A known and certified lineage.
	A monitored, quality test, managed processes for ingestion and transformation.
	A governed usage —> data isn’t just for enterprise BI tools.
	Hadoop contains more of semi structured / structured with definable schema rather than complete unstructured data.
	Even in the case of unstructured data, structure may be applied in just about every case before analysis is done.


	The rise of the Data Scientist
	========================

	Provide requirements for Data Lake
	Proper metadata established
	Catalog
	Data Definition
	Lineage
	Quality monitoring
	Know and validate data completeness
	Data Science to Big Data Warehouse mapping
	Full data governance requirements
	Provide full process lineage
	Data certification process by data stewards and business owners
	Ongoing data quality monitoring that includes machine learning algorithm enrichment and Quality checks.


	Feed Lifecycle management and data processing platform
	=============================================

	Apache Falcon.
	OR
	Oozie + retention metadata.


	Master Data Management
	====================

	Why ? Needed during the data transformation stage for adding proper context to the raw big data.
	Consistent policy enforcement and security
	Integration with existing ecosystem
	Data Governance through workflow management
	Data Quality enforcement through meta data driven rules
	Time variant hierarchies and attributes
	Graph db - high performance, flexible and scalable
	Unifying information coming from uncontrolled sources. Since the sources are uncontrolled, their hierarchies are different and based on context where the data is coming from.







	Big Data Security
	=============

	Determining who sees what:
	Need to secure as many data types as possible
	Auto -discovery is important

	Current products
	Sentry - SQL security semantics on Hive
	Knox - Central auth mechanism on HDFS
	Cloudera Navigator - Central security auditing
	Hadoop - *NIX permission with LDAP
	Dataguise - Auto discovery, masking, encryption
	Datameer - The BI tool for Hadoop





	Hadoop Metadata
	===============

	Products like Loom
	OSS alternatives include HCatalog.
	Maps to relational schema
	Devs don’t have to worry about data format and storage
	Can use Superluminate to get started.

	http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dev-create-metastore-outside.html

	ILM
	====

	Twitter DAL (not open sourced)
	Apache Falcon

	Data Quality and monitoring
	======================

	Continuous monitoring is needed
	Accuracy and completeness of data
	All data in the BDW must have monitoring
	Basic stats - Source to target counts
	Error Events - Did we capture any errors while processing
	Tolerance - Is the metric within tolerance limit ? What is the standard deviation from the calculated ideal

	Error Event Fact Table

	Part of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in the data flow that, if it fails records an error in the Error Event Schema. Quality screens are divided into three categories:

	Column screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric values that should be numeric; out of range values; etc.
	Structure screens. These are used to test for the integrity of different relationships between columns (typically foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is valid according to some structural definition it should adhere.
	Business rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables, follow specific business rules. An example could be, that if a customer is marked as a certain type of customer, the business rules that define this kind of customer should be adhered.

	When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else than the target system or tag the data. The latter option is considered the best solution because the first option requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are missing from the target system (integrity) and it is often unclear, what should happen to these data.


	Error Event Schema

	This schema is the place, where all error events thrown by quality screens, are recorded. It consists of an Error Event Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who produced error). It also holds information about exactly when the error occurred and the severity of the error. In addition there is an Error Event DetailFact table with a foreign key to the main table that contains detailed information about in which table, record and field the error occurred and the error condition.

	Use HBase for metadata and error events
	Oozie for orchestration