Skip to content

Instantly share code, notes, and snippets.

@letslego
Last active August 29, 2015 14:24
Show Gist options
  • Save letslego/65e6b0991acd11dae337 to your computer and use it in GitHub Desktop.
Save letslego/65e6b0991acd11dae337 to your computer and use it in GitHub Desktop.
http://aws.amazon.com/big-data/
1. MDM
2. Data Governance
3. MetaData
4. Data Quality
5. Data Security
Map - Identify and filter data.
Reduce -Compress filtered data.
Mahout - Machine Learning
MLib, Pandas, R.
Pig (Query Language) and Hive (BI) - New big data etl. Hive is the Datawarehouse of Big Data
Inorder to put context we need reference data. We are sending the answers back to the warehouse. All BI tools need to write SQL.
Cloudera Impala speaks sql, HAWK is sql and C* is cql. Datameer can go against native HDFS.
Tools:
Kafka, Mahaout, Hive, Pig, Sqoop, ZooKeeper, Storm, Spark,
NoSQL :
======
Document : MongoDB,CouchDB,
Key Value : Redis, Riak
Columnar : C*, Hbase
Graph : Neo4j, Titan, OrientDB
Languages: Scala, Java, Python, Ruby, SciPy, Pandas, R
Challenges : Volume, Variety, Veracity, Velocity
===================================
Volume : Huge. Using Storm, C* and Hadoop.
Variety : Structured and Unstructured. How to govern ?
Veracity : How do you certify the quality of sentiment analysis ?
Velocity : Data Warehouse must be complete, one of the Data Quality tests. If you receiving in stream, how do you apply governance since its streaming and we don’t know when its complete.
Data Governance
=============
When we say that the data is governed there are 7 things that need to happen:
1. People: Data Stewards and Enterprise Data Council.
2. Catalog: Provide to users the catalog of what data is available and what it means.
3. Quality : ETL can’t be done in rest.
Rules for Data Ingestion
==================
With Big Data we tend to ingest everything, in DW world we run business rules before ingesting.
Rules: A. Don’t dump data before you govern.
B. Information harvested from old systems.
Making data right is immature.
Existing Data Governance needs to change. Move from Data Science to Data Governance.
Everything must be automated.
Highlights
=========
1. Org and Process changes for doing Data Governance right.
2. MDM: Managing hierarchies, graph data bases are ideal for managing hierarchical data.
3. MetaData: Catalog. There is no silver bullet.
4. Data Quality & Monitoring
5. Information Security
6. Information Lifecycle
Big Data Governance - Truth
======================
Full Data Governance can only be applied to structured data.
Data must have known schema. (This can include materialized endpoints e.g. files or tables or projections such as Hive table)
Governed structured data must have:
A known schema for metadata.
A known and certified lineage.
A monitored, quality test, managed processes for ingestion and transformation.
A governed usage —> data isn’t just for enterprise BI tools.
Hadoop contains more of semi structured / structured with definable schema rather than complete unstructured data.
Even in the case of unstructured data, structure may be applied in just about every case before analysis is done.
The rise of the Data Scientist
========================
Provide requirements for Data Lake
Proper metadata established
Catalog
Data Definition
Lineage
Quality monitoring
Know and validate data completeness
Data Science to Big Data Warehouse mapping
Full data governance requirements
Provide full process lineage
Data certification process by data stewards and business owners
Ongoing data quality monitoring that includes machine learning algorithm enrichment and Quality checks.
Feed Lifecycle management and data processing platform
=============================================
Apache Falcon.
OR
Oozie + retention metadata.
Master Data Management
====================
Why ? Needed during the data transformation stage for adding proper context to the raw big data.
Consistent policy enforcement and security
Integration with existing ecosystem
Data Governance through workflow management
Data Quality enforcement through meta data driven rules
Time variant hierarchies and attributes
Graph db - high performance, flexible and scalable
Unifying information coming from uncontrolled sources. Since the sources are uncontrolled, their hierarchies are different and based on context where the data is coming from.
Big Data Security
=============
Determining who sees what:
Need to secure as many data types as possible
Auto -discovery is important
Current products
Sentry - SQL security semantics on Hive
Knox - Central auth mechanism on HDFS
Cloudera Navigator - Central security auditing
Hadoop - *NIX permission with LDAP
Dataguise - Auto discovery, masking, encryption
Datameer - The BI tool for Hadoop
Hadoop Metadata
===============
Products like Loom
OSS alternatives include HCatalog.
Maps to relational schema
Devs don’t have to worry about data format and storage
Can use Superluminate to get started.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dev-create-metastore-outside.html
ILM
====
Twitter DAL (not open sourced)
Apache Falcon
Data Quality and monitoring
======================
Continuous monitoring is needed
Accuracy and completeness of data
All data in the BDW must have monitoring
Basic stats - Source to target counts
Error Events - Did we capture any errors while processing
Tolerance - Is the metric within tolerance limit ? What is the standard deviation from the calculated ideal
Error Event Fact Table
Part of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in the data flow that, if it fails records an error in the Error Event Schema. Quality screens are divided into three categories:
Column screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric values that should be numeric; out of range values; etc.
Structure screens. These are used to test for the integrity of different relationships between columns (typically foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is valid according to some structural definition it should adhere.
Business rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables, follow specific business rules. An example could be, that if a customer is marked as a certain type of customer, the business rules that define this kind of customer should be adhered.
When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else than the target system or tag the data. The latter option is considered the best solution because the first option requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are missing from the target system (integrity) and it is often unclear, what should happen to these data.
Error Event Schema
This schema is the place, where all error events thrown by quality screens, are recorded. It consists of an Error Event Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who produced error). It also holds information about exactly when the error occurred and the severity of the error. In addition there is an Error Event DetailFact table with a foreign key to the main table that contains detailed information about in which table, record and field the error occurred and the error condition.
Use HBase for metadata and error events
Oozie for orchestration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment