Last active
August 29, 2015 14:24
-
-
Save letslego/65e6b0991acd11dae337 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://aws.amazon.com/big-data/ | |
1. MDM | |
2. Data Governance | |
3. MetaData | |
4. Data Quality | |
5. Data Security | |
Map - Identify and filter data. | |
Reduce -Compress filtered data. | |
Mahout - Machine Learning | |
MLib, Pandas, R. | |
Pig (Query Language) and Hive (BI) - New big data etl. Hive is the Datawarehouse of Big Data | |
Inorder to put context we need reference data. We are sending the answers back to the warehouse. All BI tools need to write SQL. | |
Cloudera Impala speaks sql, HAWK is sql and C* is cql. Datameer can go against native HDFS. | |
Tools: | |
Kafka, Mahaout, Hive, Pig, Sqoop, ZooKeeper, Storm, Spark, | |
NoSQL : | |
====== | |
Document : MongoDB,CouchDB, | |
Key Value : Redis, Riak | |
Columnar : C*, Hbase | |
Graph : Neo4j, Titan, OrientDB | |
Languages: Scala, Java, Python, Ruby, SciPy, Pandas, R | |
Challenges : Volume, Variety, Veracity, Velocity | |
=================================== | |
Volume : Huge. Using Storm, C* and Hadoop. | |
Variety : Structured and Unstructured. How to govern ? | |
Veracity : How do you certify the quality of sentiment analysis ? | |
Velocity : Data Warehouse must be complete, one of the Data Quality tests. If you receiving in stream, how do you apply governance since its streaming and we don’t know when its complete. | |
Data Governance | |
============= | |
When we say that the data is governed there are 7 things that need to happen: | |
1. People: Data Stewards and Enterprise Data Council. | |
2. Catalog: Provide to users the catalog of what data is available and what it means. | |
3. Quality : ETL can’t be done in rest. | |
Rules for Data Ingestion | |
================== | |
With Big Data we tend to ingest everything, in DW world we run business rules before ingesting. | |
Rules: A. Don’t dump data before you govern. | |
B. Information harvested from old systems. | |
Making data right is immature. | |
Existing Data Governance needs to change. Move from Data Science to Data Governance. | |
Everything must be automated. | |
Highlights | |
========= | |
1. Org and Process changes for doing Data Governance right. | |
2. MDM: Managing hierarchies, graph data bases are ideal for managing hierarchical data. | |
3. MetaData: Catalog. There is no silver bullet. | |
4. Data Quality & Monitoring | |
5. Information Security | |
6. Information Lifecycle | |
Big Data Governance - Truth | |
====================== | |
Full Data Governance can only be applied to structured data. | |
Data must have known schema. (This can include materialized endpoints e.g. files or tables or projections such as Hive table) | |
Governed structured data must have: | |
A known schema for metadata. | |
A known and certified lineage. | |
A monitored, quality test, managed processes for ingestion and transformation. | |
A governed usage —> data isn’t just for enterprise BI tools. | |
Hadoop contains more of semi structured / structured with definable schema rather than complete unstructured data. | |
Even in the case of unstructured data, structure may be applied in just about every case before analysis is done. | |
The rise of the Data Scientist | |
======================== | |
Provide requirements for Data Lake | |
Proper metadata established | |
Catalog | |
Data Definition | |
Lineage | |
Quality monitoring | |
Know and validate data completeness | |
Data Science to Big Data Warehouse mapping | |
Full data governance requirements | |
Provide full process lineage | |
Data certification process by data stewards and business owners | |
Ongoing data quality monitoring that includes machine learning algorithm enrichment and Quality checks. | |
Feed Lifecycle management and data processing platform | |
============================================= | |
Apache Falcon. | |
OR | |
Oozie + retention metadata. | |
Master Data Management | |
==================== | |
Why ? Needed during the data transformation stage for adding proper context to the raw big data. | |
Consistent policy enforcement and security | |
Integration with existing ecosystem | |
Data Governance through workflow management | |
Data Quality enforcement through meta data driven rules | |
Time variant hierarchies and attributes | |
Graph db - high performance, flexible and scalable | |
Unifying information coming from uncontrolled sources. Since the sources are uncontrolled, their hierarchies are different and based on context where the data is coming from. | |
Big Data Security | |
============= | |
Determining who sees what: | |
Need to secure as many data types as possible | |
Auto -discovery is important | |
Current products | |
Sentry - SQL security semantics on Hive | |
Knox - Central auth mechanism on HDFS | |
Cloudera Navigator - Central security auditing | |
Hadoop - *NIX permission with LDAP | |
Dataguise - Auto discovery, masking, encryption | |
Datameer - The BI tool for Hadoop | |
Hadoop Metadata | |
=============== | |
Products like Loom | |
OSS alternatives include HCatalog. | |
Maps to relational schema | |
Devs don’t have to worry about data format and storage | |
Can use Superluminate to get started. | |
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dev-create-metastore-outside.html | |
ILM | |
==== | |
Twitter DAL (not open sourced) | |
Apache Falcon | |
Data Quality and monitoring | |
====================== | |
Continuous monitoring is needed | |
Accuracy and completeness of data | |
All data in the BDW must have monitoring | |
Basic stats - Source to target counts | |
Error Events - Did we capture any errors while processing | |
Tolerance - Is the metric within tolerance limit ? What is the standard deviation from the calculated ideal | |
Error Event Fact Table | |
Part of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in the data flow that, if it fails records an error in the Error Event Schema. Quality screens are divided into three categories: | |
Column screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric values that should be numeric; out of range values; etc. | |
Structure screens. These are used to test for the integrity of different relationships between columns (typically foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is valid according to some structural definition it should adhere. | |
Business rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables, follow specific business rules. An example could be, that if a customer is marked as a certain type of customer, the business rules that define this kind of customer should be adhered. | |
When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else than the target system or tag the data. The latter option is considered the best solution because the first option requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are missing from the target system (integrity) and it is often unclear, what should happen to these data. | |
Error Event Schema | |
This schema is the place, where all error events thrown by quality screens, are recorded. It consists of an Error Event Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who produced error). It also holds information about exactly when the error occurred and the severity of the error. In addition there is an Error Event DetailFact table with a foreign key to the main table that contains detailed information about in which table, record and field the error occurred and the error condition. | |
Use HBase for metadata and error events | |
Oozie for orchestration | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment