explicite/datalake.md

## datalake.md

      
    Raw
  

              datalake.md
            
          
    Data Lake Trylogy

[1] Diving in the data lake


przedstawię czym jest DL
pisze czemu wogóle to jest i jakie są korzysci
komponenty?
data swamp - czyli czego nie powinno sie robić
jakieś małe podsumowanie

https://github.com/VirtusLab/blog/pull/35/files#diff-2ce16658621b8e036b13d87703c87788
[2] Hadoop Legacy


wskarze z czym sa problemy
przedstawie inne podejscie na podstawie pachyderm (komponent i czym sie roznia od hadoop'a)
dodam wnioski

ze hadoop ma problemy i probuje to rozwiazac
ze pachyderm czesc z nich rozwiazuje (lepsza architektura) ale nie koniecznie wszystkie (jeszcze)


[3] Trzeci blog post?

Skupienie sie na pachyderm i pokazanie czy rzeczywiscie może być przyszłością
Jakiś przykład w pachyderm ML na klastrze raspberry pi - tutaj musiał bym usiąść po pracy

cluster z raspberry pi
kubernetis - in progress
pachyderm
przyklad w pachyderm
wnioski

[1] Diving in the data lake

https://github.com/VirtusLab/blog/pull/35/files#diff-2ce16658621b8e036b13d87703c87788

[2] Hadoop legacy pitfalls

http://blog.takipi.com/apache-spark-5-pitfalls-you-must-solve-before-changing-your-architecture/
http://blog.cloudera.com/wp-content/uploads/2011/03/ten_common_hadoopable_problems_final.pdf
http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html
In previous blog post I was explaining data lake base concepts. With this background there was defined some core problems which can occur in data lake and I give some hints to not have them. Most of this pitfall are caused by data lake nature. Unfortunately current Hadoop distributions can't resolve them. Hadoop had bad reputation for being extremely complex.
This time I would like to focus on them and especially on the problems of the Hadoop environment. Maybe they can be repaired?
Architecture

Hadoop has an irreparably fractured ecosystem. There is dozens of services that partially cover their responsibilities. For example for security there is Sentry and Ranger. For processing management there is NiFi and Falcon. Solution  may be to select an existing distribution. There are couple of them e.g. HDP, CDH or MEP.
Each one is different because companies around Hadoop have own vision.
Selecting the right distribution can be a real challenge, especially as each of them embed different Hadoop components, for example Cloudera’s Impala in CDH, and configuration managers like Ambari.
Everyone wants to do it their way. You can not see much cooperation between them. Even Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop seven years ago and have kept it closed source ¹.
Hadoop exist a lot of time in the market. There ares still some problems which have not been solved yet. This is due to the fact that the hadoop architecture and primarily HDFS does not allow to implement this. For example HDFS fit for batch jobs without compromising on data locality. Unfortunately right now streaming is equally important. There are three major group of problems: security, versioning and linage (metadata).
Security


Lack of authorization for data and process e.g if you run spark job and in job you are using HDFS then job is running with user which trigger job and resources are used with yarn user. There is no possible to: I'm data scientist X and I want to run job with my identity X and use resources with that identity.

Versioning


No versioned data and process. In data lake you will run lot of pipelines for data ingestion and wrangling. There will be developed by more than one developer. If you  not trace changes in pipeline then bug hunting will be extremely hard. There is not possible to disable particular pipeline and read logs.


Lack of data and process collaboration without tools like Hive (metadata layer). If we can't connect pipeline with version without metadat then guessing haw current pipeline works is extremely hard. Of course you can gives unique name for pipeline deployment and then checkout particular but this is not most way to deal with errors on production. Even if you can find bug and fix them, then you need to revert corrupted data. And without metadata you cant do this.


Lineage


No versioning for data lineage (Atlas metadata) - even with atlas we need to integrate with other applications like spark. Right now Atlas provide limited integration with Hadoop applications. Most popular computation model - Spark is not supported
problem of finding a use case -
the skills gap came out as the biggest hurdle to adoption.

Operational


there are problems around high availability. In particular, Hadoop has a single NameNode. This is where the metadata is stored about the Hadoop cluster. Unfortunately, there is only one of them, which means that the NameNode is a single point of failure for the entire environment.


jest problem przy tworzeniu hadoopowego klastra

security - kerberos is hard. lot of integration
kerberos and encyrption need to be higly available


samo utrzymanie klastra też nie jest prostem. updatowanie trwa wieki (w tesco to było parę dni). W tym momencie co prawda można pracować ale niemamy pełnego powera co kosztuje.


w ambari czasami trzeba zrestartować bardzo dużo serwisów przy dowonlej zmianie w konfigu jednej.


nie wiem jak z backupami i czy wogóle coś takiego istnieje. ale jak mam data lake pewnie chciał byś mieć jakiś minimalny set danych który bedziedz chciał to zobić ale to można podpiąć pod konsumpcę z DL


Management


long running integration project
investment in staff. nie można za bardzo przekwalifikowywać ludzi bo szybko trzeba sie skalować rzeby mieć profity
jest ogromna różnica pomiedzy budowaniem DIY cluster a środowiskiem produkcyjnym

availability architecture
security around cluster
this thinks take a lot of time to build (should be good estimated)


make sure you are understand hardware and networking requirements (most important component)
learning curve hadoop cluster is high
learning curve security in hadoop cluster is even harder

How to live?

Kylo

NiFi

Pachyderm

Core concepts

Conclusion


¹ “Let’s build a modern Hadoop” Feb 10, 2015, https://medium.com/pachyderm-data/lets-build-a-modern-hadoop-4fc160f8d74f


Next: CephFS, GlusterFS	...
Computation model


Tutaj można powiedzieć że większość lata na MapReduce
ciężko to zintegorwać atlasem

All data lakes have some computation model to data processing (ingestion and wrangling)... Most popular MapReduce
Next: Stream, Stream, Stream ...
Resources management

All data lakes have some resource management to split resources for computation... Most popular YARN
Next: Mesosphere, Kubernetes ...
Authorization and authentication policy

Next: Ranger, ?
Data and meta data

Ingestion

basic principles of data science is the more data you get, the better your data model will ultimately be

ingestion statistics - to know data health
online ingest e.g. stream
batch e.g reading from file
reading from source strategy
merging strategy in destination

Playing with data


reviewing data structure
profiling data (statistic)
reading business data (source, pipeline)

Transformation

We want to transform data and keep lineage for this. Data Transformation requires collaboration on pipeline development. For now in Hadoop it's complicated

cleansing
joining
...
ML

Common actions (operational)

Monitoring


pipeline state
pipeline output
lineage
data provenance

SLA


warning where something is wrong or some other handling strategy for fail

New approaches

HDP


To optional
Nice integration with existing deployment
Does not resolve collaboration on data problems

Kylo

https://confluence.virtuslab.com/display/SHER/Kylo
Kylo is a data lake management software originated in Teradata company. It's open sources under Think Big built on Apache Hadoop and Spark. It's responsible for:

Data Ingestion - mix of streaming and batch processing via NiFi
Prepare - helps to cleanse data in order to improve its quality and to accredit data governance
Discovery - analysts and data scientists can search and find what data is available to them

Components

https://confluence.virtuslab.com/display/SHER/NiFi
Put simply NiFi was built to automate the flow of data between systems. While the term dataflow is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Pattern
[3] Data lake on kubernetes

Pachyderm

Resolves:

any tool for data sceintist can be used (docker)
distributed processing (kubernetes)
version control for data

Reproducibility
Instant Revert
Data Lineage


https://www.youtube.com/watch?v=LamKVhe2RSM
Like time machine for data lake. You have versioning in data and process

commit in data repository
commit in docker image

Collaboration not only on process (source) but on data. Data scientist can build data set and another can fork. Merges is one that doesn't really make sense in the context of data.  If we have conflicts, no human is going to be able to resolve merge conflicts for terbytes of data.
Processing app as docker (you can use any language). Like microservices. Docker Image with some processing application (microservice) can be replayed with same data from repository. We can create n containers from one image - easy scaling.
It doesn’t take a large team with specific expertise that Hadoop requires to be productive. 
Containers allow to build data-processing algorithms in any programming languag.
New position in IT DataScientis+DevOps - like SRE in Google but for data science - https://landing.google.com/sre/
Because DataScientis know how data looks like and can say how processing should be run (cheaper and less fails).
Components

Pipelines

System of stringing containers together and doing data analysis with them
Pachyderm File System (PFS)

Distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. Base on https://github.com/ceph/ceph Provides historical snapshots (time machine)
Hadoop vs Pachyderm


Comparision for HDFS and CephFS http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf

Lineage


Governance
HDP
Kylo
Pachyderm


data
✓ ¹
✓ ²
✓


processing
✗
✓
✓


Governance


Lineage
HDP
Kylo
Pachyderm


data
✓ ³
✓
✓


processing
✗ ⁴
✓
✓


¹ only when you access data from app integrated with Ranger
² partially supported e.g if we have regex in input then we have regex
³ only supported with Atlas or manually
⁴ supported in NiFi


Modern Hadoop

HDP

https://www.slideshare.net/hortonworks/hdp-security-overview
https://confluence.virtuslab.com/display/SHER/Hortonworks+Data+Platform
Components

HDFS
YARN

https://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/
What next?

Spark on Mesosphere

Spark is not new approach. It's only for computation
Spark vs Pachyderm

Pros


Easy to setup and use - Unix file concepts with git

Cons


Can't use YARN and other Hadoop stuff. You need to create new cluster or change software for existing.

Other

In progress integration with Mesos and Spark.