Skip to content

Instantly share code, notes, and snippets.

@explicite
Last active September 19, 2017 10:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save explicite/e9bfeac00a6c71deb6a72c01b9d8f597 to your computer and use it in GitHub Desktop.
Save explicite/e9bfeac00a6c71deb6a72c01b9d8f597 to your computer and use it in GitHub Desktop.
Data Lake

Data Lake Trylogy

[1] Diving in the data lake

  • przedstawię czym jest DL
  • pisze czemu wogóle to jest i jakie są korzysci
  • komponenty?
  • data swamp - czyli czego nie powinno sie robić
  • jakieś małe podsumowanie

https://github.com/VirtusLab/blog/pull/35/files#diff-2ce16658621b8e036b13d87703c87788

[2] Hadoop Legacy

  • wskarze z czym sa problemy
  • przedstawie inne podejscie na podstawie pachyderm (komponent i czym sie roznia od hadoop'a)
  • dodam wnioski
    • ze hadoop ma problemy i probuje to rozwiazac
    • ze pachyderm czesc z nich rozwiazuje (lepsza architektura) ale nie koniecznie wszystkie (jeszcze)

[3] Trzeci blog post?

Skupienie sie na pachyderm i pokazanie czy rzeczywiscie może być przyszłością Jakiś przykład w pachyderm ML na klastrze raspberry pi - tutaj musiał bym usiąść po pracy

  • cluster z raspberry pi
  • kubernetis - in progress
  • pachyderm
  • przyklad w pachyderm
  • wnioski

[1] Diving in the data lake

https://github.com/VirtusLab/blog/pull/35/files#diff-2ce16658621b8e036b13d87703c87788


[2] Hadoop legacy pitfalls

http://blog.takipi.com/apache-spark-5-pitfalls-you-must-solve-before-changing-your-architecture/ http://blog.cloudera.com/wp-content/uploads/2011/03/ten_common_hadoopable_problems_final.pdf http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html

In previous blog post I was explaining data lake base concepts. With this background there was defined some core problems which can occur in data lake and I give some hints to not have them. Most of this pitfall are caused by data lake nature. Unfortunately current Hadoop distributions can't resolve them. Hadoop had bad reputation for being extremely complex.

This time I would like to focus on them and especially on the problems of the Hadoop environment. Maybe they can be repaired?

Architecture

Hadoop has an irreparably fractured ecosystem. There is dozens of services that partially cover their responsibilities. For example for security there is Sentry and Ranger. For processing management there is NiFi and Falcon. Solution may be to select an existing distribution. There are couple of them e.g. HDP, CDH or MEP. Each one is different because companies around Hadoop have own vision. Selecting the right distribution can be a real challenge, especially as each of them embed different Hadoop components, for example Cloudera’s Impala in CDH, and configuration managers like Ambari. Everyone wants to do it their way. You can not see much cooperation between them. Even Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop seven years ago and have kept it closed source 1.

Hadoop exist a lot of time in the market. There ares still some problems which have not been solved yet. This is due to the fact that the hadoop architecture and primarily HDFS does not allow to implement this. For example HDFS fit for batch jobs without compromising on data locality. Unfortunately right now streaming is equally important. There are three major group of problems: security, versioning and linage (metadata).

Security

  • Lack of authorization for data and process e.g if you run spark job and in job you are using HDFS then job is running with user which trigger job and resources are used with yarn user. There is no possible to: I'm data scientist X and I want to run job with my identity X and use resources with that identity.

Versioning

  • No versioned data and process. In data lake you will run lot of pipelines for data ingestion and wrangling. There will be developed by more than one developer. If you not trace changes in pipeline then bug hunting will be extremely hard. There is not possible to disable particular pipeline and read logs.

  • Lack of data and process collaboration without tools like Hive (metadata layer). If we can't connect pipeline with version without metadat then guessing haw current pipeline works is extremely hard. Of course you can gives unique name for pipeline deployment and then checkout particular but this is not most way to deal with errors on production. Even if you can find bug and fix them, then you need to revert corrupted data. And without metadata you cant do this.

Lineage

  • No versioning for data lineage (Atlas metadata) - even with atlas we need to integrate with other applications like spark. Right now Atlas provide limited integration with Hadoop applications. Most popular computation model - Spark is not supported
  • problem of finding a use case -
  • the skills gap came out as the biggest hurdle to adoption.

Operational

  • there are problems around high availability. In particular, Hadoop has a single NameNode. This is where the metadata is stored about the Hadoop cluster. Unfortunately, there is only one of them, which means that the NameNode is a single point of failure for the entire environment.

  • jest problem przy tworzeniu hadoopowego klastra

    • security - kerberos is hard. lot of integration
    • kerberos and encyrption need to be higly available
  • samo utrzymanie klastra też nie jest prostem. updatowanie trwa wieki (w tesco to było parę dni). W tym momencie co prawda można pracować ale niemamy pełnego powera co kosztuje.

  • w ambari czasami trzeba zrestartować bardzo dużo serwisów przy dowonlej zmianie w konfigu jednej.

  • nie wiem jak z backupami i czy wogóle coś takiego istnieje. ale jak mam data lake pewnie chciał byś mieć jakiś minimalny set danych który bedziedz chciał to zobić ale to można podpiąć pod konsumpcę z DL

Management

  • long running integration project
  • investment in staff. nie można za bardzo przekwalifikowywać ludzi bo szybko trzeba sie skalować rzeby mieć profity
  • jest ogromna różnica pomiedzy budowaniem DIY cluster a środowiskiem produkcyjnym
    • availability architecture
    • security around cluster
    • this thinks take a lot of time to build (should be good estimated)
  • make sure you are understand hardware and networking requirements (most important component)
  • learning curve hadoop cluster is high
  • learning curve security in hadoop cluster is even harder

How to live?

Kylo

NiFi

Pachyderm

Core concepts

Conclusion

1 “Let’s build a modern Hadoop” Feb 10, 2015, https://medium.com/pachyderm-data/lets-build-a-modern-hadoop-4fc160f8d74f


Next: CephFS, GlusterFS ...

Computation model

  • Tutaj można powiedzieć że większość lata na MapReduce
  • ciężko to zintegorwać atlasem

All data lakes have some computation model to data processing (ingestion and wrangling)... Most popular MapReduce

Next: Stream, Stream, Stream ...

Resources management

All data lakes have some resource management to split resources for computation... Most popular YARN

Next: Mesosphere, Kubernetes ...

Authorization and authentication policy

Next: Ranger, ?

Data and meta data

Ingestion

basic principles of data science is the more data you get, the better your data model will ultimately be

  • ingestion statistics - to know data health
  • online ingest e.g. stream
  • batch e.g reading from file
  • reading from source strategy
  • merging strategy in destination

Playing with data

  • reviewing data structure
  • profiling data (statistic)
  • reading business data (source, pipeline)

Transformation

We want to transform data and keep lineage for this. Data Transformation requires collaboration on pipeline development. For now in Hadoop it's complicated

  • cleansing
  • joining
  • ...
  • ML

Common actions (operational)

Monitoring

  • pipeline state
  • pipeline output
  • lineage
  • data provenance

SLA

  • warning where something is wrong or some other handling strategy for fail

New approaches

HDP

  • To optional
  • Nice integration with existing deployment
  • Does not resolve collaboration on data problems

Kylo

https://confluence.virtuslab.com/display/SHER/Kylo

Kylo is a data lake management software originated in Teradata company. It's open sources under Think Big built on Apache Hadoop and Spark. It's responsible for:

  • Data Ingestion - mix of streaming and batch processing via NiFi
  • Prepare - helps to cleanse data in order to improve its quality and to accredit data governance
  • Discovery - analysts and data scientists can search and find what data is available to them

Components

https://confluence.virtuslab.com/display/SHER/NiFi

Put simply NiFi was built to automate the flow of data between systems. While the term dataflow is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Pattern

[3] Data lake on kubernetes

Pachyderm

Resolves:

  • any tool for data sceintist can be used (docker)
  • distributed processing (kubernetes)
  • version control for data
    • Reproducibility
    • Instant Revert
    • Data Lineage

https://www.youtube.com/watch?v=LamKVhe2RSM

Like time machine for data lake. You have versioning in data and process

  • commit in data repository
  • commit in docker image

Collaboration not only on process (source) but on data. Data scientist can build data set and another can fork. Merges is one that doesn't really make sense in the context of data. If we have conflicts, no human is going to be able to resolve merge conflicts for terbytes of data.

Processing app as docker (you can use any language). Like microservices. Docker Image with some processing application (microservice) can be replayed with same data from repository. We can create n containers from one image - easy scaling.

It doesn’t take a large team with specific expertise that Hadoop requires to be productive. Containers allow to build data-processing algorithms in any programming languag.

New position in IT DataScientis+DevOps - like SRE in Google but for data science - https://landing.google.com/sre/

Because DataScientis know how data looks like and can say how processing should be run (cheaper and less fails).

Components

Pipelines

System of stringing containers together and doing data analysis with them

Pachyderm File System (PFS)

Distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. Base on https://github.com/ceph/ceph Provides historical snapshots (time machine)

Hadoop vs Pachyderm

hadoop vs pachyderm

Comparision for HDFS and CephFS http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf


Lineage

Governance HDP Kylo Pachyderm
data 1 2
processing

Governance

Lineage HDP Kylo Pachyderm
data 3
processing 4

1 only when you access data from app integrated with Ranger

2 partially supported e.g if we have regex in input then we have regex

3 only supported with Atlas or manually

4 supported in NiFi


Modern Hadoop

HDP

https://www.slideshare.net/hortonworks/hdp-security-overview https://confluence.virtuslab.com/display/SHER/Hortonworks+Data+Platform

Components

  • HDFS
  • YARN

https://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/

What next?

Spark on Mesosphere

Spark is not new approach. It's only for computation

Spark vs Pachyderm

Pros
  • Easy to setup and use - Unix file concepts with git
Cons
  • Can't use YARN and other Hadoop stuff. You need to create new cluster or change software for existing.

Other

In progress integration with Mesos and Spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment