explicite/trained_elephant.md

## trained_elephant.md

      
    Raw
  

              trained_elephant.md
            
          
    Trained elephant

Why?

Because of problems defined in (Hadoop is like elephant crushing) :

Hadoop legacy
Diving in the data lake

I want to define requirements for data lake from perspective of:

developer

collaboration (scm, review, system modularity)
testing
data samples
verification


cluster operators

monitoring
deployment process
reliability
ingestion
security


analyst

ad-hock queries
business context
data quality
lineage


business

data business value
staff scalability


Then I want to focus on pachyderm.io - review architecture, compare to Hadoop, test and answer for question Pachyderm is a modern Hadoop?.
Technical tasks TODO list


build cluster on raspberry pi
ansible for cluster management - in progress

setup k8s
setup CephFS (pv for k8s)

https://www.techrepublic.com/article/how-to-deploy-a-ceph-storage-cluster/


setup minio - standalone
setup pachyderm


run examples in pachyderm

RPI Cluster

Components - COMPLETED


4 x Raspberry Pi 2 B
1 x DC supply 5V 4A (mini usb out)
1 x 5 ports switch
4 x ethernet cables
4 x micro sd 16gb cards

Hacking

Prepare RPI


Install ubuntu http://cdimage.ubuntu.com/ubuntu/releases/16.04/release/ubuntu-16.04.3-preinstalled-server-armhf+raspi2.img.xz
Copy ssh key to nodes - https://github.com/garthvh/ansible-raspi-playbooks

template
ansible-user
ansible-pass


TODO update ansible with https://gist.github.com/gwillem/4ba393dceb55e5ae276a87300f6b8e6f

Software - Ansible

TODO


create additional user (k8s)
setup ssh conf for cluser (how to store hosts-DNS?)

Kubernetes

- name: init kubernetes on master node
  command: kubeadm init --pod-network-cidr 10.244.0.0/16
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: save token to file
  shell: echo $(kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t')  > ~/token
https://kubernetes.io/docs/getting-started-guides/kubeadm/
Minio

- name: create headles service
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-standalone-pvc.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: create stateful set
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-statefulset.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: create service
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-service.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

https://github.com/kubernetes/kubernetes/tree/master/examples/storage/minio#minio-distributed-server-deployment
Pachyderm

http://docs.pachyderm.io/en/latest/deployment/on_premises.html#
Pachyderm

Resolves:

any tool for data scientist can be used (docker)
distributed processing (kubernetes)
version control for data

Reproducibility
Instant Revert
Data Lineage


https://www.youtube.com/watch?v=LamKVhe2RSM
Like time machine for data lake. You have versioning in data and process

commit in data repository
commit in docker image

Collaboration not only on process (source) but on data. Data scientist can build data set and another can fork. Merges is one that doesn't really make sense in the context of data.  If we have conflicts, no human is going to be able to resolve merge conflicts for terabytes of data.
Processing app as docker (you can use any language). Like microservices. Docker Image with some processing application (microservice) can be replayed with same data from repository. We can create n containers from one image - easy scaling.
It doesn’t take a large team with specific expertise that Hadoop requires to be productive.
Containers allow to build data-processing algorithms in any programming language.
New position in IT DataScientis+DevOps - like SRE in Google but for data science - https://landing.google.com/sre/
Because DataScientis know how data looks like and can say how processing should be run (cheaper and less fails).
Components

Pipelines

System of stringing containers together and doing data analysis with them
Pachyderm File System (PFS)

Distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. Base on https://github.com/ceph/ceph Provides historical snapshots (time machine)
Hadoop vs Pachyderm


Comparision for HDFS and CephFS http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf

Lineage


Governance
HDP
Kylo
Pachyderm


data
✓ ¹
✓ ²
✓


processing
✗
✓
✓


Governance


Lineage
HDP
Kylo
Pachyderm


data
✓ ³
✓
✓


processing
✗ ⁴
✓
✓


¹ only when you access data from app integrated with Ranger
² partially supported e.g if we have regex in input then we have regex
³ only supported with Atlas or manually
⁴ supported in NiFi


Modern Hadoop

HDP

https://www.slideshare.net/hortonworks/hdp-security-overview
https://confluence.virtuslab.com/display/SHER/Hortonworks+Data+Platform
Components

HDFS
YARN

https://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/
What next?

Spark on Mesosphere

Spark is not new approach. It's only for computation
Spark vs Pachyderm

Pros


Easy to setup and use - Unix file concepts with git

Cons


Can't use YARN and other Hadoop stuff. You need to create new cluster or change software for existing.

Other

In progress integration with Mesos and Spark.