Because of problems defined in (Hadoop is like elephant crushing) :
I want to define requirements for data lake from perspective of:
- developer
- collaboration (scm, review, system modularity)
- testing
- data samples
- verification
- cluster operators
- monitoring
- deployment process
- reliability
- ingestion
- security
- analyst
- ad-hock queries
- business context
- data quality
- lineage
- business
- data business value
- staff scalability
Then I want to focus on pachyderm.io - review architecture, compare to Hadoop, test and answer for question Pachyderm is a modern Hadoop?
.
build cluster on raspberry pi- ansible for cluster management - in progress
- setup k8s
- setup CephFS (pv for k8s)
- setup minio - standalone
- setup pachyderm
- run examples in pachyderm
- 4 x Raspberry Pi 2 B
- 1 x DC supply 5V 4A (mini usb out)
- 1 x 5 ports switch
- 4 x ethernet cables
- 4 x micro sd 16gb cards
- Install ubuntu http://cdimage.ubuntu.com/ubuntu/releases/16.04/release/ubuntu-16.04.3-preinstalled-server-armhf+raspi2.img.xz
- Copy ssh key to nodes - https://github.com/garthvh/ansible-raspi-playbooks
- template
- ansible-user
- ansible-pass
- TODO update ansible with https://gist.github.com/gwillem/4ba393dceb55e5ae276a87300f6b8e6f
- create additional user (k8s)
- setup ssh conf for cluser (how to store hosts-DNS?)
- name: init kubernetes on master node
command: kubeadm init --pod-network-cidr 10.244.0.0/16
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
- name: save token to file
shell: echo $(kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t') > ~/token
https://kubernetes.io/docs/getting-started-guides/kubeadm/
- name: create headles service
command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-standalone-pvc.yaml?raw=true
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
- name: create stateful set
command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-statefulset.yaml?raw=true
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
- name: create service
command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-service.yaml?raw=true
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
http://docs.pachyderm.io/en/latest/deployment/on_premises.html#
Resolves:
- any tool for data scientist can be used (docker)
- distributed processing (kubernetes)
- version control for data
- Reproducibility
- Instant Revert
- Data Lineage
https://www.youtube.com/watch?v=LamKVhe2RSM
Like time machine for data lake. You have versioning in data and process
- commit in data repository
- commit in docker image
Collaboration not only on process (source) but on data. Data scientist can build data set and another can fork. Merges is one that doesn't really make sense in the context of data. If we have conflicts, no human is going to be able to resolve merge conflicts for terabytes of data.
Processing app as docker (you can use any language). Like microservices. Docker Image with some processing application (microservice) can be replayed with same data from repository. We can create n
containers from one image - easy scaling.
It doesn’t take a large team with specific expertise that Hadoop requires to be productive.
Containers allow to build data-processing algorithms in any programming language.
New position in IT DataScientis+DevOps - like SRE in Google but for data science - https://landing.google.com/sre/
Because DataScientis know how data looks like and can say how processing should be run (cheaper and less fails).
System of stringing containers together and doing data analysis with them
Distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. Base on https://github.com/ceph/ceph Provides historical snapshots (time machine)
Comparision for HDFS and CephFS http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf
Governance | HDP | Kylo | Pachyderm |
---|---|---|---|
data | ✓ 1 | ✓ 2 | ✓ |
processing | ✗ | ✓ | ✓ |
Lineage | HDP | Kylo | Pachyderm |
---|---|---|---|
data | ✓ 3 | ✓ | ✓ |
processing | ✗ 4 | ✓ | ✓ |
1 only when you access data from app integrated with Ranger
2 partially supported e.g if we have regex in input then we have regex
3 only supported with Atlas or manually
4 supported in NiFi
https://www.slideshare.net/hortonworks/hdp-security-overview https://confluence.virtuslab.com/display/SHER/Hortonworks+Data+Platform
Components
- HDFS
- YARN
https://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/
Spark is not new approach. It's only for computation
- Easy to setup and use - Unix file concepts with
git
- Can't use YARN and other Hadoop stuff. You need to create new cluster or change software for existing.
In progress integration with Mesos and Spark.