Skupienie sie na pachyderm i pokazanie czy rzeczywiscie może być przyszłością Jakiś przykład w pachyderm ML na klastrze raspberry pi - tutaj musiał bym usiąść po pracy
cluster z raspberry pi- kubernetis - in progress
- pachyderm
- przyklad w pachyderm
- wnioski
- 4 x Raspberry Pi 2 B
- 1 x DC supply 5V 4A (mini usb out)
- 1 x 5 ports switch
- 4 x ethernet cables
- 4 x micro sd 16gb cards
- name: init kubernetes on master node
command: kubeadm init --pod-network-cidr 10.244.0.0/16
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
- name: save token to file
shell: echo $(kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t') > ~/token
https://kubernetes.io/docs/getting-started-guides/kubeadm/
- name: create headles service
command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-standalone-pvc.yaml?raw=true
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
- name: create stateful set
command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-statefulset.yaml?raw=true
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
- name: create service
command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-service.yaml?raw=true
run_once: true
register: command_result
failed_when: "'FAILED' in command_result.stderr"
async: 200
poll: 10
http://docs.pachyderm.io/en/latest/deployment/on_premises.html#
Resolves:
- any tool for data sceintist can be used (docker)
- distributed processing (kubernetes)
- version control for data
- Reproducibility
- Instant Revert
- Data Lineage
https://www.youtube.com/watch?v=LamKVhe2RSM
Like time machine for data lake. You have versioning in data and process
- commit in data repository
- commit in docker image
Collaboration not only on process (source) but on data. Data scientist can build data set and another can fork. Merges is one that doesn't really make sense in the context of data. If we have conflicts, no human is going to be able to resolve merge conflicts for terbytes of data.
Processing app as docker (you can use any language). Like microservices. Docker Image with some processing application (microservice) can be replayed with same data from repository. We can create n
containers from one image - easy scaling.
It doesn’t take a large team with specific expertise that Hadoop requires to be productive.
Containers allow to build data-processing algorithms in any programming languag.
New position in IT DataScientis+DevOps - like SRE in Google but for data science - https://landing.google.com/sre/
Because DataScientis know how data looks like and can say how processing should be run (cheaper and less fails).
System of stringing containers together and doing data analysis with them
Distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. Base on https://github.com/ceph/ceph Provides historical snapshots (time machine)
Comparision for HDFS and CephFS http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf
Governance | HDP | Kylo | Pachyderm |
---|---|---|---|
data | ✓ 1 | ✓ 2 | ✓ |
processing | ✗ | ✓ | ✓ |
Lineage | HDP | Kylo | Pachyderm |
---|---|---|---|
data | ✓ 3 | ✓ | ✓ |
processing | ✗ 4 | ✓ | ✓ |
1 only when you access data from app integrated with Ranger
2 partially supported e.g if we have regex in input then we have regex
3 only supported with Atlas or manually
4 supported in NiFi
https://www.slideshare.net/hortonworks/hdp-security-overview https://confluence.virtuslab.com/display/SHER/Hortonworks+Data+Platform
Components
- HDFS
- YARN
https://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/
Spark is not new approach. It's only for computation
- Easy to setup and use - Unix file concepts with
git
- Can't use YARN and other Hadoop stuff. You need to create new cluster or change software for existing.
In progress integration with Mesos and Spark.