explicite/rpicluster.md

## rpicluster.md

      
    Raw
  

              rpicluster.md
            
          
    Why?

Skupienie sie na pachyderm i pokazanie czy rzeczywiscie może być przyszłością
Jakiś przykład w pachyderm ML na klastrze raspberry pi - tutaj musiał bym usiąść po pracy

cluster z raspberry pi
kubernetis - in progress
pachyderm
przyklad w pachyderm
wnioski

RPI Cluster

Components


4 x Raspberry Pi 2 B
1 x DC supply 5V 4A (mini usb out)
1 x 5 ports switch
4 x ethernet cables
4 x micro sd 16gb cards

Kubernetes

- name: init kubernetes on master node
  command: kubeadm init --pod-network-cidr 10.244.0.0/16
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: save token to file
  shell: echo $(kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t')  > ~/token
https://kubernetes.io/docs/getting-started-guides/kubeadm/
Minio or Rook to test

- name: create headles service
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-standalone-pvc.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: create stateful set
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-statefulset.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: create service
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-service.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

https://github.com/kubernetes/kubernetes/tree/master/examples/storage/minio#minio-distributed-server-deployment
Pachyderm

http://docs.pachyderm.io/en/latest/deployment/on_premises.html#
Pachyderm

Resolves:

any tool for data sceintist can be used (docker)
distributed processing (kubernetes)
version control for data

Reproducibility
Instant Revert
Data Lineage


https://www.youtube.com/watch?v=LamKVhe2RSM
Like time machine for data lake. You have versioning in data and process

commit in data repository
commit in docker image

Collaboration not only on process (source) but on data. Data scientist can build data set and another can fork. Merges is one that doesn't really make sense in the context of data.  If we have conflicts, no human is going to be able to resolve merge conflicts for terbytes of data.
Processing app as docker (you can use any language). Like microservices. Docker Image with some processing application (microservice) can be replayed with same data from repository. We can create n containers from one image - easy scaling.
It doesn’t take a large team with specific expertise that Hadoop requires to be productive. 
Containers allow to build data-processing algorithms in any programming languag.
New position in IT DataScientis+DevOps - like SRE in Google but for data science - https://landing.google.com/sre/
Because DataScientis know how data looks like and can say how processing should be run (cheaper and less fails).
Components

Pipelines

System of stringing containers together and doing data analysis with them
Pachyderm File System (PFS)

Distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. Base on https://github.com/ceph/ceph Provides historical snapshots (time machine)
Hadoop vs Pachyderm


Comparision for HDFS and CephFS http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf

Lineage


Governance
HDP
Kylo
Pachyderm


data
✓ ¹
✓ ²
✓


processing
✗
✓
✓


Governance


Lineage
HDP
Kylo
Pachyderm


data
✓ ³
✓
✓


processing
✗ ⁴
✓
✓


¹ only when you access data from app integrated with Ranger
² partially supported e.g if we have regex in input then we have regex
³ only supported with Atlas or manually
⁴ supported in NiFi


Modern Hadoop

HDP

https://www.slideshare.net/hortonworks/hdp-security-overview
https://confluence.virtuslab.com/display/SHER/Hortonworks+Data+Platform
Components

HDFS
YARN

https://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/
What next?

Spark on Mesosphere

Spark is not new approach. It's only for computation
Spark vs Pachyderm

Pros


Easy to setup and use - Unix file concepts with git

Cons


Can't use YARN and other Hadoop stuff. You need to create new cluster or change software for existing.

Other

In progress integration with Mesos and Spark.