Skip to content

Instantly share code, notes, and snippets.

@explicite
Last active February 28, 2018 11:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save explicite/74b72b33e53cf65d9700a57c90693659 to your computer and use it in GitHub Desktop.
Save explicite/74b72b33e53cf65d9700a57c90693659 to your computer and use it in GitHub Desktop.
RPI Cluster

Why?

Skupienie sie na pachyderm i pokazanie czy rzeczywiscie może być przyszłością Jakiś przykład w pachyderm ML na klastrze raspberry pi - tutaj musiał bym usiąść po pracy

  • cluster z raspberry pi
  • kubernetis - in progress
  • pachyderm
  • przyklad w pachyderm
  • wnioski

RPI Cluster

Components

  • 4 x Raspberry Pi 2 B
  • 1 x DC supply 5V 4A (mini usb out)
  • 1 x 5 ports switch
  • 4 x ethernet cables
  • 4 x micro sd 16gb cards

Kubernetes

- name: init kubernetes on master node
  command: kubeadm init --pod-network-cidr 10.244.0.0/16
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: save token to file
  shell: echo $(kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t')  > ~/token

https://kubernetes.io/docs/getting-started-guides/kubeadm/

Minio or Rook to test

- name: create headles service
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-standalone-pvc.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: create stateful set
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-statefulset.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

- name: create service
  command: kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/examples/storage/minio/minio-distributed-service.yaml?raw=true
  run_once: true
  register: command_result
  failed_when: "'FAILED' in command_result.stderr"
  async: 200
  poll: 10

https://github.com/kubernetes/kubernetes/tree/master/examples/storage/minio#minio-distributed-server-deployment

Pachyderm

http://docs.pachyderm.io/en/latest/deployment/on_premises.html#

Pachyderm

Resolves:

  • any tool for data sceintist can be used (docker)
  • distributed processing (kubernetes)
  • version control for data
    • Reproducibility
    • Instant Revert
    • Data Lineage

https://www.youtube.com/watch?v=LamKVhe2RSM

Like time machine for data lake. You have versioning in data and process

  • commit in data repository
  • commit in docker image

Collaboration not only on process (source) but on data. Data scientist can build data set and another can fork. Merges is one that doesn't really make sense in the context of data. If we have conflicts, no human is going to be able to resolve merge conflicts for terbytes of data.

Processing app as docker (you can use any language). Like microservices. Docker Image with some processing application (microservice) can be replayed with same data from repository. We can create n containers from one image - easy scaling.

It doesn’t take a large team with specific expertise that Hadoop requires to be productive. Containers allow to build data-processing algorithms in any programming languag.

New position in IT DataScientis+DevOps - like SRE in Google but for data science - https://landing.google.com/sre/

Because DataScientis know how data looks like and can say how processing should be run (cheaper and less fails).

Components

Pipelines

System of stringing containers together and doing data analysis with them

Pachyderm File System (PFS)

Distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. Base on https://github.com/ceph/ceph Provides historical snapshots (time machine)

Hadoop vs Pachyderm

hadoop vs pachyderm

Comparision for HDFS and CephFS http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf


Lineage

Governance HDP Kylo Pachyderm
data 1 2
processing

Governance

Lineage HDP Kylo Pachyderm
data 3
processing 4

1 only when you access data from app integrated with Ranger

2 partially supported e.g if we have regex in input then we have regex

3 only supported with Atlas or manually

4 supported in NiFi


Modern Hadoop

HDP

https://www.slideshare.net/hortonworks/hdp-security-overview https://confluence.virtuslab.com/display/SHER/Hortonworks+Data+Platform

Components

  • HDFS
  • YARN

https://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/

What next?

Spark on Mesosphere

Spark is not new approach. It's only for computation

Spark vs Pachyderm

Pros
  • Easy to setup and use - Unix file concepts with git
Cons
  • Can't use YARN and other Hadoop stuff. You need to create new cluster or change software for existing.

Other

In progress integration with Mesos and Spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment