Skip to content

Instantly share code, notes, and snippets.

View dprateek1991's full-sized avatar
💭
Ask me anything on LinkedIn

Prateek Dubey dprateek1991

💭
Ask me anything on LinkedIn
View GitHub Profile
@dprateek1991
dprateek1991 / ceph_spark_k8s_data_processing.py
Last active September 25, 2021 08:53
Process Data in Ceph using Spark on Kubernetes
# Airflow DEMO DAG
from airflow import DAG
from datetime import timedelta, datetime
from kubernetes.client import models as k8s
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
args = {
"owner": "prateek.dubey",
"email": ["dataengineeringe2e@gmail.com"],
@dprateek1991
dprateek1991 / Ceph_S3_Data_Read_PySpark_with_config.py
Created September 24, 2021 09:59
Process Ceph/S3 Data using Spark
# Import Spark Library
from pyspark.sql import SparkSession
# Create a Spark Context
spark = SparkSession.builder \
.appName('amazon-data-review') \
.config("spark.kubernetes.driver.master", "k8s://https://14HH948AC611F5A7F020B62A5C366F04.yl4.us-east-1.eks.amazonaws.com:443") \
.config("spark.kubernetes.namespace", "spark") \
@dprateek1991
dprateek1991 / Ceph_S3_Data_Read_PySpark.py
Created September 24, 2021 09:56
Read Ceph/ S3 Data via Spark and write back
# Import Spark Library
from pyspark.sql import SparkSession
# Create a Spark Context
spark = SparkSession.builder \
.appName('amazon-data-review') \
.getOrCreate()
@dprateek1991
dprateek1991 / spark_jump_pod.yaml
Created September 24, 2021 09:53
Spark Jump Pod
apiVersion: v1
kind: Pod
metadata:
name: spark-jump-pod
namespace: spark
spec:
serviceAccountName: spark
containers:
- image: dataengineeringe2e/spark-ubuntu-3.0.1
name: spark-jump-pod
@dprateek1991
dprateek1991 / k8s_commands.sh
Created September 24, 2021 09:52
Spark on K8s Commands
kubectl create namespace spark
kubectl create serviceaccount spark -n spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark --namespace=spark
@dprateek1991
dprateek1991 / spark_onpremise.sh
Created September 24, 2021 09:49
Spark on On-Premise Rancher K8s
/opt/spark/bin/spark-submit \
--master k8s://https://rancher.example.com:6443 \
--deploy-mode cluster \
--name amazon-data-review \
--conf spark.kubernetes.driver.pod.name=amazon-data-review \
--conf spark.kubernetes.executor.podNamePrefix=amazon-data-review \
--conf spark.kubernetes.namespace=spark \
--conf spark.executor.instances=2 \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=55g \
@dprateek1991
dprateek1991 / spark_eks.sh
Created September 24, 2021 09:48
Spark on EKS
/opt/spark/bin/spark-submit \
--master k8s://https://14HH948AC611F5A7F020B62A5C366F04.yl4.us-east-1.eks.amazonaws.com:443 \
--deploy-mode cluster \
--name amazon-data-review \
--conf spark.kubernetes.driver.pod.name=amazon-data-review \
--conf spark.kubernetes.executor.podNamePrefix=amazon-data-review \
--conf spark.kubernetes.namespace=spark \
--conf spark.executor.instances=2 \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=55g \
@dprateek1991
dprateek1991 / spark_k8s_airflow.py
Created September 24, 2021 09:47
Airflow DAG for Spark Application
# Airflow DEMO DAG
from airflow import DAG
from datetime import timedelta, datetime
from kubernetes.client import models as k8s
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
args = {
"owner": "prateek.dubey",
"email": ["dataengineeringe2e@gmail.com"],
"depends_on_past": False,
@dprateek1991
dprateek1991 / SparkK8sOperator.yaml
Created September 24, 2021 09:44
SparkK8sOperator YAML File
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: amazon-data-review
namespace: spark
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "dataengineeringe2e/spark-ubuntu-3.0.1"