Skip to content

Instantly share code, notes, and snippets.

@tilakpatidar
tilakpatidar / pyspark_jdbc_df_count.md
Last active May 10, 2022 13:34
Gist to perform count() on jdbc sources without re-reading the df

Postgres snippet

create database test_db;

create table t_random as select s, md5(random()::text) from generate_Series(1,5000) s;

Pyspark snippet

In [1]: df=spark.read.jdbc(url="jdbc:postgresql://localhost:5432/test_db", table="t_random", properties={"driver": "org.postgresql.Driver"}).repartition(10)
@tilakpatidar
tilakpatidar / sqoop.sh
Created July 13, 2019 07:28
Import data from postgres table to parquet using sqoop.
#!/usr/bin/env bash
#https://www.datageekinme.com/setup/setting-up-my-mac-sqoop/
# Installation on mac
brew install sqoop
sudo mkdir /var/lib/accumulo
export ACCUMULO_HOME='/var/lib/accumulo'
export SQOOP_VERSION=1.4.6_1
export SQOOP_HOME=/usr/local/Cellar/sqoop/1.4.6_1/libexec
@tilakpatidar
tilakpatidar / spark-rest-job.sh
Created July 11, 2019 05:28
Using spark master REST API to submit a job as a replacement to spark-submit command for python and scala.
#!/usr/bin/env bash
#python
curl -X POST http://localhost:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action":"CreateSubmissionRequest",
"appArgs":[
"/Users/tilak/jobs/test_job.py"
],
"appResource":"file:/Users/tilak/jobs/test_job.py",
"clientSparkVersion":"2.3.3",
@tilakpatidar
tilakpatidar / keybase.md
Created October 16, 2018 14:06
Keybase identification

Keybase proof

I hereby claim:

  • I am tilakpatidar on github.
  • I am tilakpatidar (https://keybase.io/tilakpatidar) on keybase.
  • I have a public key whose fingerprint is B366 0F6B 48D9 5E12 D7DC 1487 FF74 B160 3F1C 7463

To claim this, I am signing this object:

@tilakpatidar
tilakpatidar / conftest.py
Last active June 1, 2018 10:51
Pyspark testing
# coding=utf-8
import findspark
from pandas.util.testing import assert_frame_equal
findspark.init()
import logging
import pytest
@tilakpatidar
tilakpatidar / unique_orc_records_to_orc.scala
Created March 25, 2018 13:28
Finding unique records between ORC file and MySQL rows using Apache Spark
import spark.implicits._
import org.apache.spark.sql.SaveMode
val products = spark.sqlContext.read.format("jdbc").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "products").option("user", "gobblin").option("password", "gobblin").option("url", "jdbc:mysql://localhost/mopar_demo").load()
scala> val newProducts = spark.sqlContext.read.format("orc").load("/Users/tilak/gobblin/mopar-demo/output/org/apache/gobblin/copy/user/tilak/pricing.products_1521799535.csv/20180325023900_append/part.task_PullCsvFromS3_1521945534992_0_0.orc")
scala> val reparitionedProducts = products.repartition(10)
val joined = newProducts.as("np").join(reparitionedProducts.as("op"), reparitionedProducts("sha") === newProducts("sha"), "left_outer")
@tilakpatidar
tilakpatidar / unique_orc_records.scala
Created March 25, 2018 13:26
Finding unique records between ORC file and MySQL rows using Apache Spark
import spark.implicits._
import org.apache.spark.sql.SaveMode
val products = spark.sqlContext.read.format("jdbc").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "products").option("user", "gobblin").option("password", "gobblin").option("url", "jdbc:mysql://localhost/mopar_demo").load()
val newProducts = spark.sqlContext.read.format("orc").load("/Users/tilak/gobblin/mopar-demo/output/org/apache/gobblin/copy/user/tilak/pricing.products_1521799535.csv/20180325023900_append/part.task_PullCsvFromS3_1521945534992_0_0.orc")
val newnewProducts = newProducts.except(products)
val dfWriter = newnewProducts.write.mode(SaveMode.Append)
val connectionProperties = new java.util.Properties()
@tilakpatidar
tilakpatidar / pull_csv_from_s3_to_mysql.job
Created March 25, 2018 02:30
Apache Gobblin job to ingest csv files from s3 buckets to a MySQL table.
# ====================================================================
# PullCsvFromS3
# Pull CSV data from a directory S3 to MySQL
# ====================================================================
job.name=PullCsvFromS3
job.description=Pull CSV data from a directory S3 to MySQL
fs.uri=file:///
# Set working directory
@tilakpatidar
tilakpatidar / pull_csv_from_s3_to_avro.job
Created March 24, 2018 07:09
Apache Gobblin pull CSVs from S3 storage and write to AVRO
# ====================================================================
# PullCsvFromS3
# Pull CSV data from a directory S3 to our local system
# ====================================================================
job.name=PullCsvFromS3
job.description=Pull CSV data from a directory S3 to our local system and write as AVRO files
fs.uri=file:///
# Set working directory
@tilakpatidar
tilakpatidar / pull_csv_from_s3.job
Last active March 24, 2018 07:06
Apache Gobblin job file for pulling csv files from a S3 Bucket
# ====================================================================
# PullCsvFromS3
# Pull CSV data from a directory S3 to our local system
# ====================================================================
job.name=PullCsvFromS3
job.description=Pull CSV data from a directory S3 to our local system
fs.uri=file:///
# Set working directory