Skip to content

Instantly share code, notes, and snippets.

View prakashrd's full-sized avatar
🏠
Working from home

Jai Prakash prakashrd

🏠
Working from home
View GitHub Profile
@prakashrd
prakashrd / simple-webservers.md
Last active June 2, 2017 07:59
Simple Static Webservers

python

start the server

python -m SimpleHTTPServer 8000 localhost:8000 from the browser

node

  • install npm module
  • npm install http-server -g

Start the server

@prakashrd
prakashrd / pyspark-csv-to-parquet.py
Last active June 29, 2017 01:59
Convert a CSV file to Parquet
# A simple script to convert traffic csv to parquet file. Demonstrates the usage of csv to parquet, usage of udfs and applying the
import argparse
from pyspark.sql import SparkSession
# Import data types
from pyspark.sql.types import *
from pyspark.sql.functions import when, lit, col, udf
def convert_csv_to_parquet(spark_context, custom_schema, csv_file, parquet_file):
@prakashrd
prakashrd / pyspark-beginers.py
Created June 29, 2017 03:09
PySpark displaying limited columns from a dataframe
# having started spark journey everything is a discovery for me so jolting few notes
df = sqlContext.createDataFrame([{'name': 'Alice', 'age': 1, 'gender' : 'F'}])
# display all the columns
df.show()
# limit to few
df.select('name', 'age').show()
@prakashrd
prakashrd / 00-OozieWorkflowShellAction
Created July 5, 2017 07:34 — forked from airawat/00-OozieWorkflowShellAction
Oozie workflow with a shell action - with CaptureOutput Counts lines in a glob provided and writes the same to standard output. A subsequent email action emails the output of the shell action
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: shell action, email action
Action 1: The shell action executes a shell script that does a line count for files in a
glob provided, and writes the line count to standard output
Action 2: The email action emails the output of action 1
Pictorial overview of job:
--------------------------
@prakashrd
prakashrd / read_encoded_file.py
Created April 11, 2018 02:17
Read an encoded UTF-16 file with python
import codecs
import json
json_data = json.load(codecs.open('url_entities.json', 'r', 'utf-16'))
json_rows = [r for r in json_data]
# List unique values in a DataFrame column
# h/t @makmanalp for the updated syntax!
df['Column Name'].unique()
# Convert Series datatype to numeric (will error if column has non-numeric values)
# h/t @makmanalp
pd.to_numeric(df['Column Name'])
# Convert Series datatype to numeric, changing non-numeric values to NaN
# h/t @makmanalp for the updated syntax!
@prakashrd
prakashrd / squash_commits_after_push.sh
Last active May 18, 2018 07:39
Git : Squash commits after push
git checkout my_branch
git reset --soft HEAD~4
git commit
git push --force origin my_branch
## The above resets four commits you have pushed. Though it can be done any branch but doing it on feature branch is a
## good practise
@prakashrd
prakashrd / regex.scala
Created July 5, 2018 08:45
Scala Regex
scala> val s = """(\d+)-(\d+)-(\d+).*""".r
s: scala.util.matching.Regex = (\d+)-(\d+)-(\d+).*
scala> val s(a,b,c) = "20-30-04 jfa"
a: String = 20
b: String = 30
c: String = 04
@prakashrd
prakashrd / spark-join.scala
Created March 7, 2019 11:57
spark-joining-datasets
scala> val left = Seq((0), (1)).toDF("id")
left: org.apache.spark.sql.DataFrame = [id: int]
scala> left.join(right, "id").show
+---+-----+
| id|right|
+---+-----+
| 0| zero|
| 0| four|
+---+-----+
@prakashrd
prakashrd / pyspark_two_files.py
Created March 16, 2019 13:02
PySpark read two files join on a column and print the result df
import sys
from pyspark.sql import SparkSession
# Import data types
from pyspark.sql.types import *
from pyspark.sql.functions import when, lit, col, udf
spark = SparkSession.builder.appName("Python spark read two files").getOrCreate()
accounts_file = sys.argv[1]