Skip to content

Instantly share code, notes, and snippets.

View kovid-r's full-sized avatar
🏠
Working from home

Kovid Rathee kovid-r

🏠
Working from home
View GitHub Profile
@kovid-r
kovid-r / pyspark_cheatsheet_between.py
Last active October 11, 2022 04:49
Filter Between PySpark Cheatsheet
# Filter movies with avg_ratings > 7.5 and < 8.2
df.filter((F.col('avg_ratings') > 7.5) & (F.col('avg_ratings') < 8.2)).show()
# Another way to do this
df.filter(df.avg_ratings.between(7.5,8.2)).show()
@kovid-r
kovid-r / pyspark_cheatsheet_between.py
Created June 11, 2020 11:16
Filter Between PySpark Cheatsheet
# Filter movies with avg_ratings > 7.5 and < 8.2
df.filter((F.col('avg_ratings') > 7.5) & (F.col('avg_ratings') < 8.2)).show()
# Another way to do this
df.filter(df.avg_ratings.between(7.5,8.2)).show()
@kovid-r
kovid-r / pyspark_cheatsheet_read_using_schema.py
Last active October 11, 2022 04:49
RDD to DataFrame using schema PySpark Cheatsheet
rdd = spark.textFile(csv_file_path)
from pyspark.sql.types import StringType, StructField, StructType, IntegerType
schema = StructType([
StructField("first_name", StringType(), True),
StructField("last_name", StringType(), True),
StructField("age", IntegerType(), True)
])
df = spark.createDataFrame(rdd, schema)
# When a new column is supposed to have nulls
df = df.withColumn('new_col_1', F.lit(None).cast(StringType()))
# When a new column is supposed to have 0 as the default value
df = df.withColumn('new_col_2', F.lit(0)))
# When a new column is supposed to be derived from two (or more) existing columns
df = df.withColumn('new_col_3', F.lit(df.some_column/df.some_other_column)))
@kovid-r
kovid-r / pyspark_cheatsheet_isnull.py
Created June 13, 2020 11:39
isNull and isNotNull in Pyspark Cheatsheet
# Find all the films for which budget information is not available
df.where(df.budget.isNull()).show()
# Similarly, find all the films for which budget information is available
df.where(df.budget.isNotNull()).show()
@kovid-r
kovid-r / pyspark_cheatsheet_aggregates.py
Created June 13, 2020 11:44
Aggregates in PySpark Cheatsheet
# Year wise summary of a selected portion of the dataset
df.groupBy('year')\
.agg(F.min('budget').alias('min_budget'),\
F.max('budget').alias('max_budget'),\
F.sum('revenue').alias('total_revenue'),\
F.avg('revenue').alias('avg_revenue'),\
F.mean('revenue').alias('mean_revenue'),\
)\
.sort(F.col('year').desc())\
.show()
@kovid-r
kovid-r / pyspark_cheatsheet_windows_and_sorting.py
Created June 13, 2020 12:23
Sorting & Windows in PySpark Cheatsheet
from pyspark.sql import Window
# Rank all the films by revenue in the default ascending order
df.select("title", "year", F.rank().over(Window.orderBy("revenue")).alias("revenue_rank")).show()
# Rank year-wise films by revenue in the descending order
df.select("title", "year", F.rank().over(Window.partitionBy("year").orderBy("revenue").desc()).alias("revenue_rank")).show()
@kovid-r
kovid-r / pyspark_cheatsheet_sort_orderby.py
Created June 13, 2020 12:27
Sorting & OrderBy in PySpark Cheatsheet
df.filter(df.year != '1998').sort(F.asc('year'))
df.filter(df.year != '1998').sort(F.desc('year'))
df.filter(df.year != '1998').sort(F.col('year').desc())
df.filter(df.year != '1998').sort(F.col('year').asc())
df.filter(df.year != '1998').orderBy(F.asc('year'))
df.filter(df.year != '1998').orderBy(F.desc('year'))
df.filter(df.year != '1998').orderBy(F.col('year').desc())
df.filter(df.year != '1998').orderBy(F.col('year').asc())
@kovid-r
kovid-r / pyspark_cheatsheet_joins.py
Last active October 11, 2022 04:48
Joining data in PySpark Cheatsheet
# Joining two DataFrames
df1.join(df2, 'title', 'full')
# Another way to join DataFrames
df1.join(df2, 'title', how='left')
# Cross join when you don't specify a key
df1.join(df2)
# Another way to join
@kovid-r
kovid-r / get-medium-stats.js
Created January 9, 2022 13:36 — forked from igeligel/get-medium-stats.js
medium-get-totals
const totalTypes = {
VIEWS: 2,
READS: 3,
FANS: 5
};
const getTotal = tableColumn =>
[
...document.querySelectorAll(
`td:nth-child(${tableColumn}) > span.sortableTable-number`