Skip to content

Instantly share code, notes, and snippets.

@dineshdharme
dineshdharme / VideoProcessingAtScaleUsingSpark.py
Created March 11, 2024 06:10
An example to demonstrate using of Pyspark in video processing.
I have adapted the following jupyter notebook to show how spark can do video processing at scale.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html
You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively:
`pip install ffmpeg-python`
`pip install face-recognition`
@dineshdharme
dineshdharme / Hashing2TeraByteFileOnSpark.py
Created February 29, 2024 14:38
A workflow on how to hash a 2TB file located on S3.
Question : https://stackoverflow.com/questions/78080522/md5-hash-of-huge-files-using-pyspark/
A workflow that can help you achieve this.
Since this is a one large file of size 2 TB, you need to first split this into smaller chunks of say 1GB.
Reason for splitting is this :
https://community.databricks.com/t5/community-discussions/very-large-binary-files-ingestion-error-when-using-binaryfile/td-p/47440
@dineshdharme
dineshdharme / CumulativeSumWithResetFlag.py
Created February 27, 2024 15:21
Cumulative Sum with Reset Flag.
https://stackoverflow.com/questions/78052071/pyspark-count-over-a-window-with-reset/78060131#78060131
I modified my answer from here https://stackoverflow.com/a/78056548/3238085
to this problem setup.
import sys
from pyspark.sql import Window
@dineshdharme
dineshdharme / ResetCumulativeSumComplexAccumulatorExample.py
Created February 25, 2024 15:29
Grouping cumulative sum with reset condition achieved through complex accumulator structure.
https://stackoverflow.com/questions/78050162/pyspark-group-by-date-range/
I used the following answer as an inspiration to write the following code.
Basically, clever use of complex accumulator function allows the grouping index to be performed properly.
https://stackoverflow.com/a/64957835/3238085
import sys
@dineshdharme
dineshdharme / ExtractZippedFilesCSV.scala
Created February 5, 2024 13:26
Porting previous tar.gz uncompressor function to zipped uncompressor.
https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/
Actually porting my previous answer from tarred gzipped archive to zipped archive wasn't that difficult.
Important point(s) to keep in mind.
Repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.
`ZipFileReader.scala`
@dineshdharme
dineshdharme / ZipFileReader.scala
Created February 3, 2024 15:43
ReadingCompressedArchiveFilesForETL
https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/
The following is a solution in scala. I had to do this before in my job. So I am extracting the relevant bits here.
Few important points to keep in mind.
If possible in your workflow, try to do a tar.gz of your files instead of zip. Because I tried it only with that format.
Secondly, repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.
@dineshdharme
dineshdharme / TimeConversionQuandry.py
Created December 12, 2023 10:12
Make sure your time fractional part is only till millisecond. Additional precision causes parsing error in Spark.
The fractional seconds in your timestamp (".71910") have five digits. Spark expects up to three digits for fractional seconds (milliseconds). Having more than three digits can cause a parsing error.
Here's modified code which works.
import sys
from pyspark import SparkContext, SQLContext
from pyspark.sql import functions as F
import dateutil.parser
@dineshdharme
dineshdharme / GraphXFindParents.py
Created December 8, 2023 14:53
Finding path from source to destination using GraphFrames in Pyspark
I am adapting my previous answer from here:
https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177
Graphframe jar can be found at this location: Files : (jar[242KB])
https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12
Requirements :
Since your set of equations are underdetermined. i.e. number of unknown variables are more than the number of equations, you will get parametrized solutions.
You can solve system of linear equations (underdetermined, overdetermined or unique) using `sympy` library.
I have adapted the following stackoverflow solution to given an example of how you can solve your equation.
https://stackoverflow.com/a/50048060/3238085
@dineshdharme
dineshdharme / FaceDetectionFFmpegInPyspark.py
Created November 29, 2023 09:07
Finding bounding boxes for faces in a video sampled at a particular rate.
I have adapted the following jupyter notebook to show how spark can do video processing at scale.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html
You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively:
`pip install ffmpeg-python`
`pip install face-recognition`