Dinesh Dharme dineshdharme

## VideoProcessingAtScaleUsingSpark.py
I have adapted the following jupyter notebook to show how spark can do video processing at scale.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html

You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively:

`pip install ffmpeg-python`

`pip install face-recognition`

## Hashing2TeraByteFileOnSpark.py
Question : https://stackoverflow.com/questions/78080522/md5-hash-of-huge-files-using-pyspark/


A workflow that can help you achieve this.

Since this is a one large file of size 2 TB, you need to first split this into smaller chunks of say 1GB.

Reason for splitting is this :

https://community.databricks.com/t5/community-discussions/very-large-binary-files-ingestion-error-when-using-binaryfile/td-p/47440

## CumulativeSumWithResetFlag.py
https://stackoverflow.com/questions/78052071/pyspark-count-over-a-window-with-reset/78060131#78060131

I modified my answer from here https://stackoverflow.com/a/78056548/3238085
to this problem setup.


    import sys

    from pyspark.sql import Window

## ResetCumulativeSumComplexAccumulatorExample.py
https://stackoverflow.com/questions/78050162/pyspark-group-by-date-range/

  I used the following answer as an inspiration to write the following code.
Basically, clever use of complex accumulator function allows the grouping index to be performed properly.


https://stackoverflow.com/a/64957835/3238085

    import sys


## ExtractZippedFilesCSV.scala
https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/

Actually porting my previous answer from tarred gzipped archive to zipped archive wasn't that difficult.

Important point(s) to keep in mind.

Repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.


`ZipFileReader.scala`

## ZipFileReader.scala
https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/

The following is a solution in scala. I had to do this before in my job. So I am extracting the relevant bits here.

Few important points to keep in mind.

If possible in your workflow, try to do a tar.gz of your files instead of zip. Because I tried it only with that format.

Secondly, repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.

## TimeConversionQuandry.py
The fractional seconds in your timestamp (".71910") have five digits. Spark expects up to three digits for fractional seconds (milliseconds). Having more than three digits can cause a parsing error.

Here's modified code which works.

    import sys
    from pyspark import SparkContext, SQLContext
    from pyspark.sql import functions as F
    import dateutil.parser


## GraphXFindParents.py
I am adapting my previous answer from here:

https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177

Graphframe jar can be found at this location: Files : (jar[242KB])

https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12

Requirements :

## SympySystemOfLinearEquationsSolver.py
Since your set of equations are underdetermined. i.e. number of unknown variables are more than the number of equations, you will get parametrized solutions.


You can solve system of linear equations (underdetermined, overdetermined or unique) using `sympy` library.

I have adapted the following stackoverflow solution to given an example of how you can solve your equation.


https://stackoverflow.com/a/50048060/3238085

## FaceDetectionFFmpegInPyspark.py
I have adapted the following jupyter notebook to show how spark can do video processing at scale.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html

You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively:

`pip install ffmpeg-python`

`pip install face-recognition`
	I have adapted the following jupyter notebook to show how spark can do video processing at scale.

	https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html

	You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively:

	`pip install ffmpeg-python`

	`pip install face-recognition`
	Question : https://stackoverflow.com/questions/78080522/md5-hash-of-huge-files-using-pyspark/


	A workflow that can help you achieve this.

	Since this is a one large file of size 2 TB, you need to first split this into smaller chunks of say 1GB.

	Reason for splitting is this :

	https://community.databricks.com/t5/community-discussions/very-large-binary-files-ingestion-error-when-using-binaryfile/td-p/47440
	https://stackoverflow.com/questions/78052071/pyspark-count-over-a-window-with-reset/78060131#78060131

	I modified my answer from here https://stackoverflow.com/a/78056548/3238085
	to this problem setup.



	import sys

	from pyspark.sql import Window
	https://stackoverflow.com/questions/78050162/pyspark-group-by-date-range/

	I used the following answer as an inspiration to write the following code.
	Basically, clever use of complex accumulator function allows the grouping index to be performed properly.


	https://stackoverflow.com/a/64957835/3238085

	import sys
	https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/

	Actually porting my previous answer from tarred gzipped archive to zipped archive wasn't that difficult.

	Important point(s) to keep in mind.

	Repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.


	`ZipFileReader.scala`
	https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/

	The following is a solution in scala. I had to do this before in my job. So I am extracting the relevant bits here.

	Few important points to keep in mind.

	If possible in your workflow, try to do a tar.gz of your files instead of zip. Because I tried it only with that format.

	Secondly, repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized.
	The fractional seconds in your timestamp (".71910") have five digits. Spark expects up to three digits for fractional seconds (milliseconds). Having more than three digits can cause a parsing error.

	Here's modified code which works.

	import sys
	from pyspark import SparkContext, SQLContext
	from pyspark.sql import functions as F
	import dateutil.parser
	I am adapting my previous answer from here:

	https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177

	Graphframe jar can be found at this location: Files : (jar[242KB])

	https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12

	Requirements :
	Since your set of equations are underdetermined. i.e. number of unknown variables are more than the number of equations, you will get parametrized solutions.


	You can solve system of linear equations (underdetermined, overdetermined or unique) using `sympy` library.

	I have adapted the following stackoverflow solution to given an example of how you can solve your equation.


	https://stackoverflow.com/a/50048060/3238085