This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have adapted the following jupyter notebook to show how spark can do video processing at scale. | |
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html | |
You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively: | |
`pip install ffmpeg-python` | |
`pip install face-recognition` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Question : https://stackoverflow.com/questions/78080522/md5-hash-of-huge-files-using-pyspark/ | |
A workflow that can help you achieve this. | |
Since this is a one large file of size 2 TB, you need to first split this into smaller chunks of say 1GB. | |
Reason for splitting is this : | |
https://community.databricks.com/t5/community-discussions/very-large-binary-files-ingestion-error-when-using-binaryfile/td-p/47440 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78052071/pyspark-count-over-a-window-with-reset/78060131#78060131 | |
I modified my answer from here https://stackoverflow.com/a/78056548/3238085 | |
to this problem setup. | |
import sys | |
from pyspark.sql import Window |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78050162/pyspark-group-by-date-range/ | |
I used the following answer as an inspiration to write the following code. | |
Basically, clever use of complex accumulator function allows the grouping index to be performed properly. | |
https://stackoverflow.com/a/64957835/3238085 | |
import sys | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/ | |
Actually porting my previous answer from tarred gzipped archive to zipped archive wasn't that difficult. | |
Important point(s) to keep in mind. | |
Repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized. | |
`ZipFileReader.scala` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/77914457/unzipping-multiple-files-from-1-zip-files-using-emr/ | |
The following is a solution in scala. I had to do this before in my job. So I am extracting the relevant bits here. | |
Few important points to keep in mind. | |
If possible in your workflow, try to do a tar.gz of your files instead of zip. Because I tried it only with that format. | |
Secondly, repartition the rdd `numPartitionsProvided` to a suitably large number so that all your executors are utilized. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The fractional seconds in your timestamp (".71910") have five digits. Spark expects up to three digits for fractional seconds (milliseconds). Having more than three digits can cause a parsing error. | |
Here's modified code which works. | |
import sys | |
from pyspark import SparkContext, SQLContext | |
from pyspark.sql import functions as F | |
import dateutil.parser | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I am adapting my previous answer from here: | |
https://gist.github.com/dineshdharme/7c13dcde72e42fdd3ec47d1ad40f6177 | |
Graphframe jar can be found at this location: Files : (jar[242KB]) | |
https://mvnrepository.com/artifact/graphframes/graphframes/0.8.1-spark3.0-s_2.12 | |
Requirements : |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Since your set of equations are underdetermined. i.e. number of unknown variables are more than the number of equations, you will get parametrized solutions. | |
You can solve system of linear equations (underdetermined, overdetermined or unique) using `sympy` library. | |
I have adapted the following stackoverflow solution to given an example of how you can solve your equation. | |
https://stackoverflow.com/a/50048060/3238085 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have adapted the following jupyter notebook to show how spark can do video processing at scale. | |
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html | |
You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively: | |
`pip install ffmpeg-python` | |
`pip install face-recognition` |