Skip to content

Instantly share code, notes, and snippets.

@dineshdharme
dineshdharme / MissingValuesInterpolationAndWindowIdentification.py
Created April 11, 2024 12:28
Interpolate missing values in a timeseries
https://stackoverflow.com/questions/78304441/how-can-i-interpolate-missing-values-based-on-the-sum-of-the-gap-using-pyspark/
This was a nice fun problem to solve.
In pyspark, you can populate a column over a window specification with first not Null value or last not Null value.
Then we can also identify the groups of nulls which come together as a bunch
and then rank over them.
Once, we have those above two values, calculating the interpolated values is
@dineshdharme
dineshdharme / MaximalBipartiteMatchingGraphProblem.py
Last active April 9, 2024 16:21
A maximum bipartite matching algorithm solution.
https://stackoverflow.com/questions/78294920/select-unique-pairs-from-pyspark-dataframe
As @ Abdennacer Lachiheb mentioned in the comment, this is indeed a bipartite matching algorithm. Unlikely to get solved correctly in pyspark or using graphframes. The best would to solve it using a graph algorithm library's `hopcroft_karp_matching` like `NetworkX`. Or use `scipy.optimize.linear_sum_assignment`
`hopcroft_karp_matching` : pure python code, runs in O(E√V) time, where E is the number of edges and V is the number of vertices in the graph.
`scipy.optimize.linear_sum_assignment` : O(n^3) complexity but written in c++.
So only practical testing on the data can determine which works better on your data sizes.
@dineshdharme
dineshdharme / DynamicJsonFormatting.py
Created April 8, 2024 13:43
Dynamic Json Formatting in Pyspark using schema_of_json function.
https://stackoverflow.com/questions/78290764/flatten-dynamic-json-payload-string-using-pyspark/
There is a nifty method `schema_of_json` in pyspark which derives the schema of json string and applies to the whole column.
So the following method to handly dynamic json payloads is as follows:
- First take `json_payload` of first row of dataframe
- Create a schema of the json_payload using `schema_of_json`
@dineshdharme
dineshdharme / ParallelAPICallsInPysparkUDF_Example.py
Created April 5, 2024 13:44
A demo pyspark script to show how to invoke parallel api calls using spark.
Here's an helpful example of using Dataframes and making parallel API calls.
import json
import sys
from pyspark.sql import SQLContext
import requests
from pyspark.sql.functions import *
from pyspark.sql.types import *
@dineshdharme
dineshdharme / ParsingBooleanExpressionUsingLark.py
Created April 4, 2024 15:50
Parsing Boolean Expression Using Lark.
https://stackoverflow.com/questions/78272962/split-strings-containing-nested-brackets-in-spark-sql
It is very easy to do with lark python library.
$ `pip install lark --upgrade`
Then you need to create a grammar which is able to parse your expressions.
Following is the script :
@dineshdharme
dineshdharme / Suggestion.py
Created March 31, 2024 08:20
Rooted Minimum Spanning Tree in a Directed Graph.
https://stackoverflow.com/questions/78244909/graphframes-pyspark-route-compaction/78248893#78248893
You can possibly use
networkx's Edmond's algorithm to find minimum spanning arborescence rooted at a particular root in a given directed graph.
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.tree.branchings.Edmonds.html
In graph theory, an arborescence is a directed graph having a distinguished vertex u (called the root) such that, for any other vertex v, there is exactly one directed path from u to v.
@dineshdharme
dineshdharme / ClusteringNamesUsingMinHashLSH.py
Created March 20, 2024 11:29
Clustering similar text using minhashing and lsh.
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78192904#78192904
Here's another implementation which does the same thing. This time using MinHash and LSH.
Here's an article which explains this.
https://spotintelligence.com/2023/01/02/minhash/
First, install `datasketch` and `networkx`
@dineshdharme
dineshdharme / ClusteringNamesUsingSimHashing.py
Created March 19, 2024 18:22
Clustering names using simhash algorithm for further processing via fuzzywuzzy library for each matches.
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853
I have taken inspiration from this blogpost to write the following code.
https://leons.im/posts/a-python-implementation-of-simhash-algorithm/
The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together.
Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches.
@dineshdharme
dineshdharme / DaskPreProcessing50GBFile.py
Last active March 19, 2024 17:02
Preprocessing example of a file using Dask.
https://stackoverflow.com/questions/78162865/handling-column-breaks-in-pipe-delimited-file/78182964#78182964
You can use `dask` to achieve this task of preprocessing. The following code will process the 50GB file in blocks of 500MB and write out the output in 5 partitions. Everything is a delayed / lazy operation just like in spark. Let me know how it goes. You may have to remove the header line from the data and then provide the header in your spark dataframe.
Install dask as
`pip install dask[complete]`
@dineshdharme
dineshdharme / VideoProcessingAtScaleUsingSpark.py
Created March 11, 2024 06:10
An example to demonstrate using of Pyspark in video processing.
I have adapted the following jupyter notebook to show how spark can do video processing at scale.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html
You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively:
`pip install ffmpeg-python`
`pip install face-recognition`