This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78304441/how-can-i-interpolate-missing-values-based-on-the-sum-of-the-gap-using-pyspark/ | |
This was a nice fun problem to solve. | |
In pyspark, you can populate a column over a window specification with first not Null value or last not Null value. | |
Then we can also identify the groups of nulls which come together as a bunch | |
and then rank over them. | |
Once, we have those above two values, calculating the interpolated values is |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78294920/select-unique-pairs-from-pyspark-dataframe | |
As @ Abdennacer Lachiheb mentioned in the comment, this is indeed a bipartite matching algorithm. Unlikely to get solved correctly in pyspark or using graphframes. The best would to solve it using a graph algorithm library's `hopcroft_karp_matching` like `NetworkX`. Or use `scipy.optimize.linear_sum_assignment` | |
`hopcroft_karp_matching` : pure python code, runs in O(E√V) time, where E is the number of edges and V is the number of vertices in the graph. | |
`scipy.optimize.linear_sum_assignment` : O(n^3) complexity but written in c++. | |
So only practical testing on the data can determine which works better on your data sizes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78290764/flatten-dynamic-json-payload-string-using-pyspark/ | |
There is a nifty method `schema_of_json` in pyspark which derives the schema of json string and applies to the whole column. | |
So the following method to handly dynamic json payloads is as follows: | |
- First take `json_payload` of first row of dataframe | |
- Create a schema of the json_payload using `schema_of_json` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here's an helpful example of using Dataframes and making parallel API calls. | |
import json | |
import sys | |
from pyspark.sql import SQLContext | |
import requests | |
from pyspark.sql.functions import * | |
from pyspark.sql.types import * |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78272962/split-strings-containing-nested-brackets-in-spark-sql | |
It is very easy to do with lark python library. | |
$ `pip install lark --upgrade` | |
Then you need to create a grammar which is able to parse your expressions. | |
Following is the script : |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78244909/graphframes-pyspark-route-compaction/78248893#78248893 | |
You can possibly use | |
networkx's Edmond's algorithm to find minimum spanning arborescence rooted at a particular root in a given directed graph. | |
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.tree.branchings.Edmonds.html | |
In graph theory, an arborescence is a directed graph having a distinguished vertex u (called the root) such that, for any other vertex v, there is exactly one directed path from u to v. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78192904#78192904 | |
Here's another implementation which does the same thing. This time using MinHash and LSH. | |
Here's an article which explains this. | |
https://spotintelligence.com/2023/01/02/minhash/ | |
First, install `datasketch` and `networkx` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853 | |
I have taken inspiration from this blogpost to write the following code. | |
https://leons.im/posts/a-python-implementation-of-simhash-algorithm/ | |
The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together. | |
Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78162865/handling-column-breaks-in-pipe-delimited-file/78182964#78182964 | |
You can use `dask` to achieve this task of preprocessing. The following code will process the 50GB file in blocks of 500MB and write out the output in 5 partitions. Everything is a delayed / lazy operation just like in spark. Let me know how it goes. You may have to remove the header line from the data and then provide the header in your spark dataframe. | |
Install dask as | |
`pip install dask[complete]` | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have adapted the following jupyter notebook to show how spark can do video processing at scale. | |
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1969271421694072/3760413548916830/5612335034456173/latest.html | |
You need to install python libraries in your conda environment. Also make sure you have ffmpeg library installed natively: | |
`pip install ffmpeg-python` | |
`pip install face-recognition` |
NewerOlder