Dinesh Dharme dineshdharme

## Windows12HourSumPerRow.py
https://stackoverflow.com/questions/78490654/calculate-rolling-counts-from-two-different-time-series-columns-in-pyspark/78495948#78495948

Here's a clever way to figure out all departures before current rows arrival.

Label the corresponding times with arrival flag i.e. `"A"` or departure `"D"`

Now union these two dataframes.

Order these dataframes by time irrespective of label.

## InLineManager.py
I'm working with hierarchical data in PySpark where each employee has a manager, and I need to find all the inline managers for each employee. An inline manager is defined as the manager of the manager, and so on, until we reach the top-level manager (CEO) who does not have a manager.

Is it necessary that you have to use Pyspark in Databricks ?

If yes, this answer could help you. It does exactly that.

https://stackoverflow.com/a/77627393/3238085

Here is another solution which can help you. Unfortunately the guy who asked question has deleted it.

## BatchWriteParquetFromStructuredStreaming.py
This is not a perfect solution. But since streaming solution would be more suitable so providing it as an option.

Adapted from socket example below

https://github.com/abulbasar/pyspark-examples/blob/master/structured-streaming-socket.py

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html  (search for 'socket' in this webpage)


To figure out if processing is finished., just check for this line in the logs.

## MissingValuesInterpolationAndWindowIdentification.py
https://stackoverflow.com/questions/78304441/how-can-i-interpolate-missing-values-based-on-the-sum-of-the-gap-using-pyspark/

This was a nice fun problem to solve.

In pyspark, you can populate a column over a window specification with first not Null value or last not Null value.

Then we can also identify the groups of nulls which come together as a bunch
and then rank over them.

Once, we have those above two values, calculating the interpolated values is

## MaximalBipartiteMatchingGraphProblem.py
https://stackoverflow.com/questions/78294920/select-unique-pairs-from-pyspark-dataframe

As @ Abdennacer Lachiheb  mentioned in the comment, this is indeed a bipartite matching algorithm. Unlikely to get solved correctly in pyspark or using graphframes. The best would to solve it using a graph algorithm library's `hopcroft_karp_matching` like `NetworkX`. Or use `scipy.optimize.linear_sum_assignment`

`hopcroft_karp_matching` : pure python code, runs in O(E√V) time, where E is the number of edges and V is the number of vertices in the graph.

`scipy.optimize.linear_sum_assignment` : O(n^3) complexity but written in c++.

So only practical testing on the data can determine which works better on your data sizes.

## DynamicJsonFormatting.py
https://stackoverflow.com/questions/78290764/flatten-dynamic-json-payload-string-using-pyspark/


There is a nifty method `schema_of_json` in pyspark which derives the schema of json string and applies to the whole column.

So the following method to handly dynamic json payloads is as follows:

- First take `json_payload` of first row of dataframe

- Create a schema of the json_payload using `schema_of_json`

## ParallelAPICallsInPysparkUDF_Example.py
Here's an helpful example of using Dataframes and making parallel API calls.


    import json
    import sys

    from pyspark.sql import SQLContext
    import requests
    from pyspark.sql.functions import *
    from pyspark.sql.types import *

## ParsingBooleanExpressionUsingLark.py
https://stackoverflow.com/questions/78272962/split-strings-containing-nested-brackets-in-spark-sql

It is very easy to do with lark python library.

$ `pip install lark --upgrade`

Then you need to create a grammar which is able to parse your expressions.

Following is the script :

## Suggestion.py
https://stackoverflow.com/questions/78244909/graphframes-pyspark-route-compaction/78248893#78248893

You can possibly use

networkx's Edmond's algorithm to find minimum spanning arborescence rooted at a particular root in a given directed graph.

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.tree.branchings.Edmonds.html

In graph theory, an arborescence is a directed graph having a distinguished vertex u (called the root) such that, for any other vertex v, there is exactly one directed path from u to v.

## ClusteringNamesUsingMinHashLSH.py
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78192904#78192904

Here's another implementation which does the same thing. This time using MinHash and LSH.

Here's an article which explains this.

https://spotintelligence.com/2023/01/02/minhash/

First, install `datasketch` and `networkx`
	https://stackoverflow.com/questions/78490654/calculate-rolling-counts-from-two-different-time-series-columns-in-pyspark/78495948#78495948

	Here's a clever way to figure out all departures before current rows arrival.

	Label the corresponding times with arrival flag i.e. `"A"` or departure `"D"`

	Now union these two dataframes.

	Order these dataframes by time irrespective of label.
	I'm working with hierarchical data in PySpark where each employee has a manager, and I need to find all the inline managers for each employee. An inline manager is defined as the manager of the manager, and so on, until we reach the top-level manager (CEO) who does not have a manager.

	Is it necessary that you have to use Pyspark in Databricks ?

	If yes, this answer could help you. It does exactly that.

	https://stackoverflow.com/a/77627393/3238085

	Here is another solution which can help you. Unfortunately the guy who asked question has deleted it.
	This is not a perfect solution. But since streaming solution would be more suitable so providing it as an option.

	Adapted from socket example below

	https://github.com/abulbasar/pyspark-examples/blob/master/structured-streaming-socket.py

	https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (search for 'socket' in this webpage)


	To figure out if processing is finished., just check for this line in the logs.
	https://stackoverflow.com/questions/78304441/how-can-i-interpolate-missing-values-based-on-the-sum-of-the-gap-using-pyspark/

	This was a nice fun problem to solve.

	In pyspark, you can populate a column over a window specification with first not Null value or last not Null value.

	Then we can also identify the groups of nulls which come together as a bunch
	and then rank over them.

	Once, we have those above two values, calculating the interpolated values is
	https://stackoverflow.com/questions/78294920/select-unique-pairs-from-pyspark-dataframe

	As @ Abdennacer Lachiheb mentioned in the comment, this is indeed a bipartite matching algorithm. Unlikely to get solved correctly in pyspark or using graphframes. The best would to solve it using a graph algorithm library's `hopcroft_karp_matching` like `NetworkX`. Or use `scipy.optimize.linear_sum_assignment`

	`hopcroft_karp_matching` : pure python code, runs in O(E√V) time, where E is the number of edges and V is the number of vertices in the graph.

	`scipy.optimize.linear_sum_assignment` : O(n^3) complexity but written in c++.

	So only practical testing on the data can determine which works better on your data sizes.
	https://stackoverflow.com/questions/78290764/flatten-dynamic-json-payload-string-using-pyspark/


	There is a nifty method `schema_of_json` in pyspark which derives the schema of json string and applies to the whole column.

	So the following method to handly dynamic json payloads is as follows:

	- First take `json_payload` of first row of dataframe

	- Create a schema of the json_payload using `schema_of_json`
	Here's an helpful example of using Dataframes and making parallel API calls.


	import json
	import sys

	from pyspark.sql import SQLContext
	import requests
	from pyspark.sql.functions import *
	from pyspark.sql.types import *
	https://stackoverflow.com/questions/78272962/split-strings-containing-nested-brackets-in-spark-sql

	It is very easy to do with lark python library.

	$ `pip install lark --upgrade`

	Then you need to create a grammar which is able to parse your expressions.

	Following is the script :
	https://stackoverflow.com/questions/78244909/graphframes-pyspark-route-compaction/78248893#78248893

	You can possibly use

	networkx's Edmond's algorithm to find minimum spanning arborescence rooted at a particular root in a given directed graph.

	https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.tree.branchings.Edmonds.html

	In graph theory, an arborescence is a directed graph having a distinguished vertex u (called the root) such that, for any other vertex v, there is exactly one directed path from u to v.
	https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78192904#78192904

	Here's another implementation which does the same thing. This time using MinHash and LSH.

	Here's an article which explains this.

	https://spotintelligence.com/2023/01/02/minhash/

	First, install `datasketch` and `networkx`