This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
# This script follows the steps described in this link: | |
# https://stackoverflow.com/a/51646354/8808983 | |
LIBRARIES_DIR="python3.6-wheelhouse" | |
REQ_FILE="requirements.txt" | |
PLATFORM="manylinux1_x86_64" | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## get the list of column names | |
cols_now = list(pandas_df) | |
# move the column to head of list using index, pop and insert | |
cols_now.insert(0, cols_now.pop(cols_now.index('middle_col_name'))) | |
# select the modified column name list from the original dataframe | |
# thus rearranging the column names | |
pandas_df = pandas_df.loc[:, cols_now] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
from pathlib import Path | |
import boto3 | |
from boto3.exceptions import S3UploadFailedError | |
from joblib import Parallel, delayed | |
import pandas as pd | |
import sys | |
from coolname import generate | |
import random |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
from pathlib import Path | |
import boto3 | |
from boto3.exceptions import S3UploadFailedError | |
from joblib import Parallel, delayed | |
import sys | |
# boto3 clients not thread-safe, but also not serializable, so making it global | |
s3_client = boto3.client('s3') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have added helpful comments in the code at appropriate location. | |
This solution uses GraphX API of Spark though `graphframes` python bindings. | |
from pyspark import SQLContext | |
from pyspark.sql.functions import * | |
from graphframes import * | |
import pyspark.sql.functions as F | |
from pyspark.sql import SparkSession | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Not sure if this answer is helpful or not since I couldn't cast your iterative equation into a normal format or find an iterative equation solver. But you can definitely use scipy's fsolve to solve non-linear equations. | |
EDIT : We can used a specialized Pandas UDF to do aggregation over appropriate Window definition. | |
Here's an example below : | |
import sys | |
from pyspark import SQLContext | |
from pyspark.sql.functions import * | |
import pyspark.sql.functions as F |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This need not be so complicated. Here's an concrete example of how you can do this. | |
First load all the models in a list. Broadcast the list. Access the list broadcasted variable's value using `value`. You can concatenate your features into an array as I have done below then do inference on samples one by one. | |
You could achieve the batch semantics by using mapPartition function on an rdd and then convert the result back to dataframe as shown below. | |
import sys |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here's a solution using `PandasUDFType.GROUPED_AGG` which can be used inside a groupby clause. | |
from pyspark import SQLContext | |
from pyspark.sql.functions import * | |
import pyspark.sql.functions as F | |
from pyspark.sql.types import * | |
from pyspark.sql.window import Window | |
from typing import Iterator, Tuple | |
import pandas as pd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
My approach to solve this problem has been to create a map of code and minimum timestamp it appeared in. Then using that map to populate the start_time_i columns. Below is the code. | |
from pyspark import SparkContext, SQLContext | |
from pyspark.sql import functions as F | |
from pyspark.sql.functions import col,when | |
sc = SparkContext('local') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here's one way to do it. Basically I asked the Robs node to be made into ROW. Then onwards its normal data wrangling. | |
import sys | |
from pyspark import SparkContext, SQLContext | |
from pyspark.sql import functions as F | |
from pyspark.sql import SparkSession | |
spark = SparkSession.builder \ | |
.appName("MyApp") \ |
OlderNewer