Skip to content

Instantly share code, notes, and snippets.

View gvyshnya's full-sized avatar

George Vyshnya gvyshnya

View GitHub Profile
@gvyshnya
gvyshnya / makefile
Created July 20, 2021 21:52
This is the makefile to automate useful Dataproc-related deployment routines
REGION="europe-west1"
ZONE="europe-west1-b"
TEMPLATE_ID="download_production_table"
dev_dataproc_assets_bucket="gs://your-dataproc-assets-bucket/production/"
dev_project=your-gcp-project-id
upload_assets:
gsutil cp main.py ${dev_dataproc_assets_bucket} --region ${REGION} --project ${dev_project}
@gvyshnya
gvyshnya / create_dev_cluster.sh
Created July 20, 2021 21:50
This script automates creating a permanent Dataproc cluster with Jupyter notebook/Jupyter Lab/PySpark notebook components enabled
REGION=europe-west1
ZONE=europe-west1-b
CLUSTER_NAME=dev-cluster
SERVICE_ACCOUNT=your_service_account_name@your-gcp-project.iam.gserviceaccount.com
BUCKET_NAME=your-dataproc-staging-bucket
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--zone ${ZONE} \
@gvyshnya
gvyshnya / workflow_template.yaml
Created July 20, 2021 21:48
The yaml definition of the Dataproc workflow template
jobs:
- pysparkJob:
args:
- dataset
- entity_name
- gcs_output_bucket
- materialization_gcp_project_id
- materialization_dataset
- output_parquet
- is_partitioned,
@gvyshnya
gvyshnya / PySpark_Job.py
Created July 20, 2021 21:47
The source code of the PySpark script exporting data from a BigQuery dataset to a GCS bucket (reservoir)
from pyspark.sql.functions import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import sys
YES_TOKEN = "Yes"
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
@gvyshnya
gvyshnya / hyperopt_params.py
Created April 21, 2021 21:14
Hyperopt search parameters dictionary
#integer and string parameters, used with hp.choice()
bootstrap_type = [{'bootstrap_type':'Poisson'},
{'bootstrap_type':'Bayesian',
'bagging_temperature' : hp.loguniform('bagging_temperature', np.log(1), np.log(50))},
{'bootstrap_type':'Bernoulli'}]
LEB = ['No', 'AnyImprovement'] #remove 'Armijo' if not using GPU
grow_policy = [
{'grow_policy':'SymmetricTree'},
# {'grow_policy':'Depthwise'},
{'grow_policy':'Lossguide',
@gvyshnya
gvyshnya / ensemble_search.py
Created April 21, 2021 14:14
Hyperopt Ensemble Search function
def ensemble_search(params):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
model = EnsembleModel(params)
evaluation = [(X_test, y_test)]
model.fit(X_train, y_train,
eval_set=evaluation,
early_stopping_rounds=100, verbose=False)
@gvyshnya
gvyshnya / ensemble_classifier_class.py
Created April 21, 2021 14:10
Custom class for Ensemble Classifier on top of lightgbm, xgboost, and catboost
class EnsembleModel:
def __init__(self, params):
"""
LGB + XGB + CatBoost model
"""
self.lgb_params = params['lgb']
self.xgb_params = params['xgb']
self.cat_params = params['cat']
self.lgb_model = LGBMClassifier(**self.lgb_params)
@gvyshnya
gvyshnya / fw_do_log_transform.py
Created April 19, 2021 22:11
Adding log transformed features using featurewiz
###### log transform these columns ##########
log_cols = {'cont5':'log', 'cont8':'log', 'cont7':'log'}
train_copy = FW.FE_transform_numeric_columns(train_copy, log_cols)
test_copy = FW.FE_transform_numeric_columns(test_copy, log_cols)
@gvyshnya
gvyshnya / fw_add_groupby_agg_features.py
Last active April 19, 2021 22:08
Adding groupby aggregate features with featurewiz
### create groupby aggregates of the following numerics
agg_nums = ['cont1','cont3']
groupby_vars = ['cat2','cat4']
train_add, test_add = FW.FE_add_groupby_features_aggregated_to_dataframe(train[agg_nums+groupby_vars],
agg_types=['mean','std'],
groupby_columns=groupby_vars,
ignore_variables=[] , test=test[agg_nums+groupby_vars])
# join the dataframes with the aggregated features to the main training and testing set dataframes
train_copy = train.join(train_add.drop(groupby_vars+agg_nums, axis=1))
@gvyshnya
gvyshnya / fw_add_cat_crosses_features.py
Created April 19, 2021 21:55
Creating the category cross features with featurewiz
### we create feature crosses of these categorical variables ###
train = FW.FE_create_categorical_feature_crosses(train, ['cat4','cat18','cat13','cat2'])
test = FW.FE_create_categorical_feature_crosses(test, ['cat4','cat18','cat13','cat2'])