This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
REGION="europe-west1" | |
ZONE="europe-west1-b" | |
TEMPLATE_ID="download_production_table" | |
dev_dataproc_assets_bucket="gs://your-dataproc-assets-bucket/production/" | |
dev_project=your-gcp-project-id | |
upload_assets: | |
gsutil cp main.py ${dev_dataproc_assets_bucket} --region ${REGION} --project ${dev_project} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
REGION=europe-west1 | |
ZONE=europe-west1-b | |
CLUSTER_NAME=dev-cluster | |
SERVICE_ACCOUNT=your_service_account_name@your-gcp-project.iam.gserviceaccount.com | |
BUCKET_NAME=your-dataproc-staging-bucket | |
gcloud dataproc clusters create ${CLUSTER_NAME} \ | |
--region ${REGION} \ | |
--zone ${ZONE} \ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jobs: | |
- pysparkJob: | |
args: | |
- dataset | |
- entity_name | |
- gcs_output_bucket | |
- materialization_gcp_project_id | |
- materialization_dataset | |
- output_parquet | |
- is_partitioned, |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.sql.functions import * | |
from pyspark.context import SparkContext | |
from pyspark.sql.session import SparkSession | |
import sys | |
YES_TOKEN = "Yes" | |
sc = SparkContext.getOrCreate() | |
spark = SparkSession(sc) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#integer and string parameters, used with hp.choice() | |
bootstrap_type = [{'bootstrap_type':'Poisson'}, | |
{'bootstrap_type':'Bayesian', | |
'bagging_temperature' : hp.loguniform('bagging_temperature', np.log(1), np.log(50))}, | |
{'bootstrap_type':'Bernoulli'}] | |
LEB = ['No', 'AnyImprovement'] #remove 'Armijo' if not using GPU | |
grow_policy = [ | |
{'grow_policy':'SymmetricTree'}, | |
# {'grow_policy':'Depthwise'}, | |
{'grow_policy':'Lossguide', |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def ensemble_search(params): | |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22) | |
model = EnsembleModel(params) | |
evaluation = [(X_test, y_test)] | |
model.fit(X_train, y_train, | |
eval_set=evaluation, | |
early_stopping_rounds=100, verbose=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class EnsembleModel: | |
def __init__(self, params): | |
""" | |
LGB + XGB + CatBoost model | |
""" | |
self.lgb_params = params['lgb'] | |
self.xgb_params = params['xgb'] | |
self.cat_params = params['cat'] | |
self.lgb_model = LGBMClassifier(**self.lgb_params) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
###### log transform these columns ########## | |
log_cols = {'cont5':'log', 'cont8':'log', 'cont7':'log'} | |
train_copy = FW.FE_transform_numeric_columns(train_copy, log_cols) | |
test_copy = FW.FE_transform_numeric_columns(test_copy, log_cols) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### create groupby aggregates of the following numerics | |
agg_nums = ['cont1','cont3'] | |
groupby_vars = ['cat2','cat4'] | |
train_add, test_add = FW.FE_add_groupby_features_aggregated_to_dataframe(train[agg_nums+groupby_vars], | |
agg_types=['mean','std'], | |
groupby_columns=groupby_vars, | |
ignore_variables=[] , test=test[agg_nums+groupby_vars]) | |
# join the dataframes with the aggregated features to the main training and testing set dataframes | |
train_copy = train.join(train_add.drop(groupby_vars+agg_nums, axis=1)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### we create feature crosses of these categorical variables ### | |
train = FW.FE_create_categorical_feature_crosses(train, ['cat4','cat18','cat13','cat2']) | |
test = FW.FE_create_categorical_feature_crosses(test, ['cat4','cat18','cat13','cat2']) |
NewerOlder