Slides: https://docs.google.com/presentation/d/1SMtBILSrqt9SWK5BnEKe4ivGTHItrVjgCEB0AUhGeyM
$ python -m venv venv
$ venv/bin/pip install sqlalchemy colorama
#!/usr/bin/env sh | |
# . "$(dirname -- "$0")/_/husky.sh" # - uncomment if using husky | |
# Get the current commit message file | |
COMMIT_MSG_FILE=$1 | |
SOURCE_MSG=$2 | |
# Ref: https://git-scm.com/docs/githooks#_prepare_commit_msg | |
# SOURCE_MSG is the source where the commit messsage is taken from | |
# - EMPTY -> No source commit message present `git commit -a` |
""" | |
Simple benchmark to check if withColumn() is faster or select() is faster | |
Conflusion: select() is faster than withColumn() in a for loop as lesser dataframes are created | |
""" | |
import datetime | |
import findspark; findspark.init(); import pyspark | |
spark = pyspark.sql.SparkSession.builder.getOrCreate() | |
for ncol in [10, 100, 1000, 2000, 5000]: |
Slides: https://docs.google.com/presentation/d/1SMtBILSrqt9SWK5BnEKe4ivGTHItrVjgCEB0AUhGeyM
$ python -m venv venv
$ venv/bin/pip install sqlalchemy colorama
The 3 common ways for developers to document information about their work is:
# Example: | |
# PYJAVA_LIB=jpype venv/bin/python pyjava.py | |
import os | |
from datetime import datetime | |
from jpmml_evaluator import _package_classpath | |
lib = os.environ.get('PYJAVA_LIB') | |
assert lib is not None, 'Set env var PYJAVA_LIB to py4j/jnius/jpype' |
# Write a spark DataFrame into a single CSV files (to open with Excel/other tools easily) | |
# Save the file to S3 | |
import s3fs | |
import pyspark.sql.functions as F # noqa: N812 | |
def spark_to_csv(spark_df, out_path): | |
""" | |
Save the file in part files with spark and then append them together |
name: Python Tests | |
jobs: | |
build: | |
runs-on: ubuntu-18.04 | |
strategy: | |
max-parallel: 2 | |
matrix: | |
python-version: [3.5, 3.6, 3.7] |