Skip to content

Instantly share code, notes, and snippets.

@denji
denji / README.md
Last active April 26, 2024 18:09 — forked from istepanov/gist:3950977
Remove/Backup – settings & cli for macOS (OS X) – DataGrip, AppCode, CLion, Gogland, IntelliJ, PhpStorm, PyCharm, Rider, RubyMine, WebStorm
@inchoate
inchoate / including_external_package_in_dataflow.md
Last active February 2, 2024 11:40
Adding an extra package to a Python Dataflow project to run on GCP

The Problem

The documentation for how to deploy a pipeline with extra, non-PyPi, pure Python packages on GCP is missing some detail. This gist shows how to package and deploy an external pure-Python, non-PyPi dependency to a managed dataflow pipeline on GCP.

TL;DR: You external package needs to be a python (source/binary) distro properly packaged and shipped alongside your pipeline. It is not enough to only specify a tar file with a setup.py.

Preparing the External Package

Your external package must have a proper setup.py. What follow is an example setup.py for our ETL package. This is used to package version 1.1.1 of the etl library. The library requires 3 native PyPi packages to run. These are specified in the install_requires field. This package also ships with custom external JSON data, declared in the package_data section. Last, the setuptools.find_packages function searches for all available packages and returns that

@criccomini
criccomini / airflow-supervisord.conf
Created June 22, 2016 14:54
airflow-supervisord.conf
; Configuration for Airflow webserver and scheduler in Supervisor
[program:airflow]
command=/bin/airflow webserver
stopsignal=QUIT
stopasgroup=true
user=airflow
stdout_logfile=/var/log/airflow/airflow-stdout.log
stderr_logfile=/var/log/airflow/airflow-stderr.log
environment=HOME="/home/airflow",AIRFLOW_HOME="/etc/airflow",TMPDIR="/storage/airflow_tmp"
@willprice
willprice / .travis.yml
Last active August 15, 2023 17:12
How to set up TravisCI for projects that push back to github
# Ruby is our language as asciidoctor is a ruby gem.
lang: ruby
before_install:
- sudo apt-get install pandoc
- gem install asciidoctor
script:
- make
after_success:
- .travis/push.sh
env:
@simonw
simonw / gist:104413
Created April 30, 2009 11:19
Turn a BeautifulSoup form in to a dict of fields and default values - useful for screen scraping forms and then resubmitting them
def extract_form_fields(self, soup):
"Turn a BeautifulSoup form in to a dict of fields and default values"
fields = {}
for input in soup.findAll('input'):
# ignore submit/image with no name attribute
if input['type'] in ('submit', 'image') and not input.has_key('name'):
continue
# single element nome/value fields
if input['type'] in ('text', 'hidden', 'password', 'submit', 'image'):
@mathewbyrne
mathewbyrne / CsvResponse.php
Created March 21, 2013 22:54
A small Symfony 2 class for returning a response as a CSV file. Based on the Symfony JsonResponse class.
<?php
namespace Jb\AdminBundle\Http;
use Symfony\Component\HttpFoundation\Response;
class CsvResponse extends Response
{
protected $data;
@justinnaldzin
justinnaldzin / spark_dataframe_size_estimator.py
Created July 18, 2018 19:29
Estimate size of Spark DataFrame in bytes
# Function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# Convert DataFrame to an RDD
@ololobus
ololobus / Spark+ipython_on_MacOS.md
Last active November 22, 2022 22:24
Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Tested with Apache Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112

For older versions of Spark and ipython, please, see also previous version of text.

Install Java Development Kit

@shantanuo
shantanuo / mysql_to_big_query.sh
Last active September 14, 2022 07:12
Copy MySQL table to big query. If you need to copy all tables, use the loop given at the end. Exit with error code 3 if blob or text columns are found. The csv files are first copied to google cloud before being imported to big query.
#!/bin/sh
TABLE_SCHEMA=$1
TABLE_NAME=$2
mytime=`date '+%y%m%d%H%M'`
hostname=`hostname | tr 'A-Z' 'a-z'`
file_prefix="trimax$TABLE_NAME$mytime$TABLE_SCHEMA"
bucket_name=$file_prefix
splitat="4000000000"
bulkfiles=200
@elvismdev
elvismdev / jenkins_wpsvn_deploy.sh
Last active August 30, 2022 18:35 — forked from BFTrick/deploy.sh
Deploy from Jenkins to WordPress.org SVN
# In your Jenkins job configuration, select "Add build step > Execute shell", and paste this script contents.
# Replace `______your-plugin-name______`, `______your-wp-username______` and `______your-wp-password______` as needed.
# main config
WP_ORG_USER="______your-wp-username______" # your WordPress.org username
WP_ORG_PASS="______your-wp-password______" # your WordPress.org password
PLUGINSLUG="______your-plugin-name______"
CURRENTDIR=`pwd`
MAINFILE="______your-plugin-name______.php" # this should be the name of your main php file in the wordpress plugin