Skip to content

Instantly share code, notes, and snippets.

@mtustin-handy
mtustin-handy / capacity-scheduler.xml
Created April 1, 2016 23:15
Capacity scheduler configuration
<configuration>
<property>
<!-- Maximum resources to allocate to application masters
If this is too high application masters can crowd out actual work -->
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.25</value>
</property>
<property>
@mtustin-handy
mtustin-handy / yarn-site.xml
Created April 1, 2016 21:46
Using YARN CapacityScheduler
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
@mtustin-handy
mtustin-handy / dynadag.py
Created December 4, 2015 20:21
Programmatically generating dag tasks in airflow
lognames = list(
hdfs.list_filenames(conf.get('incoming_log_path'), full_path=False))
for logname in lognames:
# TODO use a proper regex to filter out bad lognames
# Airflow is particular about which characters exist in task names
if logname not in excluded_logs and '%' not in logname and '@' not in logname:
ingest = LogIngesterOperator(
def make_spooq_exporter(table, schema, task_id, dag):
return SpooqExportOperator(
jdbc_url=('jdbc:mysql://%s/%s?user=user&password=pasta'
% (TARGET_DB_HOST,TARGET_DB_NAME)),
target_table=table,
hive_table='%s.%s' % (schema, table),
dag=dag,
on_retry_callback=truncate_db,
task_id=task_id)
@mtustin-handy
mtustin-handy / maindag.py
Last active December 4, 2015 04:15
How to structure subdags so they work as you would expect
from airflow.models import DAG
from airflow.operators import PythonOperator, SubDagOperator
from good_dags.subdag import hive_dag
from datetime import timedelta, datetime
main_dag = DAG(
dag_id='main_dag',
schedule_interval=timedelta(hours=1),
start_date=datetime(2015, 9, 18, 21)
)
@mtustin-handy
mtustin-handy / maindag.py
Last active June 13, 2018 03:51
How not to structure subdags (unless you want them to run on their own schedule). Module bad_dags
from airflow.models import DAG
from airflow.operators import PythonOperator, SubDagOperator
from bad_dags.subdag import hive_dag
from datetime import timedelta, datetime
main_dag = DAG(
dag_id='main_dag',
schedule_interval=timedelta(hours=1),
start_date=datetime(2015, 9, 18, 21)
)
@mtustin-handy
mtustin-handy / hextobin.scala
Last active December 3, 2015 23:39
Scala to convert hex encoded as text into real binary data
val hex = area.split(" ").map(Integer.parseInt(_, 16).toByte)
@mtustin-handy
mtustin-handy / run_sqoop.sh
Last active December 3, 2015 23:16
Importing GIS data from MySQL into Hive as a string
sqoop-import --connect jdbc:mysql://<hostname>:3306/handy \
--username <user> --table geodata_table \
--target-dir /path/to/tables/geodata_table\
--fetch-size -2147483648 --null-string \\\\N --null-non-string \\\\N\
--map-column-hive area=STRING --delete-target-dir --hive-import\
--hive-database handy_db --hive-drop-import-delims --hive-overwrite