Skip to content

Instantly share code, notes, and snippets.

@qxj
Created June 9, 2015 07:46
Show Gist options
  • Save qxj/ae5888299d7bde3d9eb3 to your computer and use it in GitHub Desktop.
Save qxj/ae5888299d7bde3d9eb3 to your computer and use it in GitHub Desktop.
If input files are serialized with avro, unserialize them by org.apache.avro.mapred.AvroAsTextInputFormat in hadoop streaming.
#!/usr/bin/env bash
# @(#) norm.sh Time-stamp: <Julian Qian 2015-06-09 15:35:35>
# Copyright 2015 Julian Qian
# Author: Julian Qian <junist@gmail.com>
# Version: $Id: norm.sh,v 0.1 2015-06-08 18:03:30 jqian Exp $
#
day=$(date +%Y%m%d -d yesterday)
input=/user/hive/warehouse/query_log/ds=$day/hr=00
output=/user/work/query_log/ds=$day
hadoop fs -test -d $input
if [[ $? -eq 0 ]]; then
jars=/usr/lib/avro/avro.jar,/usr/lib/avro/avro-mapred.jar
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files $jars \
-libjars $jars \
-D mapred.reduce.tasks=5 \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input $input \
-output $output \
-mapper mapper.py \
-reducer reducer.py \
-file ./mapper.py \
-file ./reducer.py \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf mapred.job.name="jqian:$output" \
-jobconf map.output.key.field.separator=':' \
-jobconf num.key.fields.for.partition=1
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment