Skip to content

Instantly share code, notes, and snippets.

@ottomata
Created February 8, 2018 15:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ottomata/9f3b526a2208895f7232b94eba3ecc68 to your computer and use it in GitHub Desktop.
Save ottomata/9f3b526a2208895f7232b94eba3ecc68 to your computer and use it in GitHub Desktop.
$ spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine /srv/deployment/analytics/refinery/artifacts/refinery-job.jar --help
JSON Datasets -> Partitioned Hive Parquet tables.
Given an input base path, this will search all subdirectories for input
partitions to convert to Parquet backed Hive tables. This was originally
written to work with JSON data imported via Camus into hourly buckets, but
should be configurable to work with any regular import directory hierarchy.
Example:
spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar \
--input-base-path /wmf/data/raw/event \
--output-base-path /user/otto/external/eventbus5' \
--database event \
--since 24 \
--input-regex '.*(eqiad|codfw)_(.+)/hourly/(\d+)/(\d+)/(\d+)/(\d+)' \
--input-capture 'datacenter,table,year,month,day,hour' \
--table-blacklist '.*page_properties_change.*'
Usage: spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar [options]
NOTE: You may pass all of the described CLI options to this job in a single
string with --options '<options>' flag.\n
--help
Prints this usage text.
-i <path> | --input-base-path <path>
Path to input JSON datasets. This directory is expected to contain
directories of individual (topic) table datasets. E.g.
/path/to/raw/data/{myprefix_dataSetOne,myprefix_dataSetTwo}, etc.
Each of these subdirectories will be searched for partitions that
need to be refined.
-o <path> | --output-base-path <path>
Base path of output data and of external Hive tables. Each table will be created
with a LOCATION in a subdirectory of this path.
-d <database> | --database <database>
Hive database name in which to manage refined Hive tables.
-s <since-date-time> | --since <since-date-time>
Refine all data found since this date time. This may either be given as an integer
number of hours ago, or an ISO-8601 formatted date time. Default: 192 hours ago.
-u <until-date-time> | --until <until-date-time>
Refine all data found until this date time. This may either be given as an integer
number of hours ago, or an ISO-8601 formatted date time. Default: now.
-R <regex> | --input-regex <regex>
input-regex should match the input partition directory hierarchy starting from the
dataset base path, and should capture the table name and the partition values.
Along with input-capture, this allows arbitrary extraction of table names and and
partitions from the input path. You are required to capture at least "table"
using this regex. The default will match an hourly bucketed Camus import hierarchy,
using the topic name as the table name.
-C <capture-list> | --input-capture <capture-list>
input-capture should be a comma separated list of named capture groups
corresponding to the groups captured byt input-regex. These need to be
provided in the order that the groups are captured. This ordering will
also be used for partitioning.
-F <format> | --input-datetime-format <format>
This DateTimeFormat will be used to generate all possible partitions since
the given lookback-hours in each dataset directory. This format will be used
to format a DateTime to input directory partition paths. The finest granularity
supported is hourly. Every hour in the past lookback-hours will be generated,
but if you specify a less granular format (e.g. daily, like "daily"/yyyy/MM/dd),
the code will reduce the generated partition search for that day to 1, instead of 24.
The default is suitable for generating partitions in an hourly bucketed Camus
import hierarchy.
-w <regex> | --table-whitelist <regex>
Whitelist regex of table names to refine.
-b <regex> | --table-blacklist <regex>
Blacklist regex of table names to skip.
-D <filename> | --done-flag <filename>
When a partition is successfully refined, this file will be created in the
output partition path with the binary timestamp of the input source partition's
modification timestamp. This allows subsequent runs that to detect if the input
data has changed meaning the partition needs to be re-refined.
Default: _REFINED
-X <filename> | --failure-flag <filename>
When a partition fails refinement, this file will be created in the
output partition path with the binary timestamp of the input source partition's
modification timestamp. Any partition with this flag will be excluded
from refinement if the input data's modtime hasn't changed. If the
modtime has changed, this will re-attempt refinement anyway.
Default: _REFINE_FAILED
-I | --ignore-failure-flag
Set this if you want all discovered partitions with --failure-flag files to be
(re)refined. Default: false
-P <parallelism> | --parallelism <parallelism>
Refine into up to this many tables in parallel. Individual partitions
destined for the same Hive table will be refined serially.
Defaults to the number of local CPUs (i.e. what Scala parallel
collections uses).
-c <codec> | --compression-codec <codec>
Value of spark.sql.parquet.compression.codec, default: snappy
-S <value> | --sequence-file <value>
Set to true if the input data is stored in Hadoop Sequence files.
Otherwise text is assumed. Default: true
-L <limit> | --limit <limit>
Only refine this many partitions directories. This is useful while
testing to reduce the number of refinements to do at once. Defaults
to no limit.
-n | --dry-run
Set to true if no action should actually be taken. Instead, targets
to refine will be printed, but they will not be refined.
Default: false
-E | --send-email-report
Set this flag if you want an email report of any failures during refinement.
-T <smtp-uri> | --smtp-uri <smtp-uri>
SMTP server host:port. Default: mx1001.wikimedia.org
-f <from-email> | --from-email <from-email>
Email report from sender email address.
-t <to-emails> | --to-emails <to-emails>
Email report recipient email addresses (comma separated). Default: analytics-alerts@wikimedia.org
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment