Created
February 8, 2018 15:45
-
-
Save ottomata/9f3b526a2208895f7232b94eba3ecc68 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine /srv/deployment/analytics/refinery/artifacts/refinery-job.jar --help | |
JSON Datasets -> Partitioned Hive Parquet tables. | |
Given an input base path, this will search all subdirectories for input | |
partitions to convert to Parquet backed Hive tables. This was originally | |
written to work with JSON data imported via Camus into hourly buckets, but | |
should be configurable to work with any regular import directory hierarchy. | |
Example: | |
spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar \ | |
--input-base-path /wmf/data/raw/event \ | |
--output-base-path /user/otto/external/eventbus5' \ | |
--database event \ | |
--since 24 \ | |
--input-regex '.*(eqiad|codfw)_(.+)/hourly/(\d+)/(\d+)/(\d+)/(\d+)' \ | |
--input-capture 'datacenter,table,year,month,day,hour' \ | |
--table-blacklist '.*page_properties_change.*' | |
Usage: spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar [options] | |
NOTE: You may pass all of the described CLI options to this job in a single | |
string with --options '<options>' flag.\n | |
--help | |
Prints this usage text. | |
-i <path> | --input-base-path <path> | |
Path to input JSON datasets. This directory is expected to contain | |
directories of individual (topic) table datasets. E.g. | |
/path/to/raw/data/{myprefix_dataSetOne,myprefix_dataSetTwo}, etc. | |
Each of these subdirectories will be searched for partitions that | |
need to be refined. | |
-o <path> | --output-base-path <path> | |
Base path of output data and of external Hive tables. Each table will be created | |
with a LOCATION in a subdirectory of this path. | |
-d <database> | --database <database> | |
Hive database name in which to manage refined Hive tables. | |
-s <since-date-time> | --since <since-date-time> | |
Refine all data found since this date time. This may either be given as an integer | |
number of hours ago, or an ISO-8601 formatted date time. Default: 192 hours ago. | |
-u <until-date-time> | --until <until-date-time> | |
Refine all data found until this date time. This may either be given as an integer | |
number of hours ago, or an ISO-8601 formatted date time. Default: now. | |
-R <regex> | --input-regex <regex> | |
input-regex should match the input partition directory hierarchy starting from the | |
dataset base path, and should capture the table name and the partition values. | |
Along with input-capture, this allows arbitrary extraction of table names and and | |
partitions from the input path. You are required to capture at least "table" | |
using this regex. The default will match an hourly bucketed Camus import hierarchy, | |
using the topic name as the table name. | |
-C <capture-list> | --input-capture <capture-list> | |
input-capture should be a comma separated list of named capture groups | |
corresponding to the groups captured byt input-regex. These need to be | |
provided in the order that the groups are captured. This ordering will | |
also be used for partitioning. | |
-F <format> | --input-datetime-format <format> | |
This DateTimeFormat will be used to generate all possible partitions since | |
the given lookback-hours in each dataset directory. This format will be used | |
to format a DateTime to input directory partition paths. The finest granularity | |
supported is hourly. Every hour in the past lookback-hours will be generated, | |
but if you specify a less granular format (e.g. daily, like "daily"/yyyy/MM/dd), | |
the code will reduce the generated partition search for that day to 1, instead of 24. | |
The default is suitable for generating partitions in an hourly bucketed Camus | |
import hierarchy. | |
-w <regex> | --table-whitelist <regex> | |
Whitelist regex of table names to refine. | |
-b <regex> | --table-blacklist <regex> | |
Blacklist regex of table names to skip. | |
-D <filename> | --done-flag <filename> | |
When a partition is successfully refined, this file will be created in the | |
output partition path with the binary timestamp of the input source partition's | |
modification timestamp. This allows subsequent runs that to detect if the input | |
data has changed meaning the partition needs to be re-refined. | |
Default: _REFINED | |
-X <filename> | --failure-flag <filename> | |
When a partition fails refinement, this file will be created in the | |
output partition path with the binary timestamp of the input source partition's | |
modification timestamp. Any partition with this flag will be excluded | |
from refinement if the input data's modtime hasn't changed. If the | |
modtime has changed, this will re-attempt refinement anyway. | |
Default: _REFINE_FAILED | |
-I | --ignore-failure-flag | |
Set this if you want all discovered partitions with --failure-flag files to be | |
(re)refined. Default: false | |
-P <parallelism> | --parallelism <parallelism> | |
Refine into up to this many tables in parallel. Individual partitions | |
destined for the same Hive table will be refined serially. | |
Defaults to the number of local CPUs (i.e. what Scala parallel | |
collections uses). | |
-c <codec> | --compression-codec <codec> | |
Value of spark.sql.parquet.compression.codec, default: snappy | |
-S <value> | --sequence-file <value> | |
Set to true if the input data is stored in Hadoop Sequence files. | |
Otherwise text is assumed. Default: true | |
-L <limit> | --limit <limit> | |
Only refine this many partitions directories. This is useful while | |
testing to reduce the number of refinements to do at once. Defaults | |
to no limit. | |
-n | --dry-run | |
Set to true if no action should actually be taken. Instead, targets | |
to refine will be printed, but they will not be refined. | |
Default: false | |
-E | --send-email-report | |
Set this flag if you want an email report of any failures during refinement. | |
-T <smtp-uri> | --smtp-uri <smtp-uri> | |
SMTP server host:port. Default: mx1001.wikimedia.org | |
-f <from-email> | --from-email <from-email> | |
Email report from sender email address. | |
-t <to-emails> | --to-emails <to-emails> | |
Email report recipient email addresses (comma separated). Default: analytics-alerts@wikimedia.org |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment