ottomata/gist:9f3b526a2208895f7232b94eba3ecc68

## gistfile1.txt
$ spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine /srv/deployment/analytics/refinery/artifacts/refinery-job.jar --help


JSON Datasets -> Partitioned Hive Parquet tables.

Given an input base path, this will search all subdirectories for input
partitions to convert to Parquet backed Hive tables.  This was originally
written to work with JSON data imported via Camus into hourly buckets, but
should be configurable to work with any regular import directory hierarchy.

Example:
  spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar \
   --input-base-path     /wmf/data/raw/event \
   --output-base-path    /user/otto/external/eventbus5' \
   --database            event \
   --since               24 \
   --input-regex         '.*(eqiad|codfw)_(.+)/hourly/(\d+)/(\d+)/(\d+)/(\d+)' \
   --input-capture       'datacenter,table,year,month,day,hour' \
   --table-blacklist     '.*page_properties_change.*'


Usage: spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar [options]

NOTE: You may pass all of the described CLI options to this job in a single
string with --options '<options>' flag.\n
  --help
        Prints this usage text.
  -i <path> | --input-base-path <path>
        Path to input JSON datasets.  This directory is expected to contain
	directories of individual (topic) table datasets.  E.g.
	/path/to/raw/data/{myprefix_dataSetOne,myprefix_dataSetTwo}, etc.
	Each of these subdirectories will be searched for partitions that
	need to be refined.

  -o <path> | --output-base-path <path>
        Base path of output data and of external Hive tables.  Each table will be created
	with a LOCATION in a subdirectory of this path.

  -d <database> | --database <database>
        Hive database name in which to manage refined Hive tables.

  -s <since-date-time> | --since <since-date-time>
        Refine all data found since this date time.  This may either be given as an integer
	number of hours ago, or an ISO-8601 formatted date time.  Default: 192 hours ago.

  -u <until-date-time> | --until <until-date-time>
        Refine all data found until this date time.  This may either be given as an integer
	number of hours ago, or an ISO-8601 formatted date time.  Default: now.

  -R <regex> | --input-regex <regex>
        input-regex should match the input partition directory hierarchy starting from the
	dataset base path, and should capture the table name and the partition values.
	Along with input-capture, this allows arbitrary extraction of table names and and
	partitions from the input path.  You are required to capture at least "table"
	using this regex.  The default will match an hourly bucketed Camus import hierarchy,
	using the topic name as the table name.

  -C <capture-list> | --input-capture <capture-list>
        input-capture should be a comma separated list of named capture groups
	corresponding to the groups captured byt input-regex.  These need to be
	provided in the order that the groups are captured.  This ordering will
	also be used for partitioning.

  -F <format> | --input-datetime-format <format>
        This DateTimeFormat will be used to generate all possible partitions since
	the given lookback-hours in each dataset directory.  This format will be used
	to format a DateTime to input directory partition paths.  The finest granularity
	supported is hourly.  Every hour in the past lookback-hours will be generated,
	but if you specify a less granular format (e.g. daily, like "daily"/yyyy/MM/dd),
	the code will reduce the generated partition search for that day to 1, instead of 24.
	The default is suitable for generating partitions in an hourly bucketed Camus
	import hierarchy.


  -w <regex> | --table-whitelist <regex>
        Whitelist regex of table names to refine.

  -b <regex> | --table-blacklist <regex>
        Blacklist regex of table names to skip.

  -D <filename> | --done-flag <filename>
        When a partition is successfully refined, this file will be created in the
	output partition path with the binary timestamp of the input source partition's
	modification timestamp.  This allows subsequent runs that to detect if the input
	data has changed meaning the partition needs to be re-refined.
	Default: _REFINED

  -X <filename> | --failure-flag <filename>
        When a partition fails refinement, this file will be created in the
	output partition path with the binary timestamp of the input source partition's
	modification timestamp.  Any partition with this flag will be excluded
	from refinement if the input data's modtime hasn't changed.  If the
	modtime has changed, this will re-attempt refinement anyway.
	Default: _REFINE_FAILED

  -I | --ignore-failure-flag
        Set this if you want all discovered partitions with --failure-flag files to be
	(re)refined. Default: false

  -P <parallelism> | --parallelism <parallelism>
        Refine into up to this many tables in parallel.  Individual partitions
	destined for the same Hive table will be refined serially.
	Defaults to the number of local CPUs (i.e. what Scala parallel
	collections uses).

  -c <codec> | --compression-codec <codec>
        Value of spark.sql.parquet.compression.codec, default: snappy

  -S <value> | --sequence-file <value>
        Set to true if the input data is stored in Hadoop Sequence files.
	Otherwise text is assumed.  Default: true

  -L <limit> | --limit <limit>
        Only refine this many partitions directories.  This is useful while
	testing to reduce the number of refinements to do at once.  Defaults
	to no limit.

  -n | --dry-run
        Set to true if no action should actually be taken.  Instead, targets
	to refine will be printed, but they will not be refined.
	Default: false

  -E | --send-email-report
        Set this flag if you want an email report of any failures during refinement.
  -T <smtp-uri> | --smtp-uri <smtp-uri>
        SMTP server host:port. Default: mx1001.wikimedia.org
  -f <from-email> | --from-email <from-email>
        Email report from sender email address.
  -t <to-emails> | --to-emails <to-emails>
        Email report recipient email addresses (comma separated). Default: analytics-alerts@wikimedia.org
	$ spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine /srv/deployment/analytics/refinery/artifacts/refinery-job.jar --help


	JSON Datasets -> Partitioned Hive Parquet tables.

	Given an input base path, this will search all subdirectories for input
	partitions to convert to Parquet backed Hive tables. This was originally
	written to work with JSON data imported via Camus into hourly buckets, but
	should be configurable to work with any regular import directory hierarchy.

	Example:
	spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar \
	--input-base-path /wmf/data/raw/event \
	--output-base-path /user/otto/external/eventbus5' \
	--database event \
	--since 24 \
	--input-regex '.*(eqiad\|codfw)_(.+)/hourly/(\d+)/(\d+)/(\d+)/(\d+)' \
	--input-capture 'datacenter,table,year,month,day,hour' \
	--table-blacklist '.page_properties_change.'


	Usage: spark-submit --class org.wikimedia.analytics.refinery.job.JsonRefine refinery-job.jar [options]

	NOTE: You may pass all of the described CLI options to this job in a single
	string with --options '<options>' flag.\n
	--help
	Prints this usage text.
	-i <path> \| --input-base-path <path>
	Path to input JSON datasets. This directory is expected to contain
	directories of individual (topic) table datasets. E.g.
	/path/to/raw/data/{myprefix_dataSetOne,myprefix_dataSetTwo}, etc.
	Each of these subdirectories will be searched for partitions that
	need to be refined.

	-o <path> \| --output-base-path <path>
	Base path of output data and of external Hive tables. Each table will be created
	with a LOCATION in a subdirectory of this path.

	-d <database> \| --database <database>
	Hive database name in which to manage refined Hive tables.

	-s <since-date-time> \| --since <since-date-time>
	Refine all data found since this date time. This may either be given as an integer
	number of hours ago, or an ISO-8601 formatted date time. Default: 192 hours ago.

	-u <until-date-time> \| --until <until-date-time>
	Refine all data found until this date time. This may either be given as an integer
	number of hours ago, or an ISO-8601 formatted date time. Default: now.

	-R <regex> \| --input-regex <regex>
	input-regex should match the input partition directory hierarchy starting from the
	dataset base path, and should capture the table name and the partition values.
	Along with input-capture, this allows arbitrary extraction of table names and and
	partitions from the input path. You are required to capture at least "table"
	using this regex. The default will match an hourly bucketed Camus import hierarchy,
	using the topic name as the table name.

	-C <capture-list> \| --input-capture <capture-list>
	input-capture should be a comma separated list of named capture groups
	corresponding to the groups captured byt input-regex. These need to be
	provided in the order that the groups are captured. This ordering will
	also be used for partitioning.

	-F <format> \| --input-datetime-format <format>
	This DateTimeFormat will be used to generate all possible partitions since
	the given lookback-hours in each dataset directory. This format will be used
	to format a DateTime to input directory partition paths. The finest granularity
	supported is hourly. Every hour in the past lookback-hours will be generated,
	but if you specify a less granular format (e.g. daily, like "daily"/yyyy/MM/dd),
	the code will reduce the generated partition search for that day to 1, instead of 24.
	The default is suitable for generating partitions in an hourly bucketed Camus
	import hierarchy.


	-w <regex> \| --table-whitelist <regex>
	Whitelist regex of table names to refine.

	-b <regex> \| --table-blacklist <regex>
	Blacklist regex of table names to skip.

	-D <filename> \| --done-flag <filename>
	When a partition is successfully refined, this file will be created in the
	output partition path with the binary timestamp of the input source partition's
	modification timestamp. This allows subsequent runs that to detect if the input
	data has changed meaning the partition needs to be re-refined.
	Default: _REFINED

	-X <filename> \| --failure-flag <filename>
	When a partition fails refinement, this file will be created in the
	output partition path with the binary timestamp of the input source partition's
	modification timestamp. Any partition with this flag will be excluded
	from refinement if the input data's modtime hasn't changed. If the
	modtime has changed, this will re-attempt refinement anyway.
	Default: _REFINE_FAILED

	-I \| --ignore-failure-flag
	Set this if you want all discovered partitions with --failure-flag files to be
	(re)refined. Default: false

	-P <parallelism> \| --parallelism <parallelism>
	Refine into up to this many tables in parallel. Individual partitions
	destined for the same Hive table will be refined serially.
	Defaults to the number of local CPUs (i.e. what Scala parallel
	collections uses).

	-c <codec> \| --compression-codec <codec>
	Value of spark.sql.parquet.compression.codec, default: snappy

	-S <value> \| --sequence-file <value>
	Set to true if the input data is stored in Hadoop Sequence files.
	Otherwise text is assumed. Default: true

	-L <limit> \| --limit <limit>
	Only refine this many partitions directories. This is useful while
	testing to reduce the number of refinements to do at once. Defaults
	to no limit.

	-n \| --dry-run
	Set to true if no action should actually be taken. Instead, targets
	to refine will be printed, but they will not be refined.
	Default: false

	-E \| --send-email-report
	Set this flag if you want an email report of any failures during refinement.
	-T <smtp-uri> \| --smtp-uri <smtp-uri>
	SMTP server host:port. Default: mx1001.wikimedia.org
	-f <from-email> \| --from-email <from-email>
	Email report from sender email address.
	-t <to-emails> \| --to-emails <to-emails>
	Email report recipient email addresses (comma separated). Default: analytics-alerts@wikimedia.org