Skip to content

Instantly share code, notes, and snippets.

@Metras
Created April 11, 2019 19:10
Show Gist options
  • Save Metras/49af7f7be3de9bf2363cb79106d2821f to your computer and use it in GitHub Desktop.
Save Metras/49af7f7be3de9bf2363cb79106d2821f to your computer and use it in GitHub Desktop.
How to schedule and coordinate OOZIE Workflows in Hadoop
SCHEDULING AND COORDINATING OOZIE WORKFLOWS IN HADOOP
After you’ve created a set of workflows, you can use a series of Oozie coordinator jobs to schedule when they’re executed. You have two scheduling options for execution: a specific time and the availability of data in conjunction with a certain time. Thanks to Dirk deRoos for this.
TIME-BASED SCHEDULING FOR OOZIE COORDINATOR JOBS
Oozie coordinator jobs can be scheduled to execute at a certain time, but after they’re started, they can then be configured to run at specified intervals. The following example shows a coordinator job that starts running at a specified start time and date:
<coordinator-app name="sampleCoordinator"
frequency="${coord:days(1)}"
start="2014-06-01T00:01Z "
end="2014-06-01T01:00Z "
timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<controls>...</controls>
<action>
<workflow>
<app-path>${workflowAppPath}</app-path>
</workflow>
</action>
</coordinator-app>
TIME AND DATA AVAILABILITY-BASED SCHEDULING FOR OOZIE COORDINATOR JOBS
Oozie coordinator jobs can also be scheduled to execute at a certain time if specified data files or directories are available. The following listing shows an example of a coordinator that starts running at a specified start time and date, is executed once a day if the data set identified by triggerDatasetDir exists, and runs until the specified end time:
<coordinator-app name="sampleCoordinator"
frequency="${coord:days(1)}"
start="${startTime}"
end="${endTime}"
timezone="${timeZoneDef}"
xmlns="uri:oozie:coordinator:0.1">
<controls>...</controls>
<datasets>
<dataset name="input" frequency="${coord:days(1)}" initial-instance="${startTime}" timezone="${timeZoneDef}">
<uri-template>${triggerDatasetDir}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="sampleInput" dataset="input">
<instance>${startTime}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${workflowAppPath}</app-path>
</workflow>
</action>
</coordinator-app>
RUNNING OOZIE COORDINATOR JOBS
Similar to Oozie workflow jobs, coordinator jobs require a job.properties file, and the coordinator.xml file needs to be loaded in the HDFS. To run an Oozie coordinator job from the Oozie command-line interface, issue a command like the following while ensuring that the job.properties file is locally accessible:
$ oozie job –config sampleCoordinator/job.properties –run
After you submit the job, the coordinator is stored in the Oozie object database. On submission, Oozie returns an identifier to enable you to monitor and administer your coordinator — job: 0000001-00000001234567-oozie-C.
To check the status of this job, run the command
oozie job -info 0000001-00000001234567-oozie-C
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment