Apache Oozie - Workflow Scheduler for Hadoop

We will today understand the workflow scheduler component of Hadoop – Oozie. We can use it to schedule our jobs. For comparison this tool is similar in nature to our job schedulers in other environments such as SQL Agent in MSSQL, CRON jobs in Linux, Task Scheduler in Windows. However, scheduling and processing workflow on a distributed environment needs more complexity and Oozie is the open source product from Apache, which was started at Yahoo! It is used to run Hadoop jobs, get better manageability and error logs than regular cron jobs.


Let’s understand Oozie theoretically before we go ahead and create a job/workflow and schedule it in Oozie. I know this is boring but no one ever achieved anything without concepts. You always need to learn your basics first.


What is Oozie?

Apache Oozie is a tool in Hadoop to combine all your different types of actions to form a workflow. You can also combine multiple workflows into one unit and can also schedule these workflows at a particular time or on data availability. In short, Oozie is a workflow scheduler tool for Hadoop workloads.


You use Oozie to form data pipelines.


Oozie Features

  • Scalability – Scalable as Hadoop cluster takes care of job execution

  • Reliability – maintains a persistence store that holds all job/workflow information.

  • Multi-Tenancy – takes care of resource sharing and management, isolation, security for all users.

  • Security – secures every user data from other users.

  • Operability – provides operability by providing job logs, status information, WebUI and CLI.


Oozie Architecture

Oozie can be deployed on a separate server from Hadoop cluster. Oozie have two parts:

  1. Oozie Server: Workflows are submitted to Oozie Application which is Java based and uses Tomcat as server.

  2. Oozie persistence store / database: All information about users, jobs, user requests etc. are stored in SQL database.


Types of Oozie Jobs

  • Oozie Workflow Jobs - On Demand Jobs

  • Oozie Co-Ordinator Jobs – Jobs triggered by time and data availability. Basically adding schedule/trigger to your workflow jobs.

  • Oozie Bundles Jobs – Single job that is collection of co-ordinator jobs. Adding one or more co-ordinators in a single unit is Bundle.


Application Deployment model

Oozie jobs have following components:

  • a xml file that has all job/workflow details – this xml is workflow.xml for workflow jobs, coordinator.xml for coordinator jobs and bundle.xml for bundles

  • configuration files – job.configuration

  • scripts and jar files – to be placed in workspace at lib folder


Oozie Workflow

  • Oozie workflow is a multi-stage Hadoop job.

  • It is collection of Control & Action nodes.

  • Control nodes captures control dependency and decides flow of control.

  • Action is a Hadoop job.

Control Types:

  • <start> - start of workflow

  • <end> - end of workflow

  • <kill> - kill allows workflow to kill itself

  • <fork> - distribute into parallel paths using fork

  • <join> - joins the results from forked paths back

  • <decision> - decision nodes are just like if-else and are used to decide on which path to follow depending upon condition.

Important: You can nest <fork> and <join>. Only constraint is to ensure we have fork and joins in pair. All paths from one fork must end into its own join.


Action Types:

  • FS – Linux file system actions

  • DistCp – HDFS file system copy operations

  • Java – Java programs/actions

  • MapReduce – MapReduce actions

  • Shell – Shell scripts/actions

  • Hive – Hive server queries/ Hive scripts

  • Spark – Spark actions

  • Sqoop – Sqoop actions

Important: The Actions are not limited to above but using Shell/Java/MapReduce options you may run any kind of workload from Oozie on Hadoop clusters. For example – you may use Shell scripts to connect to a MySQL server and execute SQL scripts.


Create an Oozie Workflow

I guess that would all be enough from concepts. Let’s get going to create an Oozie workflow for Hadoop job. You can create an Oozie workflow using either command line interface or you may use a great UI available – HUE (Hadoop User Experience).


HUE is an open source web user interface for Hadoop. Hue allows technical and non-technical users to take advantage of Hive, Pig, and many of the other tools that are part of the Hadoop ecosystem.


We will use HUE & CLI both but we will start with HUE because it a simple tool to start learning and get a good understanding about Oozie.


Create an Oozie workflow using HUE

Let’s understand the case for which we need to create our workflow.


Case: We need a job that should connects to MySQL Server and run a set of MySQL commands. We will create a new table in existing database myDB and will insert some data into it. In the end we will read the inserted data from this table.

We will need to create few files for this.


mySQLCommands.sql

Here are the MySQL commands.

USE myDB;
CREATE TABLE siteUsers (id INT, name VARCHAR(30), email VARCHAR(30));
INSERT INTO siteUsers VALUES(1,'Anurag', 'anurag_pandiya@gmail.com');
INSERT INTO siteUsers VALUES(2,'Ankit', 'ankit.bora@hotmail.com');
INSERT INTO siteUsers VALUES(3,'Abhishek','abhishekheisenberg@yahoo.com');
INSERT INTO siteUsers VALUES(4,'Vipul', 'vipul_Osama@hotmail.com');
INSERT INTO siteUsers VALUES(5,'Sobhit', 'sobhitmaut@gmail.com');
INSERT INTO siteUsers VALUES(6,'Salil', 'dixit4u@rediffmail.com');
SELECT * FROM siteUsers;

Let’s save these into a file named ‘mySQLCommands.sql’ to create script file.


mySqlShell.sh

Shell script will have command to connect to MySQL server and execute the scripts file.

mysql -u ad -h dbserver.adlab.com -padpass < mySQLCommands.sql

Now we have both ‘mySQLCommands.sql’ and ‘mySQLShell.sh’ ready.

Let’s start Hue now and open up Oozie.


Note: You will get the link to your Hue service from your Hadoop admin. Most of the vendors like Cloudera, AWS etc. also provide Hue along with their distribution.


Once you know link to HUE open it up and login with our credentials. You will get the below screen.

This is file browser view. Click on Workflows -> Editors -> Workflows to create a new workflow.

Click Create to create a new workflow.


Click and edit name for your workflow.

You can see all the actions available on top. Add Shell Action to action panel by drag and drop.


Add your Shell script which is placed on HDFS.

Now, since we are using files we need to provide path for both ‘mySQLCommands.sql’ and ‘mySQLShell.sh’ within the action. Click + in front of Files and add both file locations.

Now click Save.

We have completed creation of our workflow. Now click Submit to submit the workflow for execution.

Click Submit again.

Workflow execution will start and you may check the status on screen.

You can also view job status from Workflows -> Dashboards -> Workflows, which lists workflows from all users. You can only read from this board. This is how Oozie ensures user security, you can only modify your own workflows and not any workflows created by other users.

Wait for the workflow to get completed. If everything goes well, you will get the successful message.

You may also check the logs for the action executed.

Finally, you may logon to MySQL and verify created table and inserted records.

use myDB;
Select * from siteUsers;

Output:

+------+----------+-------------------------------+
| id   | name     | email                         |
+------+----------+-------------------------------+
|    1 | Anurag   | anurag_pandiya@gmail.com      |
|    2 | Ankit    | ankit.bora@hotmail.com        |
|    3 | Abhishek | abhishekheisenberg@yahoo.com  |
|    4 | Vipul    | vipul_Osama@hotmail.com       |
|    5 | Sobhit   | sobhitmaut@gmail.com          |
|    6 | Salil    | dixit4u@rediffmail.com        |
+------+----------+-------------------------------+
6 rows in set (0.01 sec)