Chapter 14. Using HDP for Workflow and Scheduling With Oozie
Hortonworks Data Platform deploys Apache Oozie for your Hadoop cluster.
Oozie is a server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs, such as MapReduce, Pig, Hive, Sqoop, HDFS operations, and sub-workflows. Oozie supports coordinator jobs, which are sequences of workflow jobs that are created at a given frequency and start when all of the required input data is available.
A command-line client and a browser interface allow you to manage and administer Oozie jobs locally or remotely.
After installing an HDP 2.x cluster by using Ambari 1.5.x, access the Oozie web UI at the following URL:
http://{your.oozie.server.hostname}:11000/oozie
Important | |
---|---|
The |
Setting the Oozie Client Environment
The Oozie client requires JDK 1.6 or higher and must be available on all systems where
Oozie command line will be run. Java must be included on the path or
$JAVA_HOME
must be set to point to a Java 6+ JDK/JRE.
This is a behavior change for the Oozie client from previous releases.
Additional Oozie Resources
For additional Oozie documentation, use the following resources:
Developer Documentation
Administrator Documentation
ActiveMQ With Oozie and Falcon
You must configure an ActiveMQ URL in Apache Oozie and Apache Falcon components, if those components communicate using an ActiveMQ server that is running on a different host.
If either of the following circumtances apply to your environment, perform the indicated action.
If Falcon starts ActiveMQ server by default, but Oozie is running on a different host: Set the ActiveMQ server URL in Oozie.
If Falcon and Oozie are communicating with a standalone ActiveMQ server: Set the ActiveMQ server URL in both Oozie and Falcon.
To configure ActiveMQ URL in Oozie, add the following property via Ambari and restart Oozie:
In Ambari, navigate to Services > Oozie > Configs.
Add the following key/value pair as a property in the Custom oozie-site section.
Key=
oozie.jms.producer.connection.properties
Value=
java.naming.factory.initial#org.apache.activemq.jndi.ActiveMQInitialContextFactory;java.naming.provider.url#tcp://{ActiveMQ-server-host}:61616;connectionFactoryNames#ConnectionFactory
Navigate to Services > Falcon > Configs.
Add the following value for broker.url in the Falcon startup.properties section.
*.broker.url=tcp://{ActiveMQ-server-host}:61616
Click Service Actions > Restart All to restart the Falcon service.
Troubleshooting:
When upgrading Falcon in HDP 2.5 or later, you might encounter the following error when starting the ActiveMQ server:
ERROR - [main:] ~ Failed to start ActiveMQ JMS Message Broker. Reason: java.lang.NegativeArraySizeException (BrokerService:528)
If you encounter this error, follow these steps to delete the ActiveMQ history and then restart Falcon. If you want to retain the history, be sure to back up the ActiveMQ history prior to deleting it.
cd <ACTIVEMQ_DATA_DIR> rm -rf ./localhost cd /usr/hdp/current/falcon-server su -l <FALCON_USER> ./bin/falcon-stop ./bin/falcon-start
Configuring Pig Scripts to Use HCatalog in Oozie Workflows
To access HCatalog with a Pig action in an Oozie workflow, you need to modify configuration information to point to the Hive metastore URIs.
There are two methods for providing this configuration information. Which method you use depends upon how often your Pig scripts access the HCatalog.
Configuring Individual Pig Actions to Access HCatalog
If only a few individual Pig actions access HCatalog, do the following:
Identify the URI (host and port) for the Thrift metastore server.
In Ambari, click Hive > Configs > Advanced.
Make note of the URI in the hive.metastore.uris field in the General section.
This information is also stored in the
hive.default.xml
file.
Add the following two properties to the <configuration> elements in each Pig action.
Note Replace
[host:port(default:9083)]
in the example below with the host and port for the Thrift metastore server.<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://
[host:port(default:9083)]
</value> <description>A comma separated list of metastore uris the client can use to contact the metastore server.</description> </property> <property> <name>oozie.action.sharelib.for.pig</name> <value>pig,hive,hcatalog</value> <description>A comma separated list of libraries to be used by the Pig action.</description> </property> </configuration>
Configuring All Pig Actions to Access HCatalog
If all of your Pig actions access HCatalog, do the following:
Add the following line to the
job.properties
files, located in your working directory:oozie.action.sharelib.for.pig=pig,hive,hcatalog <!-- A comma separated list of libraries to be used by the Pig action.-->
Identify the URI (host and port) for the Thrift metastore server.
In Ambari, click Hive > Configs > Advanced.
Make note of the URI in the hive.metastore.uris field in the General section.
This information is also stored in the
hive.default.xml
file.
Add the following property to the <configuration> elements in each Pig action.
Note Replace
[host:port(default:9083)]
in the example below with the host and port for the Thrift metastore server.<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://
[host:port(default:9083)]
</value> <description>A comma separated list of metastore uris the client can use to contact the metastore server.</description> </property> </configuration>