Setting Up a Gateway Node to Restrict Cluster Access
Use the instructions that follow to set up and use a Hadoop cluster that is entirely firewalled off from outside access; the only exception will be one node which will act as a gateway. Client machines can access the cluster through the gateway using the REST API.
HttpFS will be used to allow REST access to HDFS, and Oozie will allow REST access for submitting and monitoring jobs.
Installing and Configuring the Firewall and Gateway
Follow these steps:
- Choose a cluster node to be the gateway machine.
- Install and configure the Oozie server by following the standard directions starting here: Installing Oozie.
- Install HttpFS.
- Start the Oozie server:
$ sudo service oozie start
- Start the HttpFS server:
$ sudo service hadoop-httpfs start
- Configure firewalls.
Block all access from outside the cluster.
- The gateway node should have ports 11000 (oozie) and 14000 (hadoop-httpfs) open.
- Optionally, to maintain access to the Web UIs for the cluster's JobTrackers, NameNode, and so on, open their HTTP ports: see Ports Used by Components of CDH 5.
- Optionally configure authentication in simple mode (default) or using Kerberos. See HttpFS Authentication to configure Kerberos for HttpFS and Oozie Authentication to configure Kerberos for Oozie.
- Optionally encrypt communication using HTTPS for Oozie by following these directions.
Accessing HDFS
With the Hadoop client:
All of the standard hadoop fs commands will work; just make sure to specify -fs webhdfs://HOSTNAME:14000. For example (where GATEWAYHOST is the hostname of the gateway machine):
$ hadoop fs -fs webhdfs://GATEWAYHOST:14000 -cat /user/me/myfile.txt Hello World!
Without the Hadoop client:
You can run all of the standard hadoop fs commands by using the WebHDFS REST API and any program that can do GET , PUT, POST, and DELETE requests; for example:
$ curl "http://GATEWAYHOST:14000/webhdfs/v1/user/me/myfile.txt?op=OPEN&user.name=me" Hello World!
In general, the command will look like this:
$ curl "http://GATEWAYHOST/webhdfs/v1/PATH?[user.name=USER&]op=…"
You can find a full explanation of the commands in the WebHDFS REST API documentation.
Submitting and Monitoring Jobs
The Oozie REST API supports the direct submission of jobs for MapReduce, Pig, and Hive; Oozie automatically creates a workflow with a single action. For any other action types, or to execute anything more complicated than a single job, you must create an actual workflow. Required files (JAR files, input data, and so on.) must already exist on HDFS; if they do not, you can use HttpFS to upload the files.
With the Oozie client:
All of the standard Oozie commands will work. You can find a full explanation of the commands in the documentation for the command-line utilities.
Without the Oozie client:
You can run all of the standard Oozie commands by using the REST API and any program that can do GET, PUT, and POST requests. You can find a full explanation of the commands in the Oozie Web Services API documentation.