Appendix G - Setting Up a Gateway Node to Restrict Access
Use the instructions that follow to set up and use a Hadoop cluster that is entirely firewalled off from outside access; the only exception will be one node which will act as a gateway. Client machines can access the cluster through the gateway via the REST API.
HttpFS will be used to allow REST access to HDFS, and Oozie will allow REST access for submitting and monitoring jobs.
Installing and Configuring the Firewall and Gateway
Follow these steps:
- Choose a cluster node to be the gateway machine
- Install and configure the Oozie server by following the standard directions starting here: Installing Oozie
- Install HttpFS.
- Start the Oozie server:
$ sudo service oozie start
- Start the HttpFS server:
$ sudo service hadoop-httpfs start
- Configure firewalls.Block all access from outside the cluster.
- The gateway node should have ports 11000 (oozie) and 14000 (hadoop-httpfs) open.
- Optionally, to maintain access to the Web UIs for the cluster's JobTrackers, NameNodes, etc., open their HTTP ports: see Ports Used by Components of CDH 5.
- Optionally configure authentication in simple mode (default) or using Kerberos. See HttpFS Security Configuration to configure Kerberos for HttpFS and Oozie Security Configuration to configure Kerberos for Oozie.
- Optionally encrypt communication via HTTPS for Oozie by following these directions.
Accessing HDFS
With the Hadoop client:
All of the standard hadoop fs commands will work; just make sure to specify -fs webhdfs://HOSTNAME:14000. For example (where GATEWAYHOST is the hostname of the gateway machine):
$ hadoop fs -fs webhdfs://GATEWAYHOST:14000 -cat /user/me/myfile.txt Hello World!
Without the Hadoop client:
You can run all of the standard hadoop fs commands by using the WebHDFS REST API and any program that can do GET, PUT, POST, and DELETE requests; for example:
$ curl "http://GATEWAYHOST:14000/webhdfs/v1/user/me/myfile.txt?op=OPEN&user.name=me" Hello World!
In general, the command will look like this:
$ curl "http://GATEWAYHOST/webhdfs/v1/PATH?[user.name=USER&]op=…"
You can find a full explanation of the commands in the WebHDFS REST API documentation.
Submitting and Monitoring Jobs
The Oozie REST API currently supports direct submission of MapReduce, Pig, and Hive jobs; Oozie will automatically create a workflow with a single action. For any other action types, or to execute anything more complicated than a single job, you will need to create an actual workflow. Any required files (e.g. JAR files, input data, etc.) must already exist on HDFS; if they don't, you can use HttpFS to upload the files.
With the Oozie client:
All of the standard Oozie commands will work. You can find a full explanation of the commands in the documentation for the command-line utilities.
Without the Oozie client:
You can run all of the standard Oozie commands by using the REST API and any program that can do GET, PUT, and POST requests. You can find a full explanation of the commands in the Oozie Web Services API documentation.
<< Appendix F - Using kadmin to Create Kerberos Keytab Files | Appendix H - Using a Web Browser to Access an URL Protected by Kerberos HTTP SPNEGO >> | |