Adding HttpFS

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

Apache Hadoop HttpFS is a service that provides HTTP access to HDFS.

HttpFS has a REST HTTP API supporting all HDFS filesystem operations (both read and write).

Common HttpFS use cases are:

  • Read and write data in HDFS using HTTP utilities (such as curl or wget) and HTTP libraries from languages other than Java (such as Perl).
  • Transfer data between HDFS clusters running different versions of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCp.
  • Read and write data in HDFS in a cluster behind a firewall. (The HttpFS server acts as a gateway and is the only system that is allowed to send and receive data through the firewall).

HttpFS supports Hadoop pseudo-authentication, HTTP SPNEGO Kerberos, and additional authentication mechanisms using a plugin API. HttpFS also supports Hadoop proxy user functionality.

The webhdfs client file system implementation can access HttpFS using the Hadoop filesystem command (hadoop fs), by using Hadoop DistCp, and from Java applications using the Hadoop file system Java API.

The HttpFS HTTP REST API is interoperable with the WebHDFS REST HTTP API.

For more information about HttpFS, see Hadoop HDFS over HTTP.

The HttpFS role is required for Hue when you enable HDFS high availability.

Adding the HttpFS Role

  1. Go to the HDFS service.
  2. Click the Instances tab.
  3. Click Add Role Instances.
  4. Click the text box below the HttpFS field. The Select Hosts dialog box displays.
  5. Select the host on which to run the role and click OK.
  6. Click Continue.
  7. Check the checkbox next to the HttpFS role and select Actions for Selected > Start.

Using Load Balancer with HttpFS

Configure the HttpFS Service to work with the load balancer you configured for the service:

  1. In the Cloudera Manager Admin Console, navigate to Cluster > <HDFS service>.
  2. On the Configuration tab, search for the following property:
    HttpFS Load Balancer
  3. Enter the hostname and port for the load balancer in the following format:
    <hostname>:<port>
  4. Save the changes.