Using Fast Upload with Amazon S3

Writing data to Amazon S3 is subject to limitations of the s3a OutputStream implementation, which buffers the entire file to disk before uploading it to S3. This can cause the upload to proceed very slowly and can require a large amount of temporary disk space on local disks.

As of CDH 5.12, you can configure CDH to use the Fast Upload feature. This feature implements several performance improvements and has tunable parameters for buffering to disk (the default) or to memory, tuning the number of threads, and for specifying the disk directories used for buffering.

For more information on this feature, and to learn about the tunable parameters, see Hadoop-AWS module: Integration with Amazon Web Services .

Enabling Fast Upload using Cloudera Manager

To enable Fast Upload for clusters managed by Cloudera Manager:
  1. Go to the HDFS service.
  2. Select the Configuration tab.
  3. Search for "core-site.xml" and locate the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property.
  4. Add the fs.s3a.fast.upload property and set it to true. See Setting an Advanced Configuration Snippet.
  5. Set any additional tuning properties in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml configuration properties.
  6. Click Save Changes.

    Cloudera Manager will indicate that there are stale services and which services need to be restarted. Restart the indicated services.

Enabling Fast Upload Using the Command Line

To enable Fast Upload on unmanaged clusters:
  1. Set the fs.s3a.fast.upload to true in the core-site.xml configuration file. For example:
    <property>
      <name>fs.s3a.fast.upload</name>
      <value>true</value>
    </property>
  2. Set any additional tuning parameters in the core-site.xml file.
  3. Restart the HDFS service.