A List of S3A Configuration Properties
The following fs.s3a
configuration
properties are available. To override these default s3a settings, add your
configuration to your
core-site.xml
.
<property> <name>fs.s3a.access.key</name> <description>AWS access key ID used by S3A file system. Omit for IAM role-based or provider-based authentication.</description> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key used by S3A file system. Omit for IAM role-based or provider-based authentication.</description> </property> <property> <name>fs.s3a.aws.credentials.provider</name> <description> Comma-separated class names of credential provider classes which implement com.amazonaws.auth.AWSCredentialsProvider. These are loaded and queried in sequence for a valid set of credentials. Each listed class must implement one of the following means of construction, which are attempted in order: 1. a public constructor accepting java.net.URI and org.apache.hadoop.conf.Configuration, 2. a public static method named getInstance that accepts no arguments and returns an instance of com.amazonaws.auth.AWSCredentialsProvider, or 3. a public default constructor. Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible S3 bucket without any credentials. Please note that allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases. It can be useful for accessing public data sets without requiring AWS credentials. If unspecified, then the default list of credential provider classes, queried in sequence, is: 1. org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider: supports static configuration of AWS access key ID and secret access key. See also fs.s3a.access.key and fs.s3a.secret.key. 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports configuration of AWS access key ID and secret access key in environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK. 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use of instance profile credentials if running in an EC2 VM. </description> </property> <property> <name>fs.s3a.session.token</name> <description>Session token, when using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as one of the providers. </description> </property> <property> <name>fs.s3a.security.credential.provider.path</name> <value/> <description> Optional comma separated list of credential providers, a list which is prepended to that set in hadoop.security.credential.provider.path </description> </property> <property> <name>fs.s3a.assumed.role.arn</name> <value/> <description> AWS ARN for the role to be assumed. Required if the fs.s3a.aws.credentials.provider contains org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider </description> </property> <property> <name>fs.s3a.assumed.role.session.name</name> <value/> <description> Session name for the assumed role, must be valid characters according to the AWS APIs. Only used if AssumedRoleCredentialProvider is the AWS credential provider. If not set, one is generated from the current Hadoop/Kerberos username. </description> </property> <property> <name>fs.s3a.assumed.role.policy</name> <value/> <description> JSON policy to apply to the role. Only used if AssumedRoleCredentialProvider is the AWS credential provider. </description> </property> <property> <name>fs.s3a.assumed.role.session.duration</name> <value>30m</value> <description> Duration of assumed roles before a refresh is attempted. Only used if AssumedRoleCredentialProvider is the AWS credential provider. Range: 15m to 1h </description> </property> <property> <name>fs.s3a.assumed.role.sts.endpoint</name> <value/> <description> AWS Simple Token Service Endpoint. If unset, uses the default endpoint. Only used if AssumedRoleCredentialProvider is the AWS credential provider. </description> </property> <property> <name>fs.s3a.assumed.role.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value> <description> List of credential providers to authenticate with the STS endpoint and retrieve short-lived role credentials. Only used if AssumedRoleCredentialProvider is the AWS credential provider. If unset, uses "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider". </description> </property> <property> <name>fs.s3a.connection.maximum</name> <value>15</value> <description>Controls the maximum number of simultaneous connections to S3.</description> </property> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>true</value> <description>Enables or disables SSL connections to S3.</description> </property> <property> <name>fs.s3a.endpoint</name> <description>AWS S3 endpoint to connect to. An up-to-date list is provided in the AWS Documentation: regions and endpoints. Without this property, the standard region (s3.amazonaws.com) is assumed. </description> </property> <property> <name>fs.s3a.path.style.access</name> <value>false</value> <description>Enable S3 path style access ie disabling the default virtual hosting behaviour. Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting. </description> </property> <property> <name>fs.s3a.proxy.host</name> <description>Hostname of the (optional) proxy server for S3 connections.</description> </property> <property> <name>fs.s3a.proxy.port</name> <description>Proxy server port. If this property is not set but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with the value of fs.s3a.connection.ssl.enabled). </description> </property> <property> <name>fs.s3a.proxy.username</name> <description>Username for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.password</name> <description>Password for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.domain</name> <description>Domain for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.workstation</name> <description>Workstation for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.attempts.maximum</name> <value>20</value> <description>How many times we should retry commands on transient errors.</description> </property> <property> <name>fs.s3a.connection.establish.timeout</name> <value>5000</value> <description>Socket connection setup timeout in milliseconds.</description> </property> <property> <name>fs.s3a.connection.timeout</name> <value>200000</value> <description>Socket connection timeout in milliseconds.</description> </property> <property> <name>fs.s3a.socket.send.buffer</name> <value>8192</value> <description>Socket send buffer hint to amazon connector. Represented in bytes.</description> </property> <property> <name>fs.s3a.socket.recv.buffer</name> <value>8192</value> <description>Socket receive buffer hint to amazon connector. Represented in bytes.</description> </property> <property> <name>fs.s3a.paging.maximum</name> <value>5000</value> <description>How many keys to request from S3 when doing directory listings at a time. </description> </property> <property> <name>fs.s3a.threads.max</name> <value>10</value> <description>The total number of threads available in the filesystem for data uploads *or any other queued filesystem operation*. </description> </property> <property> <name>fs.s3a.threads.keepalivetime</name> <value>60</value> <description>Number of seconds a thread can be idle before being terminated. </description> </property> <property> <name>fs.s3a.max.total.tasks</name> <value>5</value> <description>The number of operations which can be queued for execution</description> </property> <property> <name>fs.s3a.multipart.size</name> <value>100M</value> <description>How big (in bytes) to split upload or copy operations up into. A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. </description> </property> <property> <name>fs.s3a.multipart.threshold</name> <value>2147483647</value> <description>How big (in bytes) to split upload or copy operations up into. This also controls the partition size in renamed files, as rename() involves copying the source file(s). A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. </description> </property> <property> <name>fs.s3a.multiobjectdelete.enable</name> <value>true</value> <description>When enabled, multiple single-object delete requests are replaced by a single 'delete multiple objects'-request, reducing the number of requests. Beware: legacy S3-compatible object stores might not support this request. </description> </property> <property> <name>fs.s3a.acl.default</name> <description>Set a canned ACL for newly created and copied objects. Value may be Private, PublicRead, PublicReadWrite, AuthenticatedRead, LogDeliveryWrite, BucketOwnerRead, or BucketOwnerFullControl. </description> </property> <property> <name>fs.s3a.multipart.purge</name> <value>false</value> <description>True if you want to purge existing multipart uploads that may not have been completed/aborted correctly. The corresponding purge age is defined in fs.s3a.multipart.purge.age. If set, when the filesystem is instantiated then all outstanding uploads older than the purge age will be terminated -across the entire bucket. This will impact multipart uploads by other applications and users. so should be used sparingly, with an age value chosen to stop failed uploads, without breaking ongoing operations. </description> </property> <property> <name>fs.s3a.multipart.purge.age</name> <value>86400</value> <description>Minimum age in seconds of multipart uploads to purge on startup if "fs.s3a.multipart.purge" is true </description> </property> <property> <name>fs.s3a.server-side-encryption-algorithm</name> <description>Specify a server-side encryption algorithm for s3a: file system. Unset by default. It supports the following values: 'AES256' (for SSE-S3), 'SSE-KMS' and 'SSE-C'. </description> </property> <property> <name>fs.s3a.server-side-encryption.key</name> <description>Specific encryption key to use if fs.s3a.server-side-encryption-algorithm has been set to 'SSE-KMS' or 'SSE-C'. In the case of SSE-C, the value of this property should be the Base64 encoded key. If you are using SSE-KMS and leave this property empty, you'll be using your default's S3 KMS key, otherwise you should set this property to the specific KMS key id. </description> </property> <property> <name>fs.s3a.signing-algorithm</name> <description>Override the default signing algorithm so legacy implementations can still be used </description> </property> <property> <name>fs.s3a.block.size</name> <value>32M</value> <description>Block size to use when reading files using s3a: file system. A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. </description> </property> <property> <name>fs.s3a.buffer.dir</name> <value>${hadoop.tmp.dir}/s3a</value> <description>Comma separated list of directories that will be used to buffer file uploads to. </description> </property> <property> <name>fs.s3a.fast.upload.buffer</name> <value>disk</value> <description> The buffering mechanism to for data being written. Values: disk, array, bytebuffer. "disk" will use the directories listed in fs.s3a.buffer.dir as the location(s) to save data prior to being uploaded. "array" uses arrays in the JVM heap "bytebuffer" uses off-heap memory within the JVM. Both "array" and "bytebuffer" will consume memory in a single stream up to the number of blocks set by: fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks. If using either of these mechanisms, keep this value low The total number of threads performing work across all threads is set by fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the number of queued work items. </description> </property> <property> <name>fs.s3a.fast.upload.active.blocks</name> <value>4</value> <description> Maximum Number of blocks a single output stream can have active (uploading, or queued to the central FileSystem instance's pool of queued operations. This stops a single stream overloading the shared thread pool. </description> </property> <property> <name>fs.s3a.readahead.range</name> <value>64K</value> <description>Bytes to read ahead during a seek() before closing and re-opening the S3 HTTP connection. This option will be overridden if any call to setReadahead() is made to an open stream. A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. </description> </property> <property> <name>fs.s3a.user.agent.prefix</name> <value></value> <description> Sets a custom value that will be prepended to the User-Agent header sent in HTTP requests to the S3 back-end by S3AFileSystem. The User-Agent header always includes the Hadoop version number followed by a string generated by the AWS SDK. An example is "User-Agent: Hadoop 2.8.0, aws-sdk-java/1.10.6". If this optional property is set, then its value is prepended to create a customized User-Agent. For example, if this configuration property was set to "MyApp", then an example of the resulting User-Agent would be "User-Agent: MyApp, Hadoop 2.8.0, aws-sdk-java/1.10.6". </description> </property> <property> <name>fs.s3a.metadatastore.authoritative</name> <value>false</value> <description> When true, allow MetadataStore implementations to act as source of truth for getting file status and directory listings. Even if this is set to true, MetadataStore implementations may choose not to return authoritative results. If the configured MetadataStore does not support being authoritative, this setting will have no effect. </description> </property> <property> <name>fs.s3a.metadatastore.impl</name> <value>org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore</value> <description> Fully-qualified name of the class that implements the MetadataStore to be used by s3a. The default class, NullMetadataStore, has no effect: s3a will continue to treat the backing S3 service as the one and only source of truth for file and directory metadata. </description> </property> <property> <name>fs.s3a.s3guard.cli.prune.age</name> <value>86400000</value> <description> Default age (in milliseconds) after which to prune metadata from the metadatastore when the prune command is run. Can be overridden on the command-line. </description> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> <description>The implementation class of the S3A Filesystem</description> </property> <property> <name>fs.s3a.s3guard.ddb.region</name> <value></value> <description> AWS DynamoDB region to connect to. An up-to-date list is provided in the AWS Documentation: regions and endpoints. Without this property, the S3Guard will operate table in the associated S3 bucket region. </description> </property> <property> <name>fs.s3a.s3guard.ddb.table</name> <value></value> <description> The DynamoDB table name to operate. Without this property, the respective S3 bucket name will be used. </description> </property> <property> <name>fs.s3a.s3guard.ddb.table.create</name> <value>false</value> <description> If true, the S3A client will create the table if it does not already exist. </description> </property> <property> <name>fs.s3a.s3guard.ddb.table.capacity.read</name> <value>500</value> <description> Provisioned throughput requirements for read operations in terms of capacity units for the DynamoDB table. This config value will only be used when creating a new DynamoDB table, though later you can manually provision by increasing or decreasing read capacity as needed for existing tables. See DynamoDB documents for more information. </description> </property> <property> <name>fs.s3a.s3guard.ddb.table.capacity.write</name> <value>100</value> <description> Provisioned throughput requirements for write operations in terms of capacity units for the DynamoDB table. Refer to related config fs.s3a.s3guard.ddb.table.capacity.read before usage. </description> </property> <property> <name>fs.s3a.s3guard.ddb.max.retries</name> <value>9</value> <description> Max retries on batched DynamoDB operations before giving up and throwing an IOException. Each retry is delayed with an exponential backoff timer which starts at 100 milliseconds and approximately doubles each time. The minimum wait before throwing an exception is sum(100, 200, 400, 800, .. 100*2^N-1 ) == 100 * ((2^N)-1) So N = 9 yields at least 51.1 seconds (51,100) milliseconds of blocking before throwing an IOException. </description> </property> <property> <name>fs.s3a.s3guard.ddb.background.sleep</name> <value>25</value> <description> Length (in milliseconds) of pause between each batch of deletes when pruning metadata. Prevents prune operations (which can typically be low priority background operations) from overly interfering with other I/O operations. </description> </property> <property> <name>fs.s3a.retry.limit</name> <value>${fs.s3a.attempts.maximum}</value> <description> Number of times to retry any repeatable S3 client request on failure, excluding throttling requests. </description> </property> <property> <name>fs.s3a.retry.interval</name> <value>500ms</value> <description> Interval between attempts to retry operations for any reason other than S3 throttle errors. </description> </property> <property> <name>fs.s3a.retry.throttle.limit</name> <value>${fs.s3a.attempts.maximum}</value> <description> Number of times to retry any throttled request. </description> </property> <property> <name>fs.s3a.retry.throttle.interval</name> <value>1000ms</value> <description> Interval between retry attempts on throttled requests. </description> </property> <property> <name>fs.s3a.committer.name</name> <value>file</value> <description> Committer to create for output to S3A, one of: "file", "directory", "partitioned", "magic". </description> </property> <property> <name>fs.s3a.committer.magic.enabled</name> <value>false</value> <description> Enable support in the filesystem for the S3 "Magic" committer. When working with AWS S3, S3Guard must be enabled for the destination bucket, as consistent metadata listings are required. </description> </property> <property> <name>fs.s3a.committer.threads</name> <value>8</value> <description> Number of threads in committers for parallel operations on files (upload, commit, abort, delete...) </description> </property> <property> <name>fs.s3a.committer.staging.tmp.path</name> <value>tmp/staging</value> <description> Path in the cluster filesystem for temporary data. This is for HDFS, not the local filesystem. It is only for the summary data of each file, not the actual data being committed. Using an unqualified path guarantees that the full path will be generated relative to the home directory of the user creating the job, hence private (assuming home directory permissions are secure). </description> </property> <property> <name>fs.s3a.committer.staging.unique-filenames</name> <value>true</value> <description> Option for final files to have a unique name through job attempt info, or the value of fs.s3a.committer.staging.uuid When writing data with the "append" conflict option, this guarantees that new data will not overwrite any existing data. </description> </property> <property> <name>fs.s3a.committer.staging.conflict-mode</name> <value>fail</value> <description> Staging committer conflict resolution policy. Supported: "fail", "append", "replace". </description> </property> <property> <name>fs.s3a.committer.staging.abort.pending.uploads</name> <value>true</value> <description> Should the staging committers abort all pending uploads to the destination directory? Changing this if more than one partitioned committer is writing to the same destination tree simultaneously; otherwise the first job to complete will cancel all outstanding uploads from the others. However, it may lead to leaked outstanding uploads from failed tasks. If disabled, configure the bucket lifecycle to remove uploads after a time period, and/or set up a workflow to explicitly delete entries. Otherwise there is a risk that uncommitted uploads may run up bills. </description> </property> <property> <name>fs.s3a.list.version</name> <value>2</value> <description> Select which version of the S3 SDK's List Objects API to use. Currently support 2 (default) and 1 (older API). </description> </property> <property> <name>fs.s3a.etag.checksum.enabled</name> <value>false</value> <description> Should calls to getFileChecksum() return the etag value of the remote object. WARNING: if enabled, distcp operations between HDFS and S3 will fail unless -skipcrccheck is set. </description> </property>