Limitations of Amazon S3
Even though Hadoop's S3A client can make an S3 bucket appear to be a Hadoop-compatible filesystem, it is still an object store, and has some limitations when acting as a Hadoop-compatible filesystem.
-
Operations on directories are potentially slow and non-atomic.
-
Not all file operations are supported. In particular, some file operations needed by Apache HBase are not available — so HBase cannot be run on top of Amazon S3 without additional features provided by a service such as S3Guard. HBase Object Store Semantics (HBOSS) adapter bridges the gap between HBase and the S3A filesystem. HBOSS depends on S3Guard for accessing S3A buckets, so ensure that the given cluster and target AWS account fulfill all S3Guard requirements.
- Except for versions of HBase specifically designed to work with S3 storage, HBase must
not use
s3a://
paths for HBase storage. - S3 can not be used as a replacement for HDFS as the cluster filesystem in CDP. S3 can be used as a source and destination of work.
-
Data is not visible in the object store until the entire output stream has been written.
-
Neither the per-file and per-directory permissions supported by HDFS nor its more sophisticated ACL mechanism are supported.
-
Bandwidth between your workload clusters and Amazon S3 is limited and can vary significantly depending on network and VM load.
For these reasons, while Amazon S3 can be used as the source and store for persistent data,
it cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS, or be
used as defaultFS
.