Limitations of Amazon S3
Even though Hadoop's S3A client can make an S3 bucket appear to be a Hadoop-compatible filesystem, it is still an object store, and has some limitations when acting as a Hadoop-compatible filesystem.
-
Operations on directories are potentially slow and non-atomic.
-
Not all file operations are supported. In particular, some file operations needed by Apache HBase are not available — so HBase cannot be run on top of Amazon S3 without additional features. Due to this, the use of Amazon S3 for HBase in a CDP Private Cloud Base (IaaS) environment is not supported. If you intend to use HBase with S3, Cloudera recommends you to use Cloudera Operational Database instead.
- Except for versions of HBase specifically designed to work with S3 storage, HBase must
not use
s3a://
paths for HBase storage. - S3 can not be used as a replacement for HDFS as the cluster filesystem in CDP. S3 can be used as a source and destination of work.
-
Data is not visible in the object store until the entire output stream has been written.
-
Neither the per-file and per-directory permissions supported by HDFS nor its more sophisticated ACL mechanism are supported.
-
Bandwidth between your workload clusters and Amazon S3 is limited and can vary significantly depending on network and VM load.
For these reasons, while Amazon S3 can be used as the source and store for persistent data,
it cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS, or be
used as defaultFS
.