Introduction to Azure Storage and the ABFS Connector

The Hadoop-Azure module provides support for Azure Data Lake Storage Gen2 storage layer through the abfs connector.

‎Azure Data Lake Storage (ADLS) Gen2 combines the features of Azure Blob storage and Azure Data Lake Storage Gen1. In addition to the existing features of both the services, an important part of Azure Data Lake Storage Gen2 is the addition of hierarchical namespace  to Blob storage.  The hierarchical namespace feature processes operations such as moving, renaming, and deletings directories by updating a single entry (the parent directory). This ensures a significant improvement in performance of query engines writing data to the store, including MapReduce, Spark, Hive, and DistCp. You can also set permissions on a directory instead of per file basis.

Azure Blob Storage

The Azure Storage data model presents three core concepts:

  • Storage Account: All access is done through a storage account.
  • Container: A container is a grouping of multiple blobs. A storage account may have multiple containers. In Hadoop, an entire file system hierarchy is stored in a single container.
  • Blob: A file of any type and size stored with the existing Windows Azure Storage Blob (wasb) connector

Features of the ABFS Connector

The ABFS connector can be used as a replacement for HDFS on Hadoop clusters deployed in Azure infrastructure. Some of the features of the ABFS connector are as follows:

  • Supports reading and writing data stored in an Azure Blob Storage account
  • Helps to perform hadoop operations on Azure datastores
  • Provides a consistent view of the storage across all clients
  • Enables using a simple abfs:// url to access containers and directories
  • Presents a hierarchical file system view by implementing the standard Hadoop File System interface
  • Supports configuration of multiple Azure Blob Storage accounts
  • Acts as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark