Accessing External Storage from Spark
Spark can access all storage sources supported by Hadoop, including a local file system, HDFS, HBase, Amazon S3, and Microsoft ADLS.
Spark supports many file types, including text files, RCFile, SequenceFile, Hadoop InputFormat, Avro, Parquet, and compression of all supported files.
For developer information about working with external storage, see External Storage in the Spark Programming Guide.
Accessing Compressed Files
- textFile(path)
- hadoopFile(path,outputFormatClass)
- saveAsTextFile(path, compressionCodecClass="codec_class")
- saveAsHadoopFile(path,outputFormatClass, compressionCodecClass="codec_class")
For examples of accessing Avro and Parquet files, see Spark with Avro and Parquet.
For details on how to access specific types of external storage and files, see:
Using Spark with Azure Data Lake Storage (ADLS)
Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Data files are accessed using a adl:// prefix instead of hdfs://. See Configuring ADLS Gen1 Connectivity for instructions to set up ADLS as a storage layer for a CDH cluster.