Developing Apache Spark ApplicationsPDF version

Accessing HDFS Files from Spark

This section contains information on running Spark jobs over HDFS data.

To add a compression library to Spark, you can use the --jars option. For an example, see "Adding Libraries to Spark" in this guide.

To save a Spark RDD to HDFS in compressed format, use code similar to the following (the example uses the GZip algorithm):

rdd.saveAsHadoopFile("/tmp/spark_compressed", 
                     "org.apache.hadoop.mapred.TextOutputFormat", 
                     compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

For more information about supported compression algorithms, see "Configuring HDFS Compression" in the HDP Data Storage guide.

When accessing an HDFS file from PySpark, you must set HADOOP_CONF_DIR in an environment variable, as in the following example:

$ export HADOOP_CONF_DIR=/etc/hadoop/conf
$ pyspark
$ >>>lines = sc.textFile("hdfs://namenode.example.com:8020/tmp/PySparkTest/file-01")
.......

If HADOOP_CONF_DIR is not set properly, you might see an error similar to the following:

2016-08-22 00:27:06,046|t1.machine|INFO|1580|140672245782272|MainThread|Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2016-08-22 00:27:06,048|t1.machine|INFO|1580|140672245782272|MainThread|at 
{code}