Spark Guide
Also available as:
PDF

Chapter 8. Using Spark with HDFS

Specifying Compression

To add a compression library to Spark, use the --jars option. The following example adds the LZO compression library:

spark-submit --driver-memory 1G --executor-memory 1G --master yarn-client --jars /usr/hdp/2.3.0.0-2557/hadoop/lib/hadoop-lzo-0.6.0.2.3.0.0-2557.jar test_read_write.py

To specify compression in spark-shell when writing to HDFS, use code similar to:

rdd.saveAsHadoopFile("/tmp/spark_compressed", 
"org.apache.hadoop.mapred.TextOutputFormat", 
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

For more information about supported compression algorithms, see Configuring HDFS Compression in the HDFS Reference Guide.

Accessing HDFS from PySpark: Setting HADOOP_CONF_DIR

If PySpark is accessing an HDFS file, HADOOP_CONF_DIR needs to be set in an environment variable. For example:

export HADOOP_CONF_DIR=/etc/hadoop/conf
[hrt_qa@ip-172-31-42-188 spark]$ pyspark
[hrt_qa@ip-172-31-42-188 spark]$ >>>lines = sc.textFile("hdfs://ip-172-31-42-188.ec2.internal:8020/tmp/PySparkTest/file-01")
.......

If HADOOP_CONF_DIR is not set properly, you might see the following error:

Error from secure cluster

2015-09-04 00:27:06,046|t1.machine|INFO|1580|140672245782272|MainThread|Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2015-09-04 00:27:06,048|t1.machine|INFO|1580|140672245782272|MainThread|at 
{code}