Accessing HDFS Files from Spark
This subsection contains information for running Spark jobs over HDFS data.
Specifying Compression
To add a compression library to Spark, you can use the --jars
option. For
an example, see Adding Libraries to Spark.
To save a Spark RDD to HDFS in compressed format, use code similar to the following (the example uses the GZip algorithm):
rdd.saveAsHadoopFile("/tmp/spark_compressed", "org.apache.hadoop.mapred.TextOutputFormat", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
For more information about supported compression algorithms, see Configuring HDFS Compression in the HDFS Administration Guide.
Accessing HDFS from PySpark
When accessing an HDFS file from PySpark, you must set HADOOP_CONF_DIR
in
an environment variable, as in the following example:
export HADOOP_CONF_DIR=/etc/hadoop/conf [hrt_qa@ip-172-31-42-188 spark]$ pyspark [hrt_qa@ip-172-31-42-188 spark]$ >>>lines = sc.textFile("hdfs://ip-172-31-42-188.ec2.internal:8020/tmp/PySparkTest/file-01") .......
If HADOOP_CONF_DIR
is not set properly, you might see the following
error:
Error from secure cluster
2016-08-22 00:27:06,046|t1.machine|INFO|1580|140672245782272|MainThread|Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2016-08-22 00:27:06,048|t1.machine|INFO|1580|140672245782272|MainThread|at
{code}