Chapter 8. Using Spark with HDFS
Specifying Compression
To specify compression in Spark-shell when writing to HDFS, use code similar to:
rdd.saveAsHadoopFile("/tmp/spark_compressed",
"org.apache.hadoop.mapred.TextOutputFormat",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
Setting HADOOP_CONF_DIR
If PySpark is accessing an HDFS file, HADOOP_CONF_DIR
needs to be set in an
environment variable. For example:
export HADOOP_CONF_DIR=/etc/hadoop/conf [hrt_qa@ip-172-31-42-188 spark]$ pyspark [hrt_qa@ip-172-31-42-188 spark]$ >>>lines = sc.textFile("hdfs://ip-172-31-42-188.ec2.internal:8020/tmp/PySparkTest/file-01") .......
If HADOOP_CONF_DIR is not set properly, you might see the following error:
Error from secure cluster
2015-09-04 00:27:06,046|t1.machine|INFO|1580|140672245782272|MainThread|Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS] 2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 2015-09-04 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) 2015-09-04 00:27:06,048|t1.machine|INFO|1580|140672245782272|MainThread|at {code}