Canary test for pyspark
command
The following example shows a simple pyspark session that refers to
the SparkContext, calls the collect()
function which runs a Spark 2
job, and writes data to HDFS. This sequence of operations helps to check if there are
obvious configuration issues that prevent Spark jobs from working at all. For the HDFS
path for the output directory, substitute a path that exists on your own system.
$ hdfs dfs -mkdir /user/systest/spark $ pyspark ... SparkSession available as 'spark'. >>> strings = ["one","two","three"] >>> s2 = sc.parallelize(strings) >>> s3 = s2.map(lambda word: word.upper()) >>> s3.collect() ['ONE', 'TWO', 'THREE'] >>> s3.saveAsTextFile('hdfs:///user/systest/spark/canary_test') >>> quit() $ hdfs dfs -ls /user/systest/spark Found 1 items drwxr-xr-x - systest supergroup 0 2016-08-26 14:41 /user/systest/spark/canary_test $ hdfs dfs -ls /user/systest/spark/canary_test Found 3 items -rw-r--r-- 3 systest supergroup 0 2016-08-26 14:41 /user/systest/spark/canary_test/_SUCCESS -rw-r--r-- 3 systest supergroup 4 2016-08-26 14:41 /user/systest/spark/canary_test/part-00000 -rw-r--r-- 3 systest supergroup 10 2016-08-26 14:41 /user/systest/spark/canary_test/part-00001 $ hdfs dfs -cat /user/systest/spark/canary_test/part-00000 ONE $ hdfs dfs -cat /user/systest/spark/canary_test/part-00001 TWO THREE