Accessing Parquet files from Spark SQL applications
Spark SQL supports loading and saving DataFrames from and to a variety of data sources and has native support for Parquet.
To read Parquet files in Spark SQL, use the
spark.read.parquet("path")
method.
To write Parquet files in Spark SQL, use the
DataFrame.write.parquet("path")
method.
spark.sql.parquet.compression.codec
property:
spark.conf.set("spark.sql.parquet.compression.codec","codec")
The supported codec values are:
uncompressed
, gzip
,
lzo
, and snappy
. The default is
gzip
.
Currently, Spark looks up column data from Parquet files by using the names stored within
the data files. This is different than the default Parquet lookup behavior of Impala and
Hive. If data files are produced with a different physical layout due to added or reordered columns,
Spark still decodes the column data correctly. If the logical layout of the table is changed in the metastore
database, for example through an ALTER TABLE CHANGE
statement that renames a column,
Spark still looks for the data using the now-nonexistent column name and returns NULL
s
when it cannot locate the column values. To avoid behavior differences between Spark and Impala or Hive
when modifying Parquet tables, avoid renaming columns, or use Impala, Hive, or a
CREATE TABLE AS SELECT
statement to produce a new table and new set of Parquet files
containing embedded column names that match the new layout.