Spark 2.4 CSV example
A CSV example illustrates the CSV-handling change in Spark 2.4. If you migrate to Spark 2.4 from Spark 1.6 - Spark 2.3, you need to understand this change.
These are extra line1
These are extra line2
10,Messi,CF
7,Ronaldo,LW
9,Benzema,CF
The following schema definition for the DataFrame reader uses the option DROPMALFORMED. You see only the required data; all the description and error records are removed.
schema=Structtype([Structfield(“jersy”,StringType()),Structfield(“name”,StringType()),Structfi
eld(“position”,StringType())])
df1=spark.read\
.option(“mode”,”DROPMALFORMED”)\
.option(“delimiter”,”,”)\
.schema(schema)\
.csv(“inputfile”)
df1.select(“*”).show()
Output is:
jersy | name | position |
---|---|---|
10 | Messi | CF |
7 | Ronaldo | LW |
9 | Benzema | CF |
Select two columns from the dataframe and invoke show():
df1.select(“jersy”,”name”).show(truncate=False)
jersy | name |
---|---|
These are extra line1 | null |
These are extra line2 | null |
10 | Messi |
7 | Ronaldo |
9 | Benzema |
Malformed records are not dropped and pushed to the first column and the remaining columns will be replaced with null.This is due to the CSV parser column pruning which is set to true by default in Spark 2.4.
Set the following conf, and run the same code, selecting two fields.
spark.conf.set(“spark.sql.csv.parser.columnPruning.enabled”,False)
df2=spark.read\
.option(“mode”,”DROPMALFORMED”)\
.option(“delimiter”,”,”)\
.schema(schema)\
.csv(“inputfile”)
df2.select(“jersy”,”name”).show(truncate=False)
jersy | name |
---|---|
10 | Messi |
7 | Ronaldo |
9 | Benzema |
Conclusion: If working on selective columns, to handle bad records in CSV files, set spark.sql.csv.parser.columnPruning.enabled
to false
; otherwise, the error record is pushed to the first column, and all the remaining columns are treated as nulls.