Spark 2.4 CSV example

A CSV example illustrates the CSV-handling change in Spark 2.4. If you migrate to Spark 2.4 from Spark 1.6 - Spark 2.3, you need to understand this change.

In the following CSV file, the first two records describe the file. These records are not considered during processing and need to be removed from the file. The actual data to be considered for processing has three columns (jersey, name, position).
These are extra line1
These are extra line2
10,Messi,CF
7,Ronaldo,LW
9,Benzema,CF

The following schema definition for the DataFrame reader uses the option DROPMALFORMED. You see only the required data; all the description and error records are removed.

schema=Structtype([Structfield(“jersy”,StringType()),Structfield(“name”,StringType()),Structfi
eld(“position”,StringType())])
df1=spark.read\
.option(“mode”,”DROPMALFORMED”)\
.option(“delimiter”,”,”)\
.schema(schema)\
.csv(“inputfile”)
df1.select(“*”).show()

Output is:

jersy name position
10 Messi CF
7 Ronaldo LW
9 Benzema CF

Select two columns from the dataframe and invoke show():

df1.select(“jersy”,”name”).show(truncate=False)
jersy name
These are extra line1 null
These are extra line2 null
10 Messi
7 Ronaldo
9 Benzema

Malformed records are not dropped and pushed to the first column and the remaining columns will be replaced with null.This is due to the CSV parser column pruning which is set to true by default in Spark 2.4.

Set the following conf, and run the same code, selecting two fields.

spark.conf.set(“spark.sql.csv.parser.columnPruning.enabled”,False)
df2=spark.read\
   .option(“mode”,”DROPMALFORMED”)\
   .option(“delimiter”,”,”)\
   .schema(schema)\
   .csv(“inputfile”)
   df2.select(“jersy”,”name”).show(truncate=False) 
jersy name
10 Messi
7 Ronaldo
9 Benzema

Conclusion: If working on selective columns, to handle bad records in CSV files, set spark.sql.csv.parser.columnPruning.enabled to false; otherwise, the error record is pushed to the first column, and all the remaining columns are treated as nulls.