Coding in Spark for automatic metadata management

The Spark API, which saves data to a specified location, does not generate events in the Hive metastore so it is not supported by automatic metadata management.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

For example here is an example of a Spark Scala API code, which does not generate events in the Hive metastore:

Seq((1, 2)).toDF("i",
"j").write.save("/user/hive/warehouse/spark_etl.db/customers/date=01012019")

Instead, use the below Spark SQL code to ensure that Hive metastore events are generated and sent to the metastore:

Spark.sql(" INSERT OVERWRITE TABLE xxx  PARTITION (date = , …) as select * from spark_dataframe“ )