Integrating Apache Hive with Spark and KafkaPDF version

Examples of writing data in various file formats

Hive Warehouse Connector (HWC) enables you to write to tables in various formats, such as Parquet, ORC, AVRO, and Textfile. You see by example how to write a Dataframe in these formats, to a pre-existing or new table.

Initialize

import com.hortonworks.hwc.HiveWarehouseSession

import com.hortonworks.hwc.HiveWarehouseSession._

val hive = HiveWarehouseSession.session(spark).build()

Create a Dataframe

val df = Seq((1, "bat"), (2, "mouse"), (3, "horse")).toDF("id", "name")

df.show

Examples of a Dataframe write to new tables:

  • Dataframe write in Parquet
    df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
        "parquet_table").option("fileformat","parquet").save()
  • Dataframe write in ORC
    df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
        "orc_table").option("fileformat","orc").save()
  • Dataframe write in Avro
    df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
        "avro_table").option("fileformat","avro").save()
  • Dataframe write in Textfile
    • With Default Field Delimiter (,)
      df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
          "text_table1").option("fileformat","textfile").save()
    • With Custom Field Delimiter (*)
      df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
          "text_table2").option("fileformat","textfile").option("sep","*").save()

If you already have a table, you do not need to specify a file format, but you can as shown in the following examples:

Without file format specification

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
"parquet_table").save()

With file format specification

Results differ depending on whether the file format specification matches that of the table or not. If there is a mismatch in the file format, an exception is displayed.

A match succeeds

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
"parquet_table").option("fileformat","parquet").save()

Throws exception if there is a mismatch in the file format

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", 
"parquet_table").option("fileformat","orc").save()

Default Table Format is ORC

If you do not specify a file format for the table, the table is created in ORC format.

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "sample").save()

To change the default format, use the set method:

sql("set spark.datasource.hive.warehouse.default.write.format=parquet")

You can specify properties as options as follows:

.option("compression", "SNAPPY")

.option("transactional", "false")