Examples of writing data in various file formats
Hive Warehouse Connector (HWC) enables you to write to tables in various formats, such as Parquet, ORC, AVRO, and Textfile. You see by example how to write a Dataframe in these formats, to a pre-existing or new table.
Initialize
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
Create a Dataframe
val df = Seq((1, "bat"), (2, "mouse"), (3, "horse")).toDF("id",
"name")
df.show
Table does not exist
Examples of a Dataframe write to new tables:
- Dataframe write in Parquet
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "parquet_table").option("fileformat","parquet").save()
- Dataframe write in ORC
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "orc_table").option("fileformat","orc").save()
- Dataframe write in Avro
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "avro_table").option("fileformat","avro").save()
- Dataframe write in Textfile
- With Default Field Delimiter (,)
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "text_table1").option("fileformat","textfile").save()
- With Custom Field Delimiter (*)
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "text_table2").option("fileformat","textfile").option("sep","*").save()
- With Default Field Delimiter (,)
Table already exists
If you already have a table, you do not need to specify a file format, but you can as shown in the following examples:
Without file format specification
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "parquet_table").save()
With file format specification
Results differ depending on whether the file format specification matches that of the table or not. If there is a mismatch in the file format, an exception is displayed.
A match succeeds
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "parquet_table").option("fileformat","parquet").save()
Throws exception if there is a mismatch in the file format
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "parquet_table").option("fileformat","orc").save()
Default Table Format is ORC
If you do not specify a file format for the table, the table is created in ORC format.
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "sample").save()
To change the default format, use the set method:
sql("set spark.datasource.hive.warehouse.default.write.format=parquet")
You can specify properties as options as follows:
.option("compression", "SNAPPY")
.option("transactional", "false")