API examples of using file types
You see by example how to write and verify a Dataframe in Parquet, ORC, AVRO, or Textfile to a pre-existing or new table.
Initialize
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
Create a Dataframe
val df = Seq((1, "bat"), (2, "mouse"), (3, "horse")).toDF("id",
"name")
df.show
Table does not exist
- Dataframe write in Parquet
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "parquet_table").option("fileformat","parquet").save()
-
Dataframe write in ORC
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "orc_table").option("fileformat","orc").save()
-
Dataframe write in Avro
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "avro_table").option("fileformat","avro").save()
- Dataframe write in Textfile
- With Default Field Delimiter (,)
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "text_table1").option("fileformat","textfile").save()
- With Custom Field Delimiter (*)
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "text_table2").option("fileformat","textfile").option("sep","*").save()
- With Default Field Delimiter (,)
Verify from Beeline
desc formatted parquet_table;
select * from parquet_table;
desc formatted orc_table;
select * from orc_table;
desc formatted avro_table;
select * from avro_table;
desc formatted text_table1;
select * from text_table1;
desc formatted text_table2;
select * from text_table2;
Table already exists
If you already have a table, you do not need to specify a file format, but you can as shown in the following examples:
Without file format specification
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
"parquet_table").save()
With file format specification
Results differ depending on whether the file format specification matches that of the table or not.
A Match succeeds
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
"parquet_table").option("fileformat","parquet").save()
Otherwise, there is a mismatch, which throws an exception:
Throws exception if mismatch
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
"parquet_table").option("fileformat","orc").save()
Default Table Format is ORC
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "sample").save()
To change the default format, use the set method:
sql("set spark.datasource.hive.warehouse.default.write.format=parquet")
You can specify properties as options as follows:
.option("compression", "SNAPPY")
.option("transactional", "false")