API examples of using file types

You see by example how to write and verify a Dataframe in Parquet, ORC, AVRO, or Textfile to a pre-existing or new table.

Initialize

import com.hortonworks.hwc.HiveWarehouseSession

import com.hortonworks.hwc.HiveWarehouseSession._

val hive = HiveWarehouseSession.session(spark).build()

Create a Dataframe

val df = Seq((1, "bat"), (2, "mouse"), (3, "horse")).toDF("id", "name")

df.show

Table does not exist

  1. Dataframe write in Parquet
    df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
        "parquet_table").option("fileformat","parquet").save()
  2. Dataframe write in ORC
    df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
        "orc_table").option("fileformat","orc").save()
  3. Dataframe write in Avro
    df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
        "avro_table").option("fileformat","avro").save()
  4. Dataframe write in Textfile
    1. With Default Field Delimiter (,)
      df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
          "text_table1").option("fileformat","textfile").save()
    2. With Custom Field Delimiter (*)
      df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
          "text_table2").option("fileformat","textfile").option("sep","*").save()

Verify from Beeline

desc formatted parquet_table;

select * from parquet_table;

desc formatted orc_table;

select * from orc_table;

desc formatted avro_table;

select * from avro_table;

desc formatted text_table1;

select * from text_table1;

desc formatted text_table2;

select * from text_table2;

Table already exists

If you already have a table, you do not need to specify a file format, but you can as shown in the following examples:

Without file format specification

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
"parquet_table").save()

With file format specification

Results differ depending on whether the file format specification matches that of the table or not.

A Match succeeds

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table",
"parquet_table").option("fileformat","parquet").save()

Otherwise, there is a mismatch, which throws an exception:

Throws exception if mismatch

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", 
"parquet_table").option("fileformat","orc").save()

Default Table Format is ORC

df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "sample").save()

To change the default format, use the set method:

sql("set spark.datasource.hive.warehouse.default.write.format=parquet")

You can specify properties as options as follows:

.option("compression", "SNAPPY")

.option("transactional", "false")