Introduction to HWC

You need to understand Hive Warehouse Connector (HWC) to query Apache Hive tables from Apache Spark. Examples of supported APIs, such as Spark SQL, show some operations you can perform, including how to write to a Hive ACID table or write a DataFrame from Spark.

HWC is software for securely accessing Hive tables from Spark. You need to use the HWC if you want to access Hive managed tables from Spark. You explicitly use HWC by calling the HiveWarehouseConnector API to write to managed tables. You might use HWC without even realizing it. HWC implicitly reads tables when you run a Spark SQL query on a Hive managed table.

You do not need HWC to read or write Hive external tables, but you might want to use HWC to purge external table files. From Spark, using HWC you can read Hive external tables in ORC or Parquet formats. From Spark, using HWC you can write Hive external tables in ORC format only.

Creating an external table stores only the metadata in HMS. If you use HWC to create the external table, HMS keeps track of the location of table names and columns. Dropping an external table deletes the metadata from HMS. You can set an option to also drop the actual data in files, or not, from the file system.

If you do not use HWC, dropping an external table deletes only the metadata from HMS. If you do not have permission to access the file system, and you want to purge table data in addition to metadata, you need to use HWC.

Supported APIs

Spark SQL
Supports native Spark SQL query read (only) patterns. Output conforms to native spark.sql conventions.
HWC
Supports HiveWarehouse Session API operations using the HWC sql API.

DataFrames

Supports accessing a Hive ACID table from Scala, or pySpark, directly using DataFrames. Use the short name HiveAcid. Direct reads and writes from the file are not supported.

Spark SQL Example

$ spark-shell <parameters to specify HWC jar and config settings>
scala> sql("select * from managedTable").show 
scala> spark.read.table("managedTable").show

HWC API Example

scala> val hive = com.hortonworks.hwc.HiveWarehouseSession.session(spark).build()
scala> hive.executeQuery("select * from emp_acid").show
scala> hive.executeQuery("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").show

DataFrames Example

Hive ACID tables are tables in Hive metastore and must be formatted using DataFrames as follows:

Syntax:

format("HiveAcid").option("table", "<table name>"")

Example:

scala> val df = spark.read.format("HiveAcid").options(Map("table" -> "default.acidtbl")).load()
scala> df.collect()

HWC Limitations

You cannot write data using Spark Direct Reader.
Transaction semantics of Spark RDDs are not ensured when using Spark Direct Reader to read ACID tables.
HWC supports reading tables in any format, but currently supports writing tables in ORC format only.
The spark thrift server is not supported.
Table stats (basic stats and column stats) are not generated when you write a DataFrame to Hive.
The Hive Union types are not supported.
When the HWC API save mode is overwrite, writes are limited.
You cannot read from and overwrite the same table. If your query accesses only one table and you try to overwrite that table using an HWC API write method, a deadlock state might occur. Do not attempt this operation.
Example: Operation Not Supported
```
scala> val df = hive.executeQuery("select * from t1")
scala> df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector"). \
  mode("overwrite").option("table", "t1").save          
```

Workaround for using the Hive Warehouse Connector with Oozie Spark action

Hive and Spark use different Thrift versions and are incompatible with each other. Upgrading Thrift in Hive is complicated and may not be resolved in the near future. Therefore, Thrift packages are shaded inside the HWC JAR to make Hive Warehouse Connector work with Spark and Oozie Spark action. See the workaround in Cloudera Oozie documentation (link below).

Supported applications and operations

The Hive Warehouse Connector supports the following applications:

Spark shell
PySpark
The spark-submit script

The following list describes a few of the operations supported by the Hive Warehouse Connector:

Describing a table
Creating a table in ORC using .createTable() or in any format using .executeUpdate()
Writing to a table in ORC format
Selecting Hive data and retrieving a DataFrame
Writing a DataFrame to a Hive-managed ORC table in batch
Executing a Hive update statement
Reading table data, transforming it in Spark, and writing it to a new Hive table
Writing a DataFrame or Spark stream to Hive using HiveStreaming
Partitioning data when writing a DataFrame