Introduction to HWC

HWC is software available by default with the Apache Hive service in CDP Data Hub. HWC securely accesses Hive managed tables from Spark. You need to use Hive Warehouse Connector (HWC) software to query Apache Hive managed tables from Apache Spark.

To read Hive external tables from Spark, you do not need HWC. Spark uses native Spark to read external tables. If you configure HWC to work with managed tables, you can use the same configuration to work with external tables. However, you must know that accessing external tables through HWC is slower as compared to accessing external tables through native Spark libraries.

Supported applications and operations

The Hive Warehouse Connector supports the following applications:

Spark shell
PySpark
The spark-submit script
sparklyr
Zeppelin with the Livy interpreter

The following list describes a few of the operations supported by the Hive Warehouse Connector:

Describing a table
Creating a table in ORC using .createTable() or in any format using .executeUpdate()
Writing to a table in ORC format
Selecting Hive data and retrieving a DataFrame
Writing a DataFrame to a Hive-managed ORC table in batch
Executing a Hive update statement
Reading table data, transforming it in Spark, and writing it to a new Hive table
Writing a DataFrame or Spark stream to Hive using HiveStreaming
Partitioning data when writing a DataFrame