Introduction to HWC

HWC securely accesses Hive managed tables from Spark. You need to use Hive Warehouse Connector (HWC) software to query Apache Hive managed tables from Apache Spark.

To read Hive external tables from Spark, you do not need HWC. Spark uses native Spark to read external tables. If you configure HWC to work with managed tables, you can use the same configuration to work with external tables. However, you must know that accessing external tables through HWC is slower as compared to accessing external tables through native Spark libraries.

Supported applications and operations

The Hive Warehouse Connector supports the following applications:

Spark 2 (2.4.7 in the current CDP releases)
Spark 3 is not supported, even if deployed from the Cloudera parcel.
Spark shell
PySpark
The spark-submit script
sparklyr
note
Sparklyr for HWC is available only as part of 7.1.7 SP1 Cumulative hotfix 4 (CDP PvC Base 7.1.7.1024-4).
Zeppelin with the Livy interpreter

The following list describes a few of the operations supported by the Hive Warehouse Connector:

Describing a table
Creating a table in ORC using .createTable() or in any format using .executeUpdate()
Writing to a pre-existing or new table in Parquet, ORC, Avro, or Textfile formats
Selecting Hive data and retrieving a DataFrame
Writing a DataFrame to a Hive-managed ORC table in batch
Executing a Hive update statement
Reading table data, transforming it in Spark, and writing it to a new Hive table
Writing a DataFrame or Spark stream to Hive using HiveStreaming
Partitioning data when writing a DataFrame