Cloudera Data Engineering Spark support for Hive Warehouse Connector
The Hive Warehouse Connector (HWC) is a powerful tool that connects Apache Hive and Apache Spark, enabling seamless data exchange and processing between the two systems. Cloudera Data Engineering offers the integration with Hive Warehouse Connector (HWC) to access and manage Hive-managed tables from Cloudera Data Engineering Spark sessions and jobs. The dependencies of HWC are built into Cloudera Data Engineering.
Hive Warehouse Connector overview
HWC enables secure access and querying of Hive-managed tables directly from Apache Spark. HWC enables you to access data with Fine-Grained Access Control (FGAC), to ensure secure data handling. FGAC includes row-level filtering and column masking, as defined in Ranger.
If HWC is configured to work with managed tables, the same configuration applies to external tables as well.
HWC provides a wide range of functionalities to Cloudera Data Engineering, which includes:
- CREATE or DROP databases
- CREATE external pr managed Hive table
- Atomicity, Consistency, Isolation, and Durability (ACID) transactions on Hive-managed tables
- CREATE or INSERT data to Iceberg tables
- Enforce FGAC policies through Ranger using the SECURE_ACCESS read mode
Hive Warehouse Connector read modes
Using HWC, the data can be read in the following ways:
- DIRECT_READER_V2
- JDBC_CLUSTER
- SECURE_ACCESS
Each of these read modes has its own advantages and limitations. Choose the read mode carefully based on your use cases. For more information on the supported HWC read modes, their advantages, requirements, and limitations, see Hive Warehouse Connector read modes.
- Optimized Row Columnar (ORC)
- Parquet
- Avro
- Textfile
Hive Warehouse Connector write modes
HWC provides three write methods, each suited for different scenarios:
- HIVE_WAREHOUSE_CONNECTOR
- DATAFRAME_TO_STREAM
- STREAM_TO_STREAM
Each of these write modes has its own advantages and limitations. Choose the write mode carefully based on your use cases. For more information on the supported HWC write modes, their requirements, features, and usage, see Hive Warehouse Connector write modes.
Hive Warehouse Connector access patterns
The following table collects the access patterns and details the requirements of specific table types regarding the read and write modes of the Hive Warehouse Connector.
| Operation | Hive ACID tables (managed tables) | Hive external tables | Iceberg tables |
|---|---|---|---|
| Basic read | HWC required with DIRECT_READER mode | HWC is not needed, but you can use it | HWC is not needed, but you can use it |
| FGAC read | HWC required with SECURE_ACCESS mode | HWC required with SECURE_ACCESS mode | HWC required with SECURE_ACCESS mode |
| Batch write (HIVE_WAREHOUSE_CONNECTOR) | HWC required (Write Dataframe to Table) | Not supported (Non-ACID table) | Not supported (Non-ACID table) |
| Stream write (DATAFRAME_TO_STREAM , STREAM_TO_STREAM) | HWC required.
|
Not supported (Non-ACID table) | Not supported (Non-ACID table) |
Using hive.sql and spark.sql in Hive Warehouse Connector
hive.sql and spark.sql in HWC. The key
differences are: hive.sql: Explicit API defined in HWC, preferred for all table types and all modes of HWC across read and write paths.spark.sql: It is less versatile thanhive.sql. It can be used in:- DIRECT_READER_V2
- JDBC_CLUSTER
Running queries
HWC allows running Data Manipulation Language (DML) and Data Definition Language (DDL) queries
using hive.sql(). Examples include creating tables, inserting data, and
dropping tables:
- Creating an empty ORC table:
hive.sql("CREATE TABLE sample (a INT, b INT) STORED AS ORC TBLPROPERTIES ('transactional'='true')") - Inserting into a
table:
hive.sql("INSERT INTO TABLE sample VALUES (1,2), (3,4)") - CTAS
query:
hive.sql("CREATE TABLE dup_sample AS SELECT * FROM sample") - Dropping a table:
hive.sql("DROP TABLE sample")
These queries use a JDBC connection to HiveServer2 and do not generate statistics in the Spark UI.
Support for user-defined functions
HWC supports creating and using User-Defined Functions (UDFs) to extend the capabilities of Hive.
For example:
hive.sql(“CREATE FUNCTION udftypeof AS 'com.mycompany.hiveudf.TypeOf01' USING JAR 'hdfs:///warehouse/tablespace/managed/TypeOf01-1.0-SNAPSHOT.jar'”).show()
