Cloudera Data Engineering Spark support for Hive Warehouse Connector

The Hive Warehouse Connector (HWC) is a powerful tool that connects Apache Hive and Apache Spark, enabling seamless data exchange and processing between the two systems. Cloudera Data Engineering offers the integration with Hive Warehouse Connector (HWC) to access and manage Hive-managed tables from Cloudera Data Engineering Spark sessions and jobs. The dependencies of HWC are built into Cloudera Data Engineering.

Hive Warehouse Connector overview

HWC enables secure access and querying of Hive-managed tables directly from Apache Spark. HWC enables you to access data with Fine-Grained Access Control (FGAC), to ensure secure data handling. FGAC includes row-level filtering and column masking, as defined in Ranger.

If HWC is configured to work with managed tables, the same configuration applies to external tables as well.

HWC provides a wide range of functionalities to Cloudera Data Engineering, which includes:

  • CREATE or DROP databases
  • CREATE external pr managed Hive table
  • Atomicity, Consistency, Isolation, and Durability (ACID) transactions on Hive-managed tables
  • CREATE or INSERT data to Iceberg tables
  • Enforce FGAC policies through Ranger using the SECURE_ACCESS read mode

Hive Warehouse Connector read modes

Using HWC, the data can be read in the following ways:

  • DIRECT_READER_V2
  • JDBC_CLUSTER
  • SECURE_ACCESS

Each of these read modes has its own advantages and limitations. Choose the read mode carefully based on your use cases. For more information on the supported HWC read modes, their advantages, requirements, and limitations, see Hive Warehouse Connector read modes.

HWC supports the following storage formats across all read modes:
  • Optimized Row Columnar (ORC)
  • Parquet
  • Avro
  • Textfile

Hive Warehouse Connector write modes

HWC provides three write methods, each suited for different scenarios:

  • HIVE_WAREHOUSE_CONNECTOR
  • DATAFRAME_TO_STREAM
  • STREAM_TO_STREAM

Each of these write modes has its own advantages and limitations. Choose the write mode carefully based on your use cases. For more information on the supported HWC write modes, their requirements, features, and usage, see Hive Warehouse Connector write modes.

Hive Warehouse Connector access patterns

The following table collects the access patterns and details the requirements of specific table types regarding the read and write modes of the Hive Warehouse Connector.

Table 1.
Operation Hive ACID tables (managed tables) Hive external tables Iceberg tables
Basic read HWC required with DIRECT_READER mode HWC is not needed, but you can use it HWC is not needed, but you can use it
FGAC read HWC required with SECURE_ACCESS mode HWC required with SECURE_ACCESS mode HWC required with SECURE_ACCESS mode
Batch write (HIVE_WAREHOUSE_CONNECTOR) HWC required (Write Dataframe to Table) Not supported (Non-ACID table) Not supported (Non-ACID table)
Stream write (DATAFRAME_TO_STREAM , STREAM_TO_STREAM) HWC required.
  • Supported on Azure
  • Supported with a workaround on AWS
Not supported (Non-ACID table) Not supported (Non-ACID table)

Using hive.sql and spark.sql in Hive Warehouse Connector

You can use hive.sql and spark.sql in HWC. The key differences are:
  • hive.sql: Explicit API defined in HWC, preferred for all table types and all modes of HWC across read and write paths.
  • spark.sql: It is less versatile than hive.sql. It can be used in:
    • DIRECT_READER_V2
    • JDBC_CLUSTER

Running queries

HWC allows running Data Manipulation Language (DML) and Data Definition Language (DDL) queries using hive.sql(). Examples include creating tables, inserting data, and dropping tables:

  • Creating an empty ORC table:
    
    hive.sql("CREATE TABLE sample (a INT, b INT) STORED AS ORC TBLPROPERTIES ('transactional'='true')")
  • Inserting into a table:
    hive.sql("INSERT INTO TABLE sample VALUES (1,2), (3,4)")
  • CTAS query:
    hive.sql("CREATE TABLE dup_sample AS SELECT * FROM sample")
  • Dropping a table:
    hive.sql("DROP TABLE sample")
    

These queries use a JDBC connection to HiveServer2 and do not generate statistics in the Spark UI.

Support for user-defined functions

HWC supports creating and using User-Defined Functions (UDFs) to extend the capabilities of Hive.

For example:

hive.sql(“CREATE FUNCTION udftypeof AS 'com.mycompany.hiveudf.TypeOf01' USING JAR 'hdfs:///warehouse/tablespace/managed/TypeOf01-1.0-SNAPSHOT.jar'”).show()