Introduction to HWC execution modes

A comparison of each execution mode helps you make HWC configuration choices. You can see, graphically, how the configuration affects the query authorization process and your security. You read about which configuration provides fine-grained access control, such as column masking.

In CDP Public Cloud, HWC is available by default in provisioned clusters. In CDP Private Cloud Base, you need to configure an HWC execution mode. HWC executes queries in the modes shown in the following table:
Table 1.
Capabilities JDBC mode Spark Direct Reader mode
Ranger integration (fine-grained access control) N/A
Workloads handled Small datasets ETL without fine-grained access control
These modes that require connections to different Hive components:
  • Spark Direct Reader mode: Connects to Hive Metastore (HMS)
  • JDBC execution mode: Connects to HiveServer (HS2)
The execution mode determines the type of query authorization. Ranger authorizes access to Hive tables from Spark through HiveServer (HS2) or the Hive metastore API (HMS API). The following diagram shows the typical query authorization process in CDP Private Cloud Base:

You need to use HWC to read or write managed tables from Spark. Managed table queries go through HiveServer, which is integrated with Ranger. External table queries go through the HMS API, which is also integrated with Ranger.

In Spark Direct Reader mode, SparkSQL queries read managed table metadata directly from the HMS, but only if you have permission to access files on the file system.

If you do not use HWC, the Hive metastore (HMS) API, integrated with Ranger, authorizes external table access. HMS API-Ranger integration enforces the Ranger Hive ACL in this case. When you use HWC, queries such as DROP TABLE affect file system data as well as metadata in HMS.

Managed tables

A Spark job impersonates the end user when attempting to access an Apache Hive managed table. As an end user, you do not have permission to secure, managed files in the Hive warehouse. Managed tables have default file system permissions that that disallow end user access, including Spark user access.

As Administrator, you set permissions in Ranger to access the managed tables in JDBC mode. You can fine-tune Ranger to protect specific data. For example, you can mask data in certain columns, or set up tag-based access control.

In Spark Direct Reader mode, you cannot use Ranger. You must set read access to the file system location for managed tables. You must have Read and Execute permissions on the Hive warehouse location (hive.metastore.warehouse.dir).

External tables

Ranger authorization of external table reads and writes is supported. You need to configure a few properties in Cloudera Manager for authorization of external table writes. You must be granted file system permissions on external table files to allow Spark direct access to the actual table data instead of just the table metadata. For example, to purge actual data you need access to the file system.

Spark Direct Reader mode vs JDBC mode

As Spark allows users to run arbitrary code, fine grained access control, such as row level filtering or column level masking, is not possible within Spark itself. This limitation extends to data read in Spark Direct Reader mode.

To restrict data access at a fine-grained level, consider using Ranger and HWC in JDBC execution mode is your datasets are small. If you do not require fine-grained access, consider using HWC Spark Direct Reader mode. For example, use Spark Direct Reader mode for ETL use cases. Spark Direct Reader mode is the recommended mode for production.