The Hive Warehouse Connector (HWC) supports reads and writes to Apache Hive managed ACID
tables in R. Cloudera provides an R package SparklyrHWC that includes all HWC methods, such as
execute and executeQuery, and a spark_write_table method to write to managed tables. The native
sparklyr spark_write_table method supports writes to external tables only.
Support🔗
HWC should work with Sparklyr 1.0.4. Versions later than 1.0.4 should also work if interfaces
are not changed by sparklyR. However, SparklyR isn't supported by Cloudera. We will support
any issues around using HWC from sparklyR.
Downloading SparklyrHWC🔗
You can download the SparklyrHWC R package that includes HWC methods from your CDP
Cluster. Go to /opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/.
Copy the SparklyrHWC-<version> to your download location.
To read and write Hive tables in R, you need to configure an HWC execution mode. You
can configure either one of the following HWC execution modes in R:
- JDBC mode
- Suitable for writing production workloads.
- Suitable for reading production workloads having a data size of 1 GB
or less.
- Use this mode for reading if latency is not an issue.
- Spark-ACID mode
- Suitable for reading production workloads.
- Does not support writes
Reading and writing managed tables
You can read Hive managed tables
using either JDBC or Spark-ACID mode. The mode you configure affects the background
process. You use the same R code regardless of the mode with one exception: You do
not need to call the commitTxn(hs) when using JDBC mode.
To write to Hive
managed tables, you must connect to HWC in JDBC mode.
Reading and writing
external tables
You can read and write Hive external tables in R using
the sparklyr package. HWC is not required.
In the following procedure, you
configure Spark-Acid execution mode to read tables on a production cluster. You use the
native sparklyr spark_read_table and spark_load_table to read Hive managed tables in R.