Hive Warehouse Connector for accessing Apache Spark data
The Hive Warehouse Connector (HWC) is a Spark library/plugin that is launched with the Spark app to access any managed Hive table from Spark. You explicitly use HWC by calling the HiveWarehouseConnector API for writes. You can implicitly use HWC for reads by simply running a Spark SQL query on a managed table. Apache Ranger and the HiveWarehouseConnector library provide row and column, fine-grained access to the data.
Spark and Hive share a catalog in the Hive metastore (HMS). The shared catalog simplifies use of HWC. To read the Hive external table from Spark, you do not need to define the table redundantly in the Spark catalog. Also, HMS detects the type of client interacting with HMS, for example Hive or Spark, and compares the capabilities of the client with the Hive table requirement. A resulting HMS translation occurs that makes sense given the client and other factors.
HWC Limitations
- HWC supports reading tables in any format, but currently supports writing tables in ORC format only.
- Table stats (basic stats and column stats )are not generated when you write a DataFrame to Hive.
- The spark thrift server is not supported.
- The Hive Union data type is not supported.
- Transaction semantics of Spark RDDs are not ensured when using Spark Direct Reader to read ACID tables.
- When the HWC API save mode is overwrite, writes are limited.
You cannot read from and overwrite the same table. If your query accesses only one table and you try to overwrite that table using an HWC API write method, a deadlock state might occur. Do not attempt this operation.
Example: Operation Not Supported
scala> val df = hive.executeQuery("select * from t1") scala> df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode("overwrite").option("table", "t1").save
Supported applications and operations
- Spark shell
- PySpark
- The spark-submit script
- Describing a table
- Creating a table in ORC using .createTable() or in any format using .executeUpdate()
- Writing to a table in ORC format
- Selecting Hive data and retrieving a DataFrame
- Writing a DataFrame to a Hive-managed ORC table in batch
- Executing a Hive update statement
- Reading table data, transforming it in Spark, and writing it to a new Hive table
- Writing a DataFrame or Spark stream to Hive using HiveStreaming
- Partitioning data when writing a DataFrame