Introduction to HWC
You need to understand Hive Warehouse Connector (HWC) to query Apache Hive tables from Apache Spark. Examples of supported APIs, such as Spark SQL, show some operations you can perform, including how to write to a Hive ACID table or write a DataFrame from Spark.
HWC is software for securely accessing Hive tables from Spark. You need to use the HWC if you want to access Hive managed tables from Spark. You explicitly use HWC by calling the HiveWarehouseConnector API to write to managed tables. You might use HWC without even realizing it. HWC implicitly reads tables when you run a Spark SQL query on a Hive managed table.
You do not need HWC to read or write Hive external tables. You can use native Spark SQL. You might want to use HWC to purge external table files. From Spark, using HWC you can read Hive external tables in ORC or Parquet formats. From Spark, using HWC you can write Hive external tables in ORC format only.
Creating an external table stores only the metadata in HMS. If you use HWC to create the external table, HMS keeps track of the location of table names and columns. Dropping an external table deletes the metadata from HMS. You can set an option to also drop the actual data in files, or not, from the file system.
If you do not use HWC, dropping an external table deletes only the metadata from HMS. If you do not have permission to access the file system, and you want to purge table data in addition to metadata, you need to use HWC.
- Spark SQL
Supports native Spark SQL query read (only) patterns. Output conforms to native spark.sql conventions.
Supports HiveWarehouse Session API operations using the HWC
Supports accessing a Hive ACID table from Scala, or pySpark, directly using DataFrames. Use the short name
HiveAcid. Direct reads and writes from the file are not supported.
Spark SQL Example
$ spark-shell <parameters to specify HWC jar and config settings> scala> sql("select * from managedTable").show scala> spark.read.table("managedTable").show
HWC API Example
scala> val hive = com.hortonworks.hwc.HiveWarehouseSession.session(spark).build() scala> hive.executeQuery("select * from emp_acid").show scala> hive.executeQuery("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").show
Hive ACID tables are tables in Hive metastore and must be formatted using DataFrames as follows:Syntax:
format("HiveAcid").option("table", "<table name>"")Example:
scala> val df = spark.read.format("HiveAcid").options(Map("table" -> "default.acidtbl")).load() scala> df.collect()
- You cannot write data using Spark Direct Reader.
- Transaction semantics of Spark RDDs are not ensured when using Spark Direct Reader to read ACID tables.
- HWC supports reading tables in any format, but currently supports writing tables in ORC format only.
- The spark thrift server is not supported.
- Table stats (basic stats and column stats) are not generated when you write a DataFrame to Hive.
- When the HWC API save mode is overwrite, writes are limited.
You cannot read from and overwrite the same table. If your query accesses only one table and you try to overwrite that table using an HWC API write method, a deadlock state might occur. Do not attempt this operation.
Example: Operation Not Supported
scala> val df = hive.executeQuery("select * from t1") scala> df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector"). \ mode("overwrite").option("table", "t1").save
Workaround for using the Hive Warehouse Connector with Oozie Spark action
Hive and Spark use different Thrift versions and are incompatible with each other. Upgrading Thrift in Hive is complicated and may not be resolved in the near future. Therefore, Thrift packages are shaded inside the HWC JAR to make Hive Warehouse Connector work with Spark and Oozie Spark action. See the workaround in Cloudera Oozie documentation (link below).
Supported applications and operations
- Spark shell
- The spark-submit script
- Describing a table
- Creating a table in ORC using .createTable() or in any format using .executeUpdate()
- Writing to a table in ORC format
- Selecting Hive data and retrieving a DataFrame
- Writing a DataFrame to a Hive-managed ORC table in batch
- Executing a Hive update statement
- Reading table data, transforming it in Spark, and writing it to a new Hive table
- Writing a DataFrame or Spark stream to Hive using HiveStreaming
- Partitioning data when writing a DataFrame