Introduction to HWC

You need to understand Hive Warehouse Connector (HWC) to query Apache Hive tables from Apache Spark. Examples of supported APIs, such as Spark SQL, show some operations you can perform, including how to write to a Hive ACID table or write a DataFrame from Spark.

HWC is software for securely accessing Hive tables from Spark. You need to use the HWC if you want to access Hive managed tables from Spark. You explicitly use HWC by calling the HiveWarehouseConnector API to write to managed tables. You might use HWC without even realizing it. HWC implicitly reads tables when you run a Spark SQL query on a Hive managed table.

You do not need HWC to read or write Hive external tables, but you might want to use HWC to purge external table files. From Spark, using HWC you can read Hive external tables in ORC or Parquet formats. From Spark, using HWC you can write Hive external tables in ORC format only.

Creating an external table stores only the metadata in HMS. If you use HWC to create the external table, HMS keeps track of the location of table names and columns. Dropping an external table deletes the metadata from HMS. You can set an option to also drop the actual data in files, or not, from the file system.

If you do not use HWC, dropping an external table deletes only the metadata from HMS. If you do not have permission to access the file system, and you want to purge table data in addition to metadata, you need to use HWC.

Supported APIs

  • Spark SQL

    Supports native Spark SQL query read (only) patterns. Output conforms to native spark.sql conventions.

  • HWC

    Supports HiveWarehouse Session API operations using the HWC sql API.

  • DataFrames

    Supports accessing a Hive ACID table from Scala, or pySpark, directly using DataFrames. Use the short name HiveAcid. Direct reads and writes from the file are not supported.

    Spark SQL Example

    $ spark-shell <parameters to specify HWC jar and config settings>
    scala> sql("select * from managedTable").show 
    scala> spark.read.table("managedTable").show

    HWC API Example

    scala> val hive = com.hortonworks.hwc.HiveWarehouseSession.session(spark).build()
    scala> hive.executeQuery("select * from emp_acid").show
    scala> hive.executeQuery("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").show

    DataFrames Example

    Hive ACID tables are tables in Hive metastore and must be formatted using DataFrames as follows:

    Syntax:
    format("HiveAcid").option("table", "<table name>"")
    Example:
    scala> val df = spark.read.format("HiveAcid").options(Map("table" -> "default.acidtbl")).load()
    scala> df.collect()

HWC Limitations

Kerberos users of HWC and Spark Direct Reader must observe the following requirements:

  • Do specify spark.sql.hive.hiveserver2.jdbc.url.principal in configurations. For example,
    --conf "spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@VPC.CLOUDERA.COM"
  • Do not specify the spark.sql.hive.hiveserver2.jdbc.url.principal in the JDBC URL to invoke Hive. Do specify principal=hive.server2.authentication.kerberos.principal as shown in the following syntax:
    jdbc:hive://<host>:<port>/<dbName>;principal=<HiveServer2_kerberos_principal>;<otherSessionConfs>?<hiveConfs>#<hiveVars> 
    For example:
    jdbc:hive://<host>:<port>/<dbName>;principal=hive.server2.authentication.kerberos.principal;<otherSessionConfs>?<hiveConfs>#<hiveVars>       
  • Ensure that the keystore specified in the JDBC URL is in the same location on all hosts.

Other HWC limitations are:

  • You cannot write data using Spark Direct Reader.
  • HWC supports reading tables in any format, but currently supports writing tables in ORC format only.
  • The spark thrift server is not supported.
  • Transaction semantics of Spark RDDs are not ensured when using Spark Direct Reader to read ACID tables.
  • Table stats (basic stats and column stats) are not generated when you write a DataFrame to Hive.
  • The Hive Union types are not supported.
  • When the HWC API save mode is overwrite, writes are limited.

    You cannot read from and overwrite the same table. If your query accesses only one table and you try to overwrite that table using an HWC API write method, a deadlock state might occur. Do not attempt this operation.

    Example: Operation Not Supported

    scala> val df = hive.executeQuery("select * from t1")
    scala> df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector"). \
      mode("overwrite").option("table", "t1").save          

Workaround for using the Hive Warehouse Connector with Oozie Spark action

Hive and Spark use different Thrift versions and are incompatible with each other. Upgrading Thrift in Hive is complicated and may not be resolved in the near future. Therefore, Thrift packages are shaded inside the HWC JAR to make Hive Warehouse Connector work with Spark and Oozie Spark action. See the workaround in Cloudera Oozie documentation (link below).

Supported applications and operations

The Hive Warehouse Connector supports the following applications:
  • Spark shell
  • PySpark
  • The spark-submit script
The following list describes a few of the operations supported by the Hive Warehouse Connector:
  • Describing a table
  • Creating a table in ORC using .createTable() or in any format using .executeUpdate()
  • Writing to a table in ORC format
  • Selecting Hive data and retrieving a DataFrame
  • Writing a DataFrame to a Hive-managed ORC table in batch
  • Executing a Hive update statement
  • Reading table data, transforming it in Spark, and writing it to a new Hive table
  • Writing a DataFrame or Spark stream to Hive using HiveStreaming
  • Partitioning data when writing a DataFrame