Introduction to HWC and DataFrame APIs

As an Apache Spark developer, you learn the code constructs for executing Apache Hive queries using the HiveWarehouseSession API. In Spark source code, you see how to create an instance of HiveWarehouseSession. You also learn how to access a Hive ACID table using DataFrames.

Supported APIs

  • Spark SQL

    Supports built-in Spark SQL query read (only) patterns. Output conforms to built-in spark.sql conventions.

    Example

    $ spark-shell <parameters to specify HWC jar and config settings>
    scala> sql("select * from managedTable").show 
    scala> spark.read.table("managedTable").show
  • HWC

    Supports HiveWarehouse Session API operations using the HWC sql API. The .execute() and .executeQuery() methods are deprecated.

    Example
    scala> val hive = com.hortonworks.hwc.HiveWarehouseSession.session(spark).build()
    scala> hive.sql("select * from emp_acid").show
    scala> hive.sql("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").show
  • DataFrames

    Supports accessing a Hive ACID table from Scala, or pySpark, directly using DataFrames. Direct reads and writes from the file are not supported.

    Hive ACID tables are tables in Hive metastore and must be formatted using DataFrames as follows:

    Syntax
    val df = hive.sql("<SELECT query>")
    Example
    scala> val df = hive.sql("select * from managedTable where a=100")
    scala> df.collect()

Import statements and variables

The following string constants are defined by the API:

  • HIVE_WAREHOUSE_CONNECTOR
  • DATAFRAME_TO_STREAM
  • STREAM_TO_STREAM

Assuming spark is running in an existing SparkSession, use this code for imports:

  • Scala
    import com.hortonworks.hwc.HiveWarehouseSession
    import com.hortonworks.hwc.HiveWarehouseSession._
    val hive = HiveWarehouseSession.session(spark).build()
  • Java
    import com.hortonworks.hwc.HiveWarehouseSession;
    import static com.hortonworks.hwc.HiveWarehouseSession.*;
    HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build();
  • Python
    from pyspark_llap import HiveWarehouseSession
    hive = HiveWarehouseSession.session(spark).build()

Executing queries

HWC supports the hive.sql() API for executing queries. You can also use the Spark SQL to query Hive managed tables, however, it is recommened that you use the HWC sql method.
  • .sql()
    • Executes queries in both the read modes — Direct Reader and JDBC modes.

    • Consistent with the Spark sql interface.

    • Masks the internal implementation based on the cluster type you configured.

Results are returned as a DataFrame to Spark.

Support of HWC read modes on Hive tables or views

The following table represents the different Hive tables or views that are supported by the various HWC read modes:
Mode vs Table Full ACID table Insert-only ACID table View (created on a managed table) Materialized view (created on a managed table)
DIRECT_READER_V1 Yes Yes Yes Yes
DIRECT_READER_V2 Yes Yes Yes Yes
JDBC_CLUSTER Yes Yes Yes Yes

It is recommended that you do not use Managed non-transactional tables. Such tables should ideally be converted to external tables.

Support of HWC read modes on table formats

The following table formats are supported by HWC while reading a table:
Mode ORC Parquet Avro Textfile
DIRECT_READER_V1 Yes Yes Yes Yes
DIRECT_READER_V2 Yes Yes Yes Yes
JDBC_CLUSTER Yes Yes Yes Yes

hive.sql vs. spark.sql

There are a number of important differences between the hive.sql and spark.sql functions:
  • hive.sql() is explicitly defined in HWC and can be used across all read modes to query Apache Hive managed tables (full ACID and insert-only ACID tables).
  • spark.sql() can also be used across all the read modes to query an Apache Hive managed table. However, it is recommended that you use hive.sql() over spark.sql().
  • The Direct Reader imposes the constraint that the Hive table must be transactional.