Introduction to HWC and DataFrame APIs
As an Apache Spark developer, you learn the code constructs for executing Apache Hive queries using the HiveWarehouseSession API. In Spark source code, you see how to create an instance of HiveWarehouseSession. You also learn how to access a Hive ACID table using DataFrames.
Supported APIs
- Spark SQL
Supports built-in Spark SQL query read (only) patterns. Output conforms to built-in spark.sql conventions.
Example:
$ spark-shell <parameters to specify HWC jar and config settings> scala> sql("select * from managedTable").show scala> spark.read.table("managedTable").show
- HWC
Supports HiveWarehouse Session API operations using the HWC
sql
API. The.execute()
and.executeQuery()
methods are deprecated.Example
scala> val hive = com.hortonworks.hwc.HiveWarehouseSession.session(spark).build() scala> hive.sql("select * from emp_acid").show scala> hive.sql("select e.emp_id, e.first_name, d.name department from emp_acid e join dept_ext d on e.dept_id = d.id").show
- DataFrames
Supports accessing a Hive ACID table from Scala, or pySpark, directly using DataFrames. Direct reads and writes from the file are not supported.
Hive ACID tables are tables in Hive metastore and must be formatted using DataFrames as follows:
Syntax:val df = hive.sql("<SELECT query>")
Example:scala> val df = hive.sql("select * from managedTable where a=100") scala> df.collect()
Import statements and variables
The following string constants are defined by the API:
HIVE_WAREHOUSE_CONNECTOR
DATAFRAME_TO_STREAM
STREAM_TO_STREAM
Assuming spark
is running in an existing SparkSession
,
use this code for imports:
- Scala
import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build()
- Java
import com.hortonworks.hwc.HiveWarehouseSession; import static com.hortonworks.hwc.HiveWarehouseSession.*; HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build();
- Python
from pyspark_llap import HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build()
Executing queries
hive.sql()
API for executing queries. You can also use
Spark SQL to query Hive managed tables, however, it is recommended that you use the HWC sql
method. .sql()
- Executes queries in both the read modes — Direct Reader and JDBC modes.
- Consistent with the Spark sql interface.
- Masks the internal implementation based on the cluster type you configured.
- Used to execute read operations and does not support write operations, such as INSERT, UPDATE, and DELETE.
Results are returned as a DataFrame to Spark.
Support of HWC read modes on Hive tables or views
Mode vs Table | Full ACID table | Insert-only ACID table | View (created on a managed table) | Materialized view (created on a managed table) |
---|---|---|---|---|
DIRECT_READER_V1 | Yes | Yes | Yes | Yes |
DIRECT_READER_V2 | Yes | Yes | Yes | Yes |
JDBC_CLUSTER | Yes | Yes | Yes | Yes |
It is recommended that you do not use Managed non-transactional tables. Such tables should ideally be converted to external tables.
Support of HWC read modes on table formats
Mode | ORC | Parquet | Avro | Textfile |
---|---|---|---|---|
DIRECT_READER_V1 | Yes | Yes | Yes | Yes |
DIRECT_READER_V2 | Yes | Yes | Yes | Yes |
JDBC_CLUSTER | Yes | Yes | Yes | Yes |
hive.sql vs. spark.sql
hive.sql()
is explicitly defined in HWC and can be used across all read modes to query Apache Hive managed tables (full ACID and insert-ony ACID tables).spark.sql()
can also be used across all the read modes to query an Apache Hive managed table. However, it is recommended that you usehive.sql()
overspark.sql()
.- The Direct Reader imposes the constraint that the Hive table must be transactional.