JDBC mode limitations
You must understand the limitations of JDBC mode and what functionality is not supported.
Keep the following limitations of JDBC mode in mind:
- JDBC_CLUSTER is used for reads only, and is recommended for production workloads of 1 GB or
less. With larger workload, bottlenecks develop in data transfer to Spark. Use only for
running DML to extract data into an external table for direct spark.sql access.
Writes through HWC of any size are recommended for production. Writes do not use JDBC mode.
- Data transfer is through a single JDBC connection. Does not scale and is slow for large volumes of data.
- Executors require access to HiveServer (HS2).
- In JDBC_CLUSTER mode, HWC fails to correctly resolve queries that use the ORDER BY clause
when run as
hive.sql(" <query> ")
. The query returns unordered rows of data even though the query contains an ORDER BY clause. - In JDBC read mode, a query of a table having a column of a complex type, such as ARRAY, STRUCT and MAP, incorrectly represents the type as String in the returned DataFrame.
- Spark fetches only the first x rows from the connection. This is to prevent HS2 timeouts which might otherwise cause Spark to hit Session/Operation Handle exceptions.