Understanding Apache Phoenix-Hive connector
With Hortonworks Data Platform (HDP), you can use the Phoenix-Hive Storage Handler on your secured clusters to handle large joins and large aggregation. You can use this Storage Handler with HDP 2.6 or later.
This connector enables you to access the Phoenix data from Hive without any data transfer. So the Business Intelligence (BI) logic in Hive can access the operational data available in Phoenix. Using this connector, you can run a certain type of queries in Phoenix more efficiently than using Hive or other applications, however, this is not a universal tool that can run all types of queries. In some cases, Phoenix can run queries faster than the Phoenix Hive integration and vice versa. In others, you can run this tool to perform operations like many to many joins and aggregations which Phoenix would otherwise struggle to effectively run on its own. This integration is better suited for performing online analytical query processing (OLAP) operations than Phoenix.
Another use case for this connector is transferring the data between these two systems. You can use this connector to simplify the data movement between Hive and Phoenix, since an intermediate form of the data (for example, a .CSV file) is not required. The automatic movement of structured data between these two systems is the major advantage of using this tool. You should be aware that for moving a large amount of data from Hive to Phoenix CSV bulk load is preferable due to performance reasons.
A change to Hive in HDP 3.0 is that all StorageHandlers must be be marked as “external”. There is no such thing as an non-external table created by a StorageHandler. If the corresponding Phoenix table exists when the Hive table is created, it will mimic the HDP 2.x semantics of an “external” table. If the corresponding Phoenix table does not exist when the Hive table is created, it will mimic the HDP 2.x semantics of a non-external table (for example, the Phoenix table is dropped when the Hive table is dropped).