Understanding Apache Phoenix-Hive Connector

This connector enables you to access the Phoenix data from Hive without any data transfer. So the Business Intelligence (BI) logic in Hive can access the operational data available in Phoenix.

Using this connector, you can run a certain type of queries in Phoenix more efficiently than using Hive or other applications, however, this is not a universal tool that can run all types of queries. In some cases, Phoenix can run queries faster than the Phoenix Hive integration and vice versa. In others, you can run this tool to perform operations like many to many joins and aggregations which Phoenix would otherwise struggle to effectively run on its own. This integration is better suited for performing online analytical query processing (OLAP) operations than Phoenix.

Another use case for this connector is transferring the data between these two systems. You can use this connector to simplify the data movement between Hive and Phoenix, since an intermediate form of the data (for example, a .CSV file) is not required. The automatic movement of structured data between these two systems is the major advantage of using this tool. You should be aware that for moving a large amount of data from Hive to Phoenix CSV bulk load is preferable due to performance reasons.

Configure Phoenix-Hive Connector using Cloudera Manager

You must configure Phoenix-Hive connector before you can access Phoenix data from Hive. To configure the Phoenix-Hive connector using Cloudera Manager:

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

Go to the Hive service.
Click the Configuration tab.
Select Scope > Hive Cluster (Service-Wide).
Select Category > Advanced.
Locate the Hive Auxiliary JARs Directory property or search for it by typing its name in the Search box.
Add the following auxiliary path directory: /usr/local/phoenix-hive.
Important: You must manually create the /usr/local/phoenix-hive directory, and copy the opt/cloudera/parcels/PHOENIX/lib/phoenix/phoenix-<version>-hive.jar on every node in the cluster that runs Hive.

Note: You can use any directory instead of /usr/local/phoenix-hive that Hive can read.
Locate the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml property or search for it by typing its name in the Search box.
Add the following property values:
Name: hive.aux.jars.path

Value: file:///opt/cloudera/parcels/PHOENIX/lib/phoenix/phoenix-version-hive.jar
Select Scope > Gateway.
Select Category > Advanced.
Locate the Hive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xml property or search for it by typing its name in the Search box.
Add the following property values:
Name: hive.aux.jars.path

Value: file:///opt/cloudera/parcels/PHOENIX/lib/phoenix/phoenix-version-hive.jar
Enter a Reason for change, and then click Save Changes to commit the changes.
Restart the role and service when Cloudera Manager prompts you to restart.

Apache Phoenix-Hive Usage Examples

You can refer to the following Phoenix-Hive connector examples:

Creating a table
Loading data
Querying data

Creating a table

Creating an external Hive table requires an existing table in Phoenix. Hive manages only the Hive metadata. Dropping an external table from Hive deletes only the Hive metadata, but the Phoenix table is not deleted.

Use the create external table command to create an EXTERNAL Hive table.

create external table ext_table (
  i1 int,
  s1 string,
  f1 float,
  d1 decimal
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "ext_table",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "i1",
  "phoenix.column.mapping" = "i1:i1, s1:s1, f1:f1, d1:d1"
);

Following are the parameters that you could use when creating an external table:

Parameter	Default Value	Description
`phoenix.table.name`	The same name as the Hive table	Name of the existing Phoenix table
`phoenix.zookeeper.quorum`	localhost	Specifies the ZooKeeper quorum for HBase
`phoenix.zookeeper.znode.parent`	/hbase	Specifies the ZooKeeper parent node for HBase
`phoenix.zookeeper.client.port`	2181	Specifies the ZooKeeper port
`phoenix.rowkeys`	N/A	The list of columns to be the primary key in a Phoenix table
`phoenix.column.mapping`	N/A	Mappings between column names for Hive and Phoenix

Loading data

Use insert statement to load data to the Phoenix table through Hive.

insert into table T values (....);
insert into table T select c1,c2,c3 from source_table;

Querying data

You can use HiveQL for querying data in a Phoenix table. A Hive query on a single table can be as fast as running the query in the Phoenix CLI with the following property settings:

hive.fetch.task.conversion=more and hive.exec.parallel=true

Following are some of the parameters that you could use when querying the data:

Parameter	Default Value	Description
`hbase.scan.cache`	100	Read row size for a unit request
`hbase.scan.cacheblock`	false	Whether or not cache block
`split.by.stats`	false	If true, mappers use table statistics. One mapper per guide post.
`[hive-table-name].reducer.count`	1	Number of reducers. In Tez mode, this affects only single-table queries. See Limitations.
`[phoenix-table-name].query.hint`	N/A	Hint for Phoenix query (for example, NO_INDEX)

Limitations of Phoenix-Hive Connector

Following are some of the limitations of Phoenix-Hive connector:

Only 4K character specification is allowed to specify a full table. If the volume of the data is huge, then there is a possibility to lose the metadata information.
There is a difference in the way timestamp is saved in Phoenix and Hive. Phoenix uses binary format, whereas Hive uses a text format to store data.
Hive LLAP is not supported in this integration.

Categories: All Categories

Understanding Apache Phoenix-Spark Connector

Performance Tuning