Using Hive data connectors to support external data sources

You can use Hive data connectors to map databases present in external data sources to a local Hive Metastore (HMS). The external data sources can be of different types, such as MySQL, PostgreSQL, Oracle, Redshift, Derby, or other HMS instances. You can create external tables to represent the data, and then query the tables.

What is a data connector

Data connectors in Hive are objects that are defined using a set of properties, such as hostname, user credentials, and type that are required to connect Hive to the external data source. You define a connector and then use the same connector to map multiple databases from the remote data source to the local HMS.

Benefits of using data connectors

Currently, you can use a JDBCStorageHandler to connect Hive to external data sources. Users define a table in the HMS for which data resides in a remote JDBC data store. The Hive table's metadata is persisted locally in the HMS. When HiveServer (HS2) runs a query against this table, data is retrieved from the remote JDBC table. However, there are certain limitations which make it difficult to map databases having a large number of tables.

  • Each remote table in the remote data source has to be individually mapped to a local Hive table. It is cumbersome when you have to map an entire database having a lot of tables.
  • New tables created in the remote data source are not automatically visible and should be manually mapped in Hive.
  • The metadata for the mapped table is static. It does not track changes to the remote table. If the remote table is modified (added columns or dropped columns), the mapped Hive table has to be dropped and recreated. This is not feasible for organizations where tables are constantly changing.

With data connectors, you can map a database or schema in the remote data source to a Hive database. Such databases are referred to as 'Remote' databases in Hive. The benefits of creating remote databases using data connectors are as follows:

  • Tables within a mapped database are automatically visible from Hive. The metadata for these tables are not persisted in the HMS. They are retrieved at runtime through an active connector.
  • New tables created are automatically visible in Hive.
  • The columns and their data types are mapped to compatible Hive data types during runtime. Therefore, metadata changes to tables in the remote datasource are immediately visible in hive.
  • The same connector can be used to map another database to a Hive database. All the connection information is shared between these remote databases.

To create a remote database, you must first define a data connector with the required properties and then use this connector to create and map a remote database in an external data source to Hive.