Configuring Impala

You must review and configure the mandatory and recommended settings if you installed Impala without Cloudera Manager or if you want to customize your environment after installing Impala. If you installed Impala using Cloudera Manager, some of these configurations are completed automatically.

In some cases, depending on the level of Impala, CDP, and Cloudera Manager, you might need to add particular component configuration details in one of the free-form fields on the Impala configuration pages in Cloudera Manager.

  • You must enable short-circuit reads, whether or not Impala was installed through Cloudera Manager. This setting goes in the Impala configuration settings, not the Hadoop-wide settings.
  • If you installed Impala in an environment that is not managed by Cloudera Manager, you must enable block location tracking, and you can optionally enable native checksumming for optimal performance.

Short-Circuit Reads

Enabling short-circuit reads allows Impala to read local data directly from the file system. This removes the need to communicate through the DataNodes, improving performance. This setting also minimizes the number of additional copies of data. Short-circuit reads requires libhadoop.so (the Hadoop Native Library) to be accessible to both the server and the client. You must install it from an .rpm, .deb, or parcel to use short-circuit local reads.

To Enable Short-Circuit Reads

You can enable short-circuit reads through a checkbox available under Configuration tab for both Impala and HDFS.

  1. In Cloudera Manager > Clusters select the Impala service, for example, IMPALA.
  2. On the Configuration tab, search for dfs.client.read.shortcircuit.
  3. Accept the default (enabled), or check to enable the dfs.client.read.shortcircuit property to read the HDFS file blocks directly.
  4. Do the above for HDFS service too by clicking Cloudera Manager > Clusters and by selecting the HDFS service.
  5. Save the changes and Restart the service.

To configure DataNodes for short-circuit reads:

  1. On all Impala nodes, configure the following attributes from the HDFS service as shown:
    1. In Cloudera Manager > Clusters select the HDFS service, for example, HDFS.
    2. On the Configuration tab, search for dfs.domain.socket.path and set the attribute.
    3. Search for and set the attribute, if necessary, dfs.client.file-block-storage-locations.timeout.millis.
    4. Search for and set the attribute, if necessary, dfs.datanode.hdfs-blocks-metadata.enabled
  2. After applying these changes, restart all DataNodes.

Block Location Tracking

Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing better utilization of the underlying disks. Impala will not start unless this setting is enabled.

To enable block location tracking:

  1. For each DataNode, enable the following attribute dfs.datanode.hdfs-blocks-metadata.enabled file:
    1. In Cloudera Manager > Clusters select the HDFS service, for example, HDFS.
    2. On the Configuration tab, search for dfs.datanode.hdfs-blocks-metadata.enabled and enable the attribute if not already enabled.
  2. After applying these changes, restart all DataNodes.

Native Checksumming

Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if that library is available.

To enable native checksumming:

If you installed CDP from packages, the native checksumming library is installed and setup correctly, and no additional steps are required.

If you installed by other means, native checksumming may not be available due to missing shared objects. Finding the message "Unable to load native-hadoop library for your platform... using builtin-java classes where applicable" in the Impala logs indicates native checksumming may be unavailable.

To enable native checksumming, you must build and install libhadoop.so (the Hadoop Native Library).