Post-Installation Configuration for Impala
This section describes the mandatory and recommended configuration settings for Cloudera Impala. If Impala is installed using Cloudera Manager, some of these configurations are completed automatically; you must still configure short-circuit reads manually. If you installed Impala without Cloudera Manager, or if you want to customize your environment, consider making the changes described in this topic.
In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular component configuration details in one of the free-form fields on the Impala configuration pages within Cloudera Manager. In Cloudera Manager 4, these fields are labelled Safety Valve; in Cloudera Manager 5, they are called Advanced Configuration Snippet.
- You must enable short-circuit reads, whether or not Impala was installed through Cloudera Manager. This setting goes in the Impala configuration settings, not the Hadoop-wide settings.
- If you installed Impala in an environment that is not managed by Cloudera Manager, you must enable block location tracking, and you can optionally enable native checksumming for optimal performance.
- If you deployed Impala using Cloudera Manager see Testing Impala Performance to confirm proper configuration.
Mandatory: Short-Circuit Reads
Cloudera strongly recommends using Impala with CDH 4.2 or higher, ideally the latest 4.x release. Impala does support short-circuit reads with CDH 4.1, but for best performance, upgrade to CDH 4.3 or higher. The process of configuring short-circuit reads varies according to which version of CDH you are using. Choose the procedure that is appropriate for your environment.
To configure DataNodes for short-circuit reads with CDH 4.2 or later:
- Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
-
On all Impala nodes, configure the following properties in
Impala's copy of hdfs-site.xml as shown:
<property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/run/hdfs-sockets/dn</value> </property> <property> <name>dfs.client.file-block-storage-locations.timeout.millis</name> <value>10000</value> </property>
-
If /var/run/hadoop-hdfs/ is group-writable, make sure its group is root.
Note: If you are also going to enable block location tracking, you can skip copying configuration files and restarting DataNodes and go straight to Optional: Block Location Tracking. Configuring short-circuit reads and block location tracking require the same process of copying files and restarting services, so you can complete that process once when you have completed all configuration changes. Whether you copy files and restart services now or during configuring block location tracking, short-circuit reads are not enabled until you complete those final steps. - After applying these changes, restart all DataNodes.
To configure DataNodes for short-circuit reads with CDH 4.1:
-
Enable short-circuit reads by adding settings to the Impala
core-site.xml file.
- If you installed Impala using Cloudera Manager, short-circuit reads should be properly configured, but you can review the configuration by checking the contents of the core-site.xml file, which is installed at /etc/impala/conf by default.
-
If you installed using packages, instead of using Cloudera Manager, create the
core-site.xml file. This can be easily done by copying
the core-site.xml client configuration file from
another machine that is running Hadoop services. This file must be copied to the Impala configuration
directory. The Impala configuration directory is set by
the IMPALA_CONF_DIR environment variable and is by
default /etc/impala/conf. To confirm the Impala
configuration directory, check the IMPALA_CONF_DIR
environment variable value.
Note
: If the Impala configuration directory does not exist, create it and then add the core-site.xml file.
<property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property>
Note: For an installation managed by Cloudera Manager, specify these settings in the Impala dialogs, in the options field for HDFS. In Cloudera Manager 4, these fields are labelled Safety Valve; in Cloudera Manager 5, they are called Advanced Configuration Snippet. -
For each DataNode, enable access by adding the following to
the hdfs-site.xml file:
<property> <name>dfs.client.use.legacy.blockreader.local</name> <value>true</value> </property> <property> <name>dfs.datanode.data.dir.perm</name> <value>750</value> </property> <property> <name>dfs.block.local-path-access.user</name> <value>impala</value> </property> <property> <name>dfs.client.file-block-storage-locations.timeout.millis</name> <value>10000</value> </property>
Note: In the preceding example, the dfs.block.local-path-access.user is the user running the impalad process. By default, that account is impala. -
Use usermod to add users requiring local block access to the appropriate
HDFS group. For example, if you assigned impala to the
dfs.block.local-path-access.user property, you would add
impala to the hadoop HDFS group:
$ usermod -a -G hadoop impala
Note: The default HDFS group is hadoop, but it is possible to have an environment configured to use an alternate group. To find the configured HDFS group name using the Cloudera Manager admin console, click Services and click HDFS. Click the Configuration tab. Under Service-Wide, click Advanced in the left column. The Shared Hadoop Group Name property contains the group name.Note: If you are going to enable block location tracking, you can skip copying configuration files and restarting DataNodes and go straight to Mandatory: Block Location Tracking. Configuring short-circuit reads and block location tracking require the same process of copying files and restarting services, so you can complete that process once when you have completed all configuration changes. Whether you copy files and restart services now or during configuring block location tracking, short-circuit reads are not enabled until you complete those final steps. - Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
- After applying these changes, restart all DataNodes.
Mandatory: Block Location Tracking
Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing better utilization of the underlying disks. Impala will not start unless this setting is enabled.
To enable block location tracking:
-
For each DataNode, adding the following to the hdfs-site.xml
file:
<property> <name>dfs.datanode.hdfs-blocks-metadata.enabled</name> <value>true</value> </property>
- Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
- After applying these changes, restart all DataNodes.
Optional: Native Checksumming
Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if that library is available.
To enable native checksumming:
If you installed CDH from packages, the native checksumming library is installed and setup correctly. In such a case, no additional steps are required. Conversely, if you installed by other means, such as with tarballs, native checksumming may not be available due to missing shared objects. Finding the message "Unable to load native-hadoop library for your platform... using builtin-java classes where applicable" in the Impala logs indicates native checksumming may be unavailable. To enable native checksumming, you must build and install libhadoop.so (the Hadoop Native Library).
<< Configuring Impala | Configuring Impala to Work with ODBC >> | |