Setting Up Apache Impala Using the Command Line

Impala is an open-source add-on to the Cloudera Enterprise Core that returns rapid responses to queries.

What is Included in an Impala Installation

Impala is made up of a set of components that can be installed on multiple nodes throughout your cluster. The key installation step for performance is to install the impalad daemon (which does most of the query processing work) on all DataNodes in the cluster.

The Impala package installs these binaries:

  • impalad - The Impala daemon. Plans and executes queries against HDFS, HBase, and Amazon S3 data. Run one impalad process on each node in the cluster that has a DataNode.

  • statestored - Name service that tracks location and status of all impalad instances in the cluster. Run one instance of this daemon on a node in your cluster. Most production deployments run this daemon on the namenode.

  • catalogd - Metadata coordination service that broadcasts changes from Impala DDL and DML statements to all affected Impala nodes, so that new tables, newly loaded data, and so on are immediately visible to queries submitted through any Impala node. (Prior to Impala 1.2, you had to run the REFRESH or INVALIDATE METADATA statement on each node to synchronize changed metadata. Now those statements are only required if you perform the DDL or DML through an external mechanism such as Hive or by uploading data to the Amazon S3 filesystem.) Run one instance of this daemon on a node in your cluster, preferably on the same host as the statestored daemon.

  • impala-shell - Command-line interface for issuing queries to the Impala daemon. You install this on one or more hosts anywhere on your network, not necessarily DataNodes or even within the same cluster as Impala. It can connect remotely to any instance of the Impala daemon.

Before doing the installation, ensure that you have all necessary prerequisites. See Impala Requirements for details.

Installing Impala from the Command Line

Before installing Impala manually, make sure all applicable nodes have the appropriate hardware configuration, levels of operating system and CDH, and any other software prerequisites. See Impala Requirements for details.

You can install Impala across many hosts or on one host:

  • Installing Impala across multiple machines creates a distributed configuration. For best performance, install Impala on all DataNodes.
  • Installing Impala on a single machine produces a pseudo-distributed cluster.

To install Impala on a host:

  1. Install CDH, including Hive, as described in Installing and Deploying Unmanaged CDH Using the Command Line.
  2. Configure the Hive metastore to use an external database as a metastore. Impala uses this same database for its own table metadata. You can choose either a MySQL or PostgreSQL database as the metastore. The process for configuring each type of database is described in the CDH Installation Guide).

    Cloudera recommends setting up a Hive metastore service rather than connecting directly to the metastore database; this configuration is required when running Impala under CDH 4.1. Make sure the /etc/impala/conf/hive-site.xml file contains the following setting, substituting the appropriate hostname for metastore_server_host:

    <property>
    <name>hive.metastore.uris</name>
    <value>thrift://metastore_server_host:9083</value>
    </property>
    <property>
    <name>hive.metastore.client.socket.timeout</name>
    <value>3600</value>
    <description>MetaStore Client socket timeout in seconds</description>
    </property>
  3. (Optional) If you installed the full Hive component on any host, you can verify that the metastore is configured properly by starting the Hive console and querying for the list of available tables. Once you confirm that the console starts, exit the console to continue the installation:
    $ hive
    Hive history file=/tmp/root/hive_job_log_root_201207272011_678722950.txt
    hive> show tables;
    table1
    table2
    hive> quit;
    $
  4. Confirm that your package management command is aware of the Impala repository settings, as described in Impala Requirements. (For CDH 4, this is a different repository than for CDH.) You might need to download a repo or list file into a system directory underneath /etc.
  5. Use one of the following sets of commands to install the Impala package:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum install impala             # Binaries for daemons
    $ sudo yum install impala-server      # Service start/stop script
    $ sudo yum install impala-state-store # Service start/stop script
    $ sudo yum install impala-catalog     # Service start/stop script
    

    For SUSE systems:

    $ sudo zypper install impala             # Binaries for daemons
    $ sudo zypper install impala-server      # Service start/stop script
    $ sudo zypper install impala-state-store # Service start/stop script
    $ sudo zypper install impala-catalog     # Service start/stop script
    

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala             # Binaries for daemons
    $ sudo apt-get install impala-server      # Service start/stop script
    $ sudo apt-get install impala-state-store # Service start/stop script
    $ sudo apt-get install impala-catalog     # Service start/stop script
    
  6. Copy the client hive-site.xml, core-site.xml, hdfs-site.xml, and hbase-site.xml configuration files to the Impala configuration directory, which defaults to /etc/impala/conf. Create this directory if it does not already exist.
  7. Use one of the following commands to install impala-shell on the machines from which you want to issue queries. You can install impala-shell on any supported machine that can connect to DataNodes that are running impalad.

    For RHEL/CentOS systems:

    $ sudo yum install impala-shell

    For SUSE systems:

    $ sudo zypper install impala-shell

    For Debian/Ubuntu systems:

    $ sudo apt-get install impala-shell
  8. Complete any required or recommended configuration, as described in Post-Installation Configuration for Impala. Some of these configuration changes are mandatory.

Once installation and configuration are complete, see Starting Impala for how to activate the software on the appropriate nodes in your cluster.

If this is your first time setting up and using Impala in this cluster, run through some of the exercises in Impala Tutorials to verify that you can do basic operations such as creating tables and querying them.

Modifying Impala Startup Options

The configuration options for the Impala-related daemons let you choose which hosts and ports to use for the services that run on a single host, specify directories for logging, control resource usage and security, and specify other aspects of the Impala software.

Configuring Impala Startup Options through the Command Line

When you run Impala in a non-Cloudera Manager environment, the Impala server, statestore, and catalog services start up using values provided in a defaults file, /etc/default/impala.

This file includes information about many resources used by Impala. Most of the defaults included in this file should be effective in most cases. For example, typically you would not change the definition of the CLASSPATH variable, but you would always set the address used by the statestore server. Some of the content you might modify includes:

IMPALA_STATE_STORE_HOST=127.0.0.1
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/var/log/impala
IMPALA_CATALOG_SERVICE_HOST=...
IMPALA_STATE_STORE_HOST=...

export IMPALA_STATE_STORE_ARGS=${IMPALA_STATE_STORE_ARGS:- \
    -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}}
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}"
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}

To use alternate values, edit the defaults file, then restart all the Impala-related services so that the changes take effect. Restart the Impala server using the following commands:

$ sudo service impala-server restart
Stopping Impala Server:                                    [  OK  ]
Starting Impala Server:                                    [  OK  ]

Restart the Impala statestore using the following commands:

$ sudo service impala-state-store restart
Stopping Impala State Store Server:                        [  OK  ]
Starting Impala State Store Server:                        [  OK  ]

Restart the Impala catalog service using the following commands:

$ sudo service impala-catalog restart
Stopping Impala Catalog Server:                            [  OK  ]
Starting Impala Catalog Server:                            [  OK  ]

Some common settings to change include:

  • Statestore address. Where practical, put the statestore on a separate host not running the impalad daemon. In that recommended configuration, the impalad daemon cannot refer to the statestore server using the loopback address. If the statestore is hosted on a machine with an IP address of 192.168.0.27, change:

    IMPALA_STATE_STORE_HOST=127.0.0.1

    to:

    IMPALA_STATE_STORE_HOST=192.168.0.27
  • Catalog server address (including both the hostname and the port number). Update the value of the IMPALA_CATALOG_SERVICE_HOST variable. Cloudera recommends the catalog server be on the same host as the statestore. In that recommended configuration, the impalad daemon cannot refer to the catalog server using the loopback address. If the catalog service is hosted on a machine with an IP address of 192.168.0.27, add the following line:

    IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000

    The /etc/default/impala defaults file currently does not define an IMPALA_CATALOG_ARGS environment variable, but if you add one it will be recognized by the service startup/shutdown script. Add a definition for this variable to /etc/default/impala and add the option -catalog_service_host=hostname. If the port is different than the default 26000, also add the option -catalog_service_port=port.

  • Memory limits. You can limit the amount of memory available to Impala. For example, to allow Impala to use no more than 70% of system memory, change:

    export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
        -log_dir=${IMPALA_LOG_DIR} \
        -state_store_port=${IMPALA_STATE_STORE_PORT} \
        -state_store_host=${IMPALA_STATE_STORE_HOST} \
        -be_port=${IMPALA_BACKEND_PORT}}

    to:

    export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
        -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \
        -state_store_host=${IMPALA_STATE_STORE_HOST} \
        -be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}

    You can specify the memory limit using absolute notation such as 500m or 2G, or as a percentage of physical memory such as 60%.

  • Core dump enablement. To enable core dumps on systems not managed by Cloudera Manager, change:

    export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}

    to:

    export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}

    On systems managed by Cloudera Manager, enable the Enable Core Dump setting for the Impala service.

  • Authorization using the open source Sentry plugin. Specify the -server_name and -authorization_policy_file options as part of the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS settings to enable the core Impala support for authentication. See Starting the impalad Daemon with Sentry Authorization Enabled for details.

  • Auditing for successful or blocked Impala queries, another aspect of security. Specify the -audit_event_log_dir=directory_path option and optionally the -max_audit_event_log_file_size=number_of_queries and -abort_on_failed_audit_event options as part of the IMPALA_SERVER_ARGS settings, for each Impala node, to enable and customize auditing. See Auditing Impala Operations for details.

  • Password protection for the Impala web UI, which listens on port 25000 by default. This feature involves adding some or all of the --webserver_password_file, --webserver_authentication_domain, and --webserver_certificate_file options to the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS settings. See Security Guidelines for Impala for details.

  • Another setting you might add to IMPALA_SERVER_ARGS is a comma-separated list of query options and values:
    -default_query_options='option=value,option=value,...'
    
    These options control the behavior of queries performed by this impalad instance. The option values you specify here override the default values for Impala query options, as shown by the SET statement in impala-shell.
  • During troubleshooting, Cloudera Support might direct you to change other values, particularly for IMPALA_SERVER_ARGS, to work around issues or gather debugging information.

Checking the Values of Impala Configuration Options

You can check the current runtime value of all these settings through the Impala web interface, available by default at http://impala_hostname:25000/varz for the impalad daemon, http://impala_hostname:25010/varz for the statestored daemon, or http://impala_hostname:25020/varz for the catalogd daemon. In the Cloudera Manager interface, you can see the link to the appropriate service_name Web UI page when you look at the status page for a specific daemon on a specific host.

Startup Options for impalad Daemon

The impalad daemon implements the main Impala service, which performs query processing and reads and writes the data files. Some of the noteworthy options are:
  • The fe_service_threads option specifies the maximum number of concurrent client connections allowed. The default value is 64 with which 64 queries can run simultaneously.

    If you have more clients trying to connect to Impala than the value of this setting, the later arriving clients have to wait until previous clients disconnect. You can increase this value to allow more client connections. However, a large value means more threads to be maintained even if most of the connections are idle, and it could negatively impact query latency. Client applications should use the connection pool to avoid the need for large number of sessions.

Startup Options for statestored Daemon

The statestored daemon implements the Impala statestore service, which monitors the availability of Impala services across the cluster, and handles situations such as nodes becoming unavailable or becoming available again.

Startup Options for catalogd Daemon

The catalogd daemon implements the Impala catalog service, which broadcasts metadata changes to all the Impala nodes when Impala creates a table, inserts data, or performs other kinds of DDL and DML operations.

Use --load_catalog_in_background option to control when the metadata of a table is loaded.
  • If set to false, the metadata of a table is loaded when it is referenced for the first time. This means that the first run of a particular query can be slower than subsequent runs. Starting in Impala 2.2, the default for load_catalog_in_background is false.
  • If set to true, the catalog service attempts to load metadata for a table even if no query needed that metadata. So metadata will possibly be already loaded when the first query that would need it is run. However, for the following reasons, we recommend not to set the option to true.
    • Background load can interfere with query-specific metadata loading. This can happen on startup or after invalidating metadata, with a duration depending on the amount of metadata, and can lead to a seemingly random long running queries that are difficult to diagnose.
    • Impala may load metadata for tables that are possibly never used, potentially increasing catalog size and consequently memory usage for both catalog service and Impala Daemon.

Starting Impala

To activate Impala if it is installed but not yet started:

  1. Set any necessary configuration options for the Impala services. See Modifying Impala Startup Options for details.
  2. Start one instance of the Impala statestore. The statestore helps Impala to distribute work efficiently, and to continue running in the event of availability problems for other Impala nodes. If the statestore becomes unavailable, Impala continues to function.
  3. Start one instance of the Impala catalog service.
  4. Start the main Impala service on one or more DataNodes, ideally on all DataNodes to maximize local processing and avoid network traffic due to remote reads.

Once Impala is running, you can conduct interactive experiments using the instructions in Impala Tutorials and try Using the Impala Shell (impala-shell Command).

Starting Impala from the Command Line

To start the Impala state store and Impala from the command line or a script, you can either use the service command or you can start the daemons directly through the impalad, statestored, and catalogd executables.

Start the Impala statestore and then start impalad instances. You can modify the values the service initialization scripts use when starting the statestore and Impala by editing /etc/default/impala.

Start the statestore service using a command similar to the following:

$ sudo service impala-state-store start

Start the catalog service using a command similar to the following:

$ sudo service impala-catalog start

Start the Impala service on each DataNode using a command similar to the following:

$ sudo service impala-server start
If any of the services fail to start, review: