Apache Impala Overview
The Apache Impala provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats.
The Impala solution is composed of the following components.
- Impala
- The Impala service coordinates and executes queries received from clients. Queries are distributed among Impala nodes, and these nodes then act as workers, executing parallel query fragments.
- Hive Metastore
- Stores information about the data available to Impala. For example, the metastore lets Impala know what databases are available and what the structure of those databases is. As you create, drop, and alter schema objects, load data into tables, and so on through Impala SQL statements, the relevant metadata changes are automatically broadcast to all Impala nodes by the dedicated catalog service.
- Clients
- Entities including Hue, ODBC clients, JDBC clients, Business Intelligence applications, and the Impala Shell can all interact with Impala. These interfaces are typically used to issue queries or complete administrative tasks such as connecting to Impala.
- Storage for data to be queried
Queries executed using Impala are handled as
follows:
- User applications send SQL queries to Impala through ODBC or JDBC,
which provide standardized querying interfaces. The user application
may connect to any
impalad
in the cluster. Thisimpalad
becomes the coordinator for the query. - Impala parses the query and analyzes it to determine what tasks
need to be performed by
impalad
instances across the cluster. Execution is planned for optimal efficiency. - Storage services are accessed by local
impalad
instances to provide data. - Each
impalad
returns data to the coordinatingimpalad
, which sends these results to the client.