Porting SQL from other database systems to Impala
Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications to Impala, due to variations in data types, built-in functions, vendor language extensions, and Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for best performance.
Porting DDL and DML statements
When adapting SQL code from a legacy database system to Impala, expect to find a number of differences in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to account for the Impala partitioning scheme and Hadoop file formats.
Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor extensions and features not yet supported in Impala, you might be able to run identical or almost-identical query text on both systems.
Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse and ongoing tuning efforts on the code for SQL queries.
Porting data types from other database systems
-
Change any
VARCHAR,VARCHAR2, andCHARcolumns toSTRING. Remove any length constraints from the column declarations; for example, changeVARCHAR(32)orCHAR(1)toSTRING. Impala is very flexible about the length of string values; it does not impose any length constraints or do any special processing (such as blank-padding) forSTRINGcolumns. (In Impala 2.0 and higher, there are data typesVARCHARandCHAR, with length constraints for both types and blank-padding forCHAR. However, for performance reasons, it is still preferable to useSTRINGcolumns where practical.) -
For national language character types such as
NCHAR,NVARCHAR, orNCLOB, be aware that while Impala can store and query UTF-8 character data, currently some string manipulation operations only work correctly with ASCII data. -
Change any
DATETIMEorTIMEcolumns toDATEorTIMESTAMP. Remove any precision constraints. Remove any timezone clauses, and make sure your application logic or ETL process accounts for the fact that Impala expects allTIMESTAMPvalues to be in the Coordinated Universal Time (UTC). See TIMESTAMP data type for information about theTIMESTAMPdata type, and Impala date and time functions for conversion functions for different date and time formats.You might also need to adapt date- and time-related literal values and format strings to use the supported Impala date and time formats. If you have date and time literals with different separators or different numbers of
YY,MM, and so on placeholders than Impala expects, consider using calls toregexp_replace()to transform those values to the Impala-compatible format. See TIMESTAMP data type for information about the allowed formats for date and time literals, and Impala string functions for string conversion functions such asregexp_replace().Instead of
SYSDATE, call the functionNOW().Instead of adding or subtracting directly from a date value to produce a value N days in the past or future, use an
INTERVALexpression, for exampleNOW() + INTERVAL 30 DAYS. -
Although Impala supports
INTERVALexpressions for datetime arithmetic, as shown in TIMESTAMP data type,INTERVALis not available as a column data type in Impala. For anyINTERVALvalues stored in tables, convert them to numeric values that you can add or subtract using the functions in Impala date and time functions. For example, if you had a tableDEADLINESwith anINTcolumnTIME_PERIOD, you could construct dates N days in the future like so:SELECT NOW() + INTERVAL time_period DAYS from deadlines; -
For
YEARcolumns, change to the smallest Impala integer type that has sufficient range. See Impala SQL data types for details about ranges, casting, and so on for the various numeric data types. -
Change any
DECIMALandNUMBERtypes. If fixed-point precision is not required, you can useFLOATorDOUBLEon the Impala side depending on the range of values. For applications that require precise decimal values, such as financial data, you might need to make more extensive changes to table structure and application logic, such as using separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs to manipulate them. -
FLOAT,DOUBLE, andREALtypes are supported in Impala. Remove any precision and scale specifications. (In Impala,REALis just an alias forDOUBLE; columns declared asREALare turned intoDOUBLEbehind the scenes.) -
Most integer types from other systems have equivalents in Impala, perhaps under different names such as
BIGINTinstead ofINT8. For any that are unavailable, for exampleMEDIUMINT, switch to the smallest Impala integer type that has sufficient range. Remove any precision specifications. -
Remove any
UNSIGNEDconstraints. All Impala numeric types are signed. -
For any types holding bitwise values, use an integer type with enough range to hold all the relevant bits within a positive integer.
For example,
TINYINThas a maximum positive value of 127, not 256, so to manipulate 8-bit bitfields as positive numbers switch to the next largest typeSMALLINT. For example,CAST(127*2 AS SMALLINT).Impala does not support notation such as
b'0101'for bit literals. -
For BLOB values, use
STRINGto representCLOBorTEXTtypes (character based large objects) up to 32 KB in size. Binary large objects such asBLOB,RAWBINARY, andVARBINARYdo not currently have an equivalent in Impala. -
For Boolean-like types such as
BOOL, use the ImpalaBOOLEANtype. -
Because Impala currently does not support composite or nested types, any spatial data types in other database systems do not have direct equivalents in Impala. You could represent spatial values in string format and write UDFs to process them. Where practical, separate spatial types into separate tables so that Impala can still work with the non-spatial data.
-
Take out any
DEFAULTclauses. Impala can use data files produced from many different sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms ofLOAD DATAand external tables mean that Impala is flexible about the format of data files, and Impala does not necessarily validate or cleanse data before querying it. When copying data through ImpalaINSERTstatements, you can use conditional functions such asCASEorNVLto substitute some other value forNULLfields. -
Take out any constraints from your
CREATE TABLEandALTER TABLEstatements, for examplePRIMARY KEY,FOREIGN KEY,UNIQUE,NOT NULL,UNSIGNED, orCHECKconstraints. Impala can use data files produced from many different sources, such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries to test forNULLvalues. When copying data through ImpalaINSERTstatements, you can use conditional functions such asCASEorNVLto substitute some other value forNULLfields.Do as much verification as practical before loading data into Impala. After data is loaded into Impala, you can do further verification using SQL queries to check if values have expected ranges, if values are
NULLor not, and so on. If there is a problem with the data, you will need to re-run earlier stages of the ETL process, or do anINSERT ... SELECTstatement in Impala to copy the faulty data to a new table and transform or filter out the bad values. -
Take out any
CREATE INDEX,DROP INDEX, andALTER INDEXstatements, and equivalentALTER TABLEstatements. Remove anyINDEX,KEY, orPRIMARY KEYclauses fromCREATE TABLEandALTER TABLEstatements. Impala is optimized for bulk read operations for data warehouse-style queries, and therefore does not support indexes for its tables. -
Calls to built-in functions with out-of-range or otherwise incorrect arguments, return
NULLin Impala as opposed to raising exceptions. (This rule applies even when theABORT_ON_ERROR=truequery option is in effect.) Run small-scale queries using representative data to doublecheck that calls to built-in functions are returning expected values rather thanNULL. For example, unsupportedCASToperations do not raise an error in Impala: -
For any other type not supported in Impala, you could represent their values in string format and write UDFs to process them.
-
To detect the presence of unsupported or unconvertable data types in data files, do initial testing with the
ABORT_ON_ERROR=truequery option in effect. This option causes queries to fail immediately if they encounter disallowed type conversions.
SQL statements to remove or adapt
The following SQL statements or clauses are not currently supported or supported with limitations in Impala:
-
Impala supports the
DELETEstatement only for Kudu tables.Impala is intended for data warehouse-style operations where you do bulk moves and transforms of large quantities of data. When not using Kudu tables, instead of
DELETE, useINSERT OVERWRITEto entirely replace the contents of a table or partition, or useINSERT ... SELECTto copy a subset of data (everything but the rows you intended to delete) from one table to another. -
Impala supports the
UPDATEstatement only for Kudu tables.When not using Kudu tables, instead of
UPDATE, do all necessary transformations early in the ETL process, such as in the job that generates the original data, or when copying from one table to another to convert to a particular file format or partitioning scheme. -
Impala has no transactional statements, such as
COMMITorROLLBACK.Impala effectively works like the
AUTOCOMMITmode in some database systems, where changes take effect as soon as they are made. -
If your database, table, column, or other names conflict with Impala reserved words, use different names or quote the names with backticks.
Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or column alias.
For example, in
SELECT * FROM t1 NATURAL JOIN t2, Impala does not recognize theNATURALkeyword and interprets it as an alias for the tablet1. If you experience any unexpected behavior with queries, check the list of reserved words to make sure all keywords in join andWHEREclauses are supported keywords in Impala. -
Impala has some restrictions on subquery support.
-
Impala supports
UNIONandUNION ALLset operators, but notINTERSECT.Prefer
UNION ALLoverUNIONwhen you know the data sets are disjoint or duplicate values are not a problem;UNION ALLis more efficient because it avoids materializing and sorting the entire result set to eliminate duplicate values. Impala requires query aliases for the subqueries used as inline views in the
FROMclause.For example, without the aliasAliases are not required for the subqueries used in other parts of queries. For example:contents_of_t1at the end, the following query gives a syntax error:SELECT COUNT(*) FROM (SELECT * FROM t1) contents_of_t1;SELECT * FROM functional.alltypes WHERE id = (SELECT MIN(id) FROM functional.alltypes);-
When an alias is declared for an expression in a query, that alias cannot be referenced again within the same
SELECTlist.For example, the
averagealias cannot be referenced twice in theSELECTlist as below. You will receive an error:SELECT AVG(x) AS average, average+1 FROM t1 GROUP BY x;An alias can be referenced again in the same query if not in theSELECTlist. For example, theaveragealias can be referenced twice as shown below:SELECT AVG(x) AS average FROM t1 GROUP BY x HAVING average > 3; -
Impala does not support
NATURAL JOIN, and it does not support theUSINGclause in joins. -
Impala supports a limited choice of partitioning types.
Partitions are defined based on each distinct combination of values for one or more partition key columns. Impala does not redistribute or check data to create evenly distributed partitions. You must choose partition key columns based on your knowledge of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning to use the Impala partition syntax for
CREATE TABLEandALTER TABLEstatements.Impala partitioning is similar to range partitioning where every range has exactly one value, or key partitioning where the hash function produces a separate bucket for every combination of key values.
-
For the
top-N
queries, Impala uses theLIMITclause rather than comparing against a pseudo column namedROWNUMorROW_NUM.
SQL constructs to double-check
Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, examine and possibly update the following where appropriate:
-
A
CREATE TABLEstatement with noSTORED ASclause creates data files in plain text format, which is convenient for data interchange but not a good choice for high-volume data with high-performance queries. -
Adapting tables that were already partitioned in a different database system could produce an Impala table with a high number of partitions and not enough data in each one, leading to underutilization of Impala's parallel query features.
-
The
INSERT ... VALUESsyntax is suitable for setting up toy tables with a few rows for functional testing when used with HDFS. Each such statement creates a separate tiny file in HDFS, and it is not a scalable technique for loading megabytes or gigabytes (let alone petabytes) of data.Consider revising your data load process to produce raw data files outside of Impala, then setting up Impala external tables or using the
LOAD DATAstatement to use those data files instantly in Impala tables, with no conversion or indexing stage.INSERTworks fine for Kudu tables even though not particularly fast. -
If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data files, or a single giant data file that cannot take advantage of distributed parallel queries or partitioning. In this case, use an
INSERT ... SELECTstatement to copy the data into a new table and reorganize into a more efficient layout in the same operation.You can do
INSERT ... SELECTinto a table with a more efficient file format or from an unpartitioned table into a partitioned one. -
Complex queries may have high codegen time. As a workaround, set the query option
DISABLE_CODEGEN=trueif queries fail for this reason. -
If practical, rewrite
UNIONqueries to use theUNION ALLoperator instead. PreferUNION ALLoverUNIONwhen you know the data sets are disjoint or duplicate values are not a problem;UNION ALLis more efficient because it avoids materializing and sorting the entire result set to eliminate duplicate values.
Next porting steps after verifying syntax and semantics
Some of the decisions you make during the porting process can have an impact on performance. After your SQL code is ported and working correctly, examine the performance-related aspects of your schema design, physical layout, and queries to make sure that the ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and integration with Hadoop components. The following are a few of the areas you should examine:
- For the optimal performance, we recommend that you run
COMPUTE STATSon all tables. - Use the most efficient file format for your data volumes, table structure, and query characteristics.
- Partition on columns that are often used for filtering in
WHEREclauses. - Your ETL process should produce a relatively small number of multi-megabyte data files rather than a huge number of small files.
