ALTER TABLE Statement
The ALTER TABLE statement changes the structure or properties of an existing Impala table.
In Impala, this is primarily a logical operation that updates the table metadata in the metastore database that Impala shares with Hive. Most ALTER TABLE operations do not actually rewrite, move, and so on the actual data files. (The RENAME TO clause is the one exception; it can cause HDFS files to be moved to different paths.) When you do an ALTER TABLE operation, you typically need to perform corresponding physical filesystem operations, such as rewriting the data files to include extra fields, or converting them to a different file format.
Syntax:
ALTER TABLE [old_db_name.]old_table_name RENAME TO [new_db_name.]new_table_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name { ADD [IF NOT EXISTS] | DROP [IF EXISTS] } PARTITION (partition_spec)
ALTER TABLE name [PARTITION (partition_spec)]
SET { FILEFORMAT file_format
| LOCATION 'hdfs_path_of_directory'
| TBLPROPERTIES (table_properties)
| SERDEPROPERTIES (serde_properties) }
ALTER TABLE name [PARTITION (partition_spec)] SET { CACHED IN 'pool_name' | UNCACHED }
new_name ::= [new_database.]new_table_name
col_spec ::= col_name type_name
partition_spec ::= partition_col=constant_value
table_properties ::= 'name'='value'[, 'name'='value' ...]
serde_properties ::= 'name'='value'[, 'name'='value' ...]
file_format ::= { PARQUET | TEXTFILE | RCFILE | SEQUENCEFILE | AVRO }
Statement type: DDL
Usage notes:
Whenever you specify partitions in an ALTER TABLE statement, through the PARTITION (partition_spec) clause, you must include all the partitioning columns in the specification.
Most of the ALTER TABLE operations work the same for internal tables (managed by Impala) as for external tables (with data files located in arbitrary locations). The exception is renaming a table; for an external table, the underlying data directory is not renamed or moved.
HDFS caching (CACHED IN clause):
If you specify the CACHED IN clause, any existing or future data files in the table directory or the partition subdirectories are designated to be loaded into memory with the HDFS caching mechanism. See Using HDFS Caching with Impala (CDH 5.1 or higher only) for details about using the HDFS caching feature.
If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed metadata has been received by all the Impala nodes. See SYNC_DDL Query Option for details.
The following sections show examples of the use cases for various ALTER TABLE clauses.
To rename a table (RENAME TO clause):
The RENAME TO clause lets you change the name of an existing table, and optionally which database it is located in.
For internal tables, this operation physically renames the directory within HDFS that contains the data files; the original directory name no longer exists. By qualifying the table names with database names, you can use this technique to move an internal table (and its associated data directory) from one database to another. For example:
create database d1; create database d2; create database d3; use d1; create table mobile (x int); use d2; -- Move table from another database to the current one. alter table d1.mobile rename to mobile; use d1; -- Move table from one database to another. alter table d2.mobile rename to d3.mobile;
For external tables,
To change the physical location where Impala looks for data files associated with a table or partition:
ALTER TABLE table_name [PARTITION (partition_spec)] SET LOCATION 'hdfs_path_of_directory';
The path you specify is the full HDFS path where the data files reside, or will be created. Impala does not create any additional subdirectory named after the table. Impala does not move any data files to this new location or change any data files that might already exist in that directory.
To set the location for a single partition, include the PARTITION clause. Specify all the same partitioning columns for the table, with a constant value for each, to precisely identify the single partition affected by the statement:
create table p1 (s string) partitioned by (month int, day int); -- Each ADD PARTITION clause creates a subdirectory in HDFS. alter table p1 add partition (month=1, day=1); alter table p1 add partition (month=1, day=2); alter table p1 add partition (month=2, day=1); alter table p1 add partition (month=2, day=2); -- Redirect queries, INSERT, and LOAD DATA for one partition -- to a specific different directory. alter table p1 partition (month=1, day=1) set location '/usr/external_data/new_years_day';
To change the key-value pairs of the TBLPROPERTIES and SERDEPROPERTIES fields:
ALTER TABLE table_name SET TBLPROPERTIES ('key1'='value1', 'key2'='value2'[, ...]); ALTER TABLE table_name SET SERDEPROPERTIES ('key1'='value1', 'key2'='value2'[, ...]);
The TBLPROPERTIES clause is primarily a way to associate arbitrary user-specified data items with a particular table.
The SERDEPROPERTIES clause sets up metadata defining how tables are read or written, needed in some cases by Hive but not used extensively by Impala. You would use this clause primarily to change the delimiter in an existing text table or partition, by setting the 'serialization.format' and 'field.delim' property values to the new delimiter character:
-- This table begins life as pipe-separated text format. create table change_to_csv (s1 string, s2 string) row format delimited fields terminated by '|'; -- Then we change it to a CSV table. alter table change_to_csv set SERDEPROPERTIES ('serialization.format'=',', 'field.delim'=','); insert overwrite change_to_csv values ('stop','go'), ('yes','no'); !hdfs dfs -cat 'hdfs://hostname:8020/data_directory/dbname.db/change_to_csv/data_file'; stop,go yes,no
Use the DESCRIBE FORMATTED statement to see the current values of these properties for an existing table. See CREATE TABLE Statement for more details about these clauses. See Setting Statistics Manually through ALTER TABLE for an example of using table properties to fine-tune the performance-related table statistics.
To reorganize columns for a table:
ALTER TABLE table_name ADD COLUMNS (column_defs); ALTER TABLE table_name REPLACE COLUMNS (column_defs); ALTER TABLE table_name CHANGE column_name new_name new_type; ALTER TABLE table_name DROP column_name;
The column_spec is the same as in the CREATE TABLE statement: the column name, then its data type, then an optional comment. You can add multiple columns at a time. The parentheses are required whether you add a single column or multiple columns. When you replace columns, all the original column definitions are discarded. You might use this technique if you receive a new set of data files with different data types or columns in a different order. (The data files are retained, so if the new columns are incompatible with the old ones, use INSERT OVERWRITE or LOAD DATA OVERWRITE to replace all the data before issuing any further queries.)
For example, here is how you might add columns to an existing table. The first ALTER TABLE adds two new columns, and the second ALTER TABLE adds one new column. A single Impala query reads both the old and new data files, containing different numbers of columns. For any columns not present in a particular data file, all the column values are considered to be NULL.
create table t1 (x int); insert into t1 values (1), (2); alter table t1 add columns (s string, t timestamp); insert into t1 values (3, 'three', now()); alter table t1 add columns (b boolean); insert into t1 values (4, 'four', now(), true); select * from t1 order by x; +---+-------+-------------------------------+------+ | x | s | t | b | +---+-------+-------------------------------+------+ | 1 | NULL | NULL | NULL | | 2 | NULL | NULL | NULL | | 3 | three | 2016-05-11 11:19:45.054457000 | NULL | | 4 | four | 2016-05-11 11:20:20.260733000 | true | +---+-------+-------------------------------+------+
You might use the CHANGE clause to rename a single column, or to treat an existing column as a different type than before, such as to switch between treating a column as STRING and TIMESTAMP, or between INT and BIGINT. You can only drop a single column at a time; to drop multiple columns, issue multiple ALTER TABLE statements, or define the new set of columns with a single ALTER TABLE ... REPLACE COLUMNS statement.
The following examples show some safe operations to drop or change columns. Dropping the final column in a table lets Impala ignore the data causing any disruption to existing data files. Changing the type of a column works if existing data values can be safely converted to the new type. The type conversion rules depend on the file format of the underlying table. For example, in a text table, the same value can be interpreted as a STRING or a numeric value, while in a binary format such as Parquet, the rules are stricter and type conversions only work between certain sizes of integers.
create table optional_columns (x int, y int, z int, a1 int, a2 int); insert into optional_columns values (1,2,3,0,0), (2,3,4,100,100); -- When the last column in the table is dropped, Impala ignores the -- values that are no longer needed. (Dropping A1 but leaving A2 -- would cause problems, as we will see in a subsequent example.) alter table optional_columns drop column a2; alter table optional_columns drop column a1; select * from optional_columns; +---+---+---+ | x | y | z | +---+---+---+ | 1 | 2 | 3 | | 2 | 3 | 4 | +---+---+---+
create table int_to_string (s string, x int); insert into int_to_string values ('one', 1), ('two', 2); -- What was an INT column will now be interpreted as STRING. -- This technique works for text tables but not other file formats. -- The second X represents the new name of the column, which we keep the same. alter table int_to_string change x x string; -- Once the type is changed, we can insert non-integer values into the X column -- and treat that column as a string, for example by uppercasing or concatenating. insert into int_to_string values ('three', 'trois'); select s, upper(x) from int_to_string; +-------+----------+ | s | upper(x) | +-------+----------+ | one | 1 | | two | 2 | | three | TROIS | +-------+----------+
Remember that Impala does not actually do any conversion for the underlying data files as a result of ALTER TABLE statements. If you use ALTER TABLE to create a table layout that does not agree with the contents of the underlying files, you must replace the files yourself, such as using LOAD DATA to load a new set of data files, or INSERT OVERWRITE to copy from another table and replace the original data.
The following example shows what happens if you delete the middle column from a Parquet table containing three columns. The underlying data files still contain three columns of data. Because the columns are interpreted based on their positions in the data file instead of the specific column names, a SELECT * query now reads the first and second columns from the data file, potentially leading to unexpected results or conversion errors. For this reason, if you expect to someday drop a column, declare it as the last column in the table, where its data can be ignored by queries after the column is dropped. Or, re-run your ETL process and create new data files if you drop or change the type of a column in a way that causes problems with existing data files.
-- Parquet table showing how dropping a column can produce unexpected results. create table p1 (s1 string, s2 string, s3 string) stored as parquet; insert into p1 values ('one', 'un', 'uno'), ('two', 'deux', 'dos'), ('three', 'trois', 'tres'); select * from p1; +-------+-------+------+ | s1 | s2 | s3 | +-------+-------+------+ | one | un | uno | | two | deux | dos | | three | trois | tres | +-------+-------+------+ alter table p1 drop column s2; -- The S3 column contains unexpected results. -- Because S2 and S3 have compatible types, the query reads -- values from the dropped S2, because the existing data files -- still contain those values as the second column. select * from p1; +-------+-------+ | s1 | s3 | +-------+-------+ | one | un | | two | deux | | three | trois | +-------+-------+
-- Parquet table showing how dropping a column can produce conversion errors. create table p2 (s1 string, x int, s3 string) stored as parquet; insert into p2 values ('one', 1, 'uno'), ('two', 2, 'dos'), ('three', 3, 'tres'); select * from p2; +-------+---+------+ | s1 | x | s3 | +-------+---+------+ | one | 1 | uno | | two | 2 | dos | | three | 3 | tres | +-------+---+------+ alter table p2 drop column x; select * from p2; WARNINGS: File 'hdfs_filename' has an incompatible Parquet schema for column 'add_columns.p2.s3'. Column type: STRING, Parquet schema: optional int32 x [i:1 d:1 r:0] File 'hdfs_filename' has an incompatible Parquet schema for column 'add_columns.p2.s3'. Column type: STRING, Parquet schema: optional int32 x [i:1 d:1 r:0]
To change the file format that Impala expects data to be in, for a table or partition:
Use an ALTER TABLE ... SET FILEFORMAT clause. You can include an optional PARTITION (col1=val1, col2=val2, ... clause so that the file format is changed for a specific partition rather than the entire table.
Because this operation only changes the table metadata, you must do any conversion of existing data using regular Hadoop techniques outside of Impala. Any new data created by the Impala INSERT statement will be in the new format. You cannot specify the delimiter for Text files; the data files must be comma-delimited.
To set the file format for a single partition, include the PARTITION clause. Specify all the same partitioning columns for the table, with a constant value for each, to precisely identify the single partition affected by the statement:
create table p1 (s string) partitioned by (month int, day int); -- Each ADD PARTITION clause creates a subdirectory in HDFS. alter table p1 add partition (month=1, day=1); alter table p1 add partition (month=1, day=2); alter table p1 add partition (month=2, day=1); alter table p1 add partition (month=2, day=2); -- Queries and INSERT statements will read and write files -- in this format for this specific partition. alter table p1 partition (month=2, day=2) set fileformat parquet;
To add or drop partitions for a table, the table must already be partitioned (that is, created with a PARTITIONED BY clause). The partition is a physical directory in HDFS, with a name that encodes a particular column value (the partition key). The Impala INSERT statement already creates the partition if necessary, so the ALTER TABLE ... ADD PARTITION is primarily useful for importing data by moving or copying existing data files into the HDFS directory corresponding to a partition. (You can use the LOAD DATA statement to move files into the partition directory, or ALTER TABLE ... PARTITION (...) SET LOCATION to point a partition at a directory that already contains data files.
The DROP PARTITION clause is used to remove the HDFS directory and associated data files for a particular set of partition key values; for example, if you always analyze the last 3 months worth of data, at the beginning of each month you might drop the oldest partition that is no longer needed. Removing partitions reduces the amount of metadata associated with the table and the complexity of calculating the optimal query plan, which can simplify and speed up queries on partitioned tables, particularly join queries. Here is an example showing the ADD PARTITION and DROP PARTITION clauses.
To avoid errors while adding or dropping partitions whose existence is not certain, add the optional IF [NOT] EXISTS clause between the ADD or DROP keyword and the PARTITION keyword. That is, the entire clause becomes ADD IF NOT EXISTS PARTITION or DROP IF EXISTS PARTITION. The following example shows how partitions can be created automatically through INSERT statements, or manually through ALTER TABLE statements. The IF [NOT] EXISTS clauses let the ALTER TABLE statements succeed even if a new requested partition already exists, or a partition to be dropped does not exist.
Inserting 2 year values creates 2 partitions:
create table partition_t (s string) partitioned by (y int); insert into partition_t (s,y) values ('two thousand',2000), ('nineteen ninety',1990); show partitions partition_t; +-------+-------+--------+------+--------------+-------------------+--------+-------------------+ | y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+ | 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false | | 2000 | -1 | 1 | 13B | NOT CACHED | NOT CACHED | TEXT | false | | Total | -1 | 2 | 29B | 0B | | | | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+
Without the IF NOT EXISTS clause, an attempt to add a new partition might fail:
alter table partition_t add partition (y=2000); ERROR: AnalysisException: Partition spec already exists: (y=2000).
The IF NOT EXISTS clause makes the statement succeed whether or not there was already a partition with the specified key value:
alter table partition_t add if not exists partition (y=2000); alter table partition_t add if not exists partition (y=2010); show partitions partition_t; +-------+-------+--------+------+--------------+-------------------+--------+-------------------+ | y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+ | 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false | | 2000 | -1 | 1 | 13B | NOT CACHED | NOT CACHED | TEXT | false | | 2010 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT | false | | Total | -1 | 2 | 29B | 0B | | | | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+
Likewise, the IF EXISTS clause lets DROP PARTITION succeed whether or not the partition is already in the table:
alter table partition_t drop if exists partition (y=2000); alter table partition_t drop if exists partition (y=1950); show partitions partition_t; +-------+-------+--------+------+--------------+-------------------+--------+-------------------+ | y | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+ | 1990 | -1 | 1 | 16B | NOT CACHED | NOT CACHED | TEXT | false | | 2010 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT | false | | Total | -1 | 1 | 16B | 0B | | | | +-------+-------+--------+------+--------------+-------------------+--------+-------------------+
-- Create an empty table and define the partitioning scheme. create table part_t (x int) partitioned by (month int); -- Create an empty partition into which you could copy data files from some other source. alter table part_t add partition (month=1); -- After changing the underlying data, issue a REFRESH statement to make the data visible in Impala. refresh part_t; -- Later, do the same for the next month. alter table part_t add partition (month=2); -- Now you no longer need the older data. alter table part_t drop partition (month=1); -- If the table was partitioned by month and year, you would issue a statement like: -- alter table part_t drop partition (year=2003,month=1); -- which would require 12 ALTER TABLE statements to remove a year's worth of data. -- If the data files for subsequent months were in a different file format, -- you could set a different file format for the new partition as you create it. alter table part_t add partition (month=3) set fileformat=parquet;
The value specified for a partition key can be an arbitrary constant expression, without any references to columns. For example:
alter table time_data add partition (month=concat('Decem','ber')); alter table sales_data add partition (zipcode = cast(9021 * 10 as string));
To switch a table between internal and external:
-- Switch a table from internal to external. ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE'); -- Switch a table from external to internal. ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='FALSE');
Cancellation: Cannot be cancelled.
HDFS permissions:
Most ALTER TABLE clauses do not actually read or write any HDFS files, and so do not depend on specific HDFS permissions. For example, the SET FILEFORMAT clause does not actually check the file format existing data files or convert them to the new format, and the SET LOCATION clause does not require any special permissions on the new location. (Any permission-related failures would come later, when you actually query or insert into the table.)
In general, ALTER TABLE clauses that do touch HDFS files and directories require the same HDFS permissions as corresponding CREATE, INSERT, or SELECT statements. The permissions allow the user ID that the impalad daemon runs under, typically the impala user, to read or write files or directories, or (in the case of the execute bit) descend into a directory. The RENAME TO clause requires read, write, and execute permission in the source and destination database directories and in the table data directory, and read and write permission for the data files within the table. The ADD PARTITION and DROP PARTITION clauses require write and execute permissions for the associated partition directory.
Related information:
Overview of Impala Tables, CREATE TABLE Statement, DROP TABLE Statement, Partitioning for Impala Tables, Internal Tables, External Tables