ALTER TABLE Statement
The ALTER TABLE statement changes the structure or properties of an existing table. In Impala, this is a logical operation that updates the table metadata in the metastore database that Impala shares with Hive; ALTER TABLE does not actually rewrite, move, and so on the actual data files. Thus, you might need to perform corresponding physical filesystem operations, such as moving data files to a different HDFS directory, rewriting the data files to include extra fields, or converting them to a different file format.
Syntax:
ALTER TABLE name RENAME TO new_name ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...]) ALTER TABLE name DROP [COLUMN] column_name ALTER TABLE name CHANGE column_name new_name new_type ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...]) ALTER TABLE name { ADD | DROP } PARTITION (partition_spec) ALTER TABLE name [PARTITION (partition_spec)] SET { FILEFORMAT format | LOCATION 'hdfs_path_of_directory' | TBLPROPERTIES (table_properties) | SERDEPROPERTIES (serde_properties) } new_name ::= [new_database.]new_table_name col_spec ::= col_name type_name partition_spec ::= partition_col=constant_value table_properties ::= 'name'='value'[, 'name'='value' ...] serde_properties ::= 'name'='value'[, 'name'='value' ...]
Statement type: DDL
Usage notes:
Whenever you specify partitions in an ALTER TABLE statement, through the PARTITION (partition_spec) clause, you must include all the partitioning columns in the specification.
Most of the ALTER TABLE operations work the same for internal tables (managed by Impala) as for external tables (with data files located in arbitrary locations). The exception is renaming a table; for an external table, the underlying data directory is not renamed or moved.
If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed metadata has been received by all the Impala nodes. See SYNC_DDL for details.
CREATE TABLE Statement, DROP TABLE Statement, Partitioning Internal Tables, External Tables
The following sections show examples of the use cases for various ALTER TABLE clauses.
To rename a table:
ALTER TABLE old_name RENAME TO new_name;
For internal tables, his operation physically renames the directory within HDFS that contains the data files; the original directory name no longer exists. By qualifying the table names with database names, you can use this technique to move an internal table (and its associated data directory) from one database to another. For example:
create database d1; create database d2; create database d3; use d1; create table mobile (x int); use d2; -- Move table from another database to the current one. alter table d1.mobile rename to mobile; use d1; -- Move table from one database to another. alter table d2.mobile rename to d3.mobile;
To change the physical location where Impala looks for data files associated with a table or partition:
ALTER TABLE table_name [PARTITION (partition_spec)] SET LOCATION 'hdfs_path_of_directory';
The path you specify is the full HDFS path where the data files reside, or will be created. Impala does not create any additional subdirectory named after the table. Impala does not move any data files to this new location or change any data files that might already exist in that directory.
To set the location for a single partition, include the PARTITION clause. Specify all the same partitioning columns for the table, with a constant value for each, to precisely identify the single partition affected by the statement:
create table p1 (s string) partitioned by (month int, day int); -- Each ADD PARTITION clause creates a subdirectory in HDFS. alter table p1 add partition (month=1, day=1); alter table p1 add partition (month=1, day=2); alter table p1 add partition (month=2, day=1); alter table p1 add partition (month=2, day=2); -- Redirect queries, INSERT, and LOAD DATA for one partition -- to a specific different directory. alter table p1 partition (month=1, day=1) set location '/usr/external_data/new_years_day';
To change the key-value pairs of the TBLPROPERTIES and SERDEPROPERTIES fields:
ALTER TABLE table_name SET TBLPROPERTIES ('key1'='value1', 'key2'='value2'[, ...]); ALTER TABLE table_name SET SERDEPROPERTIES ('key1'='value1', 'key2'='value2'[, ...]);
The TBLPROPERTIES clause is primarily a way to associate arbitrary user-specified data items with a particular table.
The SERDEPROPERTIES clause sets up metadata defining how tables are read or written, needed in some cases by Hive but not used extensively by Impala. You would use this clause primarily to change the delimiter in an existing text table or partition, by setting the 'serialization.format' and 'field.delim' property values to the new delimiter character:
-- This table begins life as pipe-separated text format. create table change_to_csv (s1 string, s2 string) row format delimited fields terminated by '|'; -- Then we change it to a CSV table. alter table change_to_csv set SERDEPROPERTIES ('serialization.format'=',', 'field.delim'=','); insert overwrite change_to_csv values ('stop','go'), ('yes','no'); !hdfs dfs -cat 'hdfs://hostname:8020/data_directory/dbname.db/change_to_csv/data_file'; stop,go yes,no
Use the DESCRIBE FORMATTED statement to see the current values of these properties for an existing table. See CREATE TABLE Statement for more details about these clauses. See Setting Statistics Manually through ALTER TABLE for an example of using table properties to fine-tune the performance-related table statistics.
To reorganize columns for a table:
ALTER TABLE table_name ADD COLUMNS (column_defs); ALTER TABLE table_name REPLACE COLUMNS (column_defs); ALTER TABLE table_name CHANGE column_name new_name new_type; ALTER TABLE table_name DROP column_name;
The column_spec is the same as in the CREATE TABLE statement: the column name, then its data type, then an optional comment. You can add multiple columns at a time. The parentheses are required whether you add a single column or multiple columns. When you replace columns, all the original column definitions are discarded. You might use this technique if you receive a new set of data files with different data types or columns in a different order. (The data files are retained, so if the new columns are incompatible with the old ones, use INSERT OVERWRITE or LOAD DATA OVERWRITE to replace all the data before issuing any further queries.)
You might use the CHANGE clause to rename a single column, or to treat an existing column as a different type than before, such as to switch between treating a column as STRING and TIMESTAMP, or between INT and BIGINT. You can only drop a single column at a time; to drop multiple columns, issue multiple ALTER TABLE statements, or define the new set of columns with a single ALTER TABLE ... REPLACE COLUMNS statement.
To change the file format that Impala expects data to be in, for a table or partition:
ALTER TABLE table_name [PARTITION (partition_spec)] SET FILEFORMAT { PARQUET | PARQUETFILE | TEXTFILE | RCFILE | SEQUENCEFILE }
Because this operation only changes the table metadata, you must do any conversion of existing data using regular Hadoop techniques outside of Impala. Any new data created by the Impala INSERT statement will be in the new format. You cannot specify the delimiter for Text files; the data files must be comma-delimited.
To set the file format for a single partition, include the PARTITION clause. Specify all the same partitioning columns for the table, with a constant value for each, to precisely identify the single partition affected by the statement:
create table p1 (s string) partitioned by (month int, day int); -- Each ADD PARTITION clause creates a subdirectory in HDFS. alter table p1 add partition (month=1, day=1); alter table p1 add partition (month=1, day=2); alter table p1 add partition (month=2, day=1); alter table p1 add partition (month=2, day=2); -- Queries and INSERT statements will read and write files -- in this format for this specific partition. alter table p1 partition (month=2, day=2) set fileformat parquet;
To add or drop partitions for a table, the table must already be partitioned (that is, created with a PARTITIONED BY clause). The partition is a physical directory in HDFS, with a name that encodes a particular column value (the partition key). The Impala INSERT statement already creates the partition if necessary, so the ALTER TABLE ... ADD PARTITION is primarily useful for importing data by moving or copying existing data files into the HDFS directory corresponding to a partition. (You can use the LOAD DATA statement to move files into the partition directory, or ALTER TABLE ... PARTITION (...) SET LOCATION to point a partition at a directory that already contains data files.
The DROP PARTITION clause is used to remove the HDFS directory and associated data files for a particular set of partition key values; for example, if you always analyze the last 3 months worth of data, at the beginning of each month you might drop the oldest partition that is no longer needed. Removing partitions reduces the amount of metadata associated with the table and the complexity of calculating the optimal query plan, which can simplify and speed up queries on partitioned tables, particularly join queries. Here is an example showing the ADD PARTITION and DROP PARTITION clauses.
-- Create an empty table and define the partitioning scheme. create table part_t (x int) partitioned by (month int); -- Create an empty partition into which you could copy data files from some other source. alter table part_t add partition (month=1); -- After changing the underlying data, issue a REFRESH statement to make the data visible in Impala. refresh part_t; -- Later, do the same for the next month. alter table part_t add partition (month=2); -- Now you no longer need the older data. alter table part_t drop partition (month=1); -- If the table was partitioned by month and year, you would issue a statement like: -- alter table part_t drop partition (year=2003,month=1); -- which would require 12 ALTER TABLE statements to remove a year's worth of data. -- If the data files for subsequent months were in a different file format, -- you could set a different file format for the new partition as you create it. alter table part_t add partition (month=3) set fileformat=parquet;
The value specified for a partition key can be an arbitrary constant expression, without any references to columns. For example:
alter table time_data add partition (month=concat('Decem','ber')); alter table sales_data add partition (zipcode = cast(9021 * 10 as string));
An alternative way to reorganize a table and its associated data files is to use CREATE TABLE to create a variation of the original table, then use INSERT to copy the transformed or reordered data to the new table. The advantage of ALTER TABLE is that it avoids making a duplicate copy of the data files, allowing you to reorganize huge volumes of data in a space-efficient way using familiar Hadoop techniques.
<< DML Statements | ALTER VIEW Statement >> | |