Test driving Iceberg from Impala
You complete a task that creates Iceberg tables from Impala with mock data that you can test drive using your own queries. You learn how to work with partitioned tables.
- You must meet the prerequisites to query Iceberg tables mentioned earlier, including obtaining Ranger access permissions.
- In Impala, use a database.
-
Create an Impala table to hold mock data for this task.
create external table mock_rows stored as parquet as select x from ( with v as (values (1 as x), (1), (1), (1), (1)) select v.x from v, v v2, v v3, v v4, v v5, v v6 ) a;
-
Create another Impala table based on mock_rows.
create external table customer_demo stored as parquet as select FROM_TIMESTAMP(DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)), 'yyyy-MM') as year_month, DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)) as ts, CONCAT( cast ( TRUNC(RAND(1) * 250 + 2) as string), '.' , cast ( TRUNC(RAND(2) * 250 + 2) as string), '.', cast ( TRUNC(RAND(3) * 250 + 2) as string), '.', cast ( TRUNC(RAND(4) * 250 + 2) as string) ) as ip, CONCAT("USER_", cast ( TRUNC(RAND(4) * 1000) as string),'@somedomain.com') as email, CONCAT("USER_", cast ( TRUNC(RAND(5) * 1000) as string)) as username, CONCAT("USER_", cast ( TRUNC(RAND(6) * 100) as string)) as country, cast( RAND(8)*10000 as double) as metric_1, cast( RAND(9)*10000 as double) as metric_2, cast( RAND(10)*10000 as double) as metric_3, cast( RAND(11)*10000 as double) as metric_4, cast( RAND(12)*10000 as double) as metric_5 from mock_rows ;
-
Create another Impala table based on mock_rows.
create external table customer_demo2 stored as parquet as select FROM_TIMESTAMP(DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)), 'yyyy-MM') as year_month, DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)) as ts, CONCAT( cast ( TRUNC(RAND(1) * 250 + 2) as string), '.' , cast ( TRUNC(RAND(2) * 250 + 2) as string), '.', cast ( TRUNC(RAND(3) * 250 + 2) as string), '.', cast ( TRUNC(RAND(4) * 250 + 2) as string) ) as ip, CONCAT("USER_", cast ( TRUNC(RAND(4) * 1000) as string),'@somedomain.com') as email, CONCAT("USER_", cast ( TRUNC(RAND(5) * 1000) as string)) as username, CONCAT("USER_", cast ( TRUNC(RAND(6) * 100) as string)) as country, cast( RAND(8)*10000 as double) as metric_1, cast( RAND(9)*10000 as double) as metric_2, cast( RAND(10)*10000 as double) as metric_3, cast( RAND(11)*10000 as double) as metric_4, cast( RAND(12)*10000 as double) as metric_5 from mock_rows ;
-
Create an Iceberg table from the customer_demo table.
CREATE TABLE customer_demo_iceberg STORED BY ICEBERG AS SELECT * FROM customer_demo;
-
Insert into the customer_demo_iceberg table the results of selecting all data from the customer_demo2 table.
INSERT INTO customer_demo_iceberg select * from customer_demo2; INSERT INTO customer_demo_iceberg select * from customer_demo2; INSERT INTO customer_demo_iceberg select * from customer_demo2;
-
Create an Iceberg table partitioned by the year_month column and based on the customer_demo_iceberg table.
CREATE TABLE customer_demo_iceberg_part PARTITIONED BY(year_month) STORED BY ICEBERG AS SELECT ts, ip , email, username , country, metric_1 , metric_2 , metric_3 , metric_4 , metric_5, year_month FROM customer_demo_iceberg;
-
Split the partitioned data into manageable files.
ALTER TABLE customer_demo_iceberg_part SET PARTITION SPEC (year_month,BUCKET(15, country));
-
Insert the results of reading the customer_demo_iceberg table into the partitioned table.
INSERT INTO customer_demo_iceberg_part (year_month, ts, ip, email, username, country, metric_1, metric_2, metric_3, metric_4, metric_5) SELECT year_month, ts, ip, email, username, country, metric_1, metric_2, metric_3, metric_4, metric_5 FROM customer_demo_iceberg;
-
Run time travel queries on the Iceberg tables, using the history output to get the
snapshot id, and substitute the id in the second SELECT query.
SELECT * FROM customer_demo_iceberg FOR SYSTEM_TIME AS OF '2021-12-09 05:39:18.689000000' LIMIT 100; DESCRIBE HISTORY customer_demo_iceberg; SELECT * FROM customer_demo_iceberg FOR SYSTEM_VERSION AS OF <snapshot id> LIMIT 100;