Test driving Iceberg from Impala

You complete a task that creates Iceberg tables from Impala with mock data that you can test drive using your own queries. You learn how to work with partitioned tables.

  • You must meet the prerequisites to query Iceberg tables mentioned earlier, including obtaining Ranger access permissions.
  1. In Impala, use a database.
  2. Create an Impala table to hold mock data for this task.
    create external table mock_rows stored as parquet as 
    select x from (
    with v as (values (1 as x), (1), (1), (1), (1)) 
    select v.x from v, v v2, v v3, v v4, v v5, v v6
    ) a; 
  3. Create another Impala table based on mock_rows.
    create external table customer_demo stored as parquet as 
    select 
    FROM_TIMESTAMP(DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)), 'yyyy-MM') as year_month,
    DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)) as ts,
    CONCAT(
      cast ( TRUNC(RAND(1) * 250 + 2) as string), '.' , 
      cast ( TRUNC(RAND(2) * 250 + 2) as string), '.', 
      cast ( TRUNC(RAND(3) * 250 + 2) as string), '.',
      cast ( TRUNC(RAND(4) * 250 + 2) as string)
    ) as ip,
    CONCAT("USER_", cast ( TRUNC(RAND(4) * 1000) as string),'@somedomain.com') as email,
    CONCAT("USER_", cast ( TRUNC(RAND(5) * 1000) as string)) as username,
    CONCAT("USER_", cast ( TRUNC(RAND(6) * 100) as string)) as country,
    cast( RAND(8)*10000 as double) as metric_1,
    cast( RAND(9)*10000 as double) as metric_2,
    cast( RAND(10)*10000 as double) as metric_3,
    cast( RAND(11)*10000 as double) as metric_4,
    cast( RAND(12)*10000 as double) as metric_5
    from mock_rows 
    ;
  4. Create another Impala table based on mock_rows.
    create external table customer_demo2 stored as parquet as 
    select 
    FROM_TIMESTAMP(DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)), 'yyyy-MM') as year_month,
    DAYS_SUB(now() , cast ( TRUNC(RAND(7)*365*1) as bigint)) as ts,
    CONCAT(
      cast ( TRUNC(RAND(1) * 250 + 2) as string), '.' , 
      cast ( TRUNC(RAND(2) * 250 + 2) as string), '.', 
      cast ( TRUNC(RAND(3) * 250 + 2) as string), '.',
      cast ( TRUNC(RAND(4) * 250 + 2) as string)
    ) as ip,
    CONCAT("USER_", cast ( TRUNC(RAND(4) * 1000) as string),'@somedomain.com') as email,
    CONCAT("USER_", cast ( TRUNC(RAND(5) * 1000) as string)) as username,
    CONCAT("USER_", cast ( TRUNC(RAND(6) * 100) as string)) as country,
    cast( RAND(8)*10000 as double) as metric_1,
    cast( RAND(9)*10000 as double) as metric_2,
    cast( RAND(10)*10000 as double) as metric_3,
    cast( RAND(11)*10000 as double) as metric_4,
    cast( RAND(12)*10000 as double) as metric_5
    from mock_rows 
    ;   
  5. Create an Iceberg table from the customer_demo table.
    CREATE TABLE customer_demo_iceberg STORED BY ICEBERG AS SELECT * FROM customer_demo;    
  6. Insert into the customer_demo_iceberg table the results of selecting all data from the customer_demo2 table.
    INSERT INTO customer_demo_iceberg select * from customer_demo2;
    INSERT INTO customer_demo_iceberg select * from customer_demo2;
    INSERT INTO customer_demo_iceberg select * from customer_demo2;  
  7. Create an Iceberg table partitioned by the year_month column and based on the customer_demo_iceberg table.
    CREATE TABLE customer_demo_iceberg_part PARTITIONED BY(year_month) STORED BY ICEBERG 
    AS SELECT ts, ip , email, username , country, metric_1 , metric_2 , metric_3 , metric_4 , metric_5, year_month 
    FROM customer_demo_iceberg;
  8. Split the partitioned data into manageable files.
    ALTER TABLE customer_demo_iceberg_part SET PARTITION SPEC (year_month,BUCKET(15, country));
  9. Insert the results of reading the customer_demo_iceberg table into the partitioned table.
    INSERT INTO customer_demo_iceberg_part (year_month, ts, ip, email, username, country, metric_1, metric_2, metric_3, metric_4, metric_5) 
    SELECT year_month, ts, ip, email, username, country, metric_1, metric_2, metric_3, metric_4, metric_5  
    FROM customer_demo_iceberg;
  10. Run time travel queries on the Iceberg tables, using the history output to get the snapshot id, and substitute the id in the second SELECT query.
    SELECT * FROM customer_demo_iceberg FOR SYSTEM_TIME AS OF '2021-12-09 05:39:18.689000000' LIMIT 100;
    DESCRIBE HISTORY customer_demo_iceberg;
    SELECT * FROM customer_demo_iceberg FOR SYSTEM_VERSION AS OF <snapshot id> LIMIT 100;