Using the Spark shell

Using Spark, you can create an Iceberg table followed by schema evolution, partition specification, and partition evolution.

You must configure the Spark shell as such you have included the valid Spark runtime version.

Run the following command in your Spark shell to create a new Iceberg table

  1. spark.sql("CREATE TABLE spark_catalog.default.sample_1 ( id bigint COMMENT 'unique id', data string) USING iceberg");
  2. Navigate accordingly in the Atlas UI to view the changes.
    The following images provide information about Iceberg table creation process.

    Run the following command in your Spark shell to create a Schema Evolution in a new table. For example - sample_2.

  3. spark.sql("CREATE TABLE spark_catalog.default.sample_2 ( id bigint COMMENT 'unique id', data string) USING iceberg");
  4. Navigate accordingly in the Atlas UI to view the changes.
    The following image provide information about Iceberg schema evolution process.

    Run the following command in your Spark shell to include a column:

  5. spark.sql("ALTER TABLE spark_catalog.default.sample_2 ADD COLUMN  (add_col_1 string )");
  6. Navigate accordingly in the Atlas UI to view the changes.
    The following images provide information about Iceberg schema creation process.

    Run the following command in your Spark shell to include the second column:

  7. spark.sql("ALTER TABLE spark_catalog.default.sample_2 ADD COLUMN  (add_col_2 string )");
  8. Navigate accordingly in the Atlas UI to view the changes.
    The following image provide information about Iceberg schema creation process.

    Run the following command in your Spark shell to create a Partition Specification in a new table (sample_3):

  9. spark.sql("CREATE TABLE spark_catalog.default.sample_3 (id bigint,data string,category string,ts timestamp) USING iceberg PARTITIONED BY (bucket(16, id), days(ts), category)");
  10. Navigate accordingly in the Atlas UI to view the changes.
    The following images provide information about Iceberg partition specification process.
    Run the following command in your Spark shell to create a Partition Evolution in a new table (sample_3):
  11. spark.sql("ALTER TABLE spark_catalog.default.sample_3 ADD PARTITION FIELD years(ts)");
  12. Navigate accordingly in the Atlas UI to view the changes.
    The following images provide information about Iceberg partition evolution process.

    Displaying Snapshot attributes

    Run the following command to display relevant snapshot table parameters as attributes in Atlas. For example: Existing snapshot ID, current snapshot timestamp, snapshot count, number of files and related attributes.

    spark.sql("CREATE TABLE spark_catalog.default.sample3 ( id int, data string) USING iceberg");

    spark.sql("INSERT INTO default.sample3 VALUES (1, 'TEST')");

    Run the following command to scale up the snapshot count value.

    spark.sql("INSERT INTO default.sample3 VALUES (1, 'TEST')");

    The latest Snapshot ID that Iceberg points to along with the Snapshot timestamp and Snapshot count are updated respectively.

    Click on the parameters field to display the details pertaining to the snapshot attributes.

    Support for data compaction

    Data Compaction is the process of taking several small files and rewriting them into fewer larger files to speed up queries. When performing compaction on an Iceberg table, execute the rewriteDataFiles procedure, optionally specifying a filter of which files to rewrite and the desired size of the resulting files.

    As an example, in an Atlas instance consider that the number of files and snapshot count are 9.

    Run the following command to perform data compaction

    spark.sql("CALL spark_catalog.system.rewrite_data_files('default.sample3')"),show();

    The count for the number of rewritten data files is compacted from a total count of 9 to 1.

    Support for metadata rewrite attributes

    Iceberg uses metadata in its manifest list and manifest files speed up query planning and to prune unnecessary data files. You can rewrite manifests for a table to optimize the plan to scan the data.

    Run the following command to rewrite manifests on a sample table 3.

    spark.sql("CALL spark_catalog.system.rewrite_manifests('default.sample3')").show();

    The count for table manifested data is rewritten from 10 to 1.

    The process is completed.