Exposing Cloud Data as Hive Tables
Datasets stored in S3, ADLS or WASB can be easily made available in Hive as managed or external tables. The main difference between these two table types is that data linked in an external table does not get deleted when the table is deleted.
Therefore, external tables are optimal when the data is already present in a cloud storage service such as S3, ADLS or WASB, which provides longer-term persistence at a lower cost than attached storage.
External Tables
External tables operate in a similar manner as references. If you create an external table called "inventory"
CREATE EXTERNAL TABLE `inventory`( `inv_item_sk` int, `inv_warehouse_sk` int, `inv_quantity_on_hand` int) PARTITIONED BY ( `inv_date_sk` int) STORED AS ORC LOCATION 's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';
and then you drop the "inventory" table, the contents of
s3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory
will not be
deleted. Only the table definition will be removed from the metastore.
Note that data from a partitioned table is not automatically loaded upon table creation. To load data, use the MSCK. For more information, refer to Populating Partition-Related Information.
Managed Tables
Note | |
---|---|
The following actions require you to have write access to the S3 bucket. |
If you create a managed table called “inventory”
CREATE TABLE `inventory`( `inv_item_sk` int, `inv_warehouse_sk` int, `inv_quantity_on_hand` int) PARTITIONED BY ( `inv_date_sk` int) STORED AS ORC LOCATION 's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';
and then you drop the “inventory” table, the contents of
s3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory
will be deleted.