Cloud Data Access
Also available as:
PDF
loading table of contents...

Exposing Cloud Data as Hive Tables

Datasets stored in S3, ADLS or WASB can be easily made available in Hive as managed or external tables. The main difference between these two table types is that data linked in an external table does not get deleted when the table is deleted.

Therefore, external tables are optimal when the data is already present in a cloud storage service such as S3, ADLS or WASB, which provides longer-term persistence at a lower cost than attached storage.

External Tables

External tables operate in a similar manner as references. If you create an external table called "inventory"

CREATE EXTERNAL TABLE `inventory`(
  `inv_item_sk` int,
  `inv_warehouse_sk` int,
  `inv_quantity_on_hand` int)
PARTITIONED BY (
  `inv_date_sk` int) STORED AS ORC
LOCATION
  's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';

and then you drop the "inventory" table, the contents of s3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory will not be deleted. Only the table definition will be removed from the metastore.

Note that data from a partitioned table is not automatically loaded upon table creation. To load data, use the MSCK. For more information, refer to Populating Partition-Related Information.

Managed Tables

[Note]Note

The following actions require you to have write access to the S3 bucket.

If you create a managed table called “inventory”

CREATE TABLE `inventory`(
  `inv_item_sk` int,
  `inv_warehouse_sk` int,
  `inv_quantity_on_hand` int)
PARTITIONED BY (
  `inv_date_sk` int) STORED AS ORC
LOCATION
  's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';

and then you drop the “inventory” table, the contents of s3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory will be deleted.