SQL tables in CDP

You need to understand the relationship of table types to ACID properties to run SQL queries on your data. The location of a table depends on the table type. You might choose a table type based on its supported storage format.

You can create ACID (atomic, consistent, isolated, and durable) tables for unlimited transactions or for insert-only transactions. These tables are managed tables. Alternatively, you can create an external table for non-transactional use. Because control of the external table is weak, the external table is not ACID compliant.

The following matrix includes the types of tables you can create, SQL engine support, ACID property support, required storage format, and key SQL operations (INSERT, UPDATE, and DELETE).

Table Type Engine ACID File Format INSERT UPDATE/DELETE
Managed: CRUD transactional Hive Yes ORC Yes Yes
Managed: Insert-only transactional Hive, Impala Yes Any Yes No
External Hive, Impala No Any Yes No

Although you cannot use the SQL UPDATE or DELETE statements to delete data in some types of tables, you can use DROP PARTITION on any table type to delete the data.

Transactional tables

Transactional tables are ACID tables that reside in the Hive warehouse. To achieve ACID compliance, Hive has to manage the table, including access to the table data. Only through Hive and Impala can you access and change the data in managed tables. Because the engine has full control of managed tables, the engine can optimize these tables extensively.

CDP is designed to support a relatively low rate of transactions, as opposed to serving as an online analytical processing (OLAP) system. You can use the SHOW TRANSACTIONS command to list open and aborted transactions.

Transactional tables are on a par with non-ACID tables performance-wise. No bucketing or sorting is required. Bucketing does not affect performance. These tables are compatible with native cloud storage.

Hive supports one statement per transaction, which can include any number of rows, partitions, or tables.

External tables

External table data created from Hive or Impala is not owned or controlled by Hive. You typically use an external table when you want to create the table based on data located in the object store. You can create an external table based on a PostgreSQL table, and use the table from Hive or Impala. You can create an external table based on a text file, such as CSV (comma separated values). You can use Hue to put the Postgres table or CSV file on S3.

The following capabilities are not supported for external tables:

  • Query cache

  • Materialized views, except in a limited way

  • Automatic runtime filtering

  • File merging after insert

When you run DROP TABLE on an external table, by default only the metadata (schema) is dropped. If you want the DROP TABLE command to also remove the actual data in the external table, as DROP TABLE does on a managed table, you need to set the external.table.purge property to true.

ORC vs Parquet formats

The differences between Optimized Row Columnar (ORC) file format for storing data in SQL engines are important to understand. Query performance improves when you use the appropriate format for your application.

The following table compares SQL engine support for ORC and Parquet.

Table 1.
Capability ORC Parquet SQL Engine
Read external data Hive, Impala
Write Full ACID tables Hive
Read Full ACID tables Hive, Impala
Read Insert-only managed tables Impala
Column index Hive
Column index Impala
CBO uses column metadata Hive
Recommended format Hive
Recommended format Impala
Read complex types Impala
Read/write complex types Hive
The major differences between Hive and Impala are as follows:
  • Default file format for table creation

    Hive=ORC; Impala=Parquet

  • Support for CRUD ACID tables

    Hive only, ORC storage format only

  • Support for insert-only ACID tables

    Hive and Impala, in all storage formats

In Hive, if you accept the default, or explicitly specify ORC storage, you get an ACID table with insert, update, and delete (CRUD) capabilities. If you specify any other storage type for a managed table, such as text, CSV, AVRO, or JSON, you get an insert-only ACID table. You cannot update or delete columns in the insert-only table.