SQL tables in CDP
You need to understand the relationship of table types to ACID properties to run SQL queries on your data. The location of a table depends on the table type. You might choose a table type based on its supported storage format.
You can create ACID (atomic, consistent, isolated, and durable) tables for unlimited transactions or for insert-only transactions. These tables are managed tables. Alternatively, you can create an external table for non-transactional use. Because control of the external table is weak, the external table is not ACID compliant.
The following matrix includes the types of tables you can create, SQL engine support, ACID property support, required storage format, and key SQL operations (INSERT, UPDATE, and DELETE).
Table Type | Engine | ACID | File Format | INSERT | UPDATE/DELETE |
---|---|---|---|---|---|
Managed: CRUD transactional | Hive | Yes | ORC | Yes | Yes |
Managed: Insert-only transactional | Hive, Impala | Yes | Any | Yes | No |
External | Hive, Impala | No | Any | Yes | No |
Although you cannot use the SQL UPDATE or DELETE statements to delete data in some types of tables, you can use DROP PARTITION on any table type to delete the data.
Transactional tables
Transactional tables are ACID tables that reside in the Hive warehouse. To achieve ACID compliance, Hive has to manage the table, including access to the table data. Only through Hive and Impala can you access and change the data in managed tables. Because the engine has full control of managed tables, the engine can optimize these tables extensively.
CDP is designed to support a relatively low rate of transactions, as opposed to serving as an online analytical processing (OLAP) system. You can use the SHOW TRANSACTIONS command to list open and aborted transactions.
Transactional tables are on a par with non-ACID tables performance-wise. No bucketing or sorting is required. Bucketing does not affect performance. These tables are compatible with native cloud storage.
Hive supports one statement per transaction, which can include any number of rows, partitions, or tables.
External tables
External table data created from Hive or Impala is not owned or controlled by Hive. You typically use an external table when you want to create the table based on data located in the object store. You can create an external table based on a PostgreSQL table, and use the table from Hive or Impala. You can create an external table based on a text file, such as CSV (comma separated values). You can use Hue to put the Postgres table or CSV file on S3.
The following capabilities are not supported for external tables:
-
Query cache
-
Materialized views, except in a limited way
-
Automatic runtime filtering
-
File merging after insert
When you run DROP TABLE on an external table, by default only the metadata
(schema) is dropped. If you want the DROP TABLE command to also remove the actual
data in the external table, as DROP TABLE does on a managed table, you need to set
the external.table.purge
property to true.
ORC vs Parquet formats
The differences between Optimized Row Columnar (ORC) file format for storing data in SQL engines are important to understand. Query performance improves when you use the appropriate format for your application.
The following table compares SQL engine support for ORC and Parquet.
Capability | ORC | Parquet | SQL Engine |
---|---|---|---|
Read external data | ✓ | ✓ | Hive, Impala |
Write Full ACID tables | ✓ | Hive | |
Read Full ACID tables | ✓ | Hive, Impala | |
Read Insert-only managed tables | ✓ | ✓ | Impala |
Column index | ✓ | ✓ | Hive |
Column index | ✓ | Impala | |
CBO uses column metadata | ✓ | Hive | |
Recommended format | ✓ | Hive | |
Recommended format | ✓ | Impala | |
Read complex types | ✓ | ✓ | Impala |
Read/write complex types | ✓ | ✓ | Hive |
- Default file format for table creation
Hive=ORC; Impala=Parquet
- Support for CRUD ACID tables
Hive only, ORC storage format only
- Support for insert-only ACID tables
Hive and Impala, in all storage formats