SQL tables in CDP

You need to understand the relationship of table types to ACID properties to run SQL queries on your data. The location of a table depends on the table type. You might choose a table type based on its supported storage format.

You can create ACID (atomic, consistent, isolated, and durable) tables for unlimited transactions or for insert-only transactions. These tables are managed tables. Alternatively, you can create an external table for non-transactional use. Because control of the external table is weak, the external table is not ACID compliant.

The following matrix includes the types of tables you can create, SQL engine support, ACID property support, required storage format, and key SQL operations (INSERT, UPDATE, and DELETE).


Table Type	Engine	ACID	File Format	INSERT	UPDATE/DELETE
Managed: CRUD transactional	Hive	Yes	ORC	Yes	Yes
Managed: Insert-only transactional	Hive, Impala	Yes	Any	Yes	No
External	Hive, Impala	No	Any	Yes	No

Although you cannot use the SQL UPDATE or DELETE statements to delete data in some types of tables, you can use DROP PARTITION on any table type to delete the data.

Transactional tables🔗

Transactional tables are ACID tables that reside in the Hive warehouse. To achieve ACID compliance, Hive has to manage the table, including access to the table data. Only through Hive and Impala can you access and change the data in managed tables. Because the engine has full control of managed tables, the engine can optimize these tables extensively.

CDP is designed to support a relatively low rate of transactions, as opposed to serving as an online analytical processing (OLAP) system. You can use the SHOW TRANSACTIONS command to list open and aborted transactions.

Transactional tables are on a par with non-ACID tables performance-wise. No bucketing or sorting is required. Bucketing does not affect performance. These tables are compatible with native cloud storage.

Hive supports one statement per transaction, which can include any number of rows, partitions, or tables.

External tables🔗

External table data created from Hive or Impala is not owned or controlled by Hive. You typically use an external table when you want to create the table based on data located in the object store. You can create an external table based on a PostgreSQL table, and use the table from Hive or Impala. You can create an external table based on a text file, such as CSV (comma separated values). You can use Hue to put the Postgres table or CSV file on S3.

The following capabilities are not supported for external tables:

Query cache
Materialized views, except in a limited way
Automatic runtime filtering
File merging after insert

When you run DROP TABLE on an external table, by default only the metadata (schema) is dropped. If you want the DROP TABLE command to also remove the actual data in the external table, as DROP TABLE does on a managed table, you need to set the external.table.purge property to true.

ORC vs Parquet formats🔗

The differences between Optimized Row Columnar (ORC) file format for storing data in SQL engines are important to understand. Query performance improves when you use the appropriate format for your application.

The following table compares SQL engine support for ORC and Parquet.

Table 1.
Capability	ORC	Parquet	SQL Engine
Read external data	✓	✓	Hive, Impala
Write Full ACID tables	✓		Hive
Read Full ACID tables	✓		Hive, Impala
Read Insert-only managed tables	✓	✓	Impala
Column index	✓	✓	Hive
Column index		✓	Impala
CBO uses column metadata	✓		Hive
Recommended format	✓		Hive
Recommended format		✓	Impala
Read complex types	✓	✓	Impala
Read/write complex types	✓	✓	Hive

The major differences between Hive and Impala are as follows:

Default file format for table creation
Hive=ORC; Impala=Parquet
Support for CRUD ACID tables
Hive only, ORC storage format only
Support for insert-only ACID tables
Hive and Impala, in all storage formats

In Hive, if you accept the default, or explicitly specify ORC storage, you get an ACID table with insert, update, and delete (CRUD) capabilities. If you specify any other storage type for a managed table, such as text, CSV, AVRO, or JSON, you get an insert-only ACID table. You cannot update or delete columns in the insert-only table.

SQL tables in CDP

Transactional tables🔗

External tables🔗

ORC vs Parquet formats🔗

We want your opinion

How can we improve this page?