Table design considerations
You need to understand the old way of designing tables and how bring some table design habits to ACID cause problems. Solutions come later.
Managing data ingestion frequency
Things to consider when designing a table with Hive include ingest frequency. Data ingestion has been the controlling factor in table designs on legacy platforms. ACID tables provide an opportunity to solve problems in legacy table designs.
On legacy platforms, you see a tendency to over-partition tables, not to gain an advantage from a consumer standpoint, but for a physical demarkation point for the ingest process. If anything goes wrong, you delete the partition representing the data for that ingest, and just move on.
A table design based on ingest frequency, which is the number of ingest events, can lead to an abnormal, inefficent number of small files. Each ingest event creates a new series of files. Historically, to overcome this, you used excessive partitions to define and mitigate problems from numerous appends to a physical location (base or partition).
Append operations on non-ACID tables create a small file problem.
Ingesting tables with numerous partitions, excessive append operations, and numerous READ operations result in the following problems:
-
Poor consumer performance
-
Increase compute requirements
-
File system management stress
An ACID table helps prevent issues by:
-
Compacting append operations, building better file sizes through compactions.
-
Building and maintaining statistics
-
Atomic operations
When you cannot alter data sources, design your workflow as follows:
-
Use an ingest table to sweep to an ACID table (append).
-
Execute READs on ACID tables.
-
When partitions make sense, design them for the consumer, not the ingest pattern.
Streaming Data Sources:
-
Hive streaming (to an ACID table)
Building high-level partitions for ACID tables
With compactions and CRUD capabilities, you need to design tables for the consumer.
Partitioning that works fine for non-ACID tables, such as YYYY-MM-DD-HH, which represents 8760 partitions a year, is not recommended for partitioning ACID tables. Build a higher-level partition, such as YYYY-MM, which represents 12 partitions a year.
Let compactions manage and optimize the appends. Optimize by using larger files per partition. ORC file types with columnar formatting, statistics, headers, footers, and Bloom Filters help optimize what Hive must scan, reducing I/O.
Tracking batches
- Add a field to the table with the batch id.
- Use ORC CRUD functionality to remove and replay a batch instead of dropping the partition.