org.apache.hadoop.hive.ql.io
Interface AcidInputFormat<V>

Type Parameters:
V - The row type
All Superinterfaces:
org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,V>, InputFormatChecker
All Known Implementing Classes:
OrcInputFormat

public interface AcidInputFormat<V>
extends org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,V>, InputFormatChecker

The interface required for input formats that what to support ACID transactions.

The goal is to provide ACID transactions to Hive. There are several primary use cases:

It is important to support batch updates and maintain read consistency within a query. A non-goal is to support many simultaneous updates or to replace online transactions systems.

The design changes the layout of data within a partition from being in files at the top level to having base and delta directories. Each write operation will be assigned a sequential global transaction id and each read operation will request the list of valid transaction ids.

With each new write operation a new delta directory is created with events that correspond to inserted, updated, or deleted rows. Each of the files is stored sorted by the original transaction id (ascending), bucket (ascending), row id (ascending), and current transaction id (descending). Thus the files can be merged by advancing through the files in parallel.

The base files include all transactions from the beginning of time (transaction id 0) to the transaction in the directory name. Delta directories include transactions (inclusive) between the two transaction ids.

Because read operations get the list of valid transactions when they start, all reads are performed on that snapshot, regardless of any transactions that are committed afterwards.

The base and the delta directories have the transaction ids so that major (merge all deltas into the base) and minor (merge several deltas together) compactions can happen while readers continue their processing.

To support transitions between non-ACID layouts to ACID layouts, the input formats are expected to support both layouts and detect the correct one.


Nested Class Summary
static class AcidInputFormat.Options
          Options for controlling the record readers.
static interface AcidInputFormat.RawReader<V>
           
static interface AcidInputFormat.RowReader<V>
           
 
Method Summary
 AcidInputFormat.RawReader<V> getRawReader(org.apache.hadoop.conf.Configuration conf, boolean collapseEvents, int bucket, ValidTxnList validTxnList, org.apache.hadoop.fs.Path baseDirectory, org.apache.hadoop.fs.Path[] deltaDirectory)
          Get a reader that returns the raw ACID events (insert, update, delete).
 AcidInputFormat.RowReader<V> getReader(org.apache.hadoop.mapred.InputSplit split, AcidInputFormat.Options options)
          Get a record reader that provides the user-facing view of the data after it has been merged together.
 
Methods inherited from interface org.apache.hadoop.mapred.InputFormat
getRecordReader, getSplits
 
Methods inherited from interface org.apache.hadoop.hive.ql.io.InputFormatChecker
validateInput
 

Method Detail

getReader

AcidInputFormat.RowReader<V> getReader(org.apache.hadoop.mapred.InputSplit split,
                                       AcidInputFormat.Options options)
                                       throws IOException
Get a record reader that provides the user-facing view of the data after it has been merged together. The key provides information about the record's identifier (transaction, bucket, record id).

Parameters:
split - the split to read
options - the options to read with
Returns:
a record reader
Throws:
IOException

getRawReader

AcidInputFormat.RawReader<V> getRawReader(org.apache.hadoop.conf.Configuration conf,
                                          boolean collapseEvents,
                                          int bucket,
                                          ValidTxnList validTxnList,
                                          org.apache.hadoop.fs.Path baseDirectory,
                                          org.apache.hadoop.fs.Path[] deltaDirectory)
                                          throws IOException
Get a reader that returns the raw ACID events (insert, update, delete). Should only be used by the compactor.

Parameters:
conf - the configuration
collapseEvents - should the ACID events be collapsed so that only the last version of the row is kept.
bucket - the bucket to read
validTxnList - the list of valid transactions to use
baseDirectory - the base directory to read or the root directory for old style files
deltaDirectory - a list of delta files to include in the merge
Returns:
a record reader
Throws:
IOException


Copyright © 2014 The Apache Software Foundation. All rights reserved.