Hive Sink
Important | |
---|---|
Hortonworks strongly recommends that all users running HDP 2.3.4 upgrade to HDP 2.3.4.7. |
This sink streams events containing delimited text or JSON data directly into a Hive table or partition. Events are written using Hive transactions. As soon as a set of events are committed to Hive, they become immediately visible to Hive queries. Partitions to which flume will stream to can either be pre-created or, optionally, Flume can create them if they are missing. Fields from incoming event data are mapped to corresponding columns in the Hive table.
Property Name |
Default |
Description |
---|---|---|
channel | – | |
type | – | The component type name, needs to be hive. |
hive.metastore | – | Hive metastore URI (eg thrift://a.b.com:9083). |
hive.database | – | Hive database name . |
hive.table | – | Hive table name. |
hive.partition | – | Comma separated list of partition values identifying the partition to write to. May contain escape sequences. E.g.: If the table is partitioned by (continent: string, country :string, time : string) then ‘Asia,India,2014-02-26-01-21’ will indicate continent=Asia,country=India,time=2014-02-26-01-21. |
hive.txnsPerBatchAsk | 100 | Hive grants a batch of transactions instead of single transactions to streaming clients like Flume. This setting configures the number of desired transactions per Transaction Batch. Data from all transactions in a single batch end up in a single file. Flume will write a maximum of batchSize events in each transaction in the batch. This setting in conjunction with batchSize provides control over the size of each file. Note that eventually Hive will transparently compact these files into larger files. |
heartBeatInterval | 240 | (In seconds) Interval between consecutive heartbeats sent to Hive to keep unused transactions from expiring. Set this value to 0 to disable heartbeats . |
autoCreatePartitions | true | Flume will automatically create the necessary Hive partitions to stream to. |
batchSize | 15000 | Max number of events written to Hive in a single Hive transaction. |
maxOpenConnections | 500 | Allow only this number of open connections. If this number is exceeded, the least recently used connection is closed. |
callTimeout | 10000 | (In milliseconds) Timeout for Hive & HDFS I/O operations, such as openTxn, write, commit, abort. |
serializer | – | Serializer is responsible for parsing out field from the event and mapping them to columns in the hive table. Choice of serializer depends upon the format of the data in the event. Supported serializers: DELIMITED and JSON. |
roundUnit | minute | The unit of the round down value - second, minute or hour. |
roundValue | 1 | Rounded down to the highest multiple of this (in the unit configured using hive.roundUnit), less than current time. |
timeZone | Local | Name of the timezone that should be used for resolving the escape sequences in partition, e.g. Time America/Los_Angeles. |
useLocalTimeStamp | false | Use the local time (instead of the timestamp from the event header) while replacing the escape sequences. |
[D]
Following serializers are provided for Hive sink:
JSON: Handles UTF8 encoded Json (strict syntax) events and requires no configuration. Object names in the JSON are mapped directly to columns with the same name in the Hive table. Internally uses org.apache.hive.hcatalog.data.JsonSerDe but is independent of the Serde of the Hive table. This serializer requires HCatalog to be installed.
DELIMITED: Handles simple delimited textual events. Internally uses LazySimpleSerde but is independent of the Serde of the Hive table.
Property Name |
Default |
Description |
---|---|---|
serializer.delimiter | , | (Type: string) The field delimiter in the incoming data. To use special characters, surround them with double quotes like “\t”. |
serializer.fieldnames | – | The mapping from input fields to columns in hive table. Specified as a comma separated list (no spaces) of hive table columns names, identifying the input fields in order of their occurrence. To skip fields leave the column name unspecified. Eg. ‘time,,IP,message’ indicates the 1st, 3rd and 4th fields in input map to time, IP and message columns in the hive table. |
serializer.serdeSeparator | Ctrl-A | (Type: character) Customizes the separator used by underlying serde. There can be a gain in efficiency if the fields in serializer.fieldnames are in same order as table columns, the serializer.delimiter is same as the serializer.serdeSeparator and number of fields in serializer.fieldnames is less than or equal to number of table columns, as the fields in incoming event body do not need to be reordered to match order of table columns. Use single quotes for special characters like ‘\t’. Ensure input fields do not contain this character. Note: If serializer.delimiter is a single character, preferably set this to the same character. |
[D]
The following are the escape sequences supported:
Alias |
Description |
---|---|
%{host} | Substitute value of event header named “host”. Arbitrary header names are supported. |
%t | Unix time in milliseconds . |
%a | Locale’s short weekday name (Mon, Tue, ...) |
%A | Locale’s full weekday name (Monday, Tuesday, ...) |
%b | Locale’s short month name (Jan, Feb, ...) |
%B | Locale’s long month name (January, February, ...) |
%c | Locale’s date and time (Thu Mar 3 23:05:25 2005) |
%d | Day of month (01) |
%D | Date; same as %m/%d/%y |
%H | Hour (00..23) |
%I | Hour (01..12) |
%j | Day of year (001..366) |
%k | Hour ( 0..23) |
%m | Month (01..12) |
%M | Minute (00..59) |
%p | Locale’s equivalent of am or pm |
%s | Seconds since 1970-01-01 00:00:00 UTC |
%S | Second (00..59) %y last two digits of year (00..99) |
%Y | Year (2015) |
%z | +hhmm numeric timezone (for example, -0400) |
Example Hive table:
create table weblogs ( id int , msg string ) partitioned by (continent string, country string, time string) clustered by (id) into 5 buckets stored as orc;
Example for agent named a1:
a1.channels = c1 a1.channels.c1.type = memory a1.sinks = k1 a1.sinks.k1.type = hive a1.sinks.k1.channel = c1 a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083 a1.sinks.k1.hive.database = logsdb a1.sinks.k1.hive.table = weblogs a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M a1.sinks.k1.useLocalTimeStamp = false a1.sinks.k1.round = true a1.sinks.k1.roundValue = 10 a1.sinks.k1.roundUnit = minute a1.sinks.k1.serializer = DELIMITED a1.sinks.k1.serializer.delimiter = "\t" a1.sinks.k1.serializer.serdeSeparator = '\t' a1.sinks.k1.serializer.fieldnames =id,,msg
Note: For all of the time related escape sequences, a header with the key “timestamp” must exist among the headers of the event (unless useLocalTimeStampis set to true). One way to add this automatically is to use the TimestampInterceptor.
The above configuration will round down the timestamp to the last 10th minute. For example, an event with timestamp header set to 11:54:34 AM, June 12, 2012 and ‘country’ header set to ‘india’ will evaluate to the partition (continent=’asia’,country=’india’,time=‘2012-06-12-11-50’. The serializer is configured to accept tab separated input containing three fields and to skip the second field.