Anatomy of a Hive-to-Druid data transformation
A breakdown of the example SQL code that transforms Hive data into a Druid datasource helps you understand how to transform your own data.
SQL example code
The following SQL statement contains the main elements of a statement that can transform Hive data into a time series-based Druid datasource. You need to replace the values in the statement to match your data warehouse environment and analytics parameters.
CREATE TABLE ssb_druid_hive
STORED BY 'org.apache.hadoop.hive.
druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,
cast(c_city as string) c_city,
cast(c_nation as string) c_nation,
cast(c_region as string) c_region,
cast(d_weeknuminyear as string) d_weeknuminyear,
cast(d_year as string) d_year,
cast(d_yearmonth as string) d_yearmonth,
cast(d_yearmonthnum as string) d_yearmonthnum,
cast(lo_discount as string) lo_discount,
cast(lo_quantity as string) lo_quantity,
cast(p_brand1 as string) p_brand1,
cast(p_category as string) p_category,
cast(p_mfgr as string) p_mfgr,
cast(s_city as string) s_city,
cast(s_nation as string) s_nation,
cast(s_region as string) s_region,
lo_revenue,
lo_extendedprice * lo_discount discounted_price,
lo_revenue - lo_supplycost net_revenue
FROM
ssb_10_flat_orc.customer, ssb_10_flat_orc.dates, ssb_10_flat_orc.lineorder,
ssb_10_flat_orc.part, ssb_10_flat_orc.supplier
where
lo_orderdate = d_datekey and lo_partkey = p_partkey
and lo_suppkey = s_suppkey and lo_custkey = c_custkey;
Explanation of SQL example
The following breakdown of the preceding SQL statement explains the main elements of a statement that can transform Hive data into a time series-based Druid datasource. You need to replace the values in the statement to match your data warehouse environment and analytics parameters.
CREATE TABLE ssb_druid_hive
- Creates the Hive table and assigns a name to it. You must use a table name that is not already used by another Druid datasource.
STORED BY, 'org.apache.hadoop.hive., druid.DruidStorageHandler'
- This calls the Druid storage handler so that the Hive data can be transformed to a Druid datasource.
TBLPROPERTIES ("druid.segment.granularity" = "MONTH", "druid.query.granularity" = "DAY") AS SELECT cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,
- Creates the __time column, which is a required SQL timestamp column for the Druid datasource.
cast(c_city as string) c_city, cast(c_nation as string) c_nation, cast(c_region as string) c_region, cast(d_weeknuminyear as string) d_weeknuminyear, ...
- cast (… as string) statements index columns as dimensions. Dimensions in the Druid datasource are used to search and filter.
lo_extendedprice * lo_discount discounted_price, lo_revenue - lo_supplycost net_revenue
- These lines preaggregate metrics columns. To index a column as metrics, you need to cast the column to a Hive numeric data type.
FROM ssb_10_flat_orc.customer, ssb_10_flat_orc.dates, ...
- The numeric value indicates the data scale. The other information corresponds with the name and other components of the source Hive tables.
Table Property | Required | Description | Valid Values |
---|---|---|---|
druid.segment.popularity | No | Defines how the data is physically partitioned. The values that are permissible here correspond with Druid segment granularity. | "YEAR", "MONTH", "WEEK", "DAY", "HOUR", "MINUTE", "SECOND" |
druid.query.granularity | No | Defines how much granularity to store in a segment. The values that are permissible here correspond with Druid query granularity. | "YEAR", "MONTH", "WEEK", "DAY", "HOUR", "MINUTE", "SECOND" |
If you need Druid to ingest Hive data that follows the same schema as the first data set that you transformed, you can do so with the INSERT INTO statement.