Anatomy of a Hive-to-Druid data transformation

A breakdown of the example SQL code that transforms Hive data into a Druid datasource helps you understand how to transform your own data.

SQL example code

The following SQL statement contains the main elements of a statement that can transform Hive data into a time series-based Druid datasource. You need to replace the values in the statement to match your data warehouse environment and analytics parameters.

CREATE TABLE ssb_druid_hive
STORED BY 'org.apache.hadoop.hive.
druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,
cast(c_city as string) c_city,
cast(c_nation as string) c_nation,
cast(c_region as string) c_region,
cast(d_weeknuminyear as string) d_weeknuminyear,
cast(d_year as string) d_year,
cast(d_yearmonth as string) d_yearmonth,
cast(d_yearmonthnum as string) d_yearmonthnum,
cast(lo_discount as string) lo_discount,
cast(lo_quantity as string) lo_quantity,
cast(p_brand1 as string) p_brand1,
cast(p_category as string) p_category,
cast(p_mfgr as string) p_mfgr,
cast(s_city as string) s_city,
cast(s_nation as string) s_nation,
cast(s_region as string) s_region,
lo_revenue,
lo_extendedprice * lo_discount discounted_price,
lo_revenue - lo_supplycost net_revenue
FROM
ssb_10_flat_orc.customer, ssb_10_flat_orc.dates, ssb_10_flat_orc.lineorder,
ssb_10_flat_orc.part, ssb_10_flat_orc.supplier
where
lo_orderdate = d_datekey and lo_partkey = p_partkey
and lo_suppkey = s_suppkey and lo_custkey = c_custkey;

Explanation of SQL example

The following breakdown of the preceding SQL statement explains the main elements of a statement that can transform Hive data into a time series-based Druid datasource. You need to replace the values in the statement to match your data warehouse environment and analytics parameters.

CREATE TABLE ssb_druid_hive: Creates the Hive table and assigns a name to it. You must use a table name that is not already used by another Druid datasource.

STORED BY, 'org.apache.hadoop.hive., druid.DruidStorageHandler': This calls the Druid storage handler so that the Hive data can be transformed to a Druid datasource.

TBLPROPERTIES ("druid.segment.granularity" = "MONTH", "druid.query.granularity" = "DAY") AS SELECT cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,: Creates the __time column, which is a required SQL timestamp column for the Druid datasource.

cast(c_city as string) c_city, cast(c_nation as string) c_nation, cast(c_region as string) c_region, cast(d_weeknuminyear as string) d_weeknuminyear, ...: cast (… as string) statements index columns as dimensions. Dimensions in the Druid datasource are used to search and filter.

lo_extendedprice * lo_discount discounted_price, lo_revenue - lo_supplycost net_revenue: These lines preaggregate metrics columns. To index a column as metrics, you need to cast the column to a Hive numeric data type.

FROM ssb_10_flat_orc.customer, ssb_10_flat_orc.dates, ...: The numeric value indicates the data scale. The other information corresponds with the name and other components of the source Hive tables.


Table Property	Required	Description	Valid Values

druid.segment.popularity	No	Defines how the data is physically partitioned. The values that are permissible here correspond with Druid segment granularity.	"YEAR", "MONTH", "WEEK", "DAY", "HOUR", "MINUTE", "SECOND"
druid.query.granularity	No	Defines how much granularity to store in a segment. The values that are permissible here correspond with Druid query granularity.	"YEAR", "MONTH", "WEEK", "DAY", "HOUR", "MINUTE", "SECOND"

If you need Druid to ingest Hive data that follows the same schema as the first data set that you transformed, you can do so with the INSERT INTO statement.