Using Druid and Apache Hive
Also available as:
PDF

Transform Apache Hive Data to Druid

You can execute a Hive query to transform data in Hive to a data source in Druid.

  • If you use Kerberos, configure and run Hive low-latency analytical processing (LLAP).
  • You set up a table, ssb_10_flat_orc as defined in the Star Schema Benchmark example.

A Hive SQL command, an excerpt from the Star Schema Benchmark using the Hive/Druid Integration, invokes the Druid storage handler, specifies Druid segment granularity, and maps selected Hive columns to Druid column types.

  1. Put all the Hive data to undergo ETL in a Hive table.
  2. Execute a Hive query to set the location of the Druid broker using a DNS name or IP address and port 8082, the default broker text listening port.
    SET hive.druid.broker.address.default=10.10.20.30:8082;
    If you installed the Hive and Druid services using Ambari, you can skip this step.
  3. Run a CREATE TABLE AS SELECT statement to create a new Druid datasource from the Hive table you selected in step 1.
    The following example pushes Hive data to Druid.
    CREATE TABLE ssb_druid_hive
    STORED BY 'org.apache.hadoop.hive.
    druid.DruidStorageHandler'
    TBLPROPERTIES (
    "druid.segment.granularity" = "MONTH",
    "druid.query.granularity" = "DAY")
    AS
    SELECT
    cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,
    cast(c_city as string) c_city,
    cast(c_nation as string) c_nation,
    cast(c_region as string) c_region,
    cast(d_weeknuminyear as string) d_weeknuminyear,
    cast(d_year as string) d_year,
    cast(d_yearmonth as string) d_yearmonth,
    cast(d_yearmonthnum as string) d_yearmonthnum,
    cast(lo_discount as string) lo_discount,
    cast(lo_quantity as string) lo_quantity,
    cast(p_brand1 as string) p_brand1,
    cast(p_category as string) p_category,
    cast(p_mfgr as string) p_mfgr,
    cast(s_city as string) s_city,
    cast(s_nation as string) s_nation,
    cast(s_region as string) s_region,
    lo_revenue,
    lo_extendedprice * lo_discount discounted_price,
    lo_revenue - lo_supplycost net_revenue
    FROM
    ssb_10_flat_orc.customer, ssb_10_flat_orc.dates, ssb_10_flat_orc.lineorder,
    ssb_10_flat_orc.part, ssb_10_flat_orc.supplier
    where
    lo_orderdate = d_datekey and lo_partkey = p_partkey
    and lo_suppkey = s_suppkey and lo_custkey = c_custkey;