Creating a default directory for managed tables

You can specify a top-level directory for managed tables when creating a Hive database.

Create a default directory for managed tables only after limiting CREATE DATABASE and ALTER DATABASE statements to users having the Admin role, which has hive service user permissions. Permissions to the managed directory must be limited to the hive service user. In addition to restricting permissions to the hive user, you can further secure managed tables using Ranger fine-grained permissions, such as row-level filtering and column masking.

As Admin, you specify a managed location within the default location specified by the hive.metastore.warehouse.dir configuration property to give managed tables a common location for governance policies. The managed location designates a single root directory for all tenant tables, managed and external.

Use the following syntax to create a database that specifies a location for managed tables:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    [COMMENT database_comment]
    [LOCATION external_table_path]
    [MANAGEDLOCATION managed_table_directory_path]
    [WITH DBPROPERTIES (property_name=property_value, ...)];
Do not set LOCATION and MANAGEDLOCATION to the same file system path.
Use the following syntax to set or change a location for managed tables.
ALTER (DATABASE|SCHEMA) database_name SET MANAGEDLOCATION [managed_table_directory_path];
  1. Create a database mydatabase that specifies a top level directory named sales for managed tables.
    CREATE DATABASE mydatabase MANAGEDLOCATION '/warehouse/tablespace/managed/hive/sales';
  2. Change the abc_sales database location to the same location as mydatabase.
    ALTER DATABASE abc_sales SET MANAGEDLOCATION '/warehouse/tablespace/managed/hive/sales';

Generating surrogate keys

You can use the built-in SURROGATE_KEY user-defined function (UDF) to automatically generate numerical Ids for rows as you enter data into a table. The generated surrogate keys can replace wide, multiple composite keys.

Hive supports the surrogate keys on ACID tables only, as described in the following matrix of table types:
Table Type ACID Surrogate Keys File Format INSERT UPDATE/DELETE
Managed: CRUD transactional Yes Yes ORC Yes Yes
Managed: Insert-only transactional Yes Yes Any Yes No
Managed: Temporary No No Any Yes No
External No No Any Yes No
The table you want to join using surrogate keys cannot have column types that need casting. These data types must be primitives, such as INT or STRING.

Joins using the generated keys are faster than joins using strings. Using generated keys does not force data into a single node by a row number. You can generate keys as abstractions of natural keys. Surrogate keys have an advantage over UUIDs, which are slower and probabilistic.

The SURROGATE_KEY UDF generates a unique Id for every row that you insert into a table. It generates keys based on the execution environment in a distributed system, which includes a number of factors, such as internal data structures, the state of a table, and the last transaction id. Surrogate key generation does not require any coordination between compute tasks.

The UDF takes either no arguments or two arguments:

  • Write Id bits
  • Task Id bits
  1. Create a students table in the default ORC format that has ACID properties.
    CREATE TABLE students (row_id INT, name VARCHAR(64), dorm INT);
  2. Insert data into the table. For example:
    INSERT INTO TABLE students VALUES (1, 'fred flintstone', 100), (2, 'barney rubble', 200);
  3. Create a version of the students table using the SURROGATE_KEY UDF.
    CREATE TABLE students_v2 
    (`ID` BIGINT DEFAULT SURROGATE_KEY(),
     row_id INT,
     name VARCHAR(64), 
     dorm INT, 
     PRIMARY KEY (ID) DISABLE NOVALIDATE);
  4. Insert data, which automatically generates surrogate keys for the primary keys.
    INSERT INTO students_v2 (row_id, name, dorm) SELECT * FROM students;
  5. Take a look at the surrogate keys.
    SELECT * FROM students_v2;
    +-----------------+---------------------+-------------------+-------------------+
    | students_v2.id  | students_v2.row_id  | students_v2.name  | students_v2.dorm  |
    +-----------------+---------------------+-------------------+-------------------+
    | 1099511627776   | 1                   | fred flintstone   | 100               |
    | 1099511627777   | 2                   | barney rubble     | 200               |
    +-----------------+---------------------+-------------------+-------------------+               
  6. Add the surrogate keys as a foreign key to another table, such as a student_grades table, to speed up subsequent joins of the tables.
    ALTER TABLE student_grades ADD COLUMNS (gen_id BIGINT);
     
    MERGE INTO student_grades g USING students_v2 s ON g.row_id = s.row_id
    WHEN MATCHED THEN UPDATE SET gen_id = s.id;
    Now you can achieve fast joins on the surrogate keys.