Create database
In this example we will use a HIVE ACID table as we expect our table to feature a number of slowly changing dimensions such as mileage and data center utility rates. Furthermore we don’t expect significant transaction volumes or no of Hive Objects such as partitions on our clusters that could impact performance due to compactions.
We will create a specific database that will use managed tables for both travel
expenses and public cloud reporting (see, Create a default directory for managed
tables
). From a security perspective some of the tables will have read access for all
employees, with write access limited. The trip table will feature Personal Identifiable
Information, and employees in the European Union will be subject to GDPR regulatory
requirements and therefore those columns will need to be masked to a restricted number of
users that require access.
In the following section we illustrate how we can apply a GDPR compliance policy using Apache Ranger on these objects, first we create the tables.
CREATE DATABASE carbonAccounting
Each table will be stored in the default
managed location for finance data /warehouse/finance/managed/hive.CREATE TABLE trip(employeeID string, date date, departureCity string, arrivalCity string, routeDistance int, flighNo string, serviceClass string, journeyType string, co2Emission int)
STORED AS ORC
TBLPROPERTIES ('transactional'='true',
'transactional_properties'='insert_only');
CREATE TABLE officeEmissions(use https://docs.google.com/spreadsheets/d/1NYU0eQDukjYufN80qzrrSjf2ZUE672wop69qCb3G0Sc/edit?usp=sharing for columns)
STORED AS ORC
TBLPROPERTIES ('transactional'='true',
'transactional_properties'='insert_only');
CREATE TABLE public_cloud_reporting(Kilowatt int, DataCentre string, Utilization int)
STORED AS ORC
TBLPROPERTIES ('transactional'='true',
'transactional_properties'='insert_only')
Description “captures power draw of respective data center note public cloud”;
CREATE TABLE carbonEmissions(type int, emission int, value int)
STORED AS ORC
TBLPROPERTIES ('transactional'='true',
'transactional_properties'='insert_only');
The emission data can be calculated by means of a function that can call reference
data that might be delivered to the cluster using Cloudera Flow Management tools
(CFM)
such as Apache NiFi. Typically this reference data will be subject to commercial agreements
and therefore access will need to be restricted.