Create database

In this example we will use a HIVE ACID table as we expect our table to feature a number of slowly changing dimensions such as mileage and data center utility rates. Furthermore we don’t expect significant transaction volumes or no of Hive Objects such as partitions on our clusters that could impact performance due to compactions.

We will create a specific database that will use managed tables for both travel expenses and public cloud reporting (see, Create a default directory for managed tables). From a security perspective some of the tables will have read access for all employees, with write access limited. The trip table will feature Personal Identifiable Information, and employees in the European Union will be subject to GDPR regulatory requirements and therefore those columns will need to be masked to a restricted number of users that require access.

In the following section we illustrate how we can apply a GDPR compliance policy using Apache Ranger on these objects, first we create the tables.

CREATE DATABASE carbonAccounting
Each table will be stored in the default managed location for finance data /warehouse/finance/managed/hive.
Our database will be composed of a number of tables that capture scope 1 emissions data such as travel and data center power consumption. We also include an Embedded Emissions Table which can be used to store carbon contribution of assets such as physical hardware.
CREATE TABLE trip(employeeID string, date date, departureCity string, arrivalCity string, routeDistance int, flighNo string, serviceClass string, journeyType string, co2Emission int) 
  STORED AS ORC
  TBLPROPERTIES ('transactional'='true',
  'transactional_properties'='insert_only');

CREATE TABLE officeEmissions(use https://docs.google.com/spreadsheets/d/1NYU0eQDukjYufN80qzrrSjf2ZUE672wop69qCb3G0Sc/edit?usp=sharing for columns) 
  STORED AS ORC
  TBLPROPERTIES ('transactional'='true',
  'transactional_properties'='insert_only');

CREATE TABLE public_cloud_reporting(Kilowatt int, DataCentre string, Utilization int) 
  STORED AS ORC
  TBLPROPERTIES ('transactional'='true',
  'transactional_properties'='insert_only')
Description “captures power draw of respective data center note public cloud”;

CREATE TABLE carbonEmissions(type int, emission int, value int) 
  STORED AS ORC
  TBLPROPERTIES ('transactional'='true',
  'transactional_properties'='insert_only');

The emission data can be calculated by means of a function that can call reference data that might be delivered to the cluster using Cloudera Flow Management tools (CFM) such as Apache NiFi. Typically this reference data will be subject to commercial agreements and therefore access will need to be restricted.