Spark and Hive permissions on Azure cloud cluster

This topic describes example scenarios about how Ranger authorizes queries coming from Spark to access Hive table.

Environment details: Cloudera Data Platform: Azure cloud cluster with CDP 7.2.18.500; Cloudera Manager: CM 7.12.0.500

Connecting to Spark and running Hive queries

Connecting to Spark shell

To connect to the Spark shell, run the following command:

spark3-shell

Running Hive queries in Spark shell

To run Hive queries, you need to call the HiveContext class:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

Executing a query

sqlContext.sql("select * from customer2").show()

Prerequisites for Ranger Hive security

In DataHub Hive services (HiveServer2 and Hive Metastore), the following configuration needs to be added in ranger-hive-security.xml and the services restarted:

ranger.plugin.hive.urlauth.filesystem.schemes = hdfs:,file:,wasb:,adl:,abfs:

Initial setup with Hive user

Run the following commands using the Hive user:

CREATE EXTERNAL TABLE customer2 (c1 int, c2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
insert into customer2 values(1,'hivetest1');
select * from customer2;

create database vehicle;
create external table vehicle.cars(car_id int, car_name string, car_color string, car_price int);
insert into vehicle.cars(car_id, car_name, car_color, car_price) VALUES (1,'car1','color1',100000), (2,'car2','color2',200000), (3,'car3','color3',300000), (4,'car4','color4',400000);
select * from vehicle.cars;

Use cases for custom user permissions

Usecase 1: Running a Select query with a custom user

Objective: To run a select query using custom user hrt_qa having policy on a table.
Workflow:
1. Created initial Hive policy:
  - Resource:
    - Hive Database = default
    - Hive Table = customer2
    - Hive Column = *
  - User:
    - hrt_qa = select
2. Attempted command (as hrt_qa after kinit in Spark shell):
```
sqlContext.sql("select * from customer2").show()
```
3. Observed error:
  Denied exception for abfs path. For example,
```
SASTokenProviderException: Failed to acquire a SAS token for get-status on /warehouse/tablespace/external/hive/customer2 due to org.apache.hadoop.security.AccessControlException: Permission denied
```
  This indicates a need for an ADLS policy.
4. ADLS policy creation:
  - Storage account: sdxqeweekly
  - Storage account container: vb-18df-data
  - Relative path: /warehouse/tablespace/external/hive
  - User: hrt_qa = all
5. Re-attempted command (as hrt_qa after kinit):
```
sqlContext.sql("select * from customer2").show()
```
Result: The command is successful, and table records are visible.

Usecase 2: Creating a table in the database with a custom user

Objective: To allow custom user hrt_qa to create a table in the vehicle database.
Workflow:
1. Initial Hive policy creation:
  - Resource:
    - Hive Database = vehicle
    - Hive Table = *
    - Hive Column = *
  - User:
    - hrt_qa = create
2. Attempted command (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("create table vehicle.cars1(car_id int, car_name string, car_color string, car_price int)").show()
```
3. Observed error: Denied exception for Azure path. For example,
```
SASTokenProviderException: Failed to acquire a SAS token for get-status on /warehouse/tablespace/external/hive/vehicle.db/cars1 due to org.apache.hadoop.security.AccessControlException: Permission denied.
```
  This indicates a need for an ADLS policy.
4. ADLS policy creation:
  - Storage Account: sdxqeweekly
  - Storage Account Container: vb-18df-data
  - Relative Path: /warehouse/tablespace/external/hive
  - User: hrt_qa = all
5. Re-attempted command (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("create table vehicle.cars1(car_id int, car_name string, car_color string, car_price int)").show()
```
Result: Command was successful, and the table was created.

Usecase 3: Altering a table in the database with a custom user

Objective: To allow a custom user hrt_qa to alter the cars table in the vehicle database.
Workflow:
1. Initial Hive policy creation:
  - Resource:
    - Hive Database = vehicle
    - Hive Table = cars
    - Hive Column = *
  - User: hrt_qa = alter
2. Attempted command (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("alter table vehicle.cars rename to vehicle.cars_a").show()
```
3. Observed error: Permission denied: user [hrt_qa] does not have [SELECT] privilege on [vehicle/cars].
  This indicates that you need to give SELECT permission to the user.
4. Updated Hive policy (adding select):
  - Resource:
    - Hive Database = vehicle
    - Hive Table = cars
    - Hive Column = *
  - User: hrt_qa = alter, select
5. Re-attempted command (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("alter table vehicle.cars rename to vehicle.cars_a").show()
```
6. Observed Error: Unable to alter table. Permission denied: user [hrt_qa] does not have [ALTER] privilege on [vehicle/cars_a].
  This indicates that the new table name also needs ALTER permission.
7. Further updated Hive policy (adding new table name):
  - Resource:
    - Hive Database = vehicle
    - Hive Table = cars, cars_a
    - Hive Column = *
  - User: hrt_qa = alter, select
8. Re-attempted command (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("alter table vehicle.cars rename to vehicle.cars_a").show()
```
9. Deny log generated: Command was successful, and the table was renamed. However, Ranger UI audit access logs showed a deny log for the Azure directory.
  This indicates that you need to perform the alter operation with ADLS policy to avoid the above denied log.
10. Addressing Azure directory deny log (with ADLS policy):
  - Hive policy:
    - Resource:
      
      Hive Database = vehicle
      
      Hive Table = cars_a, cars_b
      
      Hive Column = *
    - User: hrt_qa = alter, select
  - ADLS policy:
    - Storage Account: sdxqeweekly
    - Storage Account Container: vb-18df-data
    - Relative Path: /warehouse/tablespace/external/hive
    - User: hrt_qa = all
11. Final command (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("alter table vehicle.cars_a rename to vehicle.cars_b").show()
```
Result: Command was successful, the table got renamed successfully, and no deny logs for the Azure directory were seen.

Usecase 4: Dropping a table with a custom user

Objective: To allow custom user hrt_qa to run a drop table query.
Workflow:
1. Initial setup (as Hive user):
  1. Create database:
```
create database sports;
```
  2. Create external table:
```
create external table sports.cricket(id int, name string);
```
  3. Insert data:
```
insert into sports.cricket values (11, 'name1');
```
  4. Verify data:
```
select * from sports.cricket;
```
2. Initial Hive policy:
  - Resource:
    - Hive Database = sports
    - Hive Table = cricket
    - Hive Column = *
  - User: hrt_qa = drop
3. Attempt to drop table (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("drop table sports.cricket").show()
```
4. Observed Error:
  The command failed due to a missing SELECT permission:
```
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table cricket. Permission denied: user [hrt_qa] does not have [SELECT] privilege on [sports/cricket]
```
  This indicates that you need to give SELECT permission to the user.
5. Updated Hive policy:
  - Resource:
    - Hive Database = sports
    - Hive Table = cricket
    - Hive Column = *
  - User: hrt_qa = drop, select
6. Re-attempted command to drop table (as hrt_qa after kinit in a Spark shell):
```
kinit as hrt_qa
sqlContext.sql("drop table sports.cricket").show()
```
Result: The command executed successfully, and the table was dropped.

Usecase 5: Inserting values into a table with a custom user

Objective: Inserting values into a table with custom user hrt_qa.
Workflow:
1. Initial setup (as Hive user):
  1. Create database:
```
create database sports;
```
  2. Create external table:
```
create external table sports.football(id int, name string);
```
  3. Verify data:
```
select * from sports.football;
```
2. Initial Hive policy:
  - Resource:
    - Hive Database = sports
    - Hive Table = football
    - Hive Column = *
  - User: hrt_qa = update
3. Attempt to insert data (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("insert into sports.football values (11, 'name1')").show()
```
4. Observed error (after first attempt):
  The command failed due to a missing SELECT permission:
```
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Permission denied: user [hrt_qa] does not have [SELECT] privilege on [sports/football]
```
  This indicates that the custom user needs SELECT permission for insertion.
5. Updated Hive policy:
  - Resource:
    - Hive Database = sports
    - Hive Table = football
    - Hive Column = *
  - User: hrt_qa = update, select
6. Re-attempted command to insert data (as hrt_qa after kinit in a Spark shell):
```
sqlContext.sql("insert into sports.football values (11, 'name1')").show()
```
7. Observed error (after second attempt):
  The command failed due to an Azure directory permission issue:
```
org.apache.spark.SparkRuntimeException: Cannot create staging directory: 'Some(abfs://vb-18df-data@sdxqeweekly.dfs.core.windows.net/warehouse/tablespace/external/hive/sports.db/football/.hive-staging_hive_ 2024-12-18 _10-12-47_030_6416593550262289528-1)': Failed to acquire a SAS token for create-directory on /warehouse/tablespace/external/hive/sports.db/football/.hive-staging_hive_ 2024-12-18 _10-12-47_030_6416593550262289528-1 due to org.apache.hadoop.security.AccessControlException: Permission denied.
```
  This indicates that an ADLS (Azure Data Lake Storage) policy is required for the custom user to manage the staging directory.
8. Added ADLS policy:
  - Existing Hive Policy:
    - Resource:
      
      Hive Database = sports
      
      Hive Table = football
      
      Hive Column = *
    - User: hrt_qa = update, select
  - ADLS Policy:
    - Storage Account = sdxqeweekly
    - Storage Account Container = vb-18df-data
    - Relative Path = /warehouse/tablespace/external/hive
    - User: hrt_qa = all
9. Re-attempt command to insert data (as hrt_qa after kinit in a Spark shell after adding ADLS policy):
```
sqlContext.sql("insert into sports.football values (11, 'name1')").show()
```
10. Observed error (after third attempt):
  The command failed due to a missing ALTER permission, even though the data was observed to be inserted:
```
org.apache.spark.sql.AnalysisException: Unable to alter table. Permission denied: user [hrt_qa] does not have [ALTER] privilege on [sports/football]
```
  This indicates that the custom user also requires ALTER permission.
11. Final updated Hive policy:
  - Existing Hive policy:
    - Resource:
      
      Hive Database = sports
      
      Hive Table = football
      
      Hive Column = *
    - User: hrt_qa = update, select, alter
  - ADLS policy:
    - Storage Account = sdxqeweekly
    - Storage Account Container = vb-18df-data
    - Relative Path = /warehouse/tablespace/external/hive
    - User: hrt_qa = all
12. Final re-attempted command to insert data (as hrt_qa after kinit in a Spark shell):
```
kinit as hrt_qa
sqlContext.sql("insert into sports.football values (11, 'name1')").show()
```
Result: The command executed successfully, and two entries were observed in the table.