Ranger RMS - HIVE-S3 ACL Sync Overview
Ranger Resource Mapping Server (RMS) enables automatic translation of access policies from HIVE to S3. This feature is available only in AWS deployments.
About HIVE-S3 ACL Sync
It is common to have different workloads use the same data – some require authorizations at the table level (Apache Hive queries) and others at the underlying files (Apache Spark jobs). Unfortunately, in such instances you would have to create and maintain separate Ranger policies for both Hive and S3, that correspond to each other.
As a result, whenever a change is made on a Hive table policy, the data admin should make a consistent change in the corresponding S3 policy. Failure to do so could result in security and/or data exposure issues. Ideally the data admin would set a single table policy, and the corresponding file access policies would automatically be kept in sync along with access audits, referring to the table policy that enforced it.
Legacy CDH users had a feature called the Hive-HDFS ACL sync which had Hive policies in Apache Sentry that automatically linked Hive permissions with HDFS ACLs. This was especially convenient for external table data used by Spark or Hive.
Prior to CDP 7.2.18, Ranger only supported manually managing Hive and S3 policies separately. Ranger RMS (Resource Mapping Server) allows you to authorize access to S3 locations using policies defined for Hive tables. RMS is the service that enables Hive-S3 Policy Sync.
RMS periodically connects to the Hive Metastore and pulls Hive metadata (database-name, table-name) to S3 file-name mapping. We have introduced a RAZ-chained plugin (running in the Ranger RAZ service) which has an additional HivePolicyEnforcer module. The RAZ-S3 plugin downloads Hive policies from Ranger Admin, along with the mappings from Ranger RMS. S3 access is determined by both S3 policies and Hive policies.
Phase I (items 1-3 above)
Ranger RMS periodically connects to the HIVE Metastore and pulls HIVE metadata (database-name, table-name) to S3 file-name mapping.
Phase II (items 4-9 above)
The Ranger RAZ S3 Chained Plugin (running in the RAZ service) periodically pulls S3 policies from Ranger Admin. With the introduction of Ranger RMS, the Ranger RAZ S3 Chained Plugin (running in the RAZ service) has been extended with an additional HIVEPolicyEnforcer module. It now pulls down the HIVE-S3 mappings from RMS and HIVE Policies from Ranger Admin.
After phase II completes, the requested S3 access is determined in the RAZ service by the S3 and HIVE policies defined by the Ranger Administrator.
About database-level grants feature
Legacy CDH users used HIVE policies in Apache Sentry that automatically linked HIVE permissions with HDFS ACLs. This was especially convenient for external table data used by Spark or HIVE. Specifically, using Sentry, you could make grants at the HIVE database level and HDFS permissions would propagate to the database directory, and to all tables and partitions under it.
In RMS-S3, we have introduced the database-level grant feature also. Ranger Resource Mapping Server (RMS) willl allow you to create a database-level policy in HIVE and have these permissions propagate to the S3 locations and all tables under it. RMS is the service that enables HIVE-S3 ACL Sync.
RMS captures database metadata from the HIVE Meta Store (HMS). After the first, full-synchronization run, RMS downloads mappings for tables and databases present in the HMS.
Whenever you create a new database, RMS synchronizes metadata information from HMS and uses it to update the resource mapping file linking HIVE database resources to their corresponding S3 location. Any user with access permissions on a HIVE database automatically receives similar S3 file-level access permissions on the database’s data files. Select/ Read access for any user in the database location is allowed through default HIVE policy for all-databases. This behavior is treated as _any access, which is similar to the HIVE command show tables. If a user has no HIVE policy which allows access to the database, then the user is denied access to the corresponding S3 location of that database. Without this feature, users will not be allowed to access the S3 location of a database even if the user had permission to access the database through a HIVE policy. The S3 to HIVE access type mappings follow:
Access Type mapping for S3 to HIVE for Database:
- _any=[_any]
- read=[_any]
- write=[create, drop, alter]
Access Type mapping for S3 to HIVE for Table:
- _any=[_any]
- read=[select]
- write=[update, alter]
If you create tables under a database but the S3 location of the corresponding table does not reside under the S3 location of that database (for example: table locations are external locations), the HIVE policies (database- name, table = *, column= *) translate into S3 access rules and allow the RAZ S3 chained plugin to enforce them. If the policy is created only for the database resource, the same access translates to the S3 location of that database only; not for the tables residing under that database.
Ranger RMS Assumptions and Limitations
-
All partitions of a table are assumed to be under the location specified for the table. Therefore, table permissions will not authorize access to partitions that store data outside the location specified for the table. For example, if a table is located in a
/warehouse/foo
S3 directory, all partitions of the table must have locations that are under the/warehouse/foo
directory. - In public cloud 7.2.18 or above, RMS service will be available only on AWS deployment for fresh install setup (not for upgrade scenario). A customer with this new RMS entitlement ENABLE_RMS_ON_DATALAKE should be able to create a cluster with RMS as a configurable option (--enable-ranger-rms) through a cdp cli command create-aws-datalake. When RMS is selected during cluster setup, customers will not be required to install & configure RMS separately.
-
The Ranger RMS ACL-sync feature supports a single logical HMS, to evaluate S3 access via HIVE permissions. This is aligned with the Sentry implementation in CDH.
- Permissions granted on views (traditional and materialized) do not extend to S3 access. This is aligned with the Sentry implementation in CDH.
- RMS ACL sync is designed to work on a specific pair of S3<->Hive Ranger service. Ranger RMS supports only one pair of Hive and S3 services. By default cm_s3 is configured as source service and cm_hive as target service.
-
If a Public Cloud Base deployment supports multiple logical HMS with a single Ranger, Ranger RMS (Hive-S3 ACL-Sync) works with only one logical HMS. Permissions granted on databases/tables in other logical HMS instances will not be considered to authorize S3 access.
- Ranger RAZ memory requirements must be increased, based on the number of HIVE table mappings downloaded to S3 Ranger plugin. Additionally, maintaining HIVE policies in memory cache will also require additional memory.
- In public cloud deployments, Ranger RMS service will be installed only on DataLake and it will use the same database as Ranger Admin to store mappings downloaded from HMS.
- Ranger RAZ service running in DataHub will have a S3 chained plugin and it will do the authorization based on the policies and mappings downloaded & stored into policy-cache directory of RAZ service. Even if the RMS service is stopped, authorization will continue to work based on the files available in the policy-cache directory.
- Metrics support for RMS with S3 is not added in the CDH-7.2.18 release.
- Expect Ranger RAZ CPU load to increase, due to additional access evaluation performed to enforce HIVE policies and periodic downloading and processing of the HIVE table mappings. The latter increase is proportional to the number of table mappings downloaded to HDFS Ranger plugin.
When multiple databases are mapped to a single S3 location, and if a HIVE policy allows a user to access one database. Then, users will be able to access its S3 location and all other files & directories under it. This may include table or database directories of other databases and tables. However, users will not be able to access other databases or tables under it through Hive queries.
For example,
music_a, music_b, music_c are created at S3 path'/data'.
Policy-A to allow 'sam' user 'all' access on resource = {database=music_a; table= * ; column= * ; }
Now, 'sam' user will get all access on S3 path /data and files, directories under it. Therefore, 'sam' user will be able to access S3 location of tables under music_b and music_c databases as long as those locations reside under /data directory.
However, 'sam' user will not be able to access music_b and music_c databases or any tables under these databases through Hive queries.
Comparison with Sentry HDFS ACL sync
The Ranger RMS (Hive-S3 ACL-Sync) feature resembles the Sentry HDFS ACL Sync feature in the way it downloads and keeps track of the HIVE table to S3 location mapping.
It differs from Sentry in the way it completely and transparently supports all features that Ranger policies express. Therefore, support for tag-based policies, security-zones, masking and row-filtering and audit logging is included with this implementation.
Also, the feature is enabled or disabled by a simple configuration on the Ranger RAZ side, allowing each installation the option of turning this feature on or off.