Authentication Overview

Authentication is a basic security requirement for any computing environment. In simple terms, users and services must prove their identity (authenticate) to the system before they can use system features to the degree authorized. Authentication and authorization work hand-in-hand to protect system resources. Authorization is handled in many different ways, from access control lists (ACLs), to HDFS extended ACLs, to role-based access controls (RBAC) using Sentry. See Authorization Overview for more information.

Several different mechanisms work together to authenticate users and services in a cluster. These vary depending on the services configured on the cluster. Most CDH components, including Apache Hive, Hue, and Apache Impala can use Kerberos for authentication. Both MIT and Microsoft Active Directory Kerberos implementations can be integrated for use with Cloudera clusters.

In addition, Kerberos credentials can be stored and managed in the LDAP-compliant identity service, such as OpenLDAP and Microsoft Active Directory, a core component of Windows Server.

This section provides a brief overview with special focus on different deployment models available when using Microsoft Active Directory for Kerberos authentication or when integrating MIT Kerberos and Microsoft Active Directory.

Cloudera does not provide a Kerberos implementation. Cloudera clusters can be configured to use Kerberos for authentication, either MIT Kerberos or Microsoft Server Active Directory Kerberos, specifically the Key Distribution Center or KDC. The Kerberos instance must be setup and operational before you can configure the cluster to use it.

Gathering all the configuration details about the KDC—or having the Kerberos administrator available to help during the setup process—is an important preliminary task involved with integrating the cluster and Kerberos regardless of the deployment model.

Kerberos Overview

In simple terms, Kerberos is an authentication protocol that relies on cryptographic mechanisms to handle interactions between a requesting client and server, greatly reducing the risk of impersonation. Passwords are not stored locally nor sent over the network in the clear. The password users enter when logging in to their systems is used to unlock a local mechanism that is then used in a subsequent interaction with a trusted third-party to grant a user a ticket (with a limited lifetime) that is used to authenticate with requested services. After the client and server processes prove their respective identities to each other, communications are encrypted to ensure privacy and data integrity.

The trusted third-party is the Kerberos Key Distribution Center (KDC), the focal point for Kerberos operations which also provides the Authentication Service and the Ticket Granting Service (TGS) for the system. Briefly, the TGS issues a ticket to the requesting user or service which is then presented to the requested service that proves the user (or service) identity for the ticket lifetime (by default, 10 hours). There are many nuances to Kerberos, including defining the principals that identify users and services for the system, ticket renewal, delegated token handling, to name a few. See Kerberos Security Artifacts Overview.

Furthermore, these processes occur for the most part completely transparently. For example, business users of the cluster simply enter their password when they log in, and the ticket-handling, encryption, and other details take place automatically, behind the scenes. Additionally, users are authenticated not only to a single service target, but to the network as a whole thanks to the tickets and other mechanisms at work in the Kerberos infrastructure.

Kerberos Deployment Models

Credentials needed for Kerberos authentication can be stored and managed in an LDAP-compliant identity/directory service, such as OpenLDAP or Microsoft Active Directory.

At one time a stand-alone service offering from Microsoft, Active Directory services are now packaged as part of the Microsoft Server Domain Services. In the early 2000s, Microsoft replaced its NT LAN Manager authentication mechanism with Kerberos. That means that sites running Microsoft Server can integrate their clusters with Active Directory for Kerberos and have the credentials stored in the LDAP directory on the same server.

This section provides overviews of the different deployment models available for integrating Kerberos authentication with Cloudera clusters, with some of the advantages and disadvantages of the available approaches.

Local MIT KDC

This approach uses an MIT KDC that is local to the cluster. Users and services authenticate to the local KDC before they can interact with the CDH components on the cluster.

Architecture Summary

  • An MIT KDC and a separate Kerberos realm is deployed locally to the CDH cluster. The local MIT KDC is typically deployed on a Utility host. Additional replicated MIT KDCs for high-availability are optional.
  • All cluster hosts must be configured to use the local MIT Kerberos realm using the krb5.conf file.
  • All service and user principals must be created in the local MIT KDC and Kerberos realm.
  • The local MIT KDC will authenticate both the service principals (using keytab files) and user principals (using passwords).
  • Cloudera Manager connects to the local MIT KDC to create and manage the principals for the CDH services running on the cluster. To do this Cloudera Manager uses an admin principal and keytab that is created during the setup process. This step has been automated by the Kerberos wizard. See Enabling Kerberos Authentication Using the Wizard for details, or see How to Configure Clusters to Use Kerberos for Authentication for information about creating an admin principal manually.
  • The local MIT KDC administrator typically creates all other user principals. However, the Cloudera Manager Kerberos wizard can create the principals and keytab files automatically.


Pros Cons
The authentication mechanism is isolated from the rest of the enterprise. This mechanism is not integrated with central authentication system.
This is fairly easy to setup, especially if you use the Cloudera Manager Kerberos wizard that automates creation and distribution of service principals and keytab files. User and service principals must be created in the local MIT KDC, which can be time-consuming.
The local MIT KDC can be a single point of failure for the cluster unless replicated KDCs can be configured for high-availability.
The local MIT KDC is yet another authentication system to manage.

Local MIT KDC with Active Directory Integration

This approach uses an MIT KDC and Kerberos realm that is local to the cluster. However, Active Directory stores the user principals that will access the cluster in a central realm. Users will have to authenticate with this central AD realm to obtain TGTs before they can interact with CDH services on the cluster. Note that CDH service principals reside only in the local KDC realm.

Architecture Summary

  • An MIT KDC and a distinct Kerberos realm is deployed locally to the CDH cluster. The local MIT KDC is typically deployed on a Utility host and additional replicated MIT KDCs for high-availability are optional.
  • All cluster hosts are configured with both Kerberos realms (local and central AD) using the krb5.conf file. The default realm should be the local MIT Kerberos realm.
  • Service principals should be created in the local MIT KDC and the local Kerberos realm. Cloudera Manager connects to the local MIT KDC to create and manage the principals for the CDH services running on the cluster. To do this, Cloudera Manager uses an admin principal and keytab that is created during the security setup. This step has been automated by the Kerberos wizard.
  • A one-way, cross-realm trust must be set up from the local Kerberos realm to the central AD realm containing the user principals that require access to the CDH cluster. There is no need to create the service principals in the central AD realm and no need to create user principals in the local realm.


Pros Cons
The local MIT KDC serves as a shield for the central Active Directory from the many hosts and services in a CDH cluster. Service restarts in a large cluster create many simultaneous authentication requests. If Active Directory is unable to handle the spike in load, then the cluster can effectively cause a distributed denial of service (DDOS) attack. The local MIT KDC can be a single point of failure (SPOF) for the cluster. Replicated KDCs can be configured for high-availability.

This is fairly easy to setup, especially if you use the Cloudera Manager Kerberos wizard that automates creation and distribution of service principals and keytab files.

Active Directory administrators will only need to be involved to configure the cross-realm trust during setup.

The local MIT KDC is yet another authentication system to manage.
Integration with central Active Directory for user principal authentication results in a more complete authentication solution.
Allows for incremental configuration. Hadoop security can be configured and verified using local MIT KDC independently of integrating with Active Directory.

Using a Centralized Active Directory Service

This approach uses the central Active Directory as the KDC. No local KDC is required. Before you decide upon an AD KDC deployment, make sure you are aware of the following possible ramifications of that decision.

Architecture Summary

  • All service and user principals are created in the Active Directory KDC.
  • All cluster hosts are configured with the central AD Kerberos realm using krb5.conf.
  • Cloudera Manager connects to the Active Directory KDC to create and manage the principals for the CDH services running on the cluster. To do this, Cloudera Manager uses a principal that has the privileges to create other accounts within the given Organisational Unit (OU). (This step has been automated by the Kerberos wizard.)
  • All service and user principals are authenticated by the Active Directory KDC.


Recommendations for Active Directory KDC

Several different subsystems are involved in servicing authentication requests, including the Key Distribution Center (KDC), Authentication Service (AS), and Ticket Granting Service (TGS). The more nodes in the cluster and the more services provided, the heavier the traffic between these services and the services running on the cluster.

As a general guideline, Cloudera recommends using a dedicated Active Directory instance (Microsoft Server Domain Services) for every 100 nodes in the cluster. However, this is just a loose guideline. Monitor utilization and deploy additional instances as needed to meet the demand.

By default, Kerberos uses TCP for client/server communication which guarantees delivery but is not as fast at delivering packets as UDP. To override this setting and let Kerberos try UDP before TCP, modify the Kerberos configuration file (krb5.conf) as follows:
[libdefaults]
udp_preference_limit = 1
...

This is especially useful if the domain controllers are not on the same subnet as the cluster or are separated by firewalls.

In general, Cloudera recommends setting up the Active Directory domain controller (Microsoft Server Domain Services) on the same subnet as the cluster and never over a WAN connection. Separating the cluster from the KDC running on the Active Directory domain controller results in considerable latency and affects cluster performance.

Troubleshooting cluster operations when Active Directory is being used for Kerberos authentication requires administrative access to the Microsoft Server Domain Services instance. Administrators may need to enable Kerberos event logging on the Microsoft Server KDC to resolve issues.

Deleting Cloudera Manager roles or nodes requires manually deleting the associate Active Directory accounts. Cloudera Manager cannot delete entries from Active Directory.

Identity Integration with Active Directory

A core requirement for enabling Kerberos security in the platform is that users have accounts on all cluster processing nodes. Commercial products such as Centrify or Quest Authentication Services (QAS) provide integration of all cluster hosts for user and group resolution to Active Directory. These tools support automated Kerberos authentication on login by users to a Linux host with AD. For sites not using Active Directory, or sites wanting to use an open source solution, the Site Security Services Daemon (SSSD) can be used with either AD or OpenLDAP compatible directory services and MIT Kerberos for the same needs.

For third-party providers, you may have to purchase licenses from the respective vendors. This procedure requires some planning as it takes time to procure these licenses and deploy these products on a cluster. Care should be taken to ensure that the identity management product does not associate the service principal names (SPNs) with the host principals when the computers are joined to the AD domain. For example, Centrify by default associates the HTTP SPN with the host principal. So the HTTP SPN should be specifically excluded when the hosts are joined to the domain.

You will also need to complete the following setup tasks in AD:
  • Active Directory Organizational Unit (OU) and OU user - A separate OU in Active Directory should be created along with an account that has privileges to create additional accounts in that OU.

  • Enable SSL for AD - Cloudera Manager should be able to connect to AD on the LDAPS (TCP 636) port.

  • Principals and Keytabs - In a direct-to-AD deployment that is set up using the Kerberos wizard, by default, all required principals and keytabs will be created, deployed and managed by Cloudera Manager. However, if for some reason you cannot allow Cloudera Manager to manage your direct-to-AD deployment, then unique accounts should be manually created in AD for each service running on each host and keytab files must be provided for the same. These accounts should have the AD User Principal Name (UPN) set to service/fqdn@REALM, and the Service Principal Name (SPN) set to service/fqdn. The principal name in the keytab files should be the UPN of the account. The keytab files should follow the naming convention: servicename_fqdn.keytab. The following principals and keytab files must be created for each host they run on: Hadoop Users (user:group) and Kerberos Principals.

  • AD Bind Account - Create an AD account that will be used for LDAP bindings in Hue, Cloudera Manager and Cloudera Navigator.

  • AD Groups for Privileged Users - Create AD groups and add members for the authorized users, HDFS admins and HDFS superuser groups.
    • Authorized users – A group consisting of all users that need access to the cluster
    • HDFS admins – Groups of users that will run HDFS administrative commands
    • HDFS super users – Group of users that require superuser privilege, that is, read/wwrite access to all data and directories in HDFS

      Putting regular users into the HDFS superuser group is not recommended. Instead, an account that administrators escalate issues to, should be part of the HDFS superuser group.

  • AD Groups for Role-Based Access to Cloudera Manager and Cloudera Navigator - Create AD groups and add members to these groups so you can later configure role-based access to Cloudera Manager and Cloudera Navigator.

  • AD Test Users and Groups - At least one existing AD user and the group that the user belongs to should be provided to test whether authorization rules work as expected.

Using TLS/SSL for Secure Keytab Distribution

The Kerberos keytab file is transmitted among the hosts in the Cloudera Manager cluster, between Cloudera Manager Server and Cloudera Manager Agent hosts. To keep this sensitive data secure, configure Cloudera Manager Server and the Cloudera Manager Agent hosts for encrypted communications using TLS/SSL. See Encrypting Data in Transit for details.

Using the Wizard or Manual Process to Configure Kerberos Authentication

Cloudera does not provide a Kerberos implementation but uses an existing Kerberos deployment to authenticate services and users. The Kerberos server may be set up exclusively for use by the cluster (for example, Local MIT KDC) or may be a distributed Kerberos deployment used by other applications in the organization.

Regardless of the deployment model, the Kerberos instance must be operational before the cluster can be configured to use it. In addition, the cluster itself should also be operational and ideally, configured to use TLS/SSL for Cloudera Manager Server and Cloudera Manager Agent hosts, as mentioned above.

When you are ready to integrate the cluster with your organization's MIT KDC or Active Directory KDC, you can do so using the wizard provided in Cloudera Manager Server or by following a manual process, as follows:

Authentication Mechanisms used by Cluster Components

Component or Product Authentication Mechanism Supported
Accumulo Kerberos (partial)
Backup and Disaster Recovery Kerberos (used to authenticate Cloudera Manager to Kerberos-protected services), LDAP, SAML
Cloudera Manager Kerberos (used to authenticate Cloudera Manager to Kerberos-protected services), LDAP, SAML
Cloudera Navigator Active Directory, OpenLDAP, SAML
Flume Kerberos (starting CDH 5.4)
HBase Kerberos, user-based authentication required for HBase Thrift and REST clients
HDFS Kerberos, SPNEGO (HttpFS)
HiveServer None
HiveServer2 Kerberos, LDAP, Custom/pluggable authentication
Hive Metastore Kerberos
Hue Kerberos, LDAP, SAML, Custom/pluggable authentication
Impala Kerberos, LDAP, SPNEGO (Impala Web Console)
Kudu Kerberos
MapReduce Kerberos (also see HDFS)
Oozie Kerberos, SPNEGO
Pig Kerberos
Search Kerberos, SPNEGO
Sentry Kerberos
Spark Kerberos
Sqoop Kerberos
Sqoop2 Kerberos (as of CDH 5.4)
YARN Kerberos (also see HDFS)
Zookeeper Kerberos