1.1. Kerberos Overview

Establishing identity with strong authentication is the basis for secure access in Hadoop. Users need to be able to reliably “identify” themselves and then have that identity propagated throughout the Hadoop cluster. Once this is done those users can access resources (such as files or directories) or interact with the cluster (like running MapReduce jobs). As well, Hadoop cluster resources themselves (such as Hosts and Services) need to authenticate with each other to avoid potential malicious systems “posing as” part of the cluster to gain access to data.

To create that secure communication among its various components, Hadoop uses Kerberos. Kerberos is a third party authentication mechanism, in which users and services that users want to access rely on a third party - the Kerberos server - to authenticate each to the other. The Kerberos server itself is known as the Key Distribution Center, or KDC. At a high level, it has three parts:

  • A database of the users and services (known as principals) that it knows about and their respective Kerberos passwords

  • An authentication server (AS) which performs the initial authentication and issues a Ticket Granting Ticket (TGT)

  • A Ticket Granting Server (TGS) that issues subsequent service tickets based on the initial TGT

A user principal requests authentication from the AS. The AS returns a TGT that is encrypted using the user principal's Kerberos password, which is known only to the user principal and the AS. The user principal decrypts the TGT locally using its Kerberos password, and from that point forward, until the ticket expires, the user principal can use the TGT to get service tickets from the TGS. Service tickets are what allow a principal to access various services.

Because cluster resources (hosts or services) cannot provide a password each time to decrypt the TGT, they use a special file, called a keytab, which contains the resource principal's authentication credentials.

The set of hosts, users, and services over which the Kerberos server has control is called a realm.


Table 1.1. Kerberos terminology

Key Distribution Center, or KDCThe trusted source for authentication in a Kerberos-enabled environment.
Kerberos KDC ServerThe machine, or server, that serves as the Key Distribution Center.
Kerberos ClientAny machine in the cluster that authenticates against the KDC.
PrincipalThe unique name of a user or service that authenticates against the KDC.
KeytabA file that includes one or more principals and their keys.
RealmThe Kerberos network that includes a KDC and a number of Clients.