Chapter 2. Installing HDFS and YARN

This section describes how to install the Hadoop Core components, HDFS, YARN, and MapReduce.

Complete the following instructions to install Hadoop Core components:

1. Set Default File and Directory Permissions

Set the default file and directory permissions to 0022 (022).

Use the umask command to confirm and set as necessary.

Ensure that the umask is set for all terminal sessions that you use during installation.

2. Install the Hadoop Packages

Execute the following command on all cluster nodes.

For RHEL/CentOS/Oracle Linux:
yum install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop-mapreduce hadoop-client openssl
For SLES:
zypper install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop- mapreduce hadoop-client openssl
For Ubuntu/Debian:
apt-get install hadoop hadoop-hdfs libhdfs0 hadoop-yarn hadoop-mapreduce hadoop-client openssl

3. Install Compression Libraries

Make the following compression libraries available on all the cluster nodes.

3.1. Install Snappy

Install Snappy on all the nodes in your cluster. At each node:

For RHEL/CentOS/Oracle Linux:
yum install snappy snappy-devel
For SLES:
zypper install snappy snappy-devel
For Ubuntu/Debian:
apt-get install libsnappy1 libsnappy-dev

3.2. Install LZO

Execute the following command at all the nodes in your cluster:

RHEL/CentOS/Oracle Linux:
yum install lzo lzo-devel hadooplzo hadooplzo-native
For SLES:
zypper install lzo lzo-devel hadooplzo hadooplzo-native
For Ubuntu/Debian:
apt-get install liblzo2-2 liblzo2-dev hadooplzo

4. Create Directories

Create directories and configure ownership + permissions on the appropriate hosts as described below.

If any of these directories already exist, we recommend deleting and recreating them. Use the following instructions to create appropriate directories:

We strongly suggest that you edit and source the bash script files included with the HDP companion files.
Alternately, you can also copy the contents to your ~/.bash_profile to set up these environment variables in your environment.
Create the NameNode directories
Create the secondary NameNode directories
Create the DataNode and YARN NodeManager local directories
Create the log and PID directories

4.1. Create the NameNode Directories

On the node that hosts the NameNode service, execute the following commands:

mkdir -p $DFS_NAME_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_NAME_DIR;
chmod -R 755 $DFS_NAME_DIR;

Where:

$DFS_NAME_DIR is the space separated list of directories where NameNode stores the file system image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

4.2. Create the SecondaryNameNode Directories

On all the nodes that can potentially run the SecondaryNameNode service, execute the following commands:

mkdir -p $FS_CHECKPOINT_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $FS_CHECKPOINT_DIR; 
chmod -R 755 $FS_CHECKPOINT_DIR;

where:

$FS_CHECKPOINT_DIR is the space-separated list of directories where SecondaryNameNode should store the checkpoint image. For example, /grid/hadoop/ hdfs/snn /grid1/hadoop/hdfssnn /grid2/hadoop/hdfs/snn.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

4.3. Create DataNode and YARN NodeManager Local Directories

At each DataNode, execute the following commands:

mkdir -p $DFS_DATA_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_DATA_DIR;
chmod -R 750 $DFS_DATA_DIR;

where:

$DFS_DATA_DIR is the space-separated list of directories where DataNodes should store the blocks. For example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn / grid2/hadoop/hdfs/dn.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

At the ResourceManager and all DataNodes, execute the following commands:

mkdir -p $YARN_LOCAL_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_DIR;
chmod -R 755 $YARN_LOCAL_DIR;

where:

$YARN_LOCAL_DIR is the space separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn/local /grid1/hadoop/ yarn/local /grid2/hadoop/yarn/local.
$YARN_USER is the user owning the YARN services. For example, yarn.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

On the ResourceManager and all DataNodes, execute the following commands:

mkdir -p $YARN_LOCAL_LOG_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_LOG_DIR; 
chmod -R 755 $YARN_LOCAL_LOG_DIR;

where:

$YARN_LOCAL_LOG_DIR is the space-separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/ yarn/logs /grid2/hadoop/yarn/local.
$YARN_USER is the user owning the YARN services. For example, yarn.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

4.4. Create the Log and PID Directories

At all nodes, execute the following commands:

mkdir -p $HDFS_LOG_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_LOG_DIR;
chmod -R 755 $HDFS_LOG_DIR;

where:

$HDFS_LOG_DIR is the directory for storing the HDFS logs.
This directory name is a combination of a directory and the $HDFS_USER. For example, /var/log/hadoop/hdfs, where hdfs is the $HDFS_USER.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $YARN_LOG_DIR; 
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOG_DIR;
chmod -R 755 $YARN_LOG_DIR;

where:

$YARN_LOG_DIR is the directory for storing the YARN logs.
This directory name is a combination of a directory and the $YARN_USER. For example, /var/log/hadoop/yarn, where yarn is the $YARN_USER.
$YARN_USER is the user owning the YARN services. For example, yarn.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $HDFS_PID_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_PID_DIR;
chmod -R 755 $HDFS_PID_DIR

where:

$HDFS_PID_DIR is the directory for storing the HDFS process ID.
This directory name is a combination of a directory and the $HDFS_USER. For example, /var/run/hadoop/hdfs where hdfs is the $HDFS_USER.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $YARN_PID_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_PID_DIR;
chmod -R 755 $YARN_PID_DIR;

where:

$YARN_PID_DIR is the directory for storing the YARN process ID.
This directory name is a combination of a directory and the $YARN_USER. For example, /var/run/hadoop/yarn where yarn is the $YARN_USER.
$YARN_USER is the user owning the YARN services. For example, yarn.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $MAPRED_LOG_DIR;
chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_LOG_DIR;
chmod -R 755 $MAPRED_LOG_DIR;

where:

$MAPRED_LOG_DIR is the directory for storing the JobHistory Server logs.
This directory name is a combination of a directory and the $MAPRED_USER. For example, /var/log/hadoop/mapred where mapred is the $MAPRED_USER.
$MAPRED_USER is the user owning the MAPRED services. For example, mapred.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $MAPRED_PID_DIR;
chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_PID_DIR;
chmod -R 755 $MAPRED_PID_DIR;

where:

$MAPRED_PID_DIR is the directory for storing the JobHistory Server process ID.
This directory name is a combination of a directory and the $MAPRED_USER. For example, /var/run/hadoop/mapred where mapred is the $MAPRED_USER.
$MAPRED_USER is the user owning the MAPRED services. For example, mapred.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

4.5. Symlink Directories with hdp-select

	Important
	HDP 2.2 installs hdp-select automatically with the installation or upgrade of the first HDP component. If you have not already upgraded Zookeeper, hdp-select has not been installed.

To prevent version-specific directory issues for your scripts and updates, Hortonworks provides hdp-select, a script that symlinks directories to hdp-current and modifies paths for configuration directories.

Run hdp-select set all on the NameNode and on all DataNodes:

hdp-select set all 2.2.0.0-<$version>

For example:

/usr/bin/hdp-select set all 2.2.0.0-2041

Legal notices