Understanding Custom Installation Solutions

Cloudera hosts two types of software repositories that you can use to install products such as Cloudera Manager or CDH—parcel repositories and RHEL and SLES RPM and Debian/Ubuntu package repositories.

These repositories are effective solutions in most cases, but custom installation solutions are sometimes required. Using the software repositories requires client access over the Internet and results in the installation of the latest version of products. An alternate solution is required if:
  • You need to install older product versions. For example, in a CDH cluster, all hosts must run the same CDH version. After completing an initial installation, you may want to add hosts. This could be to increase the size of your cluster to handle larger tasks or to replace older hardware.
  • The hosts on which you want to install Cloudera products are not connected to the Internet, so they are unable to reach the Cloudera repository. (For a parcel installation, only the Cloudera Manager Server needs Internet access, but for a package installation, all cluster members need access to the Cloudera repository). Some organizations choose to partition parts of their network from outside access. Isolating segments of a network can provide greater assurance that valuable data is not compromised by individuals out of maliciousness or for personal gain. In such a case, the isolated computers are unable to access Cloudera repositories for new installations or upgrades.
In both of these cases, using a custom repository solution allows you to meet the needs of your organization, whether that means installing older versions of Cloudera software or installing any version of Cloudera software on hosts that are disconnected from the Internet.

Understanding Parcels

Parcels are a packaging format that facilitate upgrading software from within Cloudera Manager. You can download, distribute, and activate a new software version all from within Cloudera Manager. Cloudera Manager downloads a parcel to a local directory. Once the parcel is downloaded to the Cloudera Manager Server host, an Internet connection is no longer needed to deploy the parcel. Parcels are available for CDH 4.1.3 and onwards. For detailed information about parcels, see Parcels.

If your Cloudera Manager Server does not have Internet access, you can obtain the required parcel files and put them into a parcel repository. See Creating and Using a Parcel Repository for Cloudera Manager.

Understanding Package Management

Before getting into the details of how to configure a custom package management solution in your environment, it can be useful to have more information about:
  • Package management tools
  • Package repositories

See Creating and Using a Package Repository for Cloudera Manager.

Package Management Tools

Packages (rpm or deb files) help ensure that installations complete successfully by encoding each package's dependencies. That means that if you request the installation of a solution, all required elements can be installed at the same time. For example, hadoop-0.20-hive depends on hadoop-0.20. Package management tools, such as yum (RHEL), zypper (SLES), and apt-get (Debian/Ubuntu) are tools that can find and install any required packages. For example, for RHEL, you might enter yum install hadoop-0.20-hive. yum would inform you that the hive package requires hadoop-0.20 and offers to complete that installation for you. zypper and apt-get provide similar functionality.

Package Repositories

Package management tools operate on package repositories.

Repository Configuration Files

Information about package repositories is stored in configuration files, the location of which varies according to the package management tool.
  • RHEL/CentOS yum - /etc/yum.repos.d
  • SLES zypper - /etc/zypp/zypper.conf
  • Debian/Ubuntu apt-get - /etc/apt/apt.conf (Additional repositories are specified using *.list files in the /etc/apt/sources.list.d/ directory.)
For example, on a typical CentOS system, you might find:
[user@localhost ~]$ ls -l /etc/yum.repos.d/
total 24
-rw-r--r-- 1 root root 2245 Apr 25  2010 CentOS-Base.repo
-rw-r--r-- 1 root root  626 Apr 25  2010 CentOS-Media.repo
The .repo files contain pointers to one or many repositories. There are similar pointers inside configuration files for zypper and apt-get. In the following snippet from CentOS-Base.repo, there are two repositories defined: one named Base and one named Updates. The mirrorlist parameter points to a website that has a list of places where this repository can be downloaded.
# ...
name=CentOS-$releasever - Base

#released updates
name=CentOS-$releasever - Updates
# ...

Listing Repositories

You can list the repositories you have enabled. The command varies according to operating system:
  • RHEL/CentOS - yum repolist
  • SLES - zypper repos
  • Debian/Ubuntu - apt-get does not include a command to display sources, but you can determine sources by reviewing the contents of /etc/apt/sources.list and any files contained in /etc/apt/sources.list.d/.
The following shows an example of what you might find on a CentOS system in repolist:
[root@localhost yum.repos.d]$ yum repolist
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * addons: mirror.san.fastserv.com
 * base: centos.eecs.wsu.edu
 * extras: mirrors.ecvps.com
 * updates: mirror.5ninesolutions.com
repo id                        repo name                                 status
addons                         CentOS-5 - Addons                         enabled:     0
base                           CentOS-5 - Base                           enabled: 3,434
extras                         CentOS-5 - Extras                         enabled:   296
updates                        CentOS-5 - Updates                        enabled: 1,137
repolist: 4,867