Installing and Upgrading Cloudera Data Science Workbench 1.3.x
This topic walks you through the installation and upgrade paths available for Cloudera Data Science Workbench 1.3.x. It also describes the steps needed to configure your cluster gateway hosts and block devices before you can begin installing the Cloudera Data Science Workbench parcel/package.
Installing Cloudera Data Science Workbench 1.3.x
- Using a Custom Service Descriptor (CSD) and Parcel - Starting with version 1.2.x, Cloudera Data Science Workbench is available as an add-on service for
Cloudera Manager. Two files are required for this type of installation: a CSD JAR file that contains all the configuration needed to describe and manage the new Cloudera Data Science Workbench
service, and the Cloudera Data Science Workbench parcel. To install this service, first download and copy the CSD file to the Cloudera Manager Server host. Then use Cloudera Manager to distribute the
Cloudera Data Science Workbench parcel to the relevant gateway nodes.
or
- Using a Package - Alternatively, you can install the Cloudera Data Science Workbench package directly on the CDH cluster's gateway nodes. In this case, the Cloudera Data Science Workbench service will not be available in Cloudera Manager.
To begin the installation process, continue reading Required Pre-Installation Steps.
Upgrading to the Latest Version of Cloudera Data Science Workbench 1.3.x
- Upgrading an existing CSD-based deployment to the latest 1.3.x CSD and parcel. For instructions, see Upgrading a CSD-based Deployment to the Latest 1.3.x CSD.
- Migrating from an RPM-based deployment to the latest 1.3.x CSD and parcel-based deployment. For instructions, see Migrating from an RPM-based Deployment to the Latest 1.3.x CSD.
- Upgrading an existing RPM-based deployment to the latest 1.3.x RPM. Note that you cannot use Cloudera Manager for this upgrade path. For instructions, see Upgrading to the Latest Version of Cloudera Data Science Workbench 1.3.x Using Packages.
Airgapped Installations
Sometimes organizations choose to restrict parts of their network from the Internet for security reasons. Isolating segments of a network can provide assurance that valuable data is not being compromised by individuals out of maliciousness or for personal gain. However, in such cases isolated hosts are unable to access Cloudera repositories for new installations or upgrades. Effective version 1.1.1, Cloudera Data Science Workbench supports installation on CDH clusters that are not connected to the Internet.
For CSD-based installs in an airgapped environment, put the Cloudera Data Science Workbench parcel into a new hosted or local parcel repository, and then configure the Cloudera Manager Server to target this newly-created repository.
Rollback Cloudera Data Science Workbench to a Previous Version
All stateful data for Cloudera Data Science Workbench is stored in the /var/lib/cdsw directory on the Master node. The contents of this directory are forward compatible, which is what allows for upgrades. However, they are not backward compatible. Therefore, to rollback Cloudera Data Science Workbench to a previous version, you must have a backup of the /var/lib/cdsw directory, taken prior to the last upgrade.
- Depending on your deployment, either uninstall the RPM or deactivate the current CDSW parcel in Cloudera Manager.
- On the master node, restore the backup copy you have of /var/lib/cdsw. Note that any changes made after this backup will be lost.
- Install a version of Cloudera Data Science Workbench that is equal to or greater than the version of the /var/lib/cdsw backup.
Required Pre-Installation Steps
The rest of this topic describes the steps you should take to review your platforms and configure your gateway hosts before you begin to install Cloudera Data Science Workbench.
Review Requirements and Supported Platforms
Review the complete list of Cloudera Data Science Workbench 1.3.x Requirements and Supported Platforms before you proceed with the installation.
Set Up a Wildcard DNS Subdomain
Cloudera Data Science Workbench uses subdomains to provide isolation for user-generated HTML and JavaScript, and routing requests between services. To set up subdomains for Cloudera Data Science Workbench, configure your DNS server with an A record for a wildcard DNS name such as *.cdsw.<your_domain>.com for the master host, and a second A record for the root entry of cdsw.<your_domain>.com.
For example, if your master IP address is 172.46.47.48, you'd configure two A records as follows:
cdsw.<your_domain>.com. IN A 172.46.47.48 *.cdsw.<your_domain>.com. IN A 172.46.47.48
You can also use a wildcard CNAME record if it is supported by your DNS provider.
Disable Untrusted SSH Access
Cloudera Data Science Workbench assumes that users only access the gateway hosts through the web application. Untrusted users with SSH access to a Cloudera Data Science Workbench host can gain full access to the cluster, including access to other users' workloads. Therefore, untrusted (non-sudo) SSH access to Cloudera Data Science Workbench hosts must be disabled to ensure a secure deployment.
For more information on the security capabilities of Cloudera Data Science Workbench, see the Cloudera Data Science Workbench Security Guide.
Configure Block Devices
Docker Block Device
The Cloudera Data Science Workbench installer will format and mount Docker on each gateway host. Make sure there is no important data stored on these devices. Do not mount these block devices prior to installation.
Every Cloudera Data Science Workbench gateway host must have one or more block devices with at least 1 TB dedicated to storage of Docker images. The Docker block devices store the Cloudera Data Science Workbench Docker images including the Python, R, and Scala engines. Each engine image can weigh 15GB.
Application Block Device or Mount Point
The master host on Cloudera Data Science Workbench requires at least 500 GB for database and project storage. This recommended capacity is contingent on the expected number of users and projects on the cluster. While large data files should be stored on HDFS, it is not uncommon to find gigabytes of data or libraries in individual projects. Running out of storage will cause the application to fail. Cloudera recommends allocating at least 5 GB per project and at least 1 TB of storage in total. Make sure you continue to carefully monitor disk space usage and I/O using Cloudera Manager.
Cloudera Data Science Workbench will store all application data at /var/lib/cdsw. In a CSD-based deployment, this location is not configurable. Cloudera Data Science Workbench will assume the system administrator has formatted and mounted one or more block devices to /var/lib/cdsw.
Regardless of the application data storage configuration you choose, /var/lib/cdsw must be stored on a separate block device. Given typical database and user access patterns, an SSD is strongly recommended.
By default, data in /var/lib/cdsw is not backed up or replicated to HDFS or other nodes. Reliable storage and backup strategy is critical for production installations. See Backup and Disaster Recovery for Cloudera Data Science Workbench for more information.
Install Cloudera Data Science Workbench
To use the Cloudera Manager CSD and parcel to install Cloudera Data Science Workbench, follow the steps at Installation and Upgrade Using Cloudera Manager.
OR
To install the Cloudera Data Science Workbench package on the cluster gateway hosts, follow the steps at Installation and Upgrade Using Packages.