Installing Cloudera Data Science Workbench 1.7.2 Using Packages

Use the following steps to install the latest Cloudera Data Science Workbench 1.7.2 using RPM packages.

Prerequisites
Configure Gateway Hosts Using Cloudera Manager
Install Cloudera Data Science Workbench on the Master Host
(Optional) Install Cloudera Data Science Workbench on Worker Hosts
Create the Administrator Account
Next Steps

Prerequisites

Before you begin installing Cloudera Data Science Workbench, make sure you have completed the steps to configure your hosts and block devices.

Configure Gateway Hosts Using Cloudera Manager

Cloudera Data Science Workbench hosts must be added to your CDH cluster as gateway hosts, with gateway roles properly configured. To configure gateway hosts:

If you have not already done so and plan to use PySpark, install either the Anaconda parcel or Python (versions 2.7.11 and 3.6.1) on your CDH cluster. For more information see, Python Supported Versions.
Configure Apache Spark on your gateway hosts.
1. (CDH 5 only) Install and configure the CDS 2.x Powered by Apache Spark parcel and CSD. For instructions, see Installing CDS 2.x Powered by Apache Spark.
  
  Important: Do not install CDS 2.x if you are using CDH 6. Spark 2 ships as part of the CDH 6 package; the add-on parcel is no longer required. To see which version of Spark 2 ships with CDH, refer the CDH 6 Packaging documentation.
2. (Required for CDH 5 and CDH 6) To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.
```
hdfs dfs -mkdir /user/<username>
hdfs dfs -chown <username>:<username> /user/<username>
```
If you are using CDS 2.3 release 2 (or higher), review the associated known issues here: CDS Powered By Apache Spark.
Use Cloudera Manager to create add gateway hosts to your CDH cluster.
1. Create a new host template that includes gateway roles for HDFS, YARN, and Spark 2.
  
  (Required for CDH 6) If you want to run workloads on dataframe-based tables, such as tables from PySpark, sparklyr, SparkSQL, or Scala, you must also add the Hive gateway role to the template.
2. Use the instructions at Adding a Host to the Cluster to add gateway hosts to the cluster. Apply the template created in the previous step to these gateway hosts. If your cluster is kerberized, confirm that the krb5.conf file on your gateway hosts is correct.

Test Spark 2 integration on the gateway hosts.

SSH to a gateway host.
If your cluster is kerberized, run kinit to authenticate to the CDH cluster’s Kerberos Key Distribution Center. The Kerberos ticket you create is not visible to Cloudera Data Science Workbench users.

Submit a test job to Spark by executing the following command:

CDH 5

spark2-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-example*.jar 100

To view a sample command, click Show

spark2-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples*.jar 100

CDH 6

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client SPARK_HOME/lib/spark-examples*.jar 100

To view a sample command, click Show

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples*.jar 100

View the status of the job in the CLI output or in the Spark web UI to confirm that the host you want to use for the Cloudera Data Science Workbench master functions properly as a Spark gateway.

To view sample CLI output, click Show

19/02/15 09:37:39 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.1.0
19/02/15 09:37:39 INFO spark.SparkContext: Submitted application: Spark Pi
...
19/02/15 09:37:40 INFO util.Utils: Successfully started service 'sparkDriver' on port 37050.
...
19/02/15 09:38:06 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 18.659033 s

Install Cloudera Data Science Workbench on the Master Host

Use the following steps to install Cloudera Data Science Workbench on the master host. Note that airgapped clusters and non-airgapped clusters use different files for installation.

Non-airgapped Installation - Download the Cloudera Data Science Workbench repo file (cloudera-cdsw.repo) from the following location:
```
https://archive.cloudera.com/p/cdsw1/1.7.2/redhat7/yum/cloudera-cdsw.repo
```
Airgapped installation - For airgapped installations, download the Cloudera Data Science Workbench RPM file from the following location:
```
https://archive.cloudera.com/p/cdsw1/1.7.2/redhat7/yum/RPMS/x86_64/
```
Important: Make sure all Cloudera Data Science Workbench hosts (master and worker) are running the same version of Cloudera Data Science Workbench.
Skip this step for airgapped installations. Add the Cloudera Public GPG repository key. This key verifies that you are downloading genuine packages.
```
sudo rpm --import https://archive.cloudera.com/p/cdsw1/1.7.2/redhat7/yum/RPM-GPG-KEY-cloudera
```
Non-airgapped Installation - Install the latest RPM with the following command:
```
sudo yum install cloudera-data-science-workbench
```
Airgapped Installation - Copy the RPM downloaded in the previous step to the appropriate gateway host. Then, use the complete filename to install the package. For example:
```
sudo yum install cloudera-data-science-workbench-1.7.2.12345.rpm
```
For guidance on any warnings displayed during the installation process, see Understanding Installation Warnings.

Edit the configuration file at /etc/cdsw/config/cdsw.conf. The following table lists the configuration properties that can be configured in cdsw.conf.

cdsw.conf Properties
Properties	Description
Required Configuration
`DOMAIN`	Wildcard DNS domain configured to point to the master host. If the wildcard DNS entries are configured as `cdsw.<company>.com` and `*.cdsw.<company>.com`, then `DOMAIN` should be set to `cdsw.<company>.com`. Users' browsers should then contact the Cloudera Data Science Workbench web application at `http://cdsw.<company>.com`. This domain for DNS and is unrelated to Kerberos or LDAP domains.
`MASTER_IP`	IPv4 address for the master host that is reachable from the worker hosts. Within an AWS VPC, `MASTER_IP` should be set to the internal IP address of the master host; for instance, if your hostname is `ip-10-251-50-12.ec2.internal`, set `MASTER_IP` to the corresponding IP address, `10.251.50.12`.
`DISTRO`	The Hadoop distribution installed on the cluster. Set this property to `CDH`.
`DOCKER_BLOCK_DEVICES`	Block device(s) for Docker images (space separated if there are multiple). Use the full path to specify the image(s), for instance, `/dev/xvde`.
`JAVA_HOME`	Path where Java is installed on the Cloudera Data Science Workbench hosts. The value for `JAVA_HOME` depends on whether you are using JDK or JRE. For example, if you're using JDK 1.8_162, set `JAVA_HOME` to `/usr/java/jdk1.8.0_162`. If you are only using JRE, set it to `/usr/java/jdk1.8.0_162/jre`. Note that Spark 2.2 (and higher) requires JDK 1.8. For more details on the specific versions of Oracle JDK recommended for CDH and Cloudera Manager clusters, see the Cloudera Product Compatibility Matrix - Supported JDK Versions.
Optional Configuration
`APPLICATION_BLOCK_DEVICE`	(Master Host Only) Configure a block device for application state. If this property is left blank, the filesystem mounted at `/var/lib/cdsw` on the master host will be used to store all user data. For production deployments, Cloudera strongly recommends you use this option with a dedicated SSD block device for the `/var/lib/cdsw` mount. (Not recommended) If set, Cloudera Data Science Workbench will format the provided block device as `ext4`, mount it to `/var/lib/cdsw`, and store all user data on it. This option has only been provided for proof-of-concept setups, and Cloudera is not responsible for any data loss. Use the full path to specify the mount point, for instance, `/dev/xvdf`.
`RESERVE_MASTER`	Set this property to `true` to reserve the master host for Cloudera Data Science Workbench's internal components and services such as Livelog, the PostgreSQL database, and so on. User workloads will now run exclusively on worker hosts, while the master is reserved for internal application services. Note that this property is not yet available as a configuration property in Cloudera Manager. However, you can use an Advanced Configuration Snippet (Safety Valve) to configure this as described here: Reserving the Master Host for Internal CDSW Components. Important: This feature only applies to deployments with more than one Cloudera Data Science Workbench host. Enabling this feature on single-host deployments will leave Cloudera Data Science Workbench incapable of scheduling any workloads.
`DISTRO_DIR`	Path where the Hadoop distribution is installed on the Cloudera Data Science Workbench hosts. For CDH clusters, the default location of the parcel directory is `/opt/cloudera/parcels`. Specify this property only if you are using a non-default location.
`ANACONDA_DIR`	Path where the Anaconda package is installed. On CDH clusters, Anaconda is installed as a parcel in Cloudera Manager. Therefore, this parameter does not apply and must be left blank.
`TLS_ENABLE`	Enable and enforce HTTPS (TLS/SSL) for web access. Set to `true` to enable and enforce HTTPS access to the web application. You can also set this property to `true` to enable external TLS termination. For more details on TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.
`TLS_CERT` `TLS_KEY`	Certificate and private key for internal TLS termination. Setting `TLS_CERT` and `TLS_KEY` will enable internal TLS termination. You must also set `TLS_ENABLE` to `true` above to enable and enforce internal termination. Set these only if you are not terminating TLS externally. Make sure you specify the full path to the certificate and key files, which must be in `PEM` format. For details on certificate requirements and enabling TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.
`TLS_ROOTCA`	If `TLS_CERT` is signed by a non-public or Internal Custom Certificate Authority, set this to a .pem file containing the root certificate (trust chain) for that certificate authority. The contents of this field are then inserted into the engine's root certificate store every time a session (or any workload) is launched. This allows processes inside the engine to communicate securely with the ingress controller.
`HTTP_PROXY` `HTTPS_PROXY`	If your deployment is behind an HTTP or HTTPS proxy, set the respective `HTTP_PROXY` or `HTTPS_PROXY` property in `/etc/cdsw/config/cdsw.conf` to the hostname of the proxy you are using. HTTP_PROXY="`<http://proxy_host>`:`<proxy-port>`" HTTPS_PROXY="`<http://proxy_host>`:`<proxy_port>`" If you are using an intermediate proxy such as Cntlm to handle NTLM authentication, add the Cntlm proxy address to the `HTTP_PROXY` or `HTTPS_PROXY` fields in `cdsw.conf`. HTTP_PROXY="http://localhost:3128" HTTPS_PROXY="http://localhost:3128" If the proxy server uses TLS encryption to handle connection requests, you will need to add the proxy's root CA certificate to your host's store of trusted certificates. This is because proxy servers typically sign their server certificate with their own root certificate. Therefore, any connection attempts will fail until the Cloudera Data Science Workbench host trusts the proxy's root CA certificate. If you do not have access to your proxy's root certificate, contact your Network / IT administrator. To enable trust, copy the proxy's root certificate to the trusted CA certificate store (`ca-trust`) on the Cloudera Data Science Workbench host. cp /tmp/<proxy-root-certificate>.crt /etc/pki/ca-trust/source/anchors/ Use the following command to rebuild the trusted certificate store. update-ca-trust extract
`ALL_PROXY`	If a SOCKS proxy is in use, set to `socks5://<host>:<port>/`.
`NO_PROXY`	Comma-separated list of hostnames that should be skipped from the proxy. Starting with version 1.4, if you have defined a proxy in the `HTTP_PROXY(S)` or `ALL_PROXY` properties, Cloudera Data Science Workbench automatically appends the following list of IP addresses to the `NO_PROXY` configuration. Note that this is the minimum required configuration for this field. This list includes `127.0.0.1`, `localhost`, and any private Docker registries and HTTP services inside the firewall that Cloudera Data Science Workbench users might want to access from the engines. "127.0.0.1,localhost,100.66.0.1,100.66.0.2,100.66.0.3, 100.66.0.4,100.66.0.5,100.66.0.6,100.66.0.7,100.66.0.8,100.66.0.9, 100.66.0.10,100.66.0.11,100.66.0.12,100.66.0.13,100.66.0.14, 100.66.0.15,100.66.0.16,100.66.0.17,100.66.0.18,100.66.0.19, 100.66.0.20,100.66.0.21,100.66.0.22,100.66.0.23,100.66.0.24, 100.66.0.25,100.66.0.26,100.66.0.27,100.66.0.28,100.66.0.29, 100.66.0.30,100.66.0.31,100.66.0.32,100.66.0.33,100.66.0.34, 100.66.0.35,100.66.0.36,100.66.0.37,100.66.0.38,100.66.0.39, 100.66.0.40,100.66.0.41,100.66.0.42,100.66.0.43,100.66.0.44, 100.66.0.45,100.66.0.46,100.66.0.47,100.66.0.48,100.66.0.49, 100.66.0.50,100.77.0.10,100.77.0.128,100.77.0.129,100.77.0.130, 100.77.0.131,100.77.0.132,100.77.0.133,100.77.0.134,100.77.0.135, 100.77.0.136,100.77.0.137,100.77.0.138,100.77.0.139"
`NVIDIA_GPU_ENABLE`	Set this property to `true` to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a host is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench hosts. If this property is set to `true` on a host that does not have GPU support, there will be no effect. By default, this property is set to `false`. For detailed instructions on how to enable GPU-based workloads on Cloudera Data Science Workbench, see Using NVIDIA GPUs for Cloudera Data Science Workbench Projects.

Initialize and start Cloudera Data Science Workbench.
```
cdsw start
```
The application will take a few minutes to bootstrap. You can watch the status of application installation and startup with watch cdsw status.

(Optional) Install Cloudera Data Science Workbench on Worker Hosts

Cloudera Data Science Workbench supports adding and removing additional worker hosts at any time. Worker hosts allow you to transparently scale the number of concurrent workloads users can run.

Use the following steps to add worker hosts to Cloudera Data Science Workbench. Note that airgapped clusters and non-airgapped clusters use different files for installation.

Non air-gapped Installation - Download the Cloudera Data Science Workbench repo file (cloudera-cdsw.repo) from the following location:
```
https://archive.cloudera.com/p/cdsw1/1.7.2/redhat7/yum/cloudera-cdsw.repo
```
Airgapped installation - For airgapped installations, download the Cloudera Data Science Workbench RPM file from the following location:
```
https://archive.cloudera.com/p/cdsw1/1.7.2/redhat7/yum/RPMS/x86_64/
```
Important: Make sure all Cloudera Data Science Workbench hosts (master and worker) are running the same version of Cloudera Data Science Workbench.
Skip this step for airgapped installations. Add the Cloudera Public GPG repository key. This key verifies that you are downloading genuine packages.
```
sudo rpm --import https://archive.cloudera.com/p/cdsw1/1.7.2/redhat7/yum/RPM-GPG-KEY-cloudera
```
Non-airgapped Installation - Install the latest RPM with the following command:
```
sudo yum install cloudera-data-science-workbench
```
Airgapped Installation - Copy the RPM downloaded in the previous step to the appropriate gateway host. Then, use the complete filename to install the package. For example:
```
sudo yum install cloudera-data-science-workbench-1.7.2.12345.rpm
```
For guidance on any warnings displayed during the installation process, see Understanding Installation Warnings.
Copy cdsw.conf file from the master host:
```
scp root@cdsw-host-1.<company>.com:/etc/cdsw/config/cdsw.conf /etc/cdsw/config/cdsw.conf
```
After initialization, the cdsw.conf file includes a generated bootstrap token that allows worker hosts to securely join the cluster. You can get this token by copying the configuration file from master and ensuring it has 600 permissions.

If your hosts have heterogeneous block device configurations, modify the Docker block device settings in the worker host configuration file after you copy it. Worker hosts do not need application block devices, which store the project files and database state, and this configuration option is ignored.
Create /var/lib/cdsw on the worker host. This directory must exist on all worker hosts. Without it, the next step that registers the worker host with the master will fail.

Unlike the master host, the /var/lib/cdsw directory on worker hosts does not need to be mounted to an Application Block Device. It is only used to store CDH client configuration on workers.
On the worker host, run the following command to add the host to the cluster:
```
cdsw join
```
This causes the worker hosts to register themselves with the Cloudera Data Science Workbench master host and increase the available pool of resources for workloads.
Return to the master host and verify the host is registered with this command:
```
cdsw status
```

Create the Administrator Account

Installation typically takes 30 minutes, although it might take an additional 60 minutes for the R, Python, and Scala engine to be available on all hosts.

After your installation is complete, set up the initial administrator account. Go to the Cloudera Data Science Workbench web application at http://cdsw.<company>.com.

The first account that you create becomes the site administrator. You may now use this account to create a new project and start using the workbench to run data science workloads. For a brief example, see Getting Started with the Cloudera Data Science Workbench.

Next Steps

As a site administrator, you can invite new users, monitor resource utilization, secure the deployment, and upload a license key for the product. Depending on the size of your deployment, you might also want to customize how Cloudera Data Science Workbench schedules workloads on your gateway hosts. For more details on these tasks, see:

You can also start using the product by configuring your personal account and creating a new project. For a quickstart that walks you through creating a simple template project, see Getting Started with Cloudera Data Science Workbench. For more details on collaborating with teams, working on projects, and sharing results, see the Managing Cloudera Data Science Workbench Users.

Categories: Cloudera Data Science Workbench | Installation | Packages | Upgrade | All Categories

CSD Installation on CDH

Upgrade (CDH)