Install Cloudera Data Science Workbench on the Master Host
Use the following steps to install Cloudera Data Science Workbench on the master host.
-
Non-airgapped Installation - Download the Cloudera Data Science Workbench
repo file (cloudera-cdsw.repo) from the following location:
https://username:password@archive.cloudera.com/p/cdsw1/1.9.0/redhat7/yum/cloudera-cdsw.repo
Airgapped installation - For airgapped installations, download the Cloudera Data Science Workbench RPM file from the following location:https://username:password@archive.cloudera.com/p/cdsw1/1.9.0/redhat7/yum/RPMS/x86_64/
-
Skip this step for airgapped installations. Add the Cloudera Public GPG repository
key. This key verifies that you are downloading genuine packages.
sudo rpm --import https://username:password@archive.cloudera.com/p/cdsw1/1.9.0/redhat7/yum/RPM-GPG-KEY-cloudera
-
Non-airgapped Installation - Install the latest RPM with the following
command:
sudo yum install cloudera-data-science-workbench
Airgapped Installation - Copy the RPM downloaded in the previous step to the appropriate gateway host. Then, use the complete filename to install the package. For example:sudo yum install cloudera-data-science-workbench-1.9.0.12345.rpm
For guidance on any warnings displayed during the installation process, see Understanding Installation Warnings. -
Edit the configuration file at
/etc/cdsw/config/cdsw.conf
. The following table lists the configuration properties that can be configured in cdsw.conf.Table 1. cdsw.conf
PropertiesProperties Description Required Configuration DOMAIN
Wildcard DNS domain configured to point to the master host.
If the wildcard DNS entries are configured as
cdsw.<your_domain>.com
and*.cdsw.<your_domain>.com
, thenDOMAIN
should be set tocdsw.<your_domain>.com
. Users' browsers should then contact the Cloudera Data Science Workbench web application athttp://cdsw.<your_domain>.com
.This domain for DNS and is unrelated to Kerberos or LDAP domains.
MASTER_IP
IPv4 address for the master host that is reachable from the worker hosts.
Within an AWS VPC,
MASTER_IP
should be set to the internal IP address of the master host; for instance, if your hostname isip-10-251-50-12.ec2.internal
, setMASTER_IP
to the corresponding IP address,10.251.50.12
.DISTRO
The Hadoop distribution installed on the cluster. Set this property to
HDP
.DOCKER_BLOCK_DEVICES
Block device(s) for Docker images (space separated if there are multiple).
Use the full path to specify the image(s), for instance,
/dev/xvde
.JAVA_HOME
Path where Java is installed on the Cloudera Data Science Workbench hosts.
This path must match the
JAVA_HOME
environment variable that is configured for your HDP cluster. You can find the value inhadoop-env.sh
on any node in the HDP cluster.Note that Spark 2.3 requires JDK 1.8. For more details on the specific versions of Oracle JDK recommended for HDP clusters, see the Hortonworks Support Matrix - https://supportmatrix.cloudera.com/.
Optional Configuration APPLICATION_BLOCK_DEVICE
(Master Host Only) Configure a block device for application state.
If this property is left blank, the filesystem mounted at
/var/lib/cdsw
on the master host will be used to store all user data. For production deployments, Cloudera strongly recommends you use this option with a dedicated SSD block device for the/var/lib/cdsw
mount.(Not recommended) If set, Cloudera Data Science Workbench will format the provided block device as
ext4
, mount it to/var/lib/cdsw
, and store all user data on it. This option has only been provided for proof-of-concept setups, and Cloudera is not responsible for any data loss.Use the full path to specify the mount point, for instance,
/dev/xvdf
.RESERVE_MASTER
Set this property to
true
to reserve the master host for Cloudera Data Science Workbench's internal components and services, such as Livelog, the PostgreSQL database, and so on. User workloads will now run exclusively on worker hosts, while the master is reserved for internal application services.This feature only applies to deployments with more than one Cloudera Data Science Workbench host. Enabling this feature on single-host deployments will leave Cloudera Data Science Workbench incapable of scheduling any workloads.
DISTRO_DIR
Path where the Hadoop distribution is installed on the Cloudera Data Science Workbench hosts. For HDP clusters, the default location of the packages is
/usr/hdp
. Specify this property only if you are using a non-default location.ANACONDA_DIR
Path where Anaconda is installed. Set this property only if you are using Anaconda for package management.
By default, the Anaconda package is installed at:
/home/<your-username>/anaconda<2 or 3>
. Refer to the Anaconda FAQs for more details.If you choose to start using Anaconda anytime post-installation, you must set this property and then restart Cloudera Data Science Workbench to have this change take effect.
TLS_ENABLE
Enable and enforce HTTPS (TLS/SSL) for web access.
Set to
true
to enable and enforce HTTPS access to the web application.You can also set this property to
true
to enable external TLS termination. For more details on TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.TLS_CERT
TLS_KEY
Certificate and private key for internal TLS termination.
Setting
TLS_CERT
andTLS_KEY
will enable internal TLS termination. You must also setTLS_ENABLE
totrue
above to enable and enforce internal termination. Set these only if you are not terminating TLS externally.Make sure you specify the full path to the certificate and key files, which must be in
PEM
format.For details on certificate requirements and enabling TLS termination, see Enabling TLS/SSL for Cloudera Data Science Workbench.
HTTP_PROXY
HTTPS_PROXY
If your deployment is behind an HTTP or HTTPS proxy, set the respectiveHTTP_PROXY
orHTTPS_PROXY
property in/etc/cdsw/config/cdsw.conf
to the hostname of the proxy you are using.HTTP_PROXY="<http://proxy_host>:<proxy-port>" HTTPS_PROXY="<http://proxy_host>:<proxy_port>"
If you are using an intermediate proxy, such as Cntlm, to handle NTLM authentication, add the Cntlm proxy address to theHTTP_PROXY
orHTTPS_PROXY
fields incdsw.conf
.HTTP_PROXY="http://localhost:3128" HTTPS_PROXY="http://localhost:3128"
If the proxy server uses TLS encryption to handle connection requests, you will need to add the proxy's root CA certificate to your host's store of trusted certificates. This is because proxy servers typically sign their server certificate with their own root certificate. Therefore, any connection attempts will fail until the Cloudera Data Science Workbench host trusts the proxy's root CA certificate. If you do not have access to your proxy's root certificate, contact your Network / IT administrator.
To enable trust, copy the proxy's root certificate to the trusted CA certificate store (ca-trust
) on the Cloudera Data Science Workbench host.cp /tmp/<proxy-root-certificate>.crt /etc/pki/ca-trust/source/anchors/
Use the following command to rebuild the trusted certificate store.update-ca-trust extract
ALL_PROXY
If a SOCKS proxy is in use, set to
socks5://<host>:<port>/
.NO_PROXY
Comma-separated list of hostnames that should be skipped from the proxy.
Starting with version 1.4, if you have defined a proxy in the
HTTP_PROXY(S)
orALL_PROXY
properties, Cloudera Data Science Workbench automatically appends the following list of IP addresses to theNO_PROXY
configuration. Note that this is the minimum required configuration for this field.This list includes
127.0.0.1
,localhost
, and any private Docker registries and HTTP services inside the firewall that Cloudera Data Science Workbench users might want to access from the engines."127.0.0.1,localhost,100.66.0.1,100.66.0.2,100.66.0.3, 100.66.0.4,100.66.0.5,100.66.0.6,100.66.0.7,100.66.0.8,100.66.0.9, 100.66.0.10,100.66.0.11,100.66.0.12,100.66.0.13,100.66.0.14, 100.66.0.15,100.66.0.16,100.66.0.17,100.66.0.18,100.66.0.19, 100.66.0.20,100.66.0.21,100.66.0.22,100.66.0.23,100.66.0.24, 100.66.0.25,100.66.0.26,100.66.0.27,100.66.0.28,100.66.0.29, 100.66.0.30,100.66.0.31,100.66.0.32,100.66.0.33,100.66.0.34, 100.66.0.35,100.66.0.36,100.66.0.37,100.66.0.38,100.66.0.39, 100.66.0.40,100.66.0.41,100.66.0.42,100.66.0.43,100.66.0.44, 100.66.0.45,100.66.0.46,100.66.0.47,100.66.0.48,100.66.0.49, 100.66.0.50,100.77.0.10,100.77.0.128,100.77.0.129,100.77.0.130, 100.77.0.131,100.77.0.132,100.77.0.133,100.77.0.134,100.77.0.135, 100.77.0.136,100.77.0.137,100.77.0.138,100.77.0.139"
NVIDIA_GPU_ENABLE
Set this property to
true
to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a host is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench hosts.If this property is set to
true
on a host that does not have GPU support, there will be no effect. By default, this property is set tofalse
.For detailed instructions on how to enable GPU-based workloads on Cloudera Data Science Workbench, see Using NVIDIA GPUs for Cloudera Data Science Workbench Projects.
NVIDIA_LIBRARY_PATH
Complete path to the NVIDIA driver libraries.
-
Initialize and start Cloudera Data Science Workbench.
cdsw start
The application will take a few minutes to bootstrap. You can watch the status of application installation and startup withwatch cdsw status
.