Getting Started with Cloudera Data Science Workbench

This topic provides a suggested method for quickly getting started with data science workloads on Cloudera Data Science Workbench. For detailed instructions on using Cloudera Data Science Workbench, see the Cloudera Data Science Workbench User Guide.

Watch the following video for a quick demo of the steps described in this topic:

Cloudera Data Science Workbench Quickstart Demo

Sign up

To sign up, open the Cloudera Data Science Workbench web application in a browser. The application is typically hosted on the master node at http://cdsw.<your_domain>.com. The first time you log in, you will be prompted to create a username and password. Note that the first account created will receive site administrator privileges.

If your site administrator has configured your deployment to require invitations, you will need an invitation link to sign up.

Create a Project from a Built-in Template

Cloudera Data Science Workbench is organized around projects. Projects hold all the code, configuration, and libraries needed to reproducibly run analyses.

To help you get started, Cloudera Data Science Workbench includes sample template projects in R, Python, PySpark, and Scala. Using a template project gives you the impetus to start using the Cloudera Data Science Workbench right away.

Create a Template Project


To create a template project:
  1. Sign in to Cloudera Data Science Workbench.
  2. Click New Project.
  3. Enter the account and project name.
  4. Under the Template tab, you can choose one of the programming languages to create a project from one of the built-in templates. Alternatively, if your site administrator has added any custom template projects, those will also be available in this dropdown list.
  5. Click Create Project.

After creating your project, you see your project files and the list of jobs defined in your project. These project files are stored on an internal NFS server, and are available to all your project sessions and jobs, regardless of the gateway nodes they run on. Any changes you make to the code or libraries you install into your project will be immediately available when running an engine.

Launch a Session to Run the Project

Cloudera Data Science Workbench provides an interactive environment tailored for data science called the workbench. It supports R, Python, and Scala engines, one of which we will use to run the template project.

Workbench


Open the Workbench to Launch a Session

To run the project code, open the workbench and launch a new session.
  1. Navigate to the new project's Overview page.
  2. Click Open Workbench.

  3. Launch a New Session

    1. Use Select Engine Kernel to choose the programming language that your project uses.
    2. Use Select Engine Profile to select the number of CPU cores and memory to be used.
    3. Click Launch Session.

      The command prompt at the bottom right of your browser window will turn green when the engine is ready. Sessions typically take between 10 and 20 seconds to start.

Execute Project Code

You can enter and execute code using either the editor or the command prompt. The editor is best used for code you want to keep, while the command prompt is best for quick interactive exploration.

Editor - To run code in the editor:

  1. Select a script from the project files on the left sidebar.
  2. To run the whole script click on the top navigation bar, or, highlight the code you want to run and press Ctrl+Enter (Windows/Linux) or cmd+Enter (macOS).

Command Prompt - The command prompt functions largely like any other. Enter a command and press Enter to execute it. If you want to enter more than one line of code, use Shift+Enter to move to the next line. The output of your code, including plots, appears in the console.


Code Autocomplete - The Python and R kernels include support for automatic code completion, both in the editor and the command prompt. Use single tab to display suggestions and double tab for autocomplete.

Test Terminal Access

Cloudera Data Science Workbench provides terminal access to the running engines from the web console. You can use the terminal to move files around, run Git commands, or install libraries that cannot be installed directly from the engine.

To access the Terminal from a running session, click Terminal Access above the console pane.


By default, the terminal does not provide root or sudo access to the container. To install packages that require root access, see Customizing Engine Images.

Stop the Session

When you are done with the session, click Stop in the menu bar above the console

Next Steps

Now that you have successfully run a sample workload with the Cloudera Data Science Workbench, further acquaint yourself with Cloudera Data Science Workbench by reading the User, Administration, and Security guides to learn more about the types of users, how to collaborate on projects, how to use Spark 2 for advanced analytics, and how to secure your deployment.