Managing Projects in Cloudera Data Science Workbench

Projects form the heart of Cloudera Data Science Workbench. They hold all the code, configuration, and libraries needed to reproducibly run analyses. Each project is independent, ensuring users can work freely without interfering with one another or breaking existing workloads.

This topic describes how to create and manage projects in Cloudera Data Science Workbench.

Creating a Project

To create a Cloudera Data Science Workbench project:
  1. Go to Cloudera Data Science Workbench and on the left sidebar, click Projects.
  2. Click New Project.
  3. If you are a member of a team, from the drop-down menu, select the Account under which you want to create this project. If there is only one account on the deployment, you will not see this option.
  4. Enter a Project Name.
  5. Select Project Visibility from one of the following options.
    • Private - Only project collaborators can view or edit the project.
    • Team - If the project is created under a team account, all members of the team can view the project. Only explicitly-added collaborators can edit the project.
    • Public - All authenticated users of Cloudera Data Science Workbench will be able to view the project. Collaborators will be able to edit the project.
  6. Under Initial Setup, you can either create a blank project, or select one of the following sources for your project files.
    • Built-in Templates - Template projects contain example code that can help you get started with the Cloudera Data Science Workbench. They are available in R, Python, PySpark, and Scala. Using a template project is not required, but it helps you start using the Cloudera Data Science Workbench right away.

      Custom Templates - Starting with version 1.3, site administrators can add template projects that are customized for their organization's use-cases. For details, see Custom Template Projects.

    • Local - If you have an existing project on your local disk, use this option to upload compressed files or folders to Cloudera Data Science Workbench.
    • Git - If you already use Git for version control and collaboration, you can continue to do so with the Cloudera Data Science Workbench. Specifying a Git URL will clone the project into Cloudera Data Science Workbench. If you use a Git SSH URL, your personal private SSH key will be used to clone the repository. This is the recommended approach. However, you must add the public SSH key from your personal Cloudera Data Science Workbench account to the remote Git hosting service before you can clone the project. Specify your username and password in the URL as follows:
      http://username:password@server/path/project.git
  7. Click Create Project. After the project is created, you can see your project files and the list of jobs defined in your project.
    Note that as part of the project filesystem, Cloudera Data Science Workbench also creates the following .gitignore file.
    R
    node_modules
    *.pyc
    .*
    !.gitignore
  8. (Optional) To work with team members on a project, use the instructions in the following section to add them as collaborators to the project.

Adding Collaborators

If you want to work closely with colleagues on a particular project, use the following steps to add them to the project.
  1. Navigate to the project overview page.
  2. Click Team to open the Collaborators page.
  3. Search for collaborators by either name or email address and click Add.

    For a project created under your personal account, anyone who belongs to your organization can be added as a collaborator. For a project created under a team account, you can only add collaborators that already belong to the team. If you want to work on a project that requires collaborators from different teams, create a new team with the required members, and then create a project under that account. If your project was created from a Git repository, each collaborator must create the project from the same central Git repository.

    You can grant project collaborators one of three levels of access:
    • Viewer - Read-only access to code, data, and results.
    • Operator - Read-only access to code, data, and results. Additionally, Operators can start and stop existing jobs in the projects that they have access to.
    • Contributor - Can view, edit, create, and delete files and environmental variables, run sessions/experiments/jobs/models and execute code in running jobs. Additionally, Contributors can set the default engine for the project.
    • Admin - Has complete access to all aspects of the project. This includes the ability to add new collaborators, and delete the entire project.

For more information on collaborating effectively, see Collaborating on Projects with Cloudera Data Science Workbench.

Modifying Project Settings

Project contributors and administrators can modify aspects of the project environment such as the engine being used to launch sessions, the environment variables, and create SSH tunnels to access external resources. To make these changes:
  1. Switch context to the account where the project was created.
  2. Click Projects.
  3. From the list of projects, select the one you want to modify.
  4. Click Settings to open up the Project Settings dashboard.
    Options
    Modify the project name and its privacy settings on this page.
    Engine
    Cloudera Data Science Workbench ensures that your code is always run with the specific engine version you selected. You can select the version here. For advanced use cases, Cloudera Data Science Workbench projects can use custom Docker images for their projects. Site administrators can whitelist images for use in projects, and project administrators can use this page to select which of these whitelisted images is installed for their projects. For an example, see Customized Engine Images.

    Environment - If there are any environmental variables that should be injected into all the engines running this project, you can add them to this page. For more details, see Engine Environment Variables.

    Tunnels
    In some environments, external databases and data sources reside behind restrictive firewalls. Cloudera Data Science Workbench provides a convenient way to connect to such resources using your SSH key. For instructions, see SSH Tunnels.
    Delete Project
    This page can only be accessed by project administrators. Remember that deleting a project is irreversible. All files, data, sessions, and jobs will be lost.

Managing Files

Cloudera Data Science Workbench allows you to move, rename, copy, and delete files within the scope of the project where they live. You can also upload new files to a project, or download project files. Files can only be uploaded within the scope of a single project. Therefore, to access a script or data file from multiple projects, you will need to manually upload it to all the relevant projects.

  1. Switch context to the account where the project was created.
  2. Click Projects.
  3. From the list of projects, click on the project you want to modify. This will take you to the project overview.
  4. Click Files.
    Upload Files to a Project

    Click Upload. Select Files or Folder from the dropdown, and choose the files or folder you want to upload from your local filesystem.

    In addition to uploading files or a folder, you can upload a .tar file of multiple files and folders. After you select and upload the .tar file, you can use a terminal session to extract the contents:

    1. On the project overview page, click Open Workbench and select a running session or create a new one.
    2. Click Terminal access.
    3. In the terminal window, extract the contents of the .tar file:
      tar -xvf <file_name>.tar.gz 

      The extracted files are now available for the project.

    Download Project Files

    Click Download to download the entire project in a .zip file. To download only a specific file, select the checkbox next to the file(s) to be download and click Download.

    You can also use the checkboxes to Move, Rename, or Delete files within the scope of this project.

Disabling Project File Uploads and Downloads

Required Role: Site Administrator

By default, all Cloudera Data Science Workbench users are allowed to upload and download files to/from a project. Version 1.5 introduces a new feature flag that allows site administrators to hide the UI features that let users upload and download project files.

Note that this feature flag only removes the relevant features from the Cloudera Data Science Workbench UI. It does not disable the ability to upload and download files through the backend web API.

To disable project file uploads and downloads:
  1. Go to Admin > Security.
  2. Under the File Upload/Download section, disable the Allow file upload/download through UI checkbox.

Custom Template Projects

Required Role: Site Administrator

Site administrators can add template projects that have been customized for their organization's use-cases. These custom project templates can be added in the form of a Git repository.

To add a new template project, go to Admin > Settings. Under the Project Templates section, provide a template name, the URL to the project's Git repository, and click Add.

The added templates will become available in the Template tab on the Create Project page. Site administrators can add, edit, or delete custom templates, but not the built-in ones. However, individual built-in templates can be disabled using a checkbox in the Project Templates table at Admin > Settings.

Deleting a Project

To delete a project:
  1. Go to the project Overview page.
  2. On the left sidebar, click Settings.
  3. Go to the Delete Project.
  4. Click Delete Project and click OK to confirm.