Configure PyCharm as a Local IDE

Cloudera Data Science Workbench supports using local IDEs on your machine that allow remote execution and/or file sync over SSH, such as PyCharm. This topic describes the tasks you need to perform to configure Cloudera Data Science Workbench to act as a remote SSH interpreter for PyCharm. Once finished, you can use PyCharm to edit and sync the changes to Cloudera Data Science Workbench. To perform actions such as deploying a model, use the Cloudera Data Science Workbench web UI.
Before you begin, ensure that the following prerequisites are met:
  • You have an edition of PyCharm that supports SSH, such as the Professional Edition.
  • You have an SSH public/private key pair for your local machine that is compatible with PyCharm. If you use OpenSSH to generate the key, include the -m PEM option because PyCharm does not support modern (RFC 4716) OpenSSH keys.
  • You have Contributor permissions for an existing Cloudera Data Science project. Alternatively, create a new project you have access to.

Download cdswctl and Add an SSH Key

  1. Open the Cloudera Data Science Workbench web UI and go to Settings > Remote Editing for your user account.
  2. Download cdswctl client for your operating system.
  3. Add your SSH public key to SSH public keys for session access.
    Cloudera Data Science Workbench uses the SSH public key to authenticate your CLI client session, including the SSH endpoint connection to the Cloudera Data Science Workbench deployment.

    Any SSH endpoints that are running when you add an SSH public key must also be restarted.

Initialize an SSH Connection to Cloudera Data Science Workbench

The following task describes how to establish an SSH endpoint for Cloudera Data Science Workbench. Creating an SSH endpoint is the first step to configuring a remote editor for Cloudera Data Science Workbench.

  1. Log in to Cloudera Data Science Workbench with the CLI client:
    cdswctl login -n <username> -u cdsw.your_domain.com
    For example, the following command logs the user sample_user into the cdsw.your_domain.com deployment:
    cdswctl login -n sample_user -u cdsw.your_CDSW_domain.com
  2. Create a local SSH endpoint to Cloudera Data Science Workbench. Run the following command:
    cdswctl ssh-endpoint -p <username>/<project_name> [-c <CPU_cores>] [-m <memory_in_GB>] [-g <number_of_GPUs>]
    The command uses the following defaults for optional parameters:
    • CPU cores: 1
    • Memory: 1 GB
    • GPUs: 0
    For example, the following command starts a session for the logged-in user sample_user under the customerchurn project with .5 cores, .75 GB of memory, 0 GPUs, and the Python3 kernel:
    cdswctl ssh-endpoint -p customerchurn -c 0.5 -m 0.75

    To create an SSH endpoint in a project owned by another user or a team, for example finance, prepend the username to the project and separate them with a forward slash:

    cdswctl ssh-endpoint -p finance/customerchurn -c 0.5 -m 0.75
    This command creates session in the project customerchurn that belongs to the team finance.
    Information for the SSH endpoint appears in the output:
    ...
    You can SSH to it using
    
        ssh -p <some_port> cdsw@localhost
    ...
  3. Open a new command prompt and run the outputted command from the previous step:
    ssh -p <some_port> cdsw@localhost
    For example:
    ssh -p 9750 cdsw@localhost
    You will be prompted for the passphrase for the SSH key you entered in the Cloudera Data Science web UI.
    Once you are connected to the endpoint, you are logged in as the cdsw user and can perform actions as though you are accessing the terminal through the Cloudera Data Science Workbench web UI.
  4. Test the connection.
    If you run ls, the project files associated with the session you created are shown. If you run whoami, the command returns the cdsw user.
  5. Leave the SSH endpoint running as long as you want to use a local IDE.

Add Cloudera Data Science Workbench as an Interpreter for PyCharm

Before you begin, ensure that the SSH endpoint for Cloudera Data Science Workbench is running on your local machine. In PyCharm, you can configure an SSH interpreter. Cloudera Data Science Workbench uses this method to connect to PyCharm and act as its interpreter. These instructions were written for the Professional Edition of PyCharm Version 2019.1 and are meant as a starting point. If additional information is required, see the documentation for your version of PyCharm for specific instructions.
  1. Verify that the SSH endpoint for Cloudera Data Science Workbench is running with cdswctl. If the endpoint is not running, start it.
  2. Open PyCharm.
  3. Create a new project.
  4. Expand Project Interpreter and select Existing interpreter.
  5. Click on ... and select SSH Interpreter
  6. Select New server configuration and complete the fields:
    • Host: localhost
    • Port: <port_number>

      This is the port number provided by cdswctl.

    • Username: cdsw
  7. Select Key pair and complete the fields using the RSA private key that corresponds to the public key you added to the Remote Editing tab in the Cloudera Data Science Workbench web UI..
    For macOS users, you must add your RSA private key to your keychain. In a terminal window, run the following command:
    ssh-add -K <path to your prviate key>/<private_key>
  8. Complete the wizard. Based on the Python version you want to use, enter one of the following parameters:
    • For Python 2: /usr/local/bin/python
    • For Python 3: /usr/local/bin/python3
    You are returned to the New Project window. Existing interpreter is selected, and you should see the connection to Cloudera Data Science Workbench in the Interpreter field.
  9. In the Remote project location field, specify the following directory:
    /home/cdsw
  10. Create the project.

(Optional) Configure the Sync Between Cloudera Data Science Workbench and PyCharm

Before you configure syncing behavior between the remote editor and Cloudera Data Science Workbench, ensure that you understand the policies set forth by IT and the Site Administrator. For example, a policy might require that data remains within the Cloudera Data Science Workbench deployment but allow you to download and edit code. Configuring what files PyCharm ignores can help you adhere to IT policies.
  1. In your project, go to Preferences.
    Depending on your operating system, Preferences may be called Settings.
  2. Go to Build, Execution, Deployment and select Deployment.
  3. On the Connection tab, add the following path to the Root path field:
    /home/cdsw
  4. On the Excluded Paths tab, add any paths you want to exclude.
    Cloudera recommends excluding the following paths at a minimum:
    • /home/cdsw/.local
    • /home/cdsw/.cache
    • /home/cdsw/.ipython
    • /home/cdsw/.ipython
    • /home/cdsw/.oracle_jre_usage
    • /home/cdsw/.pip
    • /home/cdsw/.pycharm_helpers
  5. Optionally, add a Deployment path on the Mappings tab if the code for your Cloudera Data Science Workbench project lives in a subdirectory of the root path.
  6. Expand Deployment in the left navigation and go to Options > Upload changed files automatically to the default server and set the behavior to adhere to the policies set forth by IT and the Site Administrator.

    Cloudera recommends setting the behavior to Automatic upload because the data remains on the cluster while your changes get uploaded.

  7. Sync for the project file(s) to your machine and begin editing.