Register a GCP environment

Once you’ve met the cloud provider requirements, register your GCP environment.

Before you begin

This assumes that you have already fulfilled the environment prerequisites described in GCP environment prerequisites.

Steps - CDP web interface

  1. Navigate to the Management Console > Environments > Register environment.

  2. On the On the Register Environment page, provide the following information:page, provide the following information:
    Parameter Description
    General Information
    Environment Name Enter a name for your environment. The name:
    • Must be between 5 and 28 characters long.
    • Can only include lowercase letters, numbers, and hyphens.
    • Must start with a lowercase letter.
    Description (Optional) Enter a description for your environment.
    Select Cloud Provider Select Google Cloud.
    Google Cloud Platform Credential
    Select Credential Select an existing credential or select Create new credential.

    For instructions on how to create a credential for Google Cloud, refer to Create a provisioning credential for GCP.

  3. Click Next.
  4. On the Data Access and Data Lake Scaling page, provide the following information:
    Parameter Description
    Data Lake Settings
    Data Lake Cluster Name Enter a name for the Data Lake cluster that will be created for this environment. The name:
    • Must be between 5 and 100 characters long
    • Must contain lowercase letters
    • Cannot contain uppercase letters
    • Must start with a letter
    • Can only include the following accepted characters are: a-z, 0-9, -.
    Data Lake Version Select Cloudera Runtime version that should be deployed for your Data Lake. The latest stable version is used by default.

    All Data Hub clusters provisioned within this Data Lake will be using the same Runtime version.

    Note: Google Cloud environments can only be provisioned in CDP with Runtime version 7.2.8 or newer.

    Data Access
    Assumer Service Account Select the IDBroker service account created in Minimum setup for cloud storage.
    Storage Location Base Select the Google Storage location created for data in Minimum setup for cloud storage.
    Data Access Service Account Select the Data Lake service account created in Minimum setup for cloud storage.
    IDBroker Mappings We recommend that you leave this out and set it up after registering your environment as part of Onboarding CDP users and groups for cloud storage.
    Scale
    Scale Select Data Lake scale. By default, “Light Duty” is used. For more information on Data Lake scale, refer to Data Lake scale.
  5. Click Next.
  6. On the Region, Networking, Security and Storage page, provide the following information:
    Parameter Description
    Region
    Select Region Select the region where your VPC network is located..
    Network
    Use shared VPC This option is disabled by default. Enable this if you would like to use your existing shared VPC. Next enter:
    • Host project ID
    • Network name
    • Subnet name(s). If providing multiple, provide a comma separated list.
    Select Network Select the existing VPC network that you created as a prerequisite in the VPC network and subnets step. All CDP resources will be provisioned into this network.
    Select Subnets Select at least one existing subnet.
    Enable Cluster Connectivity Manager This option is enabled by default. You can disable it if you do not want to use CCM. You can use Cluster Connectivity Manager (CCM) for communication with Data Lake and Data Hub workload clusters that are on private subnets. For more information about the required setup, refer to Cluster Connectivity Manager documentation.
    Don't Create Public Ip When CCM is enabled, you can enable this option to use private IPs instead of public IPs.
    Enable FreeIPA HA This is currently not supported for Google Cloud.
    Proxies Select a proxy configuration if previously registered. For more information refer to Setting up a proxy server.
    Security Access Settings
    Select Security Access Type You have two options:
    • Do not create firewall rule: If you are using a shared VPC you can set the firewall rules directly on the VPC. If you did so, you can select this option.
    • Provide existing firewall rules: If not all of your firewall rules are set directly on the VPC, provide the previously created firewall rules for SSH an UI access. You should select two existing firewall rules, one for Knox gateway-installed nodes and another for all other nodes. You may select the same firewall rule in both places if needed.

    For information on required ports, see Firewall rules.

    SSH Settings
    New SSH public key Upload a public key directly from your computer.

    Note: CDP does not use this SSH key. The matching private key can be used by your CDP administrator for root-level access to the instances provisioned for the Data Lake and Data Hub.

    Add tags You can optionally add tags to be created for your resources on GCP. Refer to Defining custom tags.
  7. Click Next.
  8. On the Audit and Storage page, provide the following information:
    Parameter Description
    Logs - Storage and Audit
    Logger Service Account Select the Logger service account created in Minimum setup for cloud storage.
    Logs Location Base Select the Google Storage location created for logs in Minimum setup for cloud storage.
    Telemetry
    Enable Workload Analytics Enables Workload Manager support for workload clusters created within this environment. When this setting is enabled, diagnostic information about job and query execution is sent to the Workload Manager.
    Enable Cluster Logs Collection When this option is enabled. the logs generated during deployments will be automatically sent to Cloudera.
  9. Click Register Environment to trigger environment registration.
  10. The environment creation takes about 60 minutes. The creation of the FreeIPA server and Data Lake cluster is triggered. You can monitor the progress from the web UI. Once the environment creation has been completed, its status will change to “Running”.

Steps - CDP CLI

Unlike in the CDP web interface, in CDP CLI environment creation is a two-step process with environment creation and data lake creation being two separate steps. The following commands can be used to create an environment in CDP.

  1. Once you’ve met the prerequisites, register your GCP environment in CDP using the cdp environments create-gcp-environment command and providing the CLI input parameters. For example:
    cdp environments create-gcp-environment --cli-input-json '{
        "environmentName": "test-env",
        "description": "Test GCP environment",
        "credentialName": "test-gcp-crd",
        "region": "us-west2",
        "publicKey": "ssh-rsa AAAAB3NzaZ1yc2EAAAADAQABAAABAQDwCI/wmQzbNn9YcA8vdU+Ot41IIUWJfOfiDrUuNcULOQL6ke5qcEKuboXzbLxV0YmQcPFvswbM5S4FlHjy2VrJ5spyGhQajFEm9+PgrsybgzHkkssziX0zRq7U4BVD68kSn6CuAHj9L4wx8WBwefMzkw7uO1CkfifIp8UE6ZcKKKwe2fLR6ErDaN9jQxIWhTPEiFjIhItPHrnOcfGKY/p6OlpDDUOuMRiFZh7qMzfgvWI+UdN/qjnTlc/M53JftK6GJqK6osN+j7fCwKEnPwWC/gmy8El7ZMHlIENxDut6X0qj9Okc/JMmG0ebkSZAEbhgNOBNLZYdP0oeQGCXjqdv",
        "enableTunnel": true,
        "usePublicIp": true,
        "existingNetworkParams": {
            "networkName": "eng-private",
            "subnetNames": [
                "private-us-west2"
            ],
            "sharedProjectId": "dev-project"
        },
        "logStorage": {
            "storageLocationBase": "gs://logs",
            "serviceAccountEmail": "logger@dev-project.iam.gserviceaccount.com"
        }
    }'
    Parameter Description
    environmentName Provide a name for your environment.
    credentialName Provide the name of the credential created earlier.
    region Specify the region where your existing VPC network is located. For example ”us-west2” is a valid region.
    publicKey Paste your SSH public key.
    existingNetworkParams Provide a JSON specifying the following:
    {
     "networkName": "string",
     "subnetNames": ["string", ...],
     "sharedProjectId": "string"
     }

    Replace the values with the actual VPC network name, one or more subnet names and shared project ID.

    The sharedProjectId value needs to be set in the following way:
    • For a shared VPC, set it to the GCP host project ID
    • For a non-shared VPC, set it to the GCP project ID of the project where CDP is being deployed.
    enableTunnel By default CCM is enabled (set to “true”). If you would like to disable it, set it to “false”. If you disable it, then you must also add the following to your JSON definition to specify two security groups as follows:
    "securityAccess":
     {
       "securityGroupIdForKnox": "string",
       "defaultSecurityGroupId": "string"
     }
    usePublicIp Set this to “true” or “false”, depending on whether or not you want to create public IPs.
    logStorage Provide a JSON specifying your configuration for cluster and audit logs:
     {
     "storageLocationBase": "string",
     "serviceAccountEmail": "string"
     }

    The storageLocationBase should be in the following format: gs://my-bucket-name.

  2. To verify that your environment is running, use:
    cdp environments list-environments
    You can also log in to the CDP web interface to check the deployment status.
  3. Once your environment and Data Lake are running, you should set IDBroker Mappings. To create the mappings, run the cdp environments set-id-broker-mappings command. For example:
    cdp environments set-id-broker-mappings \
     --environment-name test-env \
     --data-access-role dl-admin@dev-project.iam.gserviceaccount.com \
     --ranger-audit-role ranger-audit@dev-project.iam.gserviceaccount.com \
     --mappings '[{"accessorCrn": "crn:altus:iam:us-west-1:45ca3068-42a6-4227-8394-13a4493e2ac0:user:430c534d-8a19-4d9e-963d-8af377d16963", "role": "data-science@dev-project.iam.gserviceaccount.com"},{"accessorCrn":"crn:altus:iam:us-west-1:45ca3068-42a6-4227-8394-13a4493e2ac0:machineUser:mfox-gcp-idbmms-test-mu/2cbca867-647b-44b9-8e41-47a01dea6c19","role":"data-eng@dev-project.iam.gserviceaccount.com"}]'
    Parameter Description
    environment-name Specify a name of the environment created earlier.
    data-access-role Specify an email address of the Data Lake admin service account created earlier.
    ranger-audit-role Specify an email address of the Ranger audit service account created earlier.
    mappings Map CDP users or groups to GCP service accounts created earlier. Use the following syntax:
    [
    {
    "accessorCrn": "string", 
    "role": "string"
    } 
    ... 
    ]

    You can obtain user or group CRN from the Management Console > User Management by navigating to details of a specific user or group.

    The role should be specified as service account email.

  4. Next, sync IDBroker mappings:
    cdp environments sync-id-broker-mappings --environment-name demo3
  5. Finally, check the sync status:
    cdp environments get-id-broker-mappings-sync-status --environment-name demo3
  6. One your environment is running, you can create a Data Lake using the cdp datalake create-gcp-datalake command and providing the CLI input parameters:
    cdp datalake create-gcp-datalake --cli-input-json '{
        "datalakeName": "my-dl",
        "environmentName": "test-env",
        "scale": "LIGHT_DUTY",
        "cloudProviderConfiguration": {
            "serviceAccountEmail": "idbroker@dev-project.iam.gserviceaccount.com",
            "storageLocation": "gs://data-storage"
        }
    }' 
    Parameter Description
    datalakeName Provide a name for your Data Lake.
    environmentName Provide a name of the environment created earlier.
    scale Provide Data Lake scale. It must be one of:
    • LIGHT_DUTY or
    • MEDIUM_DUTY_HA.
    cloudProviderConfiguration Provide the name of the data storage bucket and the email of the IDBroker service account.
  7. To verify that your Data lake is running, use:
    cdp datalake list-datalakes

    You can also log in to the CDP web interface to check the deployment status.

After you finish

After your environment is running, perform the following steps: