Configuring Google Cloud Storage Connectivity

Google Cloud Storage (GCS) is the object storage component of the Google Cloud Platform (GCP), which can act as the persistent storage layer for CDH clusters. You can point to data stored in GCS with the prefix gs://<bucket>. For example, to copy data from GCS to HDFS with distcp, use the following command:
hadoop distcp gs://bucket/45678/ hdfs://aNameNode:8020/45678/ 
For more information about GCS such as access control and security, see the GCS documentation.

Before you start, review the supported services and limitations.

Supported Services

CDH with GCS as the storage layer supports the following services:

  • Hive
  • MapReduce
  • Spark

Limitations

Note the following limitations with GCS support in CDH:

  • Cloudera Manager’s Backup and Disaster Recovery and other management features, such as credential management, do not support GCS.
  • GCS cannot be the default file system for the cluster.
  • Services not explicitly listed under the Supported Services section, such as Impala, Hue, and Cloudera Navigator, are not supported.

Connect the Cluster to GCS

Before you can configure connectivity between your CDH cluster and GCS, you need the following information from your GCS console:
  • The project ID listed for project_id.
  • The service account email listed for client_email.
  • The service account private key ID listed for private_key_id.
  • The service account private key listed for private_key.

If you do not have this information, you can create a new private key.

  1. Open the Google Cloud Console and navigate to the IAM & admin > Service accounts.
  2. Click on the service account you want to use for your CDH cluster. Alternatively, create a new one.
  3. Click Edit.
  4. Click Create Key.

    A window appears.

  5. Select JSON file and save the file.

    The JSON file contains all the required information to configure the GCS connection.

This section describes how to add the GCS related properties and distribute it to every node in the cluster. Alternatively, you can submit it on a per job basis.

Complete the following steps to add the GCS information to every node in the cluster:

  1. In the Cloudera Manager Admin Console, search for the following property that corresponds to the HDFS Service you want to use with GCS: Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
  2. Add the following properties:
    • Name: fs.gs.impl

      Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

    • Name: fs.AbstractFileSystem.gs.impl

      Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

    • Name: fs.gs.project.id

      Value: <project_id>

    • Name: fs.gs.auth.service.account.email

      Value <client_email>

    • Name: fs.gs.auth.service.account.enable

      Value: true

    • Name: fs.gs.auth.service.account.private.key.id

      Value: <private_key_id>

    • Name: fs.gs.auth.service.account.private.key

      Value: <private_key>

    You can see sample entries below:

    • Name: fs.gs.impl

      Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

    • Name: fs.AbstractFileSystem.gs.impl

      Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

    • Name: fs.gs.project.id

      Value: [gcp-project-id]

    • Name: fs.gs.auth.service.account.email

      Value [service-account]@[gcp-project-id].iam.gserviceaccount.com

    • Name: fs.gs.auth.service.account.enable

      Value: true

    • Name: fs.gs.auth.service.account.private.key.id

      Value: 0123456789abcde0123456789abcde0123456789

    • Name: fs.gs.auth.service.account.private.key

      Value: -----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n

  3. Save the changes.
  4. Deploy the client configurations for the HDFS service. Note that part of this process involves HDFS restarting.
  5. Run the following command to see if you can access GCS with the name of an existing bucket in place of <existing-bucket>:
    hadoop fs -ls gs://<existing-bucket>

    The command lists the contents of the bucket.