Configuring Google Cloud Storage Connectivity
hadoop distcp gs://bucket/45678/ hdfs://aNameNode:8020/45678/For more information about GCS such as access control and security, see the GCS documentation.
Before you start, review the supported services and limitations.
Supported Services
CDH with GCS as the storage layer supports the following services:
- Hive
- MapReduce
- Spark
Limitations
Note the following limitations with GCS support in CDH:
- Cloudera Manager’s Backup and Disaster Recovery and other management features, such as credential management, do not support GCS.
- GCS cannot be the default file system for the cluster.
- Services not explicitly listed under the Supported Services section, such as Impala, Hue, and Cloudera Navigator, are not supported.
Connect the Cluster to GCS
- The project ID listed for project_id.
- The service account email listed for client_email.
- The service account private key ID listed for private_key_id.
- The service account private key listed for private_key.
If you do not have this information, you can create a new private key.
- Open the Google Cloud Console and navigate to the .
- Click on the service account you want to use for your CDH cluster. Alternatively, create a new one.
- Click Edit.
- Click Create Key.
A window appears.
- Select JSON file and save the file.
The JSON file contains all the required information to configure the GCS connection.
This section describes how to add the GCS related properties and distribute it to every node in the cluster. Alternatively, you can submit it on a per job basis.
Complete the following steps to add the GCS information to every node in the cluster:
- In the Cloudera Manager Admin Console, search for the following property that corresponds to the HDFS Service you want to use with GCS: Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
- Add the following properties:
- Name: fs.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- Name: fs.AbstractFileSystem.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
- Name: fs.gs.project.id
Value: <project_id>
- Name: fs.gs.auth.service.account.email
Value <client_email>
-
Name: fs.gs.auth.service.account.enable
Value: true
- Name: fs.gs.auth.service.account.private.key.id
Value: <private_key_id>
- Name: fs.gs.auth.service.account.private.key
Value: <private_key>
You can see sample entries below:
- Name: fs.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- Name: fs.AbstractFileSystem.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
- Name: fs.gs.project.id
Value: [gcp-project-id]
- Name: fs.gs.auth.service.account.email
Value [service-account]@[gcp-project-id].iam.gserviceaccount.com
-
Name: fs.gs.auth.service.account.enable
Value: true
- Name: fs.gs.auth.service.account.private.key.id
Value: 0123456789abcde0123456789abcde0123456789
- Name: fs.gs.auth.service.account.private.key
Value: -----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n
- Name: fs.gs.impl
- Save the changes.
- Deploy the client configurations for the HDFS service. Note that part of this process involves HDFS restarting.
- Run the following command to see if you can access GCS with the name of an existing bucket in place of <existing-bucket>:
hadoop fs -ls gs://<existing-bucket>
The command lists the contents of the bucket.