Configuring Google Cloud Storage Connectivity
hadoop distcp gs://bucket/45678/ hdfs://NameNode:8020/45678/For more information about GCS such as access control and security, see the GCS documentation.
Connecting your Hadoop cluster to GCS is a multi-step process:
- Step 1. Download the GCS Connector and Create a Parcel
- Step 2. Distribute the Parcel
- Step 3. Connect the Cluster to GCS
Before setting up your CDH cluster to point to GCS, review the Supported Services and Limitations.
Supported Services
CDH 6.1 with GCS as the storage layer supports the following services:
- Hive
- MapReduce
- Spark
Limitations
Note the following limitations with GCS support in CDH 6.1:
- Cloudera Manager is not supported. This means that features such as Backup and Disaster Recovery and credential management do not support GCS.
- GCS cannot be the default file system for the cluster.
- Any other services not explicitly listed under the Supported Services section, such as Impala, Hue, and Cloudera Navigator, are not supported.
Step 1. Download the GCS Connector and Create a Parcel
To use GCS with CDH, you must manually download the GCS connector and create a parcel with it:
- Download the Cloud Storage connector for Hadoop 3.x on the Cloudera Manager Server host for your cluster.
- Copy the .jar file to your $HADOOP_COMMON_LIB_JARS_DIR:
cp ~/Downloads/gcs-connector-hadoop3-latest.jar $HADOOP_COMMON_LIB_JARS_DIR.
- Create a parcel with the .jar file:
- Download this script provided by Google
- Run the script with the following command:
./create_parcel.sh -f gcsconnector -v 3.x -o <os_distro_suffix>
Replace <os_distro_suffix> with your operating system suffix. You can find the list here: Parcel OS distro suffixes. For example, if you’re creating a parcel for RHEL 7, use el7.
The script creates a .parcel file and a .SHA file that you use to distribute the GCS connector.
Step 2. Distribute the Parcel
After you create the parcel, use Cloudera Manager to distribute the parcel to all the hosts in the cluster:
- For a local repository, copy the .parcel and .SHA file to the following directory: /opt/cloudera/parcel-repo.
- For an internal repository on a web server, you must add the .parcel and .SHA file to that repository and update manifest.json.
- Open the Cloudera Manager Admin Console.
- Navigate to the Parcels page and click Check for New Parcels.
The GCS connector parcel appears in the list of parcels.
- Find the GCS connector parcel in the list and distribute the parcel.
- Activate the parcel.
Step 3. Connect the Cluster to GCS
- The project ID listed for project_id.
- The service account email listed for client_email.
- The service account private key ID listed for private_key_id.
- The service account private key listed for private_key.
If you do not have this information, you can create a new private key.
- Open the Google Cloud Console and navigate to the .
- Click on the service account you want to use for your CDH cluster. Alternatively, create a new one.
- Click Edit.
- Click Create Key.
A window appears.
- Select JSON file and save the file.
The JSON file contains all the required information to configure the GCS connection.
This section describes how to add the GCS related properties and distribute it to every node in the cluster. Alternatively, you can submit it on a per job basis.
Complete the following steps to add the GCS information to every node in the cluster:
- In the Cloudera Manager Admin Console, search for the following property that corresponds to the HDFS Service you want to use with GCS: Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
- Add the following properties:
- Name: fs.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- Name: fs.AbstractFileSystem.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
- Name: fs.gs.project.id
Value: <project_id>
- Name: fs.gs.auth.service.account.email
Value <client_email>
-
Name: fs.gs.auth.service.account.enable
Value: true
- Name: fs.gs.auth.service.account.private.key.id
Value: <private_key_id>
- Name: fs.gs.auth.service.account.private.key
Value: <private_key>
You can see sample entries below:
- Name: fs.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- Name: fs.AbstractFileSystem.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
- Name: fs.gs.project.id
Value: [gcp-project-id]
- Name: fs.gs.auth.service.account.email
Value [service-account]@[gcp-project-id].iam.gserviceaccount.com
-
Name: fs.gs.auth.service.account.enable
Value: true
- Name: fs.gs.auth.service.account.private.key.id
Value: 0123456789abcde0123456789abcde0123456789
- Name: fs.gs.auth.service.account.private.key
Value: -----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n
- Name: fs.gs.impl
- Save the changes.
- Deploy the client configurations for the HDFS service. Note that part of this process involves HDFS restarting.
- Export the Hadoop classpath to point to the GCS .jar file.
For example, run the following command with the parcel name in place of <GCS Parcel name> :
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cloudera/parcels/<GCS Parcel name>/lib/hadoop/lib/gcs-connector-hadoop3-latest.jar
- Run the following command to see if you can access GCS with the name of an existing bucket in place of <existing-bucket>:
hadoop fs -ls gs://<existing-bucket>
The command lists the contents of the bucket.
You can now run MapReduce, Hive, and Spark jobs against data stored in GCS.