Configuring Google Cloud Storage Connectivity

Google Cloud Storage (GCS) is the object storage component of the Google Cloud Platform (GCP), which can act as the persistent storage layer for CDH clusters. You can point to data stored in GCS with the prefix gs://<bucket>. For example, to copy data from GCS to HDFS with distcp, use the following command:

hadoop distcp gs://bucket/45678/ hdfs://NameNode:8020/45678/

For more information about GCS such as access control and security, see the GCS documentation.

Connecting your Hadoop cluster to GCS is a multi-step process:

Before setting up your CDH cluster to point to GCS, review the Supported Services and Limitations.

Supported Services

CDH 6.1 with GCS as the storage layer supports the following services:

Hive
MapReduce
Spark

Limitations

Note the following limitations with GCS support in CDH 6.1:

Cloudera Manager is not supported. This means that features such as Backup and Disaster Recovery and credential management do not support GCS.
GCS cannot be the default file system for the cluster.
Any other services not explicitly listed under the Supported Services section, such as Impala, Hue, and Cloudera Navigator, are not supported.

Step 1. Download the GCS Connector and Create a Parcel

To use GCS with CDH, you must manually download the GCS connector and create a parcel with it:

Download the Cloud Storage connector for Hadoop 3.x on the Cloudera Manager Server host for your cluster.

Copy the .jar file to your $HADOOP_COMMON_LIB_JARS_DIR:

cp ~/Downloads/gcs-connector-hadoop3-latest.jar $HADOOP_COMMON_LIB_JARS_DIR.

Create a parcel with the .jar file:
1. Download this script provided by Google
2. Run the script with the following command:
```
./create_parcel.sh -f gcsconnector -v 3.x -o <os_distro_suffix>
```
  Replace <os_distro_suffix> with your operating system suffix. You can find the list here: Parcel OS distro suffixes. For example, if you’re creating a parcel for RHEL 7, use el7.
  
  The script creates a .parcel file and a .SHA file that you use to distribute the GCS connector.

Step 2. Distribute the Parcel

After you create the parcel, use Cloudera Manager to distribute the parcel to all the hosts in the cluster:

For a local repository, copy the .parcel and .SHA file to the following directory: /opt/cloudera/parcel-repo.
For an internal repository on a web server, you must add the .parcel and .SHA file to that repository and update manifest.json.
Open the Cloudera Manager Admin Console.
Navigate to the Parcels page and click Check for New Parcels.
The GCS connector parcel appears in the list of parcels.
Find the GCS connector parcel in the list and distribute the parcel.
Activate the parcel.

Step 3. Connect the Cluster to GCS

Before you can configure connectivity between the cluster and GCS, you need the information from your GCS console:

The project ID listed for project_id.
The service account email listed for client_email.
The service account private key ID listed for private_key_id.
The service account private key listed for private_key.

If you do not have this information, you can create a new private key.

Show Me How

Open the Google Cloud Console and navigate to the IAM & admin > Service accounts.
Click on the service account you want to use for your CDH cluster. Alternatively, create a new one.
Click Edit.
Click Create Key.
A window appears.
Select JSON file and save the file.
The JSON file contains all the required information to configure the GCS connection.

This section describes how to add the GCS related properties and distribute it to every node in the cluster. Alternatively, you can submit it on a per job basis.

Complete the following steps to add the GCS information to every node in the cluster:

In the Cloudera Manager Admin Console, search for the following property that corresponds to the HDFS Service you want to use with GCS: Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
Add the following properties:
- Name: fs.gs.impl
  Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- Name: fs.AbstractFileSystem.gs.impl
  Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
- Name: fs.gs.project.id
  Value: <project_id>
- Name: fs.gs.auth.service.account.email
  Value <client_email>
- Name: fs.gs.auth.service.account.enable
  
  Value: true
- Name: fs.gs.auth.service.account.private.key.id
  Value: <private_key_id>
- Name: fs.gs.auth.service.account.private.key
  Value: <private_key>
You can see sample entries below:
Show
- Name: fs.gs.impl
  Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- Name: fs.AbstractFileSystem.gs.impl
  Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
- Name: fs.gs.project.id
  Value: [gcp-project-id]
- Name: fs.gs.auth.service.account.email
  Value [service-account]@[gcp-project-id].iam.gserviceaccount.com
- Name: fs.gs.auth.service.account.enable
  
  Value: true
- Name: fs.gs.auth.service.account.private.key.id
  Value: 0123456789abcde0123456789abcde0123456789
- Name: fs.gs.auth.service.account.private.key
  Value: -----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n
Save the changes.
Deploy the client configurations for the HDFS service. Note that part of this process involves HDFS restarting.
Export the Hadoop classpath to point to the GCS .jar file.
For example, run the following command with the parcel name in place of <GCS Parcel name> :
```
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cloudera/parcels/<GCS Parcel name>/lib/hadoop/lib/gcs-connector-hadoop3-latest.jar
```
Run the following command to see if you can access GCS with the name of an existing bucket in place of <existing-bucket>:
```
hadoop fs -ls gs://<existing-bucket>
```
The command lists the contents of the bucket.

You can now run MapReduce, Hive, and Spark jobs against data stored in GCS.

Configuring ADLS Gen2 Connectivity

How To Create a Multitenant Enterprise Data Hub