Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Configure Access to GCS from Your Cluster

After obtaining the service account key, perform these steps on your cluster. The steps below assume that your service account key is called google-access-key.json. If you chose a different name, make sure to update the commands accordingly.

Steps

  1. Place the service account key on all nodes of the clusters.

    Note the following about the location where to place file:

    • Make sure to use an absolute path such as /etc/hadoop/conf/google-access-key.json (where google-access-key.json is your JSON key).

    • The path must be the same on all nodes.

    • In a single-user cluster, /etc/hadoop/conf/google-access-key.json is appropriate. Permissions for the file should be set to 444.

    • If you need to use this option with a multi-user cluster, you should place this in the user's home directory: ${USER_HOME}/.credentials/storage.json. Permissions for the file should be set to 400.

    There are many ways to place the file on the hosts. For example you can create a `hosts` file listing all the hosts, one per line, and then run the following:

    for host in `cat hosts`;
    do scp -i <Path_to_ssh_private_key> google-access-key.json <Ssh_user>@$host:/etc/hadoop/conf/google-access-key.json;
    done

  2. In the Ambari web UI, set the following two properties under custom-core-site.

    To set these properties in the custom-core-site, navigate to HDFS > Configs > Custom core-site and click Add Property. The JSON and the p12 properties cannot be set at the same time.

    • If using a key in the JSON format (recommended), set the following properties:

      fs.gs.auth.service.account.json.keyfile=<Path-to-the-JSON-file>
      fs.gs.working.dir=/
      fs.gs.path.encoding=uri-path
      fs.gs.reported.permissions=777

    • If using a key in the P12 format, set the following properties:

      fs.gs.auth.service.account.email=<Your-Service-Account-email>
      fs.gs.auth.service.account.keyfile=<Path-to-the-p12-file>
      fs.gs.working.dir=/
      fs.gs.path.encoding=uri-path
      fs.gs.reported.permissions=777

      [Note]Note

      Setting fs.gs.working.dir configures the initial working directory of a GHFS instance. This should always be set to "/".

      Setting fs.gs.path.encoding sets the path encoding to be used, and allows for spaces in the filename. This should always be set to "uri-path".

      Setting fs.gs.reported.permissions sets permissions for file listings when using gs. The default 700 may end up being too restrictive for some processes performing file-based checks.

  3. Save the configuration change and restart affected services. Additionally - depending on what services you are using - you must restart other services that access cloud storage such as Spark Thrift Server, HiveServer2, and Hive Metastore; These will not be listed as affected by Ambari, but require a restart to pick up the configuration changes.

  4. Test access to the Google Cloud Storage bucket by running a few commands from any cluster node. For example, you can use the command listed below (replace “mytestbucket” with the name of your bucket):

    hadoop fs -ls gs://mytestbucket/

After performing these steps, you should be able to start working with the Google Cloud Storage bucket(s).