Prerequisites

Learn how to collect the information you need to deploy the S3 to Databricks ReadyFlow, and meet other prerequisites.

For your data ingest source

You have the source S3 bucket and path.
You have performed one of the following to configure access to the source S3 bucket:
- You have configured access to the S3 bucket with a RAZ enabled environment.
  It is a best practice to enable RAZ to control access to your object store buckets. This allows you to use your Cloudera on cloud credentials to access S3 buckets, increases auditability, and makes object store data ingest workflows portable across cloud providers.
  1. Ensure that Fine-grained access control is enabled for your Cloudera Data Flow environment.
  2. From the Ranger UI, navigate to the S3 repository.
  3. Create a policy to govern access to the S3 bucket and path used in your ingest workflow.
    tip
    
    The Path field must begin with a forward slash ( / ).
  4. Add the machine user that you have created for your ingest workflow to the policy you just created.
  For more information, see Creating Ranger policy to use in RAZ-enabled AWS environment.
- You have configured access to the S3 bucket using ID Broker mapping.
  If your environment is not RAZ-enabled, you can configure access to the S3 bucket using ID Broker mapping.
  1. Access IDBroker mappings.
    1. To access IDBroker mappings in your environment, click Actions > Manage Access.
    2. Choose the IDBroker Mappings tab where you can provide mappings for users or groups and click Edit.
  2. Add your Cloudera Workload User and the corresponding AWS role that provides write access to your folder in your S3 bucket to the Current Mappings section by clicking the blue + sign.
    note
    You can get the AWS IAM role ARN from the Roles Summary page in AWS and can copy it into the IDBroker role field. The selected AWS IAM role must have a trust policy allowing IDBroker to assume this role.
  3. Click Save and Sync.
You have created a Streams Messaging cluster in Cloudera on cloud to host your Schema Registry.

For information on how to create a Streams Messaging cluster, see Setting up your Streams Messaging Cluster.

You have created a schema for your data and have uploaded it to the Schema Registry in the Streams Messaging cluster.

For information on how to create a new schema, see Creating a new schema. For example:


{
   "type":"record",
   "name":"SensorReading",
   "namespace":"com.cloudera.example",
   "doc":"This is a sample sensor reading",
   "fields":[
      {
         "name":"sensor_id",
         "doc":"Sensor identification number.",
         "type":"int"
      },
      {
         "name":"sensor_ts",
         "doc":"Timestamp of the collected readings.",
         "type":"long"
      },
      {
         "name":"sensor_0",
         "doc":"Reading #0.",
         "type":"int"
      },
      {
         "name":"sensor_1",
         "doc":"Reading #1.",
         "type":"int"
      },
      {
         "name":"sensor_2",
         "doc":"Reading #2.",
         "type":"int"
      },
      {
         "name":"sensor_3",
         "doc":"Reading #3.",
         "type":"int"
      }
   ]
}

You have the Schema Registry Host Name.
1. From the Management Console, go to Data Hub Clusters and select the Streams Messaging cluster you are using.
2. Navigate to the Hardware tab to locate the Master Node FQDN. Schema Registry is always running on the Master node, so copy the Master node FQDN.

You have assigned the Cloudera Workload User read-access to the schema.
1. Navigate to Management Console > Environments, and select the environment where you have created your cluster.
2. Select Ranger. You are redirected to the Ranger Service Manager page.
3. Select your Streams Messaging cluster under the Schema Registry folder.
4. Click Add New Policy.
5. On the Create Policy page, give the policy a name, specify the schema details, add the user, and assign the Read permission.

For Cloudera Data Flow

You have enabled Cloudera Data Flow for an environment.

For information on how to enable Cloudera Data Flow for an environment, see Enabling Cloudera Data Flow for an Environment.
You have created a Machine User to use as the Cloudera Workload User.
You have given the Cloudera Workload User the EnvironmentUser role.
1. From the Management Console, go to the environment for which Cloudera Data Flow is enabled.
2. From the Actions drop down, click Manage Access.
3. Identify the user you want to use as a Workload User.
  note
  
  The Cloudera Workload User can be a machine user or your own user name. It is best practice to create a dedicated Machine user for this.
4. Give that user EnvironmentUser role.
You have synchronized your user to the Cloudera on cloud environment that you enabled for Cloudera Data Flow.

For information on how to synchronize your user to FreeIPA, see Performing User Sync.
You have granted your Cloudera user the DFCatalogAdmin and DFFlowAdmin roles to enable your user to add the ReadyFlow to the Catalog and deploy the flow definition.
1. Give a user permission to add the ReadyFlow to the Catalog.
  1. From the Management Console, click User Management.
  2. Enter the name of the user or group you wish to authorize in the Search field.
  3. Select the user or group from the list that displays.
  4. Click Roles > Update Roles.
  5. From Update Roles, select DFCatalogAdmin and click Update.
    note
    If the ReadyFlow is already in the Catalog, then you can give your user just the DFCatalogViewer role.
2. Give your user or group permission to deploy flow definitions.
  1. From the Management Console, click Environments to display the Environment List page.
  2. Select the environment to which you want your user or group to deploy flow definitions.
  3. Click Actions > Manage Access to display the Environment Access page.
  4. Enter the name of your user or group you wish to authorize in the Search field.
  5. Select your user or group and click Update Roles.
  6. Select DFFlowAdmin from the list of roles.
  7. Click Update Roles.
3. Give your user or group access to the Project where the ReadyFlow will be deployed.
  1. Go to DataFlow > Projects.
  2. Select the project where you want to manage access rights and click More > Manage Access.
4. Start typing the name of the user or group you want to add and select them from the list.
5. Select the Resource Roles you want to grant.
6. Click Update Roles.
7. Click Synchronize Users.

For your data ingest target

You have created a Databricks table, non-partitioned or partitioned (single column only).
You have the Storage Location of your Databricks Table, which consists of the S3 Bucket, Path and Table Id.
You have performed one of the following to configure access to the target S3 bucket:
- You have configured access to the S3 bucket with a RAZ enabled environment.
  It is a best practice to enable RAZ to control access to your object store buckets. This allows you to use your Cloudera on cloud credentials to access S3 buckets, increases auditability, and makes object store data ingest workflows portable across cloud providers.
  1. Ensure that Fine-grained access control is enabled for your Cloudera Data Flow environment.
  2. From the Ranger UI, navigate to the S3 repository.
  3. Create a policy to govern access to the S3 bucket and path used in your ingest workflow.
    tip
    
    The Path field must begin with a forward slash ( / ).
  4. Add the machine user that you have created for your ingest workflow to the policy you just created.
  For more information, see Creating Ranger policy to use in RAZ-enabled AWS environment.
- You have configured access to the S3 bucket using ID Broker mapping.
  If your environment is not RAZ-enabled, you can configure access to the S3 bucket using ID Broker mapping.
  1. Access IDBroker mappings.
    1. To access IDBroker mappings in your environment, click Actions > Manage Access.
    2. Choose the IDBroker Mappings tab where you can provide mappings for users or groups and click Edit.
  2. Add your Cloudera Workload User and the corresponding AWS role that provides write access to your folder in your S3 bucket to the Current Mappings section by clicking the blue + sign.
    note
    You can get the AWS IAM role ARN from the Roles Summary page in AWS and can copy it into the IDBroker role field. The selected AWS IAM role must have a trust policy allowing IDBroker to assume this role.
  3. Click Save and Sync.