Prerequisites

Learn how to collect the information you need to deploy the Kafka to Cloudera Operational Database (COD) ReadyFlow, and meet other prerequisites.

Use the following checklist to ensure that you meet all the requirements before you start building your data flow:

For your data ingest source🔗

You have created a Streams Messaging cluster in Cloudera on cloud to host your Schema Registry.
Show Me How
For information on how to create a Streams Messaging cluster, see Setting up your Streams Messaging Cluster.
You have created at least one Kafka topic.
Show Me How
1. Navigate to Management Console > Environments and select your environment.
2. Select your Streams Messaging cluster.
3. Click on the Streams Messaging Manager icon.
4. Navigate to the Topics page.
5. Click Add New and provide the following information:
  - Topic name
  - Number of partitions
  - Level of availability
  - Cleanup policy
  tip
  
  SMM has automatically set Kafka topic configuration parameters. To manually adjust them, click Advanced.
6. Click Save.

You have created a schema for your data and have uploaded it to the Schema Registry in the Streams Messaging cluster.

Show Me How

For information on how to create a new schema, see Creating a new schema. For example:

{
   "type":"record",
   "name":"SensorReading",
   "namespace":"com.cloudera.example",
   "doc":"This is a sample sensor reading",
   "fields":[
      {
         "name":"sensor_id",
         "doc":"Sensor identification number.",
         "type":"int"
      },
      {
         "name":"sensor_ts",
         "doc":"Timestamp of the collected readings.",
         "type":"long"
      },
      {
         "name":"sensor_0",
         "doc":"Reading #0.",
         "type":"int"
      },
      {
         "name":"sensor_1",
         "doc":"Reading #1.",
         "type":"int"
      },
      {
         "name":"sensor_2",
         "doc":"Reading #2.",
         "type":"int"
      },
      {
         "name":"sensor_3",
         "doc":"Reading #3.",
         "type":"int"
      }
   ]
}

You have the Schema Registry Host Name.
Show Me How
1. From the Management Console, go to Data Hub Clusters and select the Streams Messaging cluster you are using.
2. Navigate to the Hardware tab to locate the Master Node FQDN. Schema Registry is always running on the Master node, so copy the Master node FQDN.
You have the Kafka broker end points.
Show Me How
1. From the Management Console, click Data Hub Clusters.
2. Select the Streams Messaging cluster from which you want to ingest data.
3. Click the Hardware tab.
4. Note the Kafka Broker FQDNs for each node in your cluster.
5. Construct your Kafka Broker Endpoints by using the FQDN and Port number 9093 separated by a colon. Separate endpoints by a comma. For example:
```
broker1.fqdn:9093,broker2.fqdn:9093,broker3.fqdn:9093
```
  Kafka broker FQDNs are listed under the Core_broker section.
You have the Kafka Consumer Group ID.
Show Me How
This ID is defined by the user. Pick an ID and then create a Ranger policy for it. Use the ID when deploying the flow in Cloudera Data Flow.
You have assigned the Cloudera Workload User policies to access the consumer group ID and topic.
Show Me How
1. Navigate to Management Console > Environments, and select the environment where you have created your cluster.
2. Select Ranger. You are redirected to the Ranger Service Manager page.
3. Select your Streams Messaging cluster under the Kafka folder.
4. Create a policy to enable your Workload User to access the Kafka source topic.
5. On the Create Policy page, give the policy a name, select topic from the drop-down list, add the user, and assign the Consume permission.
6. Create another policy to give your Workload User access to the consumer group ID.
7. On the Create Policy page, give the policy a name, select consumergroup from the drop-down list, add the user, and assign the Consume permission.
You have assigned the Cloudera Workload User read-access to the schema.
Show Me How
1. Navigate to Management Console > Environments, and select the environment where you have created your cluster.
2. Select Ranger. You are redirected to the Ranger Service Manager page.
3. Select your Streams Messaging cluster under the Schema Registry folder.
4. Click Add New Policy.
5. On the Create Policy page, give the policy a name, specify the schema details, add the user, and assign the Read permission.

For Cloudera Data Flow🔗

You have enabled Cloudera Data Flow for an environment.
Show Me How
For information on how to enable Cloudera Data Flow for an environment, see Enabling Cloudera Data Flow for an Environment.
You have created a Machine User to use as the Cloudera Workload User.
You have given the Cloudera Workload User the EnvironmentUser role. Show Me How
1. From the Management Console, go to the environment for which Cloudera Data Flow is enabled.
2. From the Actions drop down, click Manage Access.
3. Identify the user you want to use as a Workload User.
  note
  
  The Cloudera Workload User can be a machine user or your own user name. It is best practice to create a dedicated Machine user for this.
4. Give that user EnvironmentUser role.
You have synchronized your user to the Cloudera on cloud environment that you enabled for Cloudera Data Flow.
Show Me How
For information on how to synchronize your user to FreeIPA, see Performing User Sync.
You have granted your Cloudera user the DFCatalogAdmin and DFFlowAdmin roles to enable your user to add the ReadyFlow to the Catalog and deploy the flow definition. Show Me How
1. Give a user permission to add the ReadyFlow to the Catalog.
  1. From the Management Console, click User Management.
  2. Enter the name of the user or group you wish to authorize in the Search field.
  3. Select the user or group from the list that displays.
  4. Click Roles > Update Roles.
  5. From Update Roles, select DFCatalogAdmin and click Update.
    note
    If the ReadyFlow is already in the Catalog, then you can give your user just the DFCatalogViewer role.
2. Give your user or group permission to deploy flow definitions.
  1. From the Management Console, click Environments to display the Environment List page.
  2. Select the environment to which you want your user or group to deploy flow definitions.
  3. Click Actions > Manage Access to display the Environment Access page.
  4. Enter the name of your user or group you wish to authorize in the Search field.
  5. Select your user or group and click Update Roles.
  6. Select DFFlowAdmin from the list of roles.
  7. Click Update Roles.
3. Give your user or group access to the Project where the ReadyFlow will be deployed.
  1. Go to Data Flow > Projects.
  2. Select the project where you want to manage access rights and click More > Manage Access.
4. Start typing the name of the user or group you want to add and select them from the list.
5. Select the Resource Roles you want to grant.
6. Click Update Roles.
7. Click Synchronize Users.

For your data ingest target🔗

Ensure that the HBase table you are ingesting data to exists. If not, create one. Show Me How
1. From Cloudera Shared Data Experience UI, click Operational Database from the left navigation pane.
2. Click Create Database.
3. Select the environment for which Cloudera Data Flow is enabled.
4. Enter a name for your database, and click Create Database.
5. Go to the newly created database from the Databases page.
6. Go to Hue UI by clicking Hue SQL Editor.
7. Click the HBase icon to go to HBase home.
8. Click New Table.
  The Create New Table dialog appears.
9. Enter table name and column family name, and click Submit.
  A blank table is created.
10. Go to the newly created table and click New Row.
  The Insert New Row dialog appears.
11. Click Add Field, and then specify row key, column name, and cell value.
  
  note
  The cloumn name should follow the format: family: column_name, where, family is the column family name.
12. Click Submit.
Obtain the table name, column family name, and row identifier of the HBase table in Cloudera Operational Database. Show Me How
1. From Cloudera Shared Data Experience UI, click Operational Database from the left navigation pane.
2. Select the database where your HBase table exists.
3. Go to Hue UI by clicking Hue SQL Editor.
4. Click the HBase icon to go to HBase home.
5. Click the HBase table in Cloudera Operational Database.
6. After the table appears, obtain the table name, column family name, and row identifier.
You have set Ranger policies for HBase table. Show Me How
1. From the Cloudera Management Console, click Environments.
2. Use the search field to find and select the Cloudera environment for which Cloudera Data Flow is enabled.
3. Go to the Ranger UI by clicking Ranger.
4. Select your database from the HBase folder.
5. Click Add New Policy.
6. Enter policy name, HBase table name, HBase column-family name, and HBase column value.
7. In Allow Conditions section, enter the name of the Machine User, you created in Cloudera, prefixed with srv_.
8. Click Add Permissions, and assign Read and Write permissions to the user.
9. Click Add.
Obtain the hbase-site.xml file. Show Me How
To get the hbase-site.xml file from Cloudera Data Hub:
1. From the Cloudera Management Console, click Environments.
2. Use the search field to find and select the Cloudera on cloud environment for which Cloudera Data Flow is enabled.
3. Go to Data Hubs.
4. Select the Cloudera Operational Database cluster.
5. Go to Cloudera Manager by clicking CM-UI.
6. Click Clusters from the left-navigation pane, and click hbase.
7. Click Actions > Download Client Configuration to download the client configuration zip file.
8. Unzip the zip file to obtain the hbase-site.xml file.
To get the hbase-site.xml file from Cloudera Shared Data Experience:
1. From Cloudera Shared Data Experience UI, click Operational Database from the left navigation pane.
2. Select the database where your HBase table exists.
3. Go to HBase Client Tarball tab.
4. Copy the HBase Client Configuration URL.
5. Use the URL as a command to download the client configuration zip file.
6. Unzip the zip file to obtain the hbase-site.xml file.