Prerequisites
Learn how to collect the information you need to deploy the Kafka to Kudu ReadyFlow, and meet other prerequisites.
For your data ingest source
-
You have created a Streams Messaging cluster in Cloudera Public Cloud to host your Schema Registry.
For information on how to create a Streams Messaging cluster, see Setting up your Streams Messaging Cluster.
-
You have created at least one Kafka topic.
- Navigate to Management Console > Environments and select your environment.
- Select your Streams Messaging cluster.
- Click on the Streams Messaging Manager icon.
- Navigate to the Topics page.
- Click Add New and provide the following information:
- Topic name
- Number of partitions
- Level of availability
- Cleanup policy
- Click Save.
-
You have created a schema for your data and have uploaded it to the Schema Registry in the Streams Messaging cluster.
For information on how to create a new schema, see Creating a new schemaCreating a new schema. For example:{ "type":"record", "name":"SensorReading", "namespace":"com.cloudera.example", "doc":"This is a sample sensor reading", "fields":[ { "name":"sensor_id", "doc":"Sensor identification number.", "type":"int" }, { "name":"sensor_ts", "doc":"Timestamp of the collected readings.", "type":"long" }, { "name":"sensor_0", "doc":"Reading #0.", "type":"int" }, { "name":"sensor_1", "doc":"Reading #1.", "type":"int" }, { "name":"sensor_2", "doc":"Reading #2.", "type":"int" }, { "name":"sensor_3", "doc":"Reading #3.", "type":"int" } ] }
-
You have the Schema Registry Host Name.
- From the Management Console, go to Data Hub Clusters and select the Streams Messaging cluster you are using.
- Navigate to the Hardware tab to locate the Master Node FQDN. Schema Registry is always running on the Master node, so copy the Master node FQDN.
-
You have the Kafka broker end points.
- From the Management Console, click Data Hub Clusters.
- Select the Streams Messaging cluster from which you want to ingest data.
- Click the Hardware tab.
- Note the Kafka Broker FQDNs for each node in your cluster.
- Construct your Kafka Broker Endpoints by using the FQDN and Port
number 9093 separated by a colon. Separate endpoints by a comma. For example:
broker1.fqdn:9093,broker2.fqdn:9093,broker3.fqdn:9093
Kafka broker FQDNs are listed under the Core_broker section.
-
You have the Kafka Consumer Group ID.
This ID is defined by the user. Pick an ID and then create a Ranger policy for it. Use the ID when deploying the flow in Cloudera DataFlow.
-
You have assigned the Cloudera Workload User policies to access the consumer group ID and topic.
- Navigate to Management Console > Environments, and select the environment where you have created your cluster.
- Select Ranger. You are redirected to the Ranger Service Manager page.
- Select your Streams Messaging cluster under the Kafka folder.
- Create a policy to enable your Workload User to access the Kafka source topic.
- On the Create Policy page, give the policy a name, select topic from the drop-down list, add the user, and assign the Consume permission.
- Create another policy to give your Workload User access to the consumer group ID.
- On the Create Policy page, give the policy a name, select consumergroup from the drop-down list, add the user, and assign the Consume permission.
-
You have assigned the Cloudera Workload User read-access to the schema.
- Navigate to Management Console > Environments, and select the environment where you have created your cluster.
- Select Ranger. You are redirected to the Ranger Service Manager page.
- Select your Streams Messaging cluster under the Schema Registry folder.
- Click Add New Policy.
- On the Create Policy page, give the policy a name, specify the schema details, add the user, and assign the Read permission.
For Cloudera DataFlow
-
You have enabled Cloudera DataFlow for an environment.
For information on how to enable Cloudera DataFlow for an environment, see Enabling Cloudera DataFlow for an Environment.
-
You have created a Machine User to use as the Cloudera Workload User.
- You have given the Cloudera Workload User the
EnvironmentUser role.
- From the Management Console, go to the environment for which Cloudera DataFlow is enabled.
- From the Actions drop down, click Manage Access.
- Identify the user you want to use as a Workload User.
- Give that user EnvironmentUser role.
-
You have synchronized your user to the Cloudera Public Cloud environment that you enabled for Cloudera DataFlow.
For information on how to synchronize your user to FreeIPA, see Performing User Sync.
- You have granted your Cloudera user the DFCatalogAdmin and DFFlowAdmin
roles to enable your user to add the ReadyFlow to the Catalog and deploy the flow
definition.
- Give a user permission to add the ReadyFlow to the
Catalog.
- From the Management Console, click User Management.
- Enter the name of the user or group you wish to authorize in the Search field.
- Select the user or group from the list that displays.
- Click .
- From Update Roles, select DFCatalogAdmin and click Update.
- Give your user or group permission to deploy flow definitions.
- From the Management Console, click Environments to display the Environment List page.
- Select the environment to which you want your user or group to deploy flow definitions.
- Click Environment Access page. to display the
- Enter the name of your user or group you wish to authorize in the Search field.
- Select your user or group and click Update Roles.
- Select DFFlowAdmin from the list of roles.
- Click Update Roles.
- Give your user or group access to the Project where the ReadyFlow will be
deployed.
- Go to .
- Select the project where you want to manage access rights and click .
- Start typing the name of the user or group you want to add and select them from the list.
- Select the Resource Roles you want to grant.
- Click Update Roles.
- Click Synchronize Users.
- Give a user permission to add the ReadyFlow to the
Catalog.
For your data ingest target
-
You have a Real-Time Data Mart cluster running Kudu, Impala, and Hue in the same environment for which Cloudera DataFlow has been enabled.
-
You have the Kudu Master hostnames.
- From Management Console, click Data Hub Clusters.
- Select the Real-Time Data Mart cluster to which you want to ingest data into.
- Click the Hardware tab.
- Copy the FQDN for each Kudu Master.
-
You have created the Kudu table that you want to ingest data into.
- Navigate to your Real Time Data Mart cluster and click Hue from the Services pane.
- Click the Tables icon on the left pane.
- Select the default database, and click + New to create a new table.
- In the Type field, select Manually and click Next.
- Provide the table Name, Format, Primary keys, and any partitions.
- Click Submit. The newly created table displays in the default database Tables pane.
- Check the Kudu UI Tables tab for the name of the table you created. You will need this table name when you use the Cloudera DataFlow Deployment wizard to deploy the ReadyFlow.
-
You have assigned permissions via IDBroker or in Ranger to enable the Cloudera Workload User to access the Kudu table that you want to ingest data into.
- From the base cluster on Cloudera Public Cloud, select Ranger.
- Select your Real Time Data Mart cluster from the Kudu folder.
- Click Add New Policy policy.
- On the Create Policy page, enter the Kudu table name in the topic field.
- Add the Cloudera Workload User in the Select User field.
- Add the Insert and Select permissions in the Permissions field.
- Click Save.