Replicating data from CDP PvC Base cluster to Data Hub cluster with SRM running in CDP
PvC Base cluster
You can set up and configure an instance of SRM running in a CDP PvC Base cluster to
replicate data between the CDP PvC Base cluster and a Data Hub cluster. In addition, you can use
SMM to monitor the replication process. Review the following example to learn how this can be
set up.
Consider the following replication scenario:
In this scenario, data is replicated from a CDP PvC Base cluster that has Kafka, SRM, and
SMM deployed on it. This is a secure cluster that has TLS/SSL encryption and Kerberos
authentication enabled. In addition, it uses Ranger for authorization.
Data is being replicated from this cluster by SRM deployed in this cluster to a Data Hub
cluster.
The Data Hub cluster is provisioned with the one of the default Streams Messaging cluster
definitions.
This example scenario does not go into detail on how to
set up the clusters and assumes the following:
A Data Hub cluster provisioned with the
Streams Messaging Light Duty or Heavy Duty cluster definition is available.
A CDP PvC Base cluster with Kafka, SRM, and SMM is available. This cluster is TLS/SSL
and Kerberos enabled. In addition, it uses Ranger for authorization.
Network connectivity and DNS resolution are
established between the clusters.
Create a machine user for SRM in Management Console:
A machine user is required so that SRM has credentials that it can use to connect to
the Kafka service in the Data Hub cluster.
Navigate to Management Console
> User
Management.
Click Actions > Create
Machine User.
Enter a unique name for the user and click Create.
For example: srm
After the user is created, you are presented with a page that displays the
user details.
Click Set Workload Password.
Type a password in the Password and Confirm
Password fields. Leave the Environment field
blank.
Click Set Workload Password.
A
message appears on successful password creation.
Grant the machine user access to your environment:
You must grant the
machine user access to your environment for SRM to connect to the Kafka service with this
user.
Navigate to
Management Console >
Environments, and select the environment where your
Kafka cluster is located.
Click Actions > Manage
Access.
Use the search box to find and
select the machine user you want to use.
A list of
Resource Roles appears.
Select the EnvironmentUser role and click
Update Roles.
Go back to the Environment Details page and click Actions > Synchronize Users to FreeIPA.
On the Synchronize Users page, click Synchronize
Users.
Synchronizing users ensures that the role assignment is in effect for the environment.
Add Ranger permissions for the user you created for SRM in the Data Hub cluster:
You must to grant the necessary privileges to the user so that the user can access
Kafka resources. This is configured through Ranger policies.
Navigate to
Management Console >
Environments, and select the environment where your
Kafka cluster is located.
Click the Ranger link on the Environment Details page.
Select the resource-based service corresponding to the Kafka resource in
the Data Hub cluster.
Add the Workload User Name of the user you created
for SRM to the following Ranger policies:
All - consumergroup
All - topic
All - transactionalid
All - cluster
All - delegationtoken
Ensure that Ranger permissions exist for the streamsrepmgr user in
the CDP PvC Base cluster:
Access the Cloudera Manager instance of your
CDP PvC Base cluster.
Go to
Ranger > Ranger Admin Web
UI.
Log in to the Ranger Console (Ranger Admin
Web UI).
Ensure that the streamsrepmgr user is added to all required
policies.
If the user is missing, add it. The required policies are as
follows:
All - consumergroup
All - topic
All - transactionalid
All - cluster
All - delegationtoken
Create a truststore on the CDP PvC Base cluster:
A truststore is
required so that the SRM instance running in the CDP PvC Base cluster can trust the secure
Data Hub cluster. To do this, you extract the FreeIPA certificate from the CDP
environment, create a truststore that includes the certificate, and copy the truststore to
all hosts on the CDP PvC Base cluster.
Navigate to
Management Console >
Environments, and select the environment where your
Kafka cluster is located.
Go to the Summary
tab.
Scroll down to the FreeIPA
section.
Click
Actions > Get FreeIPA
Certificate.
The FreeIPA certificate
file, [***ENVIRONMENT NAME***]-env.crt, is
downloaded to your computer.
Run the following command to create the truststore:
#Bootstrap servers:
cdppvc.bootstrap.servers=[***MY-CDP-PVC-CLUSTER-HOST-1.COM:9093***],[***MY-CDP-PVC-CLUSTER-HOST-2:9093***]
datahub.bootstrap.servers=[**MY-DATAHUB-CLUSTER-HOST-1.COM:9093***],[***MY-DATAHUB-CLUSTER-HOST-1.COM:9093***]
#Replications:
cdppvc->datahub.enabled=true
#Security properties for the Datahub cluster:
datahub.security.protocol=SASL_SSL
datahub.sasl.mechanism=PLAIN
datahub.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="[***WORKLOAD USER NAME***]" password="[***MACHINE USER PASSWORD***]";
datahub.ssl.truststore.location=/opt/cloudera/security/datahub-truststore.jks
datahub.ssl.truststore.password=[***PASSWORD***]
#Use the FQDN when specifying the cluster hosts.
#The terminating semicolon in the [***ALIAS***].sasl.jaas.config property must be included in the configuration.
#The value of the [***ALIAS***].ssl.truststore.location is the location where you copied the truststore in a previous step.
#The [***ALIAS***].ssl.truststore.password property must be specified. Otherwise, the configuration might get overriden by the service ssl.truststore.password property.
Click Save.
Restart SRM.
Deploy client configuration for SRM.
Start the replication process using the srm-control tool:
SSH as an administrator to any of the SRM hosts in the CDP PvC cluster.
ssh [***USER***]@[***MY-CDP-PVC-CLUSTER.COM***]
Create a configuration file for the srm-control tool.
The srm-control tool behaves as a Kafka
client and requires configuration that is similar to any Kafka client. The configuration
file is specified with the --config option when you run the tool. The
configuration file must include cluster alias definitions, as well as properties related
to connection information and security. Cluster aliases are defined a single time,
connection and security properties are defined separately for each alias (cluster). In
this example the file is named srm.properties.
#Define aliases:
clusters=datahub, cdppvc
#Bootstrap servers:
datahub.bootstrap.servers=[***MY-DATAHUB-CLUSTER-HOST-1.COM:9093***],[***MY-DATAHUB-CLUSTER-HOST-1.COM:9093***]
cdppvc.bootstrap.servers=[***MY-CDP-PVC-CLUSTER-HOST-1.COM:9093***],[***MY-CDP-PVC-CLUSTER-HOST-2:9093***]
#DataHub cluster's security properties:
datahub.security.protocol=SASL_SSL
datahub.sasl.mechanism=PLAIN
datahub.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="[***WORKLOAD USER NAME***]" password="[***MACHINE USER PASSWORD***]";
datahub.ssl.truststore.location=/opt/cloudera/security/datahub-truststore.jks
datahub.ssl.truststore.password=[***PASSWORD***]
#CDP PvC Base cluster's security properties:
cdppvc.security.protocol=SASL_SSL
cdppvc.sasl.mechanism=GSSAPI
cdppvc.sasl.kerberos.service.name=kafka
cdppvc.sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="[***PATH TO KEYTAB FILE***]" storeKey=true useTicketCache=false principal="[***MY KERBEROS PRINCIPAL****]";
cdppvc.ssl.truststore.location=[***CDP PVC BASE GLOBAL TRUSTSTORE LOCATION***]
cdppvc.ssl.truststore.password=[***CDP PVC BASE GLOBAL TRUSTSTORE PASSWORD***]
#Use the FQDN when specifying the cluster hosts.
#The terminating semicolon in the [***ALIAS***].sasl.jaas.config properties must be included in the configuration.
#The value of the datahub.ssl.truststore.location property is the location where you copied the truststore in a previous step.
Use the srm-control tool with the topics
subcommand to add topics to the allow list: