Replicating data from CDP PvC Base cluster to Data Hub cluster with SRM running in CDP
PvC Base cluster
You can set up and configure an instance of SRM running in a CDP PvC Base cluster to
replicate data between the CDP PvC Base cluster and a Data Hub cluster. In addition, you can use
SMM to monitor the replication process. Review the following example to learn how this can be
set up.
Consider the following replication scenario:
In this scenario, data is replicated from a CDP PvC Base cluster that has Kafka, SRM, and
SMM deployed on it. This is a secure cluster that has TLS/SSL encryption and Kerberos
authentication enabled. In addition, it uses Ranger for authorization.
Data is being replicated from this cluster by SRM deployed in this cluster to a Data Hub
cluster.
The Data Hub cluster is provisioned with one of the default Streams Messaging cluster
definitions.
This example scenario does not go into detail on how to
set up the clusters and assumes the following:
A Data Hub cluster provisioned with the
Streams Messaging Light Duty or Heavy Duty cluster definition is available.
A CDP PvC Base cluster with Kafka, SRM, and SMM is available. This cluster is TLS/SSL
and Kerberos enabled. In addition, it uses Ranger for authorization.
Network connectivity and DNS resolution are
established between the clusters.
This example scenario demonstrates the configuration required to enable replication
monitoring of the Data Hub cluster with Streams Messaging Manager. This can be done by
configuring the SRM Service role to target (monitor) the Data Hub cluster. This is done as
the last step in the following list of steps and is marked optional. This is because
enabling replication monitoring of the Data Hub cluster results in a number of caveats,
which are the following:
The SRM Service role will generate additional cloud traffic.
Any extra traffic you
might have in your cloud deployment can lead to additional cloud costs.
The replications tab in SMM will display all replications targeting the Data Hub
cluster.
Although this is expected, you must understand that all other pages in SMM
will display information regarding the CDP PvC cluster. A setup like this might lead to
confusion or mislead users on what this specific instance of SMM is monitoring.
You will lose the ability to monitor the replications targeting the CDP PvC cluster.
This is only critical if you have any existing replications that are targeting the
CDP PvC cluster and you are monitoring these replications with the SMM instance running
in the CDP PvC cluster.
Create a machine user for SRM in Management Console:
A machine user is required so that SRM has credentials that it can use to connect to
the Kafka service in the Data Hub cluster.
Navigate to Management Console
> User
Management.
Click Actions > Create
Machine User.
Enter a unique name for the user and click Create.
For example: srm
After the user is created, you are presented with a page that displays the
user details.
Click Set Workload Password.
Type a password in the Password and Confirm
Password fields. Leave the Environment field
blank.
Click Set Workload Password.
A
message appears on successful password creation.
Grant the machine user access to your environment:
You must grant the
machine user access to your environment for SRM to connect to the Kafka service with this
user.
Navigate to
Management Console >
Environments, and select the environment where your
Kafka cluster is located.
Click Actions > Manage
Access.
Use the search box to find and
select the machine user you want to use.
A list of
Resource Roles appears.
Select the EnvironmentUser role and click
Update Roles.
Go back to the Environment Details page and click Actions > Synchronize Users to FreeIPA.
On the Synchronize Users page, click
Synchronize Users.
Synchronizing users ensures that the
role assignment is in effect for the environment.
Add Ranger permissions for the user you created for SRM in the Data Hub cluster:
You must to grant the necessary privileges to the user so that the user can access
Kafka resources. This is configured through Ranger policies.
Navigate to
Management Console >
Environments, and select the environment where your
Kafka cluster is located.
Click the Ranger link on the Environment Details page.
Select the resource-based service corresponding to the Kafka resource in
the Data Hub cluster.
Add the Workload User Name of the user you created
for SRM to the following Ranger policies:
All - consumergroup
All - topic
All - transactionalid
All - cluster
All - delegationtoken
Ensure that Ranger permissions exist for the streamsrepmgr user in
the CDP PvC Base cluster:
Access the
Cloudera Manager instance of your CDP PvC Base cluster.
Go to
Ranger > Ranger Admin Web
UI.
Log in to the Ranger Console (Ranger Admin
Web UI).
Ensure that the streamsrepmgr user is added to all required
policies.
If the user is missing, add it. The required policies are as
follows:
All - consumergroup
All - topic
All - transactionalid
All - cluster
All - delegationtoken
Create a truststore on the CDP PvC Base cluster:
A truststore is
required so that the SRM instance running in the CDP PvC Base cluster can trust the secure
Data Hub cluster. To do this, you extract the FreeIPA certificate from the CDP
environment, create a truststore that includes the certificate, and copy the truststore to
all hosts on the CDP PvC Base cluster.
Navigate to
Management Console >
Environments, and select the environment where your
Kafka cluster is located.
Go to the Summary
tab.
Scroll down to the FreeIPA
section.
Click
Actions > Get FreeIPA
Certificate.
The FreeIPA certificate
file, [***ENVIRONMENT NAME***]-env.crt, is
downloaded to your computer.
Run the following command to create the truststore:
If credential creation is
successful, a new entry corresponding to the Kafka credential you specified appears on
the page.
Define the co-located Kafka cluster (PvC Base):
In Cloudera Manager, go to Clusters and select the
Streams Replication Manager service.
Go to Configuration.
Find and enable the Kafka Service
property.
Find and configure the Streams Replication Manager Co-located
Kafka Cluster Alias property.
The alias you configure
represents the co-located cluster. Enter an alias that is unique and easily
identifiable. For example:
cdppvc
Enable relevant security feature toggles.
Because CDP PvC Base is
both TLS/SSL and Kerberos enabled, you must enable all feature toggles for both the
Driver and Service roles. The feature toggles are the following:
Enable TLS/SSL for SRM Driver
Enable TLS/SSL for SRM Service
Enable Kerberos Authentication
Add both clusters to SRM's configuration:
Find and configure the External Kafka Accounts
property.
Add the name of all Kafka credentials you created to
this property. This can be done by clicking the add button to add a new line to the
property and then entering the name of the Kafka credential. For
example:
datahub
Find and configure the Streams Replication Manager Cluster
alias property.
Add all cluster aliases to this property.
This includes the aliases present in both the External Kafka
Accounts and Streams Replication Manager Co-located Kafka
Cluster Alias properties. Delimit the aliases with commas. For
example:
datahub,cdppvc
Configure replications:
In this example data is replicated unidirectionally. As a result, only a single
replication must be configured.
Find the Streams Replication
Manager's Replication Configs property.
Click the add button and add new lines
for each unique replication you want to add and enable.
Add and enable your replications. For example:
cdppvc->datahub.enabled=true
Configure Driver and Service role targets:
Find and configure the Streams Replication Manager Service
Target Cluster property.
Add the co-located cluster's alias
to the property. For example:
cdppvc
Setting this property to
cdppvc does not enable you to monitor the replications targeting
the Data Hub cluster. It is possible to add the Data Hub cluster alias to this
property and as a result monitor the Data Hub cluster. However, this can lead to
unwanted behaviour. See the Before you begin section for more
information.
Find and configure the Streams Replication Manager Driver Target
Cluster property.
Gateway TLS/SSL Trust Store File: [***CDP PVC
BASE GLOBAL TRUSTSTORE LOCATION***]
Gateway TLS/SSL Truststore Password: [***CDP PVC
BASE GLOBAL TRUSTSTORE PASSWORD***]
SRM Client's Kerberos Principal Name: [***MY
KERBEROS PRINCIPAL****]
SRM Client's Kerberos Keytab Location: [***PATH
TO KEYTAB FILE***]
Take note of the password you configure in SRM Client's Secure Storage
Password and the name you configure in Environment Variable
Holding SRM Client's Secure Storage Password. You will need to provide
both of these in your CLI session before running the tool.
Click Save Changes.
Restart the SRM service.
Deploy client configuration for SRM.
Start the replication process using the srm-control tool:
SSH as an administrator to any of the SRM hosts in the CDP PvC
cluster.
ssh [***USER***]@[***MY-CDP-PVC-CLUSTER.COM***]
Set the secure storage password as an environment variable.
Replace [***SECURE STORAGE ENV VAR***] with the name of the
environment variable you specified in Environment Variable Holding SRM
Client's Secure Storage Password. Replace [***SRM SECURE
STORAGE PASSWORD***] with the password you specified in SRM
Client's Secure Storage Password. For example:
export SECURESTOREPASS=”mypassword"
Use the srm-control tool with the topics subcommand to add
topics to the allow list.