Replicating data from CDP PvC Base cluster to Data Hub cluster with SRM deployed in Data Hub cluster

You can set up and configure an instance of SRM running in a Data Hub cluster to replicate data between the Data Hub cluster and a CDP PvC Base cluster. In addition, you can use SMM to monitor the replication process. Review the following example to learn how this can be set up.

Consider the following replication scenario:

In this scenario, data is replicated from a CDP PvC Base cluster to a Data Hub cluster by an SRM instance that is deployed in the Data Hub cluster.

The CDP PvC Base cluster has Kafka deployed on it. It is a secure cluster that has TLS/SSL encryption enabled and uses PLAIN-text authentication. In addition, it uses Ranger for authorization.

The Data Hub cluster is provisioned with the one of the default Streams Messaging cluster definitions.

This example scenario does not go into detail on how to set up the clusters and assumes the following:

  • A Data Hub cluster provisioned with the Streams Messaging Light Duty or Heavy Duty cluster definition is available.

    For more information, see Creating your first Streams Messaging cluster in the CDF for Data Hub library. Alternatively, you can also review the cloud provider specific cluster creation instructions available in the Cloudera Data Hub library.

  • A CDP PvC Base cluster with Kafka is available. This cluster has TLS/SSL encryption enabled, uses PLAIN authentication, and has Ranger for authorization. For more information, see the CDP Private Cloud Base Installation Guide.

  • Network connectivity and DNS resolution are established between the clusters.
  1. Obtain PLAIN credentials for SRM.
    The credentials of a PLAIN user that can access the CDP PvC Base cluster are required. These credentials are supplied to SRM in a later step. In this example [***PLAIN USER***] and [***PLAIN USER PASSWORD***] is used to refer to these credentials.
  2. Add Ranger permissions for the PLAIN user in the CDP PvC cluster:

    You must ensure that the PLAIN user you obtained has correct permissions assigned to it in Ranger. Otherwise, SRM will not be able to access Kafka resources on the CDP PvC Base cluster.

    1. Access the Cloudera Manager instance of your CDP PvC Base cluster.
    2. Go to Ranger > Ranger Admin Web UI.
    3. Log in to the Ranger Console (Ranger Admin Web UI).
    4. Add the [***PLAIN USER***] to the following policies:
      • All - consumergroup
      • All - topic
      • All - transactionalid
      • All - cluster
      • All - delegationtoken
  3. Acquire the CDP PvC Base cluster truststore and add it to the Data Hub cluster:
    The actions you need to take differ depending on how TLS is set up in the CDP PvC Base cluster:
    1. Obtain the certificate of the Cloudera Manager root Certificate Authority.

      The Certificate Authority certificate can be obtained using the certmanager utility. For more information, see The certmanager utility.

    2. Run the following command to create the truststore:
      keytool \
        -importcert \
        -storetype JKS \
        -noprompt \
        -keystore cdppvc-truststore.jks \
        -storepass ***PASSWORD*** \
        -alias cdppvc-cm-ca \
        -file ***PATH TO CM CA CERTIFICATE***
      

      Note down the password, it is needed in a later step.

    3. Copy the cdpdc-truststore.jks file to a common location on all the hosts in your CDP Data Hub cluster.

      Cloudera recommends that you use the following location: /opt/cloudera/security/cdppvc-truststore.jks.

    4. Set the correct file permissions.

      Use 751 for the directory and 444 for the truststore file.

    1. Note down the CDP PvC Base cluster's truststore location and password, these should be known to you.
    2. Copy the truststore file to a common location on all the hosts in your CDP Data Hub cluster.

      Cloudera recommends that you use the following location: /opt/cloudera/security/turststore.jks.

    3. Set the correct file permissions.

      Use 751 for the directory and 444 for the truststore file.

  4. Configure the SRM properties in the Data Hub cluster:
    1. Access the Cloudera Manager instance of your Data Hub cluster.
    2. Go to Streams Replication Manager > Configuration and configure the following properties:
      • Streams Replication Manager Cluster alias: datahub, cdppvc
      • Streams Replication Manager Driver Target Cluster: datahub, cdppvc
      • Streams Replication Manager Service Target Cluster: datahub
      • Streams Replication Manager's Replication Configs:
        #Bootstrap servers:
        cdppvc.bootstrap.servers=[***MY-CDP-PVC-CLUSTER-HOST-1.COM:9093***],[***MY-CDP-PVC-CLUSTER-HOST-2:9093***]
        datahub.bootstrap.servers=[***MY-DATAHUB-CLUSTER-HOST-1.COM:9093***],[***MY-DATAHUB-CLUSTER-HOST-2.COM:9093***] 
        
        #Replications:
        cdppvc->datahub.enabled=true
        
        #Security properties for the CDP PvC Base cluster:
        cdppvc.security.protocol=SASL_SSL
        cdppvc.sasl.mechanism=PLAIN
        cdppvc.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="[***PLAIN USER***]" password="[***PLAIN USER PASSWORD***]"; 
        cdppvc.ssl.truststore.location=/opt/cloudera/security/cdppvc-truststore.jks
        cdppvc.ssl.truststore.password=***PASSWORD***
        
        #Use the FQDN when specifying cluster hosts.
        #The terminating semicolon in the [***ALIAS***].sasl.jaas.config property must be included in the configuration.
        #The value of the [***ALIAS***].ssl.truststore.location is the location where you copied the truststore in a previous step.
        #The [***ALIAS***].ssl.truststore.password property must be specified. Otherwise, the configuration might get overriden by the service ssl.truststore.password property.
         
        
    3. Click Save.
    4. Restart SRM.
    5. Deploy client configuration for SRM.
  5. Start data replication topics using the srm-control tool:
    1. SSH as an administrator to any of the SRM hosts in the Data Hub cluster.
      ssh [***USER***]@[***MY-DATAHUB-CLUSTER.COM***]
      
    2. Create a configuration file for the srm-control tool.

      The srm-control tool behaves as a Kafka client and requires configuration that is similar to any Kafka client. The configuration file is specified with the --config option when you run the tool. The configuration file must include cluster alias definitions, as well as properties related to connection information and security. Cluster aliases are defined a single time, connection and security properties are defined separately for each alias (cluster). In this example the file is named srm.properties.

      #Bootstrap servers:
      cdppvc.bootstrap.servers=[***MY-CDP-PVC-CLUSTER-HOST-1.COM:9093***],[***MY-CDP-PVC-CLUSTER-HOST-2:9093***]
      datahub.bootstrap.servers=[***MY-DATAHUB-CLUSTER-HOST-1.COM:9093***],[***MY-DATAHUB-CLUSTER-HOST-1.COM:9093***] 
      
      #CDP PVC Base cluster’s security properties:
      cdppvc.security.protocol=SASL_SSL
      cdppvc.sasl.mechanism=PLAIN
      cdppvc.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="[***PLAIN USER***]" password="[***PLAIN USER PASSWORD***]";
      cdppvc.ssl.truststore.location=/opt/cloudera/security/cdppvc-truststore.jks
      cdppvc.ssl.truststore.password=[***PASSWORD***]
      
      #Data Hub cluster's security properties:
      datahub.security.protocol=SASL_SSL
      datahub.sasl.mechanism=GSSAPI
      datahub.sasl.kerberos.service.name=kafka
      datahub.sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="[***PATH TO KEYTAB FILE***]" storeKey=true useTicketCache=false principal="[***MY KERBEROS PRINCIPAL****]";
      datahub.ssl.truststore.location=/opt/cloudera/security/datahub-truststore.jks
      datahub.ssl.truststore.password=[***PASSWORD***]
      
      #Use the FQDN when specifying the cluster hosts.
      #The terminating semicolon in the [***ALIAS***].sasl.jaas.config properties must be included in the configuration.
      #The value of the cdpdc.ssl.truststore.location property is the location where you copied the truststore in a previous step.
      
    3. Use the srm-control tool with the topics subcommand to add topics to the allow list:
      srm-control --config ./srm.properties topics --source cdppvc --target datahub --add [***TOPIC NAME***]
    4. Use the srm-control tool with the groups subcommand to groups to the allow list:
      srm-control --config ./srm.properties groups --source cdppvc --target datahub --add ".*"
  6. Monitor the replication process.
    Access the SMM UI in the Data Hub cluster and go to the Cluster Replications page. The replications you set up will be visible on this page.