Replicating data from CDP PvC Base cluster to Data Hub cluster with SRM deployed in Data Hub cluster

You can set up and configure an instance of SRM running in a Data Hub cluster to replicate data between the Data Hub cluster and a CDP PvC Base cluster. In addition, you can use SMM to monitor the replication process. Review the following example to learn how this can be set up.

Consider the following replication scenario:

In this scenario, data is replicated from a CDP PvC Base cluster to a Data Hub cluster by an SRM instance that is deployed in the Data Hub cluster.

The CDP PvC Base cluster has Kafka deployed on it. It is a secure cluster that has TLS/SSL encryption enabled and uses PLAIN authentication. In addition, it uses Ranger for authorization.

The Data Hub cluster is provisioned with the one of the default Streams Messaging cluster definitions.

This example scenario does not go into detail on how to set up the clusters and assumes the following:

  • A Data Hub cluster provisioned with the Streams Messaging Light Duty or Heavy Duty cluster definition is available.

    For more information, see Setting up your Streams Messaging cluster in the CDF for Data Hub library. Alternatively, you can also review the cloud provider specific cluster creation instructions available in the Cloudera Data Hub library.

  • A CDP PvC Base cluster with Kafka is available. This cluster has TLS/SSL encryption enabled, uses PLAIN authentication, and has Ranger for authorization. For more information, see the CDP Private Cloud Base Installation Guide.

  • Network connectivity and DNS resolution are established between the clusters.
  1. Obtain PLAIN credentials for SRM.
    The credentials of a PLAIN user that can access the CDP PvC Base cluster are required. These credentials are supplied to SRM in a later step. In this example [***PLAIN USER***] and [***PLAIN USER PASSWORD***] is used to refer to these credentials.
  2. Add Ranger permissions for the PLAIN user in the CDP PvC cluster:

    You must ensure that the PLAIN user you obtained has correct permissions assigned to it in Ranger. Otherwise, SRM will not be able to access Kafka resources on the CDP PvC Base cluster.

    1. Access the Cloudera Manager instance of your CDP PvC Base cluster.
    2. Go to Ranger > Ranger Admin Web UI.
    3. Log in to the Ranger Console (Ranger Admin Web UI).
    4. Add the [***PLAIN USER***] to the following policies:
      • All - consumergroup
      • All - topic
      • All - transactionalid
      • All - cluster
      • All - delegationtoken
  3. Acquire the CDP PvC Base cluster truststore and add it to the Data Hub cluster:
    The actions you need to take differ depending on how TLS is set up in the CDP PvC Base cluster:
    1. Obtain the certificate of the Cloudera Manager root Certificate Authority and its password.

      The Certificate Authority certificate and its password can be obtained using the Cloudera Manager API. The following steps describe how you can retrieve the certificate and password using the Cloudera Manager API Explorer. Alternatively, you can also retrieve the certificate and password by calling the appropriate endpoints in your browser window or using curl.

      1. Access the Cloudera Manager instance of your CDP PvC Base cluster.
      2. Go to Support > API Explorer.
      3. Find CertManagerResource.
      4. Select the /certs/truststore GET operation and click Try it out.
      5. Enter the truststore type.
      6. Click Execute.
      7. Click Download file under Responses.

        The downloaded file is your certificate.

      8. Select the /certs/truststorePassword GET operation and click Try it out.
      9. Click Execute.

        The password is displayed under Responses.

    2. Run the following command to create the truststore:
      keytool \
        -importcert \
        -storetype JKS \
        -noprompt \
        -keystore cdppvc-truststore.jks \
        -storepass ***PASSWORD*** \
        -alias cdppvc-cm-ca \
        -file ***PATH TO CM CA CERTIFICATE***
      

      Note down the password, it is needed in a later step.

    3. Copy the cdpdc-truststore.jks file to a common location on all the hosts in your CDP Data Hub cluster.

      Cloudera recommends that you use the following location: /opt/cloudera/security/cdppvc-truststore.jks.

    4. Set the correct file permissions.

      Use 751 for the directory and 444 for the truststore file.

    1. Note down the CDP PvC Base cluster's truststore location and password, these should be known to you.
    2. Copy the truststore file to a common location on all the hosts in your CDP Data Hub cluster.

      Cloudera recommends that you use the following location: /opt/cloudera/security/truststore.jks.

    3. Set the correct file permissions.

      Use 751 for the directory and 444 for the truststore file.

  4. Access the Cloudera Manager instance of your Data Hub Cluster.
  5. Define the external Kafka cluster (CDP PvC Base).
    1. Go to Administration > External Accounts.
    2. Go to the Kafka Credentials tab.
      On this tab you will create a credential for each external cluster taking part in the replication process.
    3. Click Add Kafka credentials
    4. Configure the Kafka credentials:
      In the case of this example, you must create a single credential representing the CDP PvC Base cluster. For example:
      Name=cdppvc
      Bootstrap servers=[***MY-CDP-PVC-CLUSTER-HOST-1.COM:9093***],[***MY-CDP-PVC-CLUSTER-HOST-2:9093***]
      Security Protocol=SASL_SSL
      JAAS Secret 1=[***PLAIN USER***]
      JAAS Secret 2=[***PLAIN USER PASSWORD***]
      JAAS Template=org.apache.kafka.common.security.plain.PlainLoginModule required username="##JAAS_SECRET_1##" password="##JAAS_SECRET_2##"; 
      SASL Mechanism=PLAIN
      Truststore Password=[***PASSWORD***] 
      Truststore Path=/opt/cloudera/security/cdppvc-truststore.jks
      Truststore type=JKS
      
    5. Click Add.
      If credential creation is successful, a new entry corresponding to the Kafka credential you specified appears on the page.
  6. Define the co-located Kafka cluster (Datahub):
    1. In Cloudera Manager, go to Clusters and select the Streams Replication Manager service.
    2. Go to Configuration.
    3. Find and enable the Kafka Service property.
    4. Find and configure the Streams Replication Manager Co-located Kafka Cluster Alias property.
      The alias you configure represents the co-located cluster. Enter an alias that is unique and easily identifiable. For example:
      datahub
    5. Enable relevant security feature toggles.
      Because the Data Hub cluster is both TLS/SSL and Kerberos enabled, you must enable all feature toggles for both the Driver and Service roles. The feature toggles are the following:
      • Enable TLS/SSL for SRM Driver
      • Enable TLS/SSL for SRM Service
      • Enable Kerberos Authentication
  7. Add both clusters to SRM's configuration:
    1. Find and configure the External Kafka Accounts property.
      Add the name of all Kafka credentials you created to this property. This can be done by clicking the add button to add a new line to the property and then entering the name of the Kafka credential. For example:
      cdppvc
    2. Find and configure the Streams Replication Manager Cluster alias property.
      Add all cluster aliases to this property. This includes the aliases present in both the External Kafka Accounts and Streams Replication Manager Co-located Kafka Cluster Alias properties. Delimit the aliases with commas. For example:
      datahub,cdppvc
  8. Configure replications:

    In this example data is replicated unidirectionally. As a result, only a single replication must be configured.

    1. Find the Streams Replication Manager's Replication Configs property.
    2. Click the add button and add new lines for each unique replication you want to add and enable.
    3. Add and enable your replications. For example:
      cdppvc->datahub.enabled=true
  9. Configure Driver and Service role targets:
    1. Find and configure the Streams Replication Manager Service Target Cluster property.
      Add the co-located cluster's alias to the property. For example:
      datahub
    2. Find and configure the Streams Replication Manager Driver Target Cluster property.
      For example:
      datahub,cdppvc
  10. Configure the srm-control tool:
    1. Click Gateway in the Filters pane.
    2. Find and configure the following properties:
      • SRM Client's Secure Storage Password: [***PASSWORD***]
      • Environment Variable Holding SRM Client's Secure Storage Password: SECURESTOREPASS
      • Gateway TLS/SSL Trust Store File: /opt/cloudera/security/datahub-truststore.jks
      • Gateway TLS/SSL Truststore Password: [***PASSWORD***]
      • SRM Client's Kerberos Principal Name: [***MY KERBEROS PRINCIPAL****]
      • SRM Client's Kerberos Keytab Location: [***PATH TO KEYTAB FILE***]
      Take note of the password you configure in SRM Client's Secure Storage Password and the name you configure in Environment Variable Holding SRM Client's Secure Storage Password. You will need to provide both of these in your CLI session before running the tool.
    3. Click Save Changes.
    4. Restart the SRM service.
    5. Deploy client configuration for SRM.
  11. Start the replication process using the srm-control tool:
    1. SSH as an administrator to any of the SRM hosts in the Data Hub cluster.
      ssh [***USER***]@[***MY-DATAHUB-CLUSTER.COM***]
    2. Set the secure storage password as an environment variable.
      export [***SECURE STORAGE ENV VAR***]=”[***SECURE STORAGE PASSWORD***]
      Replace [***SECURE STORAGE ENV VAR***] with the name of the environment variable you specified in Environment Variable Holding SRM Client's Secure Storage Password. Replace [***SRM SECURE STORAGE PASSWORD***] with the password you specified in SRM Client's Secure Storage Password. For example:
      export SECURESTOREPASS=”mypassword"
    3. Use the srm-control tool with the topics subcommand to add topics to the allow list.
      srm-control topics --source cdppvc --target datahub --add [***TOPIC NAME***]
    4. Use the srm-control tool with the groups subcommand to add groups to the allow list.
      srm-control groups --source cdppvc --target datahub --add ".*"
      
  12. Monitor the replication process.
    Access the SMM UI in the Data Hub cluster and go to the Cluster Replications page. The replications you set up will be visible on this page.