Known issues and limitations
This provides a summary of known issues for Cloudera Data Flow in Data Hub.
- ReplaceText with multiple concurrency tasks can result in data corruption
- ReplaceText, when scheduled to run with multiple Concurrent Tasks, and using a Replacement
Strategy of "Regular Expression" or "Literal Replace" can result in content being corrupted.
The issue is far more likely to occur with multiple Concurrent Tasks, but it may be possible to trigger when using a single Concurrent Task.
- To get the fix for this issue, file a support case through the Cloudera portal.
- Schema Registry is unavailable if Knox has failed
Schema Registry depends on Knox, but Knox is not highly available. If Knox fails, your clients cannot reach Schema Registry.
To work around this issue, switch to an internal Schema Registry configuration. To do this, configure Schema Registry with the host name, port, and direct URL rather than a dynamic Knox endpoint.
- Atlas does not connect lineage when NiFi is writing files to S3 or ADLS Gen2
When running a flow that writes data to S3 using PutHDFS or PutS3 processors or when writing files to ADLS Gen2 using PutHDFS or PutADLS, the Atlas type reported by NiFi is "nifi_dataset" while Atlas is expecting it to be of type "aws_s3_pseudo_dir" or "adls_gen2_directory". As a result, while we can show lineage of NiFi flows, the lineage will not be connected to subsequent processes that use these S3 or ADLS files.
Lineage can be connected manually in Atlas if required.
- PutAzureDataLakeStorage has several limitations
- You can add files to read-only buckets
- There is no check for file overwriting. It is possible to overwrite data.
- To add files to a bucket root level, set the destination with an empty string, rather than " / ".
PutAzureDataLakeStorage was introduced in CFM 2.0.0, for inclusion in Flow Management clusters in CDP Public Cloud. It is not available in HDF 3.5.x or CFM 1.1.x
- You can use the PutHDFS processor to write data to Azure Data Lake Storage. See Ingesting Data into Azure Data Lake Storage for details.
- Adjust PublishKafkaProcessor default timeout value for cloud
The 5000 ms timeout for PublishKafkaRecord processor when "Delivery Guarantee" is set to "all" might not be enough depending on your network setup and workload on the Kafka cluster you are connecting to.
The error message may look similar to:
2020-01-22 09:50:12,854 ERROR org.apache.nifi.processors.kafka.pubsub.PublishKafkaRecord_2_0: PublishKafkaRecord_2_0[id=cca729a5-016f-1000-ffff-ffffa3429f0c] Failed to send StandardFlowFileRecord[uuid=03d8a3ab-e1a3-41a4-9fa1-07af176aeb56, claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1579683456791-21, container=default, section=21], offset=844488, length=20],offset=0, name=NLgsvfQuAi2aad67b4-b053-4e42-b9e4-a10153941229,size=20] to Kafka: org.apache.kafka.common.errors. TimeoutException: Failed to update metadata after 5000 ms.
Increase timeout for Max Metadata Wait Time and Acknowledgement Wait Time in the processor configuration.
- Terminating a Flow Management cluster does not delete the cluster specific NiFi and NiFi Registry repositories in Ranger
When a new Flow Management cluster is created, the setup process creates new repository entries in Ranger to allow cluster specific Ranger policies. These cluster specific Ranger repositories are not being deleted when a cluster is terminated. This can lead to issues when a cluster is terminated and another cluster with the same name is being created afterwards. The new cluster will re-use the existing Ranger repository but now the NiFi component UUIDs in existing policies do not match the UUIDs of the new cluster.
Manually delete the cluster specific Ranger repositories in Ranger after terminating a Flow Management cluster.
- Terminating a Streams Messaging cluster does not delete the cluster specific Kafka repositories in Ranger
When a new Streams Messaging cluster is created, the setup process creates new repository entries in Ranger to allow cluster specific Ranger policies. These cluster specific Ranger repositories are not being deleted when a cluster is terminated.
Manually delete the cluster specific Ranger repositories in Ranger after terminating a Streams Messaging cluster
- Scaling Kafka Brokers or NiFi Nodes up/down is not possible
- Data Hub does not allow users to resize Kafka broker or NiFi node groups
- There is no workaround for this issue.