Known issues and limitations

This provides a summary of known issues for Cloudera Data Flow in Data Hub.

ReplaceText with multiple concurrency tasks can result in data corruption
ReplaceText, when scheduled to run with multiple Concurrent Tasks, and using a Replacement Strategy of "Regular Expression" or "Literal Replace" can result in content being corrupted.

The issue is far more likely to occur with multiple Concurrent Tasks, but it may be possible to trigger when using a single Concurrent Task.

To get the fix for this issue, file a support case through the Cloudera portal.
Schema Registry is unavailable if Knox has failed

Schema Registry depends on Knox, but Knox is not highly available. If Knox fails, your clients cannot reach Schema Registry.

To work around this issue, switch to an internal Schema Registry configuration. To do this, configure Schema Registry with the host name, port, and direct URL rather than a dynamic Knox endpoint.

Atlas does not connect lineage when NiFi is writing files to S3 or ADLS Gen2

When running a flow that writes data to S3 using PutHDFS or PutS3 processors or when writing files to ADLS Gen2 using PutHDFS or PutADLS, the Atlas type reported by NiFi is "nifi_dataset" while Atlas is expecting it to be of type "aws_s3_pseudo_dir" or "adls_gen2_directory". As a result, while we can show lineage of NiFi flows, the lineage will not be connected to subsequent processes that use these S3 or ADLS files.

Lineage can be connected manually in Atlas if required.

CFM-1017

PutAzureDataLakeStorage has several limitations
  • You can add files to read-only buckets
  • There is no check for file overwriting. It is possible to overwrite data.
  • To add files to a bucket root level, set the destination with an empty string, rather than " / ".

PutAzureDataLakeStorage was introduced in CFM 2.0.0, for inclusion in Flow Management clusters in CDP Public Cloud. It is not available in HDF 3.5.x or CFM 1.1.x

You can use the PutHDFS processor to write data to Azure Data Lake Storage. See Ingesting Data into Azure Data Lake Storage for details.
Adjust PublishKafkaProcessor default timeout value for cloud

The 5000 ms timeout for PublishKafkaRecord processor when "Delivery Guarantee" is set to "all" might not be enough depending on your network setup and workload on the Kafka cluster you are connecting to.

The error message may look similar to:

2020-01-22 09:50:12,854 ERROR 
org.apache.nifi.processors.kafka.pubsub.PublishKafkaRecord_2_0: 
PublishKafkaRecord_2_0[id=cca729a5-016f-1000-ffff-ffffa3429f0c] 
Failed to send StandardFlowFileRecord[uuid=03d8a3ab-e1a3-41a4-9fa1-07af176aeb56,
claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1579683456791-21, 
container=default, section=21], offset=844488, length=20],offset=0,
name=NLgsvfQuAi2aad67b4-b053-4e42-b9e4-a10153941229,size=20] to 
Kafka: org.apache.kafka.common.errors.
TimeoutException: Failed to update metadata after 5000 ms. 

Increase timeout for Max Metadata Wait Time and Acknowledgement Wait Time in the processor configuration.

CFM-661

Terminating a Flow Management cluster does not delete the cluster specific NiFi and NiFi Registry repositories in Ranger

When a new Flow Management cluster is created, the setup process creates new repository entries in Ranger to allow cluster specific Ranger policies. These cluster specific Ranger repositories are not being deleted when a cluster is terminated. This can lead to issues when a cluster is terminated and another cluster with the same name is being created afterwards. The new cluster will re-use the existing Ranger repository but now the NiFi component UUIDs in existing policies do not match the UUIDs of the new cluster.

Manually delete the cluster specific Ranger repositories in Ranger after terminating a Flow Management cluster.

CB-5566

Terminating a Streams Messaging cluster does not delete the cluster specific Kafka repositories in Ranger

When a new Streams Messaging cluster is created, the setup process creates new repository entries in Ranger to allow cluster specific Ranger policies. These cluster specific Ranger repositories are not being deleted when a cluster is terminated.

Manually delete the cluster specific Ranger repositories in Ranger after terminating a Streams Messaging cluster

Scaling Kafka Brokers or NiFi Nodes up/down is not possible
Data Hub does not allow users to resize Kafka broker or NiFi node groups
There is no workaround for this issue.

Technical Service Bulletins

TSB 2022-580: NiFi Processors cannot write to content repository
If the content repository disk is filled more than 50% (or any other value that is set in nifi.properties for nifi.content.repository.archive.max.usage.percentage), and if there is no data in the content repository archive, the following warning message can be found in the logs: "Unable to write flowfile content to content repository container default due to archive file size constraints; waiting for archive cleanup". This would block the processors and no more data is processed.

This appears to only happen if there is already data in the content repository on startup that needs to be archived, or if the following message is logged: “Found unknown file XYZ in the File System Repository; archiving file”.

Upstream JIRA
Knowledge article
For the latest update on this issue see the corresponding Knowledge article: TSB 2022-580: NiFi Processors cannot write to content repository