Known issues and limitations

This provides a summary of known issues for Cloudera Data Flow in Data Hub.

Atlas does not connect lineage when NiFi is writing files to S3 or ADLS Gen2

When running a flow that writes data to S3 using PutHDFS or PutS3 processors or when writing files to ADLS Gen2 using PutHDFS or PutADLS, the Atlas type reported by NiFi is "nifi_dataset" while Atlas is expecting it to be of type "aws_s3_pseudo_dir" or "adls_gen2_directory". As a result, while we can show lineage of NiFi flows, the lineage will not be connected to subsequent processes that use these S3 or ADLS files.

Lineage can be connected manually in Atlas if required.


PutAzureDataLakeStorage has several limitations
  • You can add files to read-only buckets
  • There is no check for file overwriting. It is possible to overwrite data.
  • To add files to a bucket root level, set the destination with an empty string, rather than " / ".

PutAzureDataLakeStorage was introduced in CFM 2.0.0, for inclusion in Flow Management clusters in CDP Public Cloud. It is not available in HDF 3.5.x or CFM 1.1.x

You can use the PutHDFS processor to write data to Azure Data Lake Storage. See Ingesting Data into Azure Data Lake Storage for details.
Adjust PublishKafkaProcessor default timeout value for cloud

The 5000 ms timeout for PublishKafkaRecord processor when "Delivery Guarantee" is set to "all" might not be enough depending on your network setup and workload on the Kafka cluster you are connecting to.

The error message may look similar to:

2020-01-22 09:50:12,854 ERROR 
Failed to send StandardFlowFileRecord[uuid=03d8a3ab-e1a3-41a4-9fa1-07af176aeb56,
claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1579683456791-21, 
container=default, section=21], offset=844488, length=20],offset=0,
name=NLgsvfQuAi2aad67b4-b053-4e42-b9e4-a10153941229,size=20] to 
Kafka: org.apache.kafka.common.errors.
TimeoutException: Failed to update metadata after 5000 ms. 

Increase timeout for Max Metadata Wait Time and Acknowledgement Wait Time in the processor configuration.


Terminating a Flow Management cluster does not delete the cluster specific NiFi and NiFi Registry repositories in Ranger

When a new Flow Management cluster is created, the setup process creates new repository entries in Ranger to allow cluster specific Ranger policies. These cluster specific Ranger repositories are not being deleted when a cluster is terminated. This can lead to issues when a cluster is terminated and another cluster with the same name is being created afterwards. The new cluster will re-use the existing Ranger repository but now the NiFi component UUIDs in existing policies do not match the UUIDs of the new cluster.

Manually delete the cluster specific Ranger repositories in Ranger after terminating a Flow Management cluster.


ReportLineageToAtlas reporting task is throwing errors on a new Flow Management cluster

Ranger policies that allow NiFi to publish metadata to Atlas are not created automatically which prevents NiFi from writing to Atlas.

Attempting to write to Atlas may result in an error message similar to:

Error running task ReportLineageToAtlas[id=843ce571-0171-1000-ffff-ffffefdc49dd] due 
to java.lang.RuntimeException: Failed to check and create NiFi flow type definitions in 
Atlas due to org.apache.atlas.AtlasServiceException: Metadata service API 
org.apache.atlas.AtlasClientV2$API_V2@3ea5b832 failed with status 403 (Forbidden) 
Response Body ({"errorCode":"ATLAS-403-00-001","errorMessage":"nifi is not authorized 
to perform create entity-def nifi_output_port"})"

As an environmentAdmin, go to Ranger and add the nifi user to the following pre-existing Ranger policies:

Ranger repository name Policy name Action
cm_atlas all - entity-type, entity-classification, entity Add nifi user to the allow condition entry where 'Create Entity', 'Update Entity', and similar permissions are allowed.
cm_atlas all - type-category, type Add nifi user to the allow condition entry where 'Create Type', 'Update Type', and similar permissions are allowed.
cm_kafka ATLAS_HOOK Add nifi user to the allow condition entry where the Publish permissions was allowed.

After adding the nifi user to the specified policies, wait up to 5 minutes to confirm that the pre-defined ReportLineageToAtlas task can successfully publish lineage to Atlas. You do not need to restart ReportLineageToAtlas to pick up this configuration change. To see the newly published information in Atlas, refresh your open Atlas browser windows

The FQDNs of the NiFi nodes in a Flow Management cluster are not registered with public DNS

Some NiFi use cases require inbound connectivity to NiFi from external systems using hostnames. Currently the FQDNs of NiFi nodes cannot be resolved over the public internet

Add a mapping of FQDN and public IP address of the NiFi nodes to the local hosts file of your external system

Terminating a Streams Messaging cluster does not delete the cluster specific Kafka repositories in Ranger

When a new Streams Messaging cluster is created, the setup process creates new repository entries in Ranger to allow cluster specific Ranger policies. These cluster specific Ranger repositories are not being deleted when a cluster is terminated.

Manually delete the cluster specific Ranger repositories in Ranger after terminating a Streams Messaging cluster

Scaling Kafka Brokers or NiFi Nodes up/down is not possible
Data Hub does not allow users to resize Kafka broker or NiFi node groups
There is no workaround for this issue.
NiFi Registry API endpoint is not displayed in the list of exposed cluster endpoints in Data Hub UI

For all services running within a Flow Management cluster, Knox is set up to proxy requests to the endpoints. While the NiFi Registry API is configured to be proxied by Knox, the endpoint URI is not exposed in the Data Hub cluster management UI.

If you need to access the NiFi Registry API through Knox, use the following endpoint format: https://<gateway-nod-fqdn>/<clustername>/cdp-proxy-api/nifi-registry-app/nifi-registry-api/.
