Known issues and limitations

You must be aware of the known issues and limitations, the areas of impact, and workarounds in Cloudera DataFlow.

Known issues

Enable DataFlow dialog offers incorrect options for Load Balancer subnet selection if Endpoint Gateway Subnets are defined in the CDP Environment

CDP supports optionally specifying Endpoint Gateway Subnets at the CDP Environment level. The known issue explained below is only present if this Endpoint Gateway Subnet feature is used.

When enabling DataFlow, the Endpoint Gateway Subnets values from the CDP Environment are referenced to influence the options offered. The intended behavior is to use the Endpoint Gateway Subnets from the CDP Environment as the options for Load Balancer Subnet selection when enabling DataFlow. The actual behavior is that Endpoint Gateway Subnets are ignored and the standard Network Subnets at CDP Environment level are used instead.

In cases where specific Endpoint Gateway Subnets are specified at the CDP Environment level, enabling DataFlow will use the Network Subnets for Load Balancer placement and may fail or attach the Load Balancer to the wrong subnet.

To enable DataFlow with the correct subnet selections, temporarily add the desired Endpoint Gateway Subnets.

  1. On the CDP Management Console select your environment and add the desired Endpoint Gateway Subnets to the existing subnets. For more information, see Add subnets to an environment.

  2. Go back to the Enable DataFlow dialog where you will now see the updated subnet list in both Worker Node Subnets and Load Balancer Subnets fields. Specifically select the subnets to use for Worker Node and Load Balancer placement and proceed with enablement. For more information, see Deploying a flow definition.

  3. After initiating deployment, return to the CDP Management Console and remove the subnet values that you have temporarily added.

IAM Policy Simulator preflight check fails with resource policy validation

With all cross account policies in place, IAM Policy Simulator preflight check still fails with the following error message:

IAM Resource Policy validation failed on AWS. CrossAccount role does not have permissions for these operations : : ssm:GetParameter, ssm:GetParameters, ssm:GetParameterHistory, ssm:GetParametersByPath

This happens because even if a given cross account role is allowed to perform a certain action (granted through IAM Policies), an attached Service Control Policy (SCP) may override that capability if it enforces a Deny on that action. SCP takes precedence over IAM Policies. SCPs are either applied at the root of an organization, or can be applied to individual accounts. A permission can be blocked at any level above the account, either implicitly or explicitly (by including it in a Deny policy statement).

As the IAM Simulator SDK does not have an option to include or exclude an organization’s SCP policy, the preflight check will fail if an SCP policy is denying an action, even though the IAM role has the necessary permissions.

This is a known issue in AWS.

Do not select the Skip Validations option when enabling DataFlow to bypass this issue. This bypasses all preflight validation checks. Instead, submit a request to add the LIFTIE_DISABLE_IAM_PREFLIGHT_CHECK entitlement to your account which ensures only the IAM Policy preflight validation check is skipped.

Stale Prometheus rules cause false or missing capacity alerts

If you performed a DataFlow settings update and modified the maximum Kubernetes node capacity value between October 10, 2022 (release 2.2.0-h3-b2) and December 08, 2022 (release 2.3.0-b347), the Prometheus alert rules that govern raising active alerts when the cluster is nearing and has reached maximum capacity do not reflect the updated minimum and maximum thresholds. This means you may observe an active alert indicating the DataFlow cluster is nearing capacity or has reached capacity when in fact it has not. Alternatively, you may not encounter an active alert when one should be active. This is due to rules being stale and reflecting the previous maximum capacity value.

If an inaccurate alert is present or there is concern that an alert may not activate when it should, the mitigation is to perform another settings update, changing either the minimum or maximum capacity again to trigger a corrective update.
Role assignment processing
Role assignment changes for DataFlow workload applications are only retrieved when the user logs into the workload application. Note, that the user can remain logged in for up to eight hours after their last interaction with the workload application.
If you assign or remove the DFFlowAdmin, DFFlowUser, or DFFlowDeveloper roles while the user is logged into the workload application, the user must log out and then log back into the workload application for the change to take effect. To achieve this, the user has to log out either from the Deployment Manager or the Deployment Wizard.

To log out from the Deployment Manager:

  1. Select a flow deployment from the Dashboard, then click Manage Deployment.
    • If you have sufficient rights to manage flow deployments, the Deployment Manager launches. Select your user name in the lower left corner and click Log Out.
    • If you do not have sufficient rights to manage flow deployments, you are forwarded to an error page. Click Logout.

To log out from the Deployment Wizard:

  1. From the Dashboard select Catalog.
  2. Select an available flow definition and click Deploy.
  3. From the Deployment Wizard, select an available environment and click Continue.
    • If you have sufficient rights to deploy dataflows in that environment, the Deployment Wizard launches. Select your user name in the lower left corner and click Log Out.
    • If you do not have sufficient rights to deploy dataflows, you are forwarded to an error page. Click Logout.

Logging out of and back into DataFlow control plane leaves workload application role assignments unaffected.

Limitations

Reporting Tasks are not supported as part of a flow deployment
When you download a flow definition from NiFi, existing reporting tasks are not exported to the resulting JSON file. As a result, deployments created in DataFlow do not have any pre-configured Reporting Tasks.
If you have been assigned the FlowAdmin role, you can manually create Reporting Tasks in the NiFi canvas after a deployment is completed.
Data Lineage information is not automatically reported to Atlas in the Data Catalog
Flow deployments created by DataFlow do not come with a pre-configured ReportLineageToAtlas Reporting Task.
If you have been assigned the FlowAdmin role, you can manually create and configure the ReportLineageToAtlas Reporting Task in the NiFi canvas after a deployment is completed.
PowerUsers are not able to create flow deployments without additional DataFlow roles
While the PowerUser role gives you the ability to view flow deployments in the Dashboard, view flow definitions in the Catalog, and allows you to initiate flow deployments, the Deployment Wizard fails after selecting an environment for which the user does not have the DFFlowAdmin resource role assigned.
Assign the DFFlowAdmin role to the user of the environment to which they want to deploy flow definitions.
CDF reports "Bad Health" during Data Lake upgrade
CDF monitors the state of the associated CDP environment to decide which actions CDF users can take. CDF detects Data Lake upgrades of the associated CDP environment and puts the CDF service into Bad Health for the duration of the upgrade blocking new deployments.
To work around this issue, wait for the Data Lake upgrade to complete before creating new flow deployments.
Deployments and CDF Services are no longer visible in the DataFlow Dashboard or Environments page when the associated CDP Environment has been deleted
If the associated CDP Environment is deleted while a DataFlow Services is enabled, it will become orphaned. Orphaned resources are no longer visible to users without the PowerUser role.
To work around this issue, open the Environments or Dashboard page with a user who has been assigned the PowerUser role. PowerUsers are able to view orphaned deployments and DataFlow services.
Non-transparent proxies are not supported on Azure
There is no workaround for this issue.
Disk related metrics have been removed from deployments and DataFlow service details

In previous versions of CDF, deployments and enabled DataFlow services showed disk capacity and disk usage metrics as part of their system metrics. You were also able to define KPIs and alerts on these metrics. Due to issues with the underlying metrics collection framework, the following metrics have been removed starting with DFX-1.1.0-h2-b1:

  • Disk Capacity (DF Service Metric)

  • Disk Capacity (Deployment System Metric)

  • Disk Usage (Deployment System Metric)