CDP Public Cloud: April 2020 Release Summary

Cloudera is pleased to announce the latest updates to CDP Public Cloud.

Highlights

All CDP services are now generally available on Microsoft Azure, including SDX, Data Hub, Cloudera Data Warehouse (both Hive and Impala), and Cloudera Machine Learning. SDX and Data Hub are available in most Azure regions, however CDW and CML are only available on Azure regions that provide AKS.
This further includes integration with the Microsoft Azure Marketplace in order to make it easy for customers to pay for CDP through their Azure credits.
Flow Management for Data Hub is now available as a new cluster definition to make it easy to create NiFi clusters on AWS or Azure.
Cloudera Machine Learning now includes expanded production machine learning capabilities (aka MLOps) to support model monitoring, cataloging, access controls, and lineage.

New Features

Cloudera Data Warehouse (CDW)

Faster Time to Productivity and Improved Workload Health
- Higher user productivity with faster Virtual Warehouse provisioning as images get preloaded on compute and shared service nodes prior to run time, and reducing the image size of the compute node from 11GB to 2GB.
- Streamlined user experience with single sign-on across more components
- Improved workload health with the ability to monitor workload operations even deeper, and receive important alerts, in Prometheus and in the CDP Web UI, as well as greater resiliency in SQL engines
Improved Business Insights with More Data Types
- Take advantage of analytics on even more new data types for deeper business insight including Kudu Date and Varchar, as well as zstd compressed text files
More Business Insights and Improved Exploration Experience with Faster Analytics
- Improved SQL Engine performance for faster response to dashboard, report and ad-hoc queries with improvements to scratch capacity management, the cache eviction algorithm, and improved read performance for ORC tables

Cloudera Machine Learning (CML)

Production Machine Learning: NEW ML lifecycle focused functionality enabling ML Engineers and Data Scientists to cut time to production for ML models from weeks to minutes and scale ML use cases without compromising enterprise security, maintainability, and governance standards.
- MLOps now GA - Monitoring models’ functional and business performance requires specialized tooling and CML now includes native functionality to enable the storage and access of custom and arbriatry model metrics. Included as well is the ability to track individual predictions to ground truth ensuring models are performing optimally and compliantly.
- SDX governance capabilities expanded to support CML now GA - This includes model cataloging, full lifecycle lineage, and custom metadata in Apache Atlas to help customers track, manage, and understand large numbers of ML models deployed across their enterprise
- Increased Model security - Model REST endpoints now have additional security features that allow user-level access control to prevent unauthorized users from accessing the end points. This enables models to be served in a CML production environment without compromising security.
Data science usability improvements
- Data scientists can now manage groups of sessions within a project in bulk, making it easier to clean up workspaces and control cloud costs for expired machine learning projects.
- Also, data scientist users can now pass in custom CLI argv for Jobs — no need to rewrite or configure when copying workloads
Administrative & security Improvements
- The platform is now W3C WCAG 2-AA Compliant for CDSW/CML Workspace Applications.
- Admins can now enforce users to enter information when starting sessions for audit purposes

Cloudera DataHub

Flow Management for Data Hub (NiFi) now GA
- A new cluster definition to spin up Apache NiFi clusters on public cloud
- Light Duty configuration for PoCs / Heavy Duty configuration for production workloads
- Based on latest Apache NiFi 1.11.4
- Integrated with SDX for authentication, authorization and data governance/lineage
Streams Messaging for Data Hub (Kafka) [Tech Preview]
- Increased connection stability for external clients by automatically updating DNS records for Kafka brokers to reflect public IP changes after a cluster has been stopped or a failed broker has been replaced
Data Engineering for Data Hub
- Data Hub now includes a new Data Engineering HA cluster definition (tech preview) that deploys all critical services (YARN, HDFS, Hive) in an HA fashion. Non-critical services (such as like Zeppelin and Livy) are not yet HA in this release
- Data Engineering cluster administrators can now leverage the visual YARN queue manager via CM to control YARN resource pools
Real-Time Data Mart for Data Hub (Kudu+Impala)
- Take advantage of analytics on even more new data types for deeper business insight including Kudu Date and Varchar

Workload Manager

Search by Job/Query ID
- Ability to search for or lookup a specific job or query run by YARN Application ID, Hive or Impala Query ID or MR Job ID
Access Control - Manage & control user access for clusters
- Workload Manager provides you with the ability to restrict access to workload views, jobs, and queries by assigning user roles. The Cluster Admin, Cluster User, and Workload User roles provide varying levels of access, which prevents users from accessing information that they do not explicitly need.
Resource Consumption Widget in Impala
- Showcases Average CPU Core Hours and average memory usage as a function of time on the Impala summary page

SDX and Data Catalog

Data Profilers - Provide not only insight into data makeup (cardinality, nullity, min/max/mean and more, both in textual as well as graphical representation) but also sensitivity with discovered classifications ensuring data access and masking rules are consistently applied, even for previously unknown data sets. Data Profilers also collect statistics about access and usage so data owners can gauge the impact of their datasets and data users can infer who are the domain experts around data sets.
Data Flow Lineage - Offers a complete overview of data ingest pipelines and how they are integrated with tables and ML models. Data flows built using the Flow Management cluster definitions in Data Hubhave lineage automatically captured and presented for NiFi flows and processors.
CML Model Lineage - With the inclusion of model and training lineage, CDP provides ever more detailed and complete lineage across the data lifecycle. This new addition gives Data Scientist/Architects the insight in the impact of changes to the upstream ingest and ETL pipelines on their work and enables them to evaluate whether ML workloads are affected by the changes.
Keytab Externalization [Tech Preview] - Enhance gateway services and reduce implementation and management overheads by retrieving CDP Kerberos keytabs from the Control Plane.

Management Console

System Auditing - Reduce security risks and prevent unexpected infrastructure costs by knowing exactly what’s happening in CDP through full auditing of user role changes, group changes as well as login activity.
Enhanced Resource Tagging - Custom tags can now be applied to SDX clusters and FreeIPA nodes during environment creation.

Full release documentation including release notes for each service is available here (Service Name → Release Notes → What’s New).