CDP Public Cloud: May 2021 Release Summary

RAPIDS ML Runtimes are now available for Open GPU Data Science

RAPIDS Runtimes ship a suite of libraries from NVIDIA that bring the power of accelerated GPUs to standard Data Science operations — be it exploratory data analysis, feature engineering, or model training. The RAPIDS libraries are designed as drop-in replacements for common Python data science libraries like cuDF (pandas), cuPy (numpy), cuML (sklearn) and Dask-CUDA (dask) — enabling GPU acceleration for data science workloads of 5X+ without significant code changes. By leveraging the parallel compute capacity of GPUs the time for complicated data engineering and data science tasks can be dramatically reduced, accelerating the timeframes for Data Scientists to take ideas from concept to production. For more information about RAPIDS see rapids.ai.

Data Scientists now can use the RAPIDS Runtimes that enable end-to-end data science and analytics pipelines entirely on GPUs.

To learn more, visit the documentation about the RAPIDS ML Runtimes in CML.

Admins can now update the configuration of Cloudera Machine Learning Workspaces on Azure

Previously, the settings on a Cloudera Machine Learning (CML) Workspace created on Microsoft Azure could not be adjusted after the ML Workspace was provisioned. This meant that, for example, the autoscaling range for Azure instances could not be changed in order to provide additional compute to the Users on the ML Workspace.

When viewing the details of a given ML Workspace from the CDP Management Console, several settings are now updateable. These include the Allowed Load Balancer Source Ranges, API Server Authorized IP Ranges, and Autoscale Range on the Azure instances. This enables Admins to not only enable the ML Workspace to continue to work even after networking changes, but also either expand or restrict the number of instances that could be made available to handle the various workloads on the ML Workspace.

Cloudera Machine Learning now conducts Validation Checks while provisioning a new Workspace

While Cloudera has documented the prerequisites necessary to provision a new Cloudera Machine Learning (CML) Workspace in the Public Cloud, steps can be missed which lead to a provisioning failure. Sometimes, these failures do not appear until minutes after a provisioning action has begun, which can lead to frustration in getting CML set up.

After an Admin provides the necessary settings and clicks the “Provision Workspace” button, CML will immediately conduct a Validation Check of the provided inputs. Any issues found will be highlighted on the “Provision Workspace” screen, enabling Admins to correct any issues before attempting to provision the ML Workspace again.

Applied ML Prototypes in Cloudera Machine Learning can now pull from repositories stored in Azure Repos

Applied ML Prototypes (AMPs) in Cloudera Machine Learning (CML) enable Users to immediately get started with powerful end-to-end examples of Projects that include all the dependencies, algorithms, and user applications required to directly apply use cases to their own needs. While CML ships with access to a library of AMPs created and maintained by Cloudera’s own Fast Forward Labs team, Admins can now set up customized AMPs that are stored in Azure Repos.

When defining the AMP Catalog Sources, Admins have been able to define sources through a Git Repository URL or Catalog File URL, pointing at repositories internal to the organization. Admins can now point these sources to repositories stored in Azure Repos, in addition to other repositories such as GitHub.

To learn more, visit the documentation about AMP Catalogs.

ML Runtimes in Cloudera Machine Learning now support Add Ons such as Spark

ML Runtimes are designed to be the successor to Engines in Cloudera Machine Learning (CML), moving away from the monolithic architecture of Engines and enabling finer grained control over exactly what is necessary by defining the Editor, Kernel, Edition, and Version needed for the given workload. One core piece previously missing from ML Runtimes was the ability to define Addons that go beyond these settings, in particular the ability to leverage the Spark execution engine through an ML Runtime.

Admins can now define Addons for ML Runtimes, including Spark and Hadoop CLI. In doing so, Users can leverage these Addons when running workloads such as Sessions, thereby enabling such functionality as Spark-on-K8s to be run through the ML Runtime.

To learn more, visit the documentation about Runtime Addons in CML.

Cloudera Machine Learning now allows Admins to Synchronize Users

COD UI has been enhanced to provide the client URL for Python (used in conjunction with the phoenixDB python library)

Previously, Admins could assign roles such as “MLUser” at the Environment level, enabling the Users to access the ML Workspaces under that Environment. However, those users would not show up inside the ML Workspace until they log in for the first time, preventing other Users from taking actions such as proactively adding them to Projects.

From the Users page, Admins can now sync the list of Users between the CDP Management Console and the given ML Workspace. After doing so, any users who have been assigned access to the ML Workspace through the “MLAdmin”, “MLUser”, or “MLBusinessUser” role would appear appropriately within the ML Workspace. This enables other Users to manage any new Users who have not logged in yet as needed, such as adding them to necessary Teams, Projects, etc.

To learn more, visit the documentation about Synchronizing Users in CML.

Cloudera Machine Learning now provides a tailored user experience for Business Users

Business Users, who are meant to use and consume information from Applications deployed in Cloudera Machine Learning (CML), can now be enabled with a streamlined user experience that not only optimizes their access to the Applications, but also unintrousively limits their ability to access workloads that they do not have access to in CML.

Admins can now define Business Users by applying the “MLBusinessUser” role at the Environment level, enabling the Business Users to access the ML Workspaces under that Environment. When viewing a given ML Workspace, Business Users will only see the Applications that have been assigned to them through Projects

To learn more, visit the documentation about Business Users in CML.

CDP Operational Database is even easier to use

It also provides better failure detection when downloading HBase client configuration files using the provided CURL command. Instead of failing silently, the command will now emit error messages upon authentication failure.

CDP Operational Database adds infrastructure tagging support

CDP Operational Database (COD) adds the ability to automatically tag underlying compute instances. This enables customers to leverage negotiated pricing with AWS & Azure where such tags are required. To use this functionality, create a COD database using the CDP CLI.

CDP Operational Database enhances auto-scaling capabilities

CDP Operational Database (COD) enhances autoscaling to include considerations in addition to RPC latency. These result in

  • Optimal performance based on data volume in the database
  • Improved database availability when a replicated database faces an outage
  • Improved data management during auto-scale down of the database

Apache Hue now supports Relational SQL queries with CDP Operational Database

Apache Hue now supports querying Apache Phoenix greatly simplifying development for customers. They can now use Hue to test their queries to ensure correctness before, during and after building their applications as well as run operational analytics on live data. (See blog.)

Apache Hue can be accessed directly from the CDP Operational Database screen and provides a rich set of features including query history, data exploration, etc.

CDP Operational Database supports complex distributed transactions

CDP Operational Database (COD) now adds support for complex distributed transactions that can support TPC-C benchmarks out of the box when using it as a scale-out relational database i.e., Apache Phoenix. (See blog.) This enables a significant expansion of capabilities for the database enabling customers struggling to manage sharded relational databases to easily use COD as an alternative.

There is no configuration required to use this capability. Simply use the “autocommit(false)” and “commit()” commands in your application code / SQL and any SQL commands within these two commands will be executed atomically just as they would with traditional relational databases.

Data Engineering HA template for Data Hub on AWS delivers end-to-end resiliency

The Data Engineering HA template has graduated from Preview status to GA on AWS today, with the release of Cloudera Runtime 7.2.9. It now supports fault-tolerant access to all cluster web UIs and end-to-end reliability for more workload types.

The addition of a load balanced, highly available Apache Knox gateway service configured for transparent failover provides this additional resiliency for end users accessing web interfaces of the cluster and for scheduled or ad hoc workloads. Specifically, Hive workloads submitted to HiveServer2 via the Knox gateway and workloads that access WebHDFS through Knox can seamlessly tolerate instance failures of master nodes running the Knox gateway. This capability is thanks to the recent addition of an external load balancer to the cluster, and adds to the highly available HDFS, YARN and HiveServer 2 services offered in earlier versions.

Future releases of Data Engineering HA will support highly available Oozie and Cloudera Manager services, ensuring that nearly all workload types can seamlessly tolerate failures of any once instance in the cluster.

To learn more, visit the documentation about Data Engineering HA.