CDP Public Cloud overview

Understand the overview and advantages of CDP.

CDP is one platform with two form factors:
CDP is an easy, fast, and secure enterprise analytics and management platform with the following capabilities:
  • Enables ingesting, managing, and delivering of any analytics workload from Edge to AI.
  • Provides enterprise grade security and governance.
  • Provides self-service access to integrated, multi-function analytics on centrally managed and secured business data.
  • Provides a consistent experience on public cloud, multi-cloud, and private cloud deployments.

CDP powers data-driven decision making by easily, quickly, and safely connecting and securing the entire data lifecycle. For this, data moves through a lifecycle in five distinct phases.

CDP gives you complete visibility into all your data with no blindspots. The CDP control plane allows you to manage the data, infrastructure, analytics, and analytic workloads across hybrid and multi-cloud environments all with cloudera shared experience or SDX providing consistent security and governance across the entire data lifecycle. You can manage and secure the data lifecycle in any cloud and data center with CDP.

CDP enables you to:
  • Automatically spin up workloads when needed and suspend their operation when complete thereby controlling the cloud costs
  • Optimize workloads based on analytics and machine learning
  • View data lineage across any cloud and transient clusters
  • Use a single pane of glass across hybrid and multi-clouds
  • Scale to petabytes of data and 1,000s of diverse users
  • Centrally control customer and operational data across multi-cloud and hybrid environments

CDP Public Cloud

Addressing real-world business problems generally requires the application of multiple analytic functions working together on the same data; For example, autonomous vehicles require the application of both real-time data streaming and machine learning algorithms. CDP addresses this by offering multi-function data management and analytics that allow solving an enterprise’s most pressing data and analytic challenges in a streamlined fashion.

Hybrid and multi-cloud, CDP gives enterprises flexibility to operate with equivalent functionality on and off premises. Support for all major cloud providers helps you avoid vendor lock-in and allows you to take control over your enterprise’s data and future. Secure and compliant, CDP meets the strict data privacy, governance, data migration, and metadata management demands of large enterprises across all their environments.

Cloudera Data Platform (CDP) is a secure and governed cloud service platform that offers a broad set of enterprise data cloud services with the key data analytics and artificial intelligence functionality that enterprises require. CDP Public Cloud is a cloud form factor of CDP.

Use cases

CDP Public Cloud services address multiple use cases, for example, registering existing CDH and HDP clusters; or spinning up Data Hub clusters and analyzing data in a cloud object store.

CDP Public Cloud services address the following use cases:

  • Register your existing CDH and HDP clusters in order to burst or migrate a workload to their public cloud environment by replicating the data and creating a Data Hub cluster to host the workload.

  • Spin up Data Hubs and then process and analyze your data in cloud object store by using applications such as Spark, Hive LLAP, Hue, and Impala.

  • Import and deploy your data flow definitions efficiently, securely, and at scale.

CDP Public Cloud services

CDP Public Cloud consists of a number of cloud services designed to address specific enterprise data cloud use cases.

This includes Data Hub powered by Cloudera Runtime, data services (Data Warehouse, Machine Learning, Data Engineering, and DataFlow), the administrative layer (Management Console), and SDX services (Data Lake, Data Catalog, Replication Manager, and Workload Manager).

Administrative layer

Management Console is a general service used by CDP administrators to manage, monitor, and orchestrate all of the CDP services from a single pane of glass across all environments. If you have deployments in your data center as well as in multiple public clouds, you can manage them all in one place - creating, monitoring, provisioning, and destroying services.

Workload clusters

Data Hub is a service for launching and managing workload clusters powered by Cloudera Runtime (Cloudera’s new unified open source distribution including the best of CDH and HDP). This includes a set of cloud optimized built-in templates for common workload types as well as a set of options allowing for extensive customization based on your enterprise’s needs.

Data Hub provides a complete workload isolation and full elasticity so that every workload, every application, or every department can have their own cluster with a different version of the software, different configuration, and running on different infrastructure. This enables a more agile development process.

Since Data Hub clusters are easy to launch and their lifecycle can be automated, you can create them on demand and when you don’t need them, you can return the resources to the cloud.

Data services

Data Engineering is a serverless service that allows you to submit Spark jobs to an auto-scaling cluster.

DataFlow is a service that enables you to import and deploy your data flow definitions efficiently, securely, and at scale. It is a cloud-native end-to-end flow management service that gives you a flow-centric experience in contrast to the traditional cluster-centric approach. This service reduces your operational and cluster management overhead, and provides a multi-tenant portal.

Data Warehouse is a service for creating and managing self-service data warehouses for teams of data analysts. This service makes it easy for an enterprise to provision a new data warehouse and share a subset of the data with a specific team or department. The service is ephemeral, allowing you to quickly create data warehouses and terminate them once the task at hand is done.

Machine Learning is a service for creating and managing self-service Machine Learning workspaces. This enables teams of data scientists to develop, test, train, and ultimately deploy machine learning models for building predictive applications all on the data under management within the enterprise data cloud.

Operational Database is a service for self-service creation of an operational database. Operational Database is a scale-out, autonomous database powered by Apache HBase and Apache Phoenix. You can use it for your low-latency and high-throughput use cases with the same storage and access layers that you are familiar with using in CDH and HDP.

Security and governance

Shared Data Experience (SDX) is a suite of technologies that make it possible for enterprises to pull all their data into one place to be able to share it with many different teams and services in a secure and governed manner. There are four discrete services within SDX technologies: Data Lake, Data Catalog, Replication Manager, and Workload Manager.

Data Lake is a set of functionality for creating safe, secure, and governed data lakes which provides a protective ring around the data wherever that’s stored, be that in cloud object storage or HDFS. Data Lake functionality is subsumed by the Management Console service and related Cloudera Runtime functionality (Ranger, Atlas, Hive MetaStore).

Data Catalog is a service for searching, organizing, securing, and governing data within the enterprise data cloud. Data Catalog is used by data stewards to browse, search, and tag the content of a data lake, create and manage authorization policies (by file, table, column, row, and so on), identify what data a user has accessed, and access the lineage of a particular data set.

Replication Manager is a service for copying, migrating, snapshotting, and restoring data between environments within the enterprise data cloud. This service is used by administrators and data stewards to move, copy, backup, replicate, and restore data in or between data lakes. This can be done for backup, disaster recovery, or migration purposes, or to facilitate dev/test in another virtual environment.

Workload Manager is a service for analyzing and optimizing workloads within the enterprise data cloud. This service is used by database and workload administrators to troubleshoot, analyze, and optimize workloads in order to improve performance and/or cost.

Interfaces

There are three basic ways to access and use CDP Public Cloud: web interface, CLI client, and SDK.

Web interface

The CDP Public Cloud web interface provides a web-based, graphical user interface. As an admin user, you can use the web interface to register environments, manage users, and provision CDP service resources for end users. As an end user, you can use the web console to access CDP service web interfaces to perform data engineering or data analytics tasks.

CLI

If you prefer to work in a terminal window, you can download and configure the CDP client that gives you access to the CDP CLI tool. The CDP CLI allows you to perform the same actions as can be performed from the web console. Furthermore, it allows you to automate routine tasks such as cluster creation.

SDK

You can use the CDP SDK for Java to integrate CDP services with your applications. Use the CDP SDK to connect to CDP services, create and manage clusters, and run jobs from your Java application or other data integration tools that you may use in your organization.

Getting started

Get started steps in CDP Public Cloud.

CDP onboarding

Regardless of your use case, your first steps in CDP should involve synchronizing your identity provider in CDP so that your users can access to CDP and are authorized to access specific resources within CDP. See Getting started as an admin.