Setting up your CDP pattern infrastructure

Before starting, ensure that your central or departmental IT has onboarded to CDP Public Cloud and registered an environment with Cloudera. Business Intelligence at Scale pattern leverages Streams Messaging Manager (SMM), Cloudera DataFlow (CDF), Cloudera Data Engineering (CDE), and Cloudera Data Warehouse (CDW) services. You must set up these services as part of your infrastructure.

Most steps required to set up your CDP pattern infrastructure should be performed by a CDP administrator and require the EnvironmentAdmin CDP role. Since the goal of this pattern is to enable self-service DataFlow development, Data Engineering, and Data Warehousing, make sure that the IT team has created the necessary users and groups, and have granted the required permissions to these users and groups.

When your IT team creates and registers an environment with CDP, it is assumed that they create a medium-duty Data Lake to import and manage streaming data in the Streams Messaging Data Hub cluster, as well as when running other CDP services.

Before you begin self-service development and infrastructure setup:
  • Ensure that you have an available CDP environment
  • Ensure that you have CDP login credentials
  • Ensure that you have a running Data Lake
  • Ensure that your CDP user is synchronized to the CDP Public Cloud environment
  • Review the AWS environments requirements checklist

For test and development environments, you can set up small to medium sized clusters, and scale up for production use.

After you register your environment with CDP, you must also activate entitlements for using Unified Analytics and Data Visualization in CDW by contacting your Cloudera Account Representative.

The following diagram shows all the Data Services used to implement the Business Intelligence at Scale pattern:

Streaming data is ingested into CDP using SMM. The data is then uploaded to S3 or cloud object stores in Avro format. Spark jobs in the CDE clusters transform the data into Parquet format and aggregrate them into tables. These tables are then available for ad-hoc analytics in CDW. Tables and queries created in CDW are then used to generate reports and dashboards in Cloudera Data Visualization.