Machine Learning Discovery and Exploration has a few prerequisites.

  • For both Hive and Impala connections, the HADOOP_USERNAME and WORKLOAD_PASSWORD must be set. The HADOOP_USERNAME is set automatically by the environment. As a data scientist, you must set the WORKLOAD_PASSWORD. For more information, see Setting the workload password.
  • To manually create Impala or Hive data connections, you must have the related JDBC URL. You can obtain this from the option menu for the virtual warehouse to which you want to connect. These warehouses must be already created in the environment. For more information on how to create a data warehouse, see Adding a New Virtual Warehouse.
  • To manually create a Spark connection, you need the Data Lake S3 bucket. This can be retrieved from the Data Lake page. For Spark data connections, you must use ML Runtimes, and specifically the Spark Runtime Add-on must be enabled before starting a workload (job or session).
  • For Spark data connections, you must have permissions set correctly for external access to the S3 bucket.
  • Hive and Impala data connections also require ML Runtimes. Legacy engines may work, but are not supported.
  • There must be auto-discovered data connections, or your Administrator must create data connections to virtual warehouses or other data sources in your CDP environment.

Known Issue

  • Only Hive virtual warehouses with SSO disabled are supported.