Connecting to S3 compatible accounts for Cloudera Data Engineering jobs or sessions

Connect to a S3 compatible account securely in theCloudera Data Engineering jobs or sessions.

  1. Create the S3 compatible accounts.
    After you create or edit the S3 external account, Cloudera Data Engineering automatically configures the S3 external accounts for the jobs or sessions that belong to that specific workload user. However, it takes approximately 10 minutes to update in the Cloudera Data Engineering jobs or sessions.
  2. To connect to S3 external accounts in Cloudera Data Engineering, run the Spark job as the workload user configured for the S3-compatible account. The Spark job automatically uses the account credentials during Hadoop S3 read and write operations. You do not need any additional configuration to specify credentials or endpoints. For more information about creating jobs, see Creating jobs in Cloudera Data Engineering.
    Sample job:
    import os
    from pyspark.sql import SparkSession
    
    def main():
        spark = SparkSession.builder \
            .appName("Spark S3A ") \
            .getOrCreate()
        try:
            # Read a CSV file from S3 using the s3a:// prefix
            df = spark.read.csv("s3a://your-bucket-name/input/data.csv", header=True, inferSchema=True)
            df.show()
        except Exception as e:
            print(f"Error executing Spark job: {e}")
        finally:
            spark.stop()
    
    if __name__ == "__main__":
        main()
  3. Import self-signed or untrusted certificates for S3-compatible accounts by setting the CA Certificate Type to Datalake in the Cloudera Management Console. For information about hot to import certificates, see Updating TLS certificates.