ReadyFlow overview: HuggingFace to S3/ADLS

You can use the HuggingFace to S3/ADLS ReadyFlow to retrieve a HuggingFace dataset and write the Parquet data to a target S3 or ADLS destination.

This ReadyFlow retrieves a dataset from the HuggingFace API and writes the Parquet data to a target S3 or ADLS destination. The dataset retrieved by default is "Salesforce/wikitext" (the default value for the Dataset Name parameter). Failed S3 or ADLS write operations are retried automatically to handle transient issues. Define a KPI on the failure_WriteToS3/ADLS connection to monitor failed write operations.

This flow is not meant to run continuously and should be run once per dataset retrieved.

HuggingFace to S3/ADLS ReadyFlow details
Source HuggingFace Dataset
Source Format Parquet
Destination Cloudera Managed Amazon S3 or ADLS
Destination Format Parquet