ReadyFlow overview: HuggingFace to S3/ADLS

You can use the HuggingFace to S3/ADLS ReadyFlow to retrieve a HuggingFace dataset and write the Parquet data to a target S3 or ADLS destination.

This ReadyFlow retrieves a dataset from the HuggingFace API and writes the Parquet data to a target S3 or ADLS destination. The dataset retrieved by default is "wikitext" (the default value for the Dataset Name parameter). Failed S3 or ADLS write operations are retried automatically to handle transient issues. Define a KPI on the failure_WriteToS3/ADLS connection to monitor failed write operations.

This flow is not meant to run continuously and should be run once per dataset retrieved.

HuggingFace to S3/ADLS ReadyFlow details
Source HuggingFace Dataset
Source Format Parquet
Destination Cloudera Managed Amazon S3 or ADLS
Destination Format Parquet