ReadyFlow overview: HuggingFace to S3/ADLS
You can use the HuggingFace to S3/ADLS ReadyFlow to retrieve a HuggingFace dataset and write the Parquet data to a target S3 or ADLS destination.
This ReadyFlow retrieves a dataset from the HuggingFace API and writes the Parquet data to a target S3 or ADLS destination. The dataset retrieved by default is "Salesforce/wikitext" (the default value for the Dataset Name parameter). Failed S3 or ADLS write operations are retried automatically to handle transient issues. Define a KPI on the failure_WriteToS3/ADLS connection to monitor failed write operations.
This flow is not meant to run continuously and should be run once per dataset retrieved.
| HuggingFace to S3/ADLS ReadyFlow details | |
|---|---|
| Source | HuggingFace Dataset |
| Source Format | Parquet |
| Destination | Cloudera Managed Amazon S3 or ADLS |
| Destination Format | Parquet |
