Filesystems

You can use file systems in Flink to consume and persistently store data for application results, fault tolerance and data recovery.

The file system used for a particular file is determined by its URI scheme. For example, file:///home/user/text.txt refers to a file in the local file system. Flink has built-in support for the file system of the local machine, including any NFS or SAN drives mounted into that local file system. It can be used by default without additional configuration.

For all schemes where Flink cannot find a directly supported file system, it falls back to Hadoop. All Hadoop file systems are automatically available when flink-runtime and the Hadoop libraries are on the classpath. This way, Flink seamlessly supports all of Hadoop file systems implementing the org.apache.hadoop.fs.FileSystem interface, and all Hadoop-compatible file systems (HCFS).

There are also pluggable file systems that are not included by default. CSA contains the following pluggable file systems:
  • Amazon S3
  • Azure Data Lake Store Gen2
  • Azure Blob Storage
  • Google Cloud Storage

For more information about how to use the supported file systems, see the Apache Flink documentation.

Using Ozone with Flink

You can use Ozone as a sink with Flink via the implementation of Hadoop file system, but it has some limitations.

Limitations using Ozone File System (OFS) with DataStream connector
When using Ozone as a file system, you must use the OnCheckpointRollingPolicy configuration, which rolls part files on every checkpoint. This configuration is needed because in case part files traverse the checkpoint interval, upon recovery from a failure, the FileSink may use the truncate() method of the file system to discard uncommitted data from the in-progress file. This method is not supported by OFS and Flink will throw an exception.
Limitations using OFS with Table API connector
Because of the limitation present for the DataStream connector, it is required to roll the files on every checkpoint when using the Table API connector. If checkpointing is enabled and a bulk format is used, then this is done automatically. In case of a row format, automatic compaction has to be turned on.