File system for content repository

If your function is going to process very large files, the data may not be held in memory, which would cause errors. In that case you may want to attach a File System to your Lambda and tell Cloudera Data Flow Functions to use the File System as the Content Repository.

Configuration🔗

If you want to attach a File System to your Lambda, you are required to connect your Lambda to the VPC where the File System is provisioned.

However, by default, Lambda runs your functions in a secure VPC with access to AWS services and the internet. Lambda owns this VPC, which is not connected to your account's default VPC. When you connect a function to a VPC in your account, the function cannot access the internet unless your VPC provides access. Access to the internet is required so that Cloudera Data Flow Functions can request the Cloudera Data Flow Catalog and retrieve the flow definition to be executed.

In such a situation, a recommended option is to create a VPC with a public subnet and multiple private subnets (to increase resiliency across multiple availability zones). You can then create an Internet Gateway and NAT Gateway to give your Lambda access to the internet. For more information, see the AWS Knowledge Center.

Concurrent access🔗

Depending on your Lambda’s configuration and the throughput of the events triggering the function, you may have multiple instances of the Lambda running concurrently and each instance would share the same File System. For more information, see this AWS Compute Blog article.

To ensure each instance of the function has its own “content repository”, you can use the CONTENT_REPO environment variable when configuring your Lambda and give it the value of the Local mount path of the attached File System:

In this case, each instance of the function will create its content repository on the File System under the following path: ${CONTENT_REPO}/<function name>/content_repository/<epoch timestamp>-<UUID>

You can share the same File System across multiple Lambda functions if desired depending on the IO performances you need.

File system for content repository

Configuration🔗

Concurrent access🔗

We want your opinion

How can we improve this page?