Running data processing containers on AWS Lambda

John Cherian
4 min readApr 15, 2022

AWS Lambda supports Docker Container as a Function(CaaF). This adds other dimensions to AWS Lambda, portability, ,dependency handling, and environment setup to the serverless architecture. The CaaF supports use cases where standard environment setup is required, portability, and handling unsupported languages like PHP and hybrid cloud environments. There exists a trend where AWS Lambda is used for data processing and loading when the load duration is no longer than 15 min. Now what is trending is that some of the Apache Spark (not all) use cases are being converted to AWS Lambda cost optimized data processing pipelines. Some use cases have seen more than a 50% reduction in AWS Cloud costs.

What is Big data?

“Big data” is defined as data that contains greater variety, arriving in increasing volumes and with more velocity. Big data is typically a dataset that cannot be processed using traditional processing engines. The big data concept has evolved in the last 3 decades. In the 90s, the 1 GB payload was considered big data, but today TBs and PBs are considered big data. The concept and definition of big data are evolving. Most of the enterprise use cases may not have all the three V’s and may not require heavy big data frameworks like Spark and MPPs. The key determining factors are whether the data is big, of course, the size of the data, ,annual/monthly growth rate of data, and the frequency of data. Some low frequency and under 10 GB payload use cases are dealt with less expensive AWS resources like AWS lambda, AWS ECS container options, and simple AWS EC2.

Bigger data use cases

Recently, the AWS Lambda team added 10 GB of ephemeral storage to AWS Lambda, which makes it capable of handling bigger datasets and files. With existing features like 10 GB of memory, concurrency, asynchronous invocations, and a 15-minute soft limit for timeouts, it overlaps into big data processing territory with Apache Spark. The concurrency feature in AWS Lambda makes the above mentioned feature applicable per payload, i.e. if the big data can be handled into small chunks under 10 GB. Some semi-big data use case scenarios can now be handled in AWS Lambda by combining Pandas, AWS BOTO3, and AWS Wrangler.

AWS Lambda Container approach

In the CaaF approach, the AWS also provides the customer with ready made base images which are constantly patched and updated so that it is quicker to add code to the image. The Docker containers can be locally developed and tested before deployment to AWS ECR. The containers are portable between AWS resources and hybrid cloud platforms.

Steps for deploying container in Lambda

  1. Design docker file/image with AWS base image, dependencies and AWS Lambda code
  2. Build, test, deploy the container to AWS ECR
  3. Create AWS Lambda function with live container

Prerequisites

Docker installed in local machine
AWS Account and Access
Prior knowledge of Pandas and AWS Wrangler

Design of docker image

Build a docker file using the AWS image. Add requirement.txt files with all the required libraries for the data processing logic. Then add the lambda_code.py with the data processing python logic.

#base image from AWS
FROM public.ecr.aws/lambda/python:3.8
#pull lambda code in
COPY lambda_code.py .
#add the pandas, wrangler dependencies
COPY requirements.txt .
#install all the dependencies
RUN pip install -r requirements.txt
#Run the command when triggered
CMD [ "lambda_code.run" ]

Build, test, deploy the container to AWS ECR

Browse to the folder where all the docker related files are created

docker build -t data-processing-lambda .

Run the docker image

docker run -p 9000:8080 data-processing-lambda

Authenticate the docker CLI with AWS ECR

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-east-1.amazonaws.com

Create the AWS ECR repository

aws ecr create-repository --repository-name Data-processing-lambda --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE

Tag the image and push it to AWS ECR repo

# tagging the image before the AWS ECR pushdocker tag  Data-processing-lambda:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/Data-processing-lambda:latest # AWS ECR push to repositorydocker push <account>.dkr.ecr.us-east-1.amazonaws.com/Data-processing-lambda:latest

Create AWS Lambda function with live container

Now the image is in the AWS ECR repository, the next step is create a lambda function that can pull the image from AWS ECR and be triggered.

Set the environment variable under AWS Lambda configuration. These environment variables are captured by the python OS library within the python script and applied in the read/write logic.

Run and test if the CSV file is generated in the target location

Github
All the code mentioned in the blog at my github page

--

--