Running data processing containers on AWS Lambda
AWS Lambda supports Docker Container as a Function(CaaF). This adds other dimensions to AWS Lambda, portability, ,dependency handling, and environment setup to the serverless architecture. The CaaF supports use cases where standard environment setup is required, portability, and handling unsupported languages like PHP and hybrid cloud environments. There exists a trend where AWS Lambda is used for data processing and loading when the load duration is no longer than 15 min. Now what is trending is that some of the Apache Spark (not all) use cases are being converted to AWS Lambda cost optimized data processing pipelines. Some use cases have seen more than a 50% reduction in AWS Cloud costs.
What is Big data?
“Big data” is defined as data that contains greater variety, arriving in increasing volumes and with more velocity. Big data is typically a dataset that cannot be processed using traditional processing engines. The big data concept has evolved in the last 3 decades. In the 90s, the 1 GB payload was considered big data, but today TBs and PBs are considered big data. The concept and definition of big data are evolving. Most of the enterprise use cases may not have all the three V’s and may not require heavy big data frameworks like Spark and MPPs. The key determining factors are whether the data is big, of course, the size of the data, ,annual/monthly growth rate of data, and the frequency of data. Some low frequency and under 10 GB payload use cases are dealt with less expensive AWS resources like AWS lambda, AWS ECS container options, and simple AWS EC2.
Bigger data use cases
Recently, the AWS Lambda team added 10 GB of ephemeral storage to AWS Lambda, which makes it capable of handling bigger datasets and files. With existing features like 10 GB of memory, concurrency, asynchronous invocations, and a 15-minute soft limit for timeouts, it overlaps into big data processing territory with Apache Spark. The concurrency feature in AWS Lambda makes the above mentioned feature applicable per payload, i.e. if the big data can be handled into small chunks under 10 GB. Some semi-big data use case scenarios can now be handled in AWS Lambda by combining Pandas, AWS BOTO3, and AWS Wrangler.
AWS Lambda Container approach
In the CaaF approach, the AWS also provides the customer with ready made base images which are constantly patched and updated so that it is quicker to add code to the image. The Docker containers can be locally developed and tested before deployment to AWS ECR. The containers are portable between AWS resources and hybrid cloud platforms.
Steps for deploying container in Lambda
- Design docker file/image with AWS base image, dependencies and AWS Lambda code
- Build, test, deploy the container to AWS ECR
- Create AWS Lambda function with live container
Prerequisites
Docker installed in local machine
AWS Account and Access
Prior knowledge of Pandas and AWS Wrangler
Design of docker image
Build a docker file using the AWS image. Add requirement.txt files with all the required libraries for the data processing logic. Then add the lambda_code.py with the data processing python logic.
#base image from AWS
FROM public.ecr.aws/lambda/python:3.8#pull lambda code in
COPY lambda_code.py .#add the pandas, wrangler dependencies
COPY requirements.txt .#install all the dependencies
RUN pip install -r requirements.txt#Run the command when triggered
CMD [ "lambda_code.run" ]
Build, test, deploy the container to AWS ECR
Browse to the folder where all the docker related files are created
docker build -t data-processing-lambda .
Run the docker image
docker run -p 9000:8080 data-processing-lambda
Authenticate the docker CLI with AWS ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-east-1.amazonaws.com
Create the AWS ECR repository
aws ecr create-repository --repository-name Data-processing-lambda --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
Tag the image and push it to AWS ECR repo
# tagging the image before the AWS ECR pushdocker tag Data-processing-lambda:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/Data-processing-lambda:latest # AWS ECR push to repositorydocker push <account>.dkr.ecr.us-east-1.amazonaws.com/Data-processing-lambda:latest
Create AWS Lambda function with live container
Now the image is in the AWS ECR repository, the next step is create a lambda function that can pull the image from AWS ECR and be triggered.
Set the environment variable under AWS Lambda configuration. These environment variables are captured by the python OS library within the python script and applied in the read/write logic.
Run and test if the CSV file is generated in the target location
Github
All the code mentioned in the blog at my github page