Performance Benchmarking for Pandas On AWS lambda for CSV files

John Cherian
4 min readApr 27, 2022

In this blog, we want to explore the data size limit for CSV format files on AWS S3 for the combination of AWS Wrangler and Pandas on AWS Lambda. The data set used for the benchmarking was downloaded from the geographical Society. The results of the performance test may vary based on the different factors like dataset, datatype, class methods, format of the data, data type composition, Read/Write methods, AWS Lambda Cold Start, coding practices, and many more. The files of various sizes are created for the performance test using pandas in the AWS S3 location. The files are uploaded to an AWS S3 location and the data types include integer, object, and string. I see this blog as a collaborative space for anyone using Pandas with AWS Lambda and sharing their experiences.

source

Test Scenarios

Scenario 1-Read and Write CSV format (no compression)

The first test is to check if Pandas and AWS Lambda can handle different data sizes for simple read and write operations. The AWS lambda code uses the Pandas and AWS Wrangler libraries for this test. File sizes from 50 MB to 2 GB per payload are used in this test. The AWS lambda code reads the csv file from the AWS S3 location using the read_csv method and writes it using the write_csv method to the AWS S3 location. The metrics monitored in this test are

  • Data Size
  • Data format (csv)
  • Duration
  • Max Memory Used
  • Pandas in memory usage
  • methods to read wrangler s3.read_csv and s3.write_csv

Scenario 2 : Read and write Compressed CSV file

In this performance test, my focus is to see if there are any other methods to load files above 1.2 GB. One other approach is to compress the file before reading. There are different compression algorithm for csv format and zip algorithm is the compression algorithm used in the experiment. I assumed that files under 1.2GB would work well with or without compression. The compressed CSV file makes the I/O operations faster and slightly better performance, but the pandas in memory have to serialize and decompress the data in order to convert it to a pandas data frame. The 2GB file was compressed to a 300MB file and ran the test on the same framework. The AWS lambda threw an out of memory issue. Breaking the big file into smaller chunks is an option before the pandas read, but the use cases with aggregation requirements may not fit this approach.

Conclusion

This performance benchmarking is applicable only on Pandas on AWS Lambda approach. The results may vary based on other factors used. These are not comprehensive test cases for performance and results may vary based on different factors used. This experiment is conducted to get an approximation of the CSV file data size threshold for Pandas on AWS Lambda using AWS S3 as the source. The payload size from S3 to AWS Lambda differs from that of the event payload size to AWS Lambda.

The Pandas with AWS Lambda are best for use cases with 100 KB to 1.2GB per payload, and the performance is rarely a concern. Anything above 1.2 GB requires an in-memory capacity of greater than 10 GB or in other words AWS Lambda is limited to10GB memory. The AWS Lambda was not intended to be a heavy data processing engine for long running tasks. During the test, I noticed that the csv file size on AWS S3 is smaller than the in-memory Pandas size, so AWS S3 csv file size will not correlate with 10 GB memory in AWS Lambda. The compression has an effect on the I/O cost in AWS Lambda and it is faster than the uncompressed data. For better performance, other compression algorithm should be used. The majority of use cases at enterprise level is under 1 GB per payload, and for anything more than 1.2 GB, there are other options like Pandas chunking, Dask, and Spark to handle such workloads. The AWS lambda has an option to run concurrently(scale out)with a payload size of under 1.2 GB where the order is not critical. Carefully choosing the framework will reduce the Cost of Infrastructure. This will be an evolving blog, will update as I observe other scenarios and use cases. Please feel free to contribute or share you experiences.

--

--