My First take on Amazon EMR Serverless
What version of Apache Spark would you like to use on AWS?
I faced the same issue at the ice cream shop when I was a child. There are 5 ways to run Spark jobs on AWS. AWS Glue, Amazon EMR-EC2, Amazon EMR on EKS, AWS Fargate and Amazon EMR serverless. Most of the time, it is challenging to choose one over the other or understand the aspects to take into account.
In terms on easy to work and granular control with then it follows the sequence Amazon EMR on EKS, EMR on EC2 , EMR serverless and then AWS Glue. The Amazon EMR Serverless service handles infrastructure setup, instance type selection, and scaling up and down, allowing the user to concentrate solely on the code. The Amazon EMR Serverless does not include all of the AWS glue wrapper libraries that automate tasks such as bookmarking, performance tuning, database connection, data processing, and data flattening. At the same time, unlike Amazon EMR on EC2, you do not need to worry about infrastructure provisioning or scaling up or down based on the workload. The Amazon EMR Serverless is a new deployment option where the spark code on EMR serverless is open-source and can be migrated to other frameworks such as Amazon EMR on EC2 or EKS as needed. The spark configuration dynamic allocation is enabled so the scaling up and down is taken care behind the scene. The Amazon EMR serverless currently supports only Batch jobs no streaming. The EMR serverless configuration is based on virtual CPU, which means that it is running on a container-based environment that, unlike Amazon EMR on EC2, can slice an EC2 into more sub units called vCPUs.
The Amazon EMR serverless is ideal for short-running batch jobs, and for longer-running jobs, Amazon EMR on EC2 may be less expensive at times. Amazon EMR serverless is ideal for not-so-big data or borderline big data because of the cost savings and potential high data growth rate. Keep in mind that there is an overhead cost for warming up the cluster and then submitting the job, so for small data sets, Pandas may outperform Spark. You can use the pre-initialized capacity feature to have some workers ready before submitting the job. The EMR serverless currently runs Spark and HIVE, with Presto in the works, so it has fewer options than EMR on EC2. It works with Apache HUDI and Apache Iceberg at a lower cost.
- The Amazon Redshift connection must be manually configured. Except for redshift-jdbc.jar, which must be downloaded and placed on the Amazon S3 folder, some of the jar files are available locally to the Amazon EMR serverless server.
- Adding additional python libraries to the EMR required packaging them in a python virtual environment, zipping them, and pointing spark submit to the python libraries. While packaging, keep in mind to use AWS Cloud 9 with Amazon-Linux -2.
- AWS Lakeformation is not support by the Amazon EMR Serverless. If this is a main requirement then avoid EMR serverless.
- If Airflow is used to orchestrate Spark jobs, there are only a few operators available in Github, and it is still evolving. Python operator can be used in conjunction with AWS BOTO3 SDKs to create alternative DAGs.
- Sometime the warm-pool workers or pre-initialized capacity are not fast on submitting the job and there is a 2–3 minutes delays to see the results.
Below is an example of an Airflow Dag submitting a job on Amazon EMR serverless.
Overall, my experience with Amazon EMR serverless has been great. I would suggest using EMR serverless for batch jobs that do not require integration with AWS Lakeformation. If additional features are required in the future and are not available on Amazon EMR Serverless, the code can be easily migrated to Amazon EMR on EC2.