Spark on Docker
Hi, I am John Cherian working as a Data engineering architect consulting for various clients in the DC area. My key area of focus is designing big data and low latency ETL architecture on Cloud platforms like AWS, Azure, and GCP. Recently noticed that a lot of firms are moving to hybrid cloud models to utilize the strengths of each cloud platform. The common ETL framework used for processing a large set of data is Apache Spark. The Apache Spark is seen in different flavors in the cloud platform. During the hybrid cloud design, I consider the cloud platform agnostic virtual environment so that it can be migrated from one platform to another seamlessly.
Spark Application (REST API) on Docker is a cloud-agnostic approach to implement data processing applications. It provides portability, an isolated environment, and resource assignment for spark application. The Spark Docker containers can be deployed on AWS, Azure, and GCP and remote jobs can be submitted using the REST API via an HTTP request. All the cloud platforms have various flavors of Kubernetes where the Docker container can be hosted and coordinated.
Benefits of Spark on Docker (SoD)
· Ideal for migration from Cloud application to another
· Packages the Spark application or Spark cluster with all dependencies
· Standardized environment and spark configuration
· Portable and cloud platform agnostic
· Migration from older spark version to the latest, free to choose spark version for legacy each spark applications
Enough talking …Show me some code
Building Spark on Docker Image
Let us do a simple example of Spark on Docker. In this package, we are creating a small Spark REST API using Flask library and then hosting it on a Docker container. The REST API can be accessed via the HTTP endpoint. The package for the sample docker image is available on my git hub at the location
Any Operating system. Python 3.x installed. Install the docker and expose the daemon on TCP.
The Dockerfile uses Base image variants of python and OpenJDK. The version of PySpark used is 3.1.x. All the dependency libraries are mentioned in the requirements.txt
For cloud platforms like AWS, Azure, and GCP they may have their own version of OpenJDK, Linux operating system, and it will be maintained and upgraded by the platform. Here, in our example, I am using an OpenJDK image from the docker hub( keeping it cloud-agnostic).
Spark application on Flask
The python script Main.py, is using the Spark framework to read sample data from a dataset downloaded from data.gov and converting it into HTML format. The Flask framework is used APIfy the Spark application. The Flask library is hosting the application on port 5000(by default). The HTTP request can be Posted to view the generated dataframe in HTML format on the browser or OS commands
The requirements file is used to install the dependencies like flask, pandas, and HTML libraries. The Flask is used to host the spark function as a REST API, pandas to convert the spark dataframe to HTML and HTML libraries to display on a web page(when accessed via localhost:5000).]
A Docker image is a combination of a file system and parameters needed to run an application as containers such as code, config file, the environment variable, and run time. It comprises multiple layers and used to execute code on Docker container. Once the image is deployed on the Docker environment it can be executed as a Docker container. The docker images are immutable but can be copied and shared which makes it an excellent candidate for testing new applications and configurations. The docker image starts with a base image consisting of dependent libraries, OS, and executable code. More you configure the docker file the base image is cached and a new layer is formed which helps in fast set up and execution.
Once the package is downloaded from GitHub into a folder using the command prompt log in to the folder and run the below commands.
docker build -t sparkRestDocker:latest .
The above command creates a docker image from the docker file package. This image will have all dependencies installed and required files copied.
The Docker container is a unit of packaged code and its dependencies that runs platform or environment agnostic. It is generated using the docker run command from a docker image.
docker run -d -p 5000:5000 sparkRestDockerdocker container ls
Once the docker run command is used to run the images then the docker container is created and the port is exposed on 5000. Check if the container is running using the container ls command. Use the browser or curl command to access the REST API end point.
Access Docker and REST spark API function
Confirm the docker container running the command docker container ls. The docker container can be accessed from the below link on your localhost at port 5000 using any browser or curl commands.
The below link accesses the spark method in main.py and displays the dataframe in HTML format. Data using is downloaded from data.gov.
Browser — http://127.0.0.1:5000/readdf?location=/main/COVID-19_Hospital_Capacity.csvOrcurl http://127.0.0.1:5000/readdf?location=/main/COVID-19_Hospital_Capacity.csv
For each cloud platform, the images might vary so please refer to the guidelines in the documentation. Use the cloud platform images so that all the OS and security updates will be automatic. The cloud platform has its own version of the Kubernetes cluster and Container registry repository. Upload the image to the Container repository and host it on the Kubernetes cluster.