Spark on Docker

Hi, I am John Cherian working as a Data engineering architect consulting for various clients in the DC area. My key area of focus is designing big data and low latency ETL architecture on Cloud platforms like AWS, Azure, and GCP. Recently noticed that a lot of firms are moving to hybrid cloud models to utilize the strengths of each cloud platform. The common ETL framework used for processing a large set of data is Apache Spark. The Apache Spark is seen in different flavors in the cloud platform. During the hybrid cloud design, I consider the cloud platform agnostic virtual environment so that it can be migrated from one platform to another seamlessly.

Spark Application (REST API) on Docker is a cloud-agnostic approach to implement data processing applications. It provides portability, an isolated environment, and resource assignment for spark application. The Spark Docker containers can be deployed on AWS, Azure, and GCP and remote jobs can be submitted using the REST API via an HTTP request. All the cloud platforms have various flavors of Kubernetes where the Docker container can be hosted and coordinated.

Benefits of Spark on Docker (SoD)

· Ideal for migration from Cloud application to another

· Packages the Spark application or Spark cluster with all dependencies

· Standardized environment and spark configuration

· Portable and cloud platform agnostic

· Migration from older spark version to the latest, free to choose spark version for legacy each spark applications

Enough talking …Show me some code

Building Spark on Docker Image

Let us do a simple example of Spark on Docker. In this package, we are creating a small Spark REST API using Flask library and then hosting it on a Docker container. The REST API can be accessed via the HTTP endpoint. The package for the sample docker image is available on my git hub at the location

Prerequisite:

Any Operating system. Python 3.x installed. Install the docker and expose the daemon on TCP.

Dockerfile

The Dockerfile uses Base image variants of python and OpenJDK. The version of PySpark used is 3.1.x. All the dependency libraries are mentioned in the requirements.txt

For cloud platforms like AWS, Azure, and GCP they may have their own version of OpenJDK, Linux operating system, and it will be maintained and upgraded by the platform. Here, in our example, I am using an OpenJDK image from the docker hub( keeping it cloud-agnostic).

Spark application on Flask

The python script Main.py, is using the Spark framework to read sample data from a dataset downloaded from data.gov and converting it into HTML format. The Flask framework is used APIfy the Spark application. The Flask library is hosting the application on port 5000(by default). The HTTP request can be Posted to view the generated dataframe in HTML format on the browser or OS commands

Requirements.txt

The requirements file is used to install the dependencies like flask, pandas, and HTML libraries. The Flask is used to host the spark function as a REST API, pandas to convert the spark dataframe to HTML and HTML libraries to display on a web page(when accessed via localhost:5000).]

Docker Image

A Docker image is a combination of a file system and parameters needed to run an application as containers such as code, config file, the environment variable, and run time. It comprises multiple layers and used to execute code on Docker container. Once the image is deployed on the Docker environment it can be executed as a Docker container. The docker images are immutable but can be copied and shared which makes it an excellent candidate for testing new applications and configurations. The docker image starts with a base image consisting of dependent libraries, OS, and executable code. More you configure the docker file the base image is cached and a new layer is formed which helps in fast set up and execution.

Once the package is downloaded from GitHub into a folder using the command prompt log in to the folder and run the below commands.

docker build -t sparkRestDocker:latest .

The above command creates a docker image from the docker file package. This image will have all dependencies installed and required files copied.

Docker container

The Docker container is a unit of packaged code and its dependencies that runs platform or environment agnostic. It is generated using the docker run command from a docker image.

docker run -d -p 5000:5000 sparkRestDockerdocker container ls

Once the docker run command is used to run the images then the docker container is created and the port is exposed on 5000. Check if the container is running using the container ls command. Use the browser or curl command to access the REST API end point.

http://localhost:5000/readdf

Access Docker and REST spark API function

Confirm the docker container running the command docker container ls. The docker container can be accessed from the below link on your localhost at port 5000 using any browser or curl commands.

The below link accesses the spark method in main.py and displays the dataframe in HTML format. Data using is downloaded from data.gov.

Browser — Orcurl 

For each cloud platform, the images might vary so please refer to the guidelines in the documentation. Use the cloud platform images so that all the OS and security updates will be automatic. The cloud platform has its own version of the Kubernetes cluster and Container registry repository. Upload the image to the Container repository and host it on the Kubernetes cluster.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store