Deploying to EMR Serverless via EMR CLI

Article cover image

Recently as part of my academic studies I had to deploy a PySpark application to EMR Serverless. I encountered a couple of issues along the way - most of them related to me trying to get a newer version of Python running on the cluster. This post will cover the issues I encountered and how I resolved them.

As part of my paper dealing with link predictions, I had to deploy a PySpark application to EMR Serverless.

The first thing I did was create an EMR serverless application and installed the CLI from AWS Labs.

> Amazon EMR Serverless provides support for Custom Images, a capability that enables you to customize the Docker container images used for running Apache Spark and Apache Hive applications on Amazon EMR Serverless. Custom images enable you to install and configure packages specific to your workload that are not available in the public distribution of EMR’s runtimes into a single immutable container. An immutable container promotes portability and simplifies dependency management for each workload and enables you to integrate developing applications for EMR Serverless with your own continuous integration (CI) pipeline.

The main issue arose from me trying to get a newer version of Python running on the cluster. A method in a library I used (networkx) required at least Python 3.9, and the default version available via yum package manager is 3.6.

I tried multiple ways to resolve this issue, including:

- Getting a newer Python version from Amazon Linux Extras (currently latest available is 3.8)

RUN amazon-linux-extras install python3.8

- Building Python from source (this 'failed' on my machine for some reason, and it took an extraordinary amount of time)

- Using a custom docker image with Python 3.9 installed on Amazon Linux 2

- Using a Python Docker image

All of these led to two issues I faced:

1. Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python":

2. ModuleNotFoundError: No module named 'jobs'.

I tried many things to resolve both issues, including creating various symlinks, copying Python binary, copying the whole environment folder, etc. None of these worked.

I resolved it by using a custom docker image provided by EMR Serverless, and setting the release version to use the same version as in the Dockerfile. I updated my application to use emr-7.0.0 and used the following Dockerfile:

# Use the Amazon EMR Serverless Spark base image
FROM --platform=linux/amd64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base

Run as root user for installation purposes

USER root

Create a virtual environment

ENV VIRTUAL_ENV=/opt/venv RUN python3 -m venv $VIRTUAL_ENV ENV PATH="$VIRTUAL_ENV/bin:$PATH"

Set Python path for PySpark

ENV PYSPARK_PYTHON=/usr/bin/python3

Upgrade pip in the virtual environment

RUN python3 -m pip install --upgrade pip

Copy your application code to the container

WORKDIR /app COPY . .

Install dependencies including venv-pack

RUN python3 -m pip install venv-pack==0.2.0 RUN python3 -m pip install .

Package the virtual environment with venv-pack

RUN mkdir /output && venv-pack -f -o /output/pyspark_deps.tar.gz

Switch to non-root user for running the application

USER hadoop:hadoop

The packaged virtual environment is now ready to be used

Export stage - used to copy packaged venv to local filesystem

FROM scratch AS export-python COPY --from=base /output/pyspark_deps.tar.gz /

The Python version available by default in this image is 3.9 which is what I needed. By using this image, I was able to resolve both issues I faced and deploy my application to EMR Serverless.

Related Articles