Recently as part of my academic studies I had to deploy a PySpark application to EMR Serverless. I encountered a couple of issues along the way - most of them related to me trying to get a newer version of Python running on the cluster. This post will cover the issues I encountered and how I resolved them.
As part of my paper dealing with link predictions, I had to deploy a PySpark application to EMR Serverless.
The first thing I did is created an EMR serverless application and installed the CLI from AWS Labs.
Amazon EMR Serverless provides support for Custom Images, a capability that enables you to customize the Docker container images used for running Apache Spark and Apache Hive applications on Amazon EMR Serverless. Custom images enables you to install and configure packages specific to your workload that are not available in the public distribution of EMR’s runtimes into a single immutable container. An immutable container promotes portability and simplifies dependency management for each workload and enables you to integrate developing applications for EMR Serverless with your own continuous integration (CI) pipeline.
The main issue arrised from me trying to get a newer version of Python running on the cluster. A method in a library I used (networkx) required at least Python 3.9, and the default version available via yum package manager is 3.6.
I tried multiple ways to resolve this issue, including:
- Getting a newer Python version from Amazon Linux Extras (currently latest available is 3.8)
RUN amazon-linux-extras install python3.8
Building Python from source (this ‘failed’ on my machine for some reason, and it took extraordinary amount of time)
Using a custom docker image with Python 3.9 installed on Amazon Linux 2
Using python docker image
All of these lead to two issues I faced:
Exception in thread “main” java.io.IOException: Cannot run program “./environment/bin/python”:
ModuleNotFoundError: No module named ‘jobs’.
I tried many things to resolve both isuues, including creating various symlinks, copying Python binary, copying the whole environment folder, etc. None of these worked.
I resolved it by using a custom docker image provided by EMR Serverless, and setting the release version to use the same version as in the Dockerfile. I updated my application to use emr-7.0.0 and used the following Dockerfile:
# Use the Amazon EMR Serverless Spark base image
FROM --platform=linux/amd64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base
# Run as root user for installation purposes
USER root
# Create a virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Set Python path for PySpark
ENV PYSPARK_PYTHON=/usr/bin/python3
# Upgrade pip in the virtual environment
RUN python3 -m pip install --upgrade pip
# Copy your application code to the container
WORKDIR /app
COPY . .
# Install dependencies including venv-pack
RUN python3 -m pip install venv-pack==0.2.0
RUN python3 -m pip install .
# Package the virtual environment with venv-pack
RUN mkdir /output && venv-pack -f -o /output/pyspark_deps.tar.gz
# Switch to non-root user for running the application
USER hadoop:hadoop
# The packaged virtual environment is now ready to be used
# Export stage - used to copy packaged venv to local filesystem
FROM scratch AS export-python
COPY --from=base /output/pyspark_deps.tar.gz /
The Python version available by default in this image is 3.9 which is what I needed. By using this image, I was able to resolve both issues I faced and was able to deploy my application to EMR Serverless.