bgereke
10/17/2021, 9:52 PMdatajoely
10/18/2021, 9:09 AMnapoleon_borntoparty
10/18/2021, 9:10 AMdeploy
or client
mode?bgereke
10/18/2021, 4:06 PMnapoleon_borntoparty
10/18/2021, 4:29 PMpyspark
and pandas
and numpy
because they should be executing "Spark code only"
Firstly, you should create and package your environment (Conda/venv) into a tarball (.tar.gz
) .
Then you should have your environment variables setup in such a way that they the Spark Driver (on Master node) to a place where it can get access to the libraries and modules. This can usually be done with (if you're using Conda)
export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
pyspark --archives pyspark_conda_env.tar.gz#environment
or alternatively in a notebook/ipython environment
import os
from pyspark.sql import SparkSession
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
"spark.archives", # 'spark.yarn.dist.archives' in YARN.
"pyspark_conda_env.tar.gz#environment").getOrCreate()
This should enable the Driver to point the workers to where they can get necessary code Python resources.
Please note this gets more expensive with big DS projects due to the sheer number of DS libs that need to be independently packaged.cluster
mode is similar in essence, but you obviously lose access to the CLI as you package the entire project as well.
You'll likely need to build the project into a whl
file, package up all dependencies and explicitly distribute them to all worker nodes, then supply all conf
paths and settings to spark-submit
directly, for example:
#!/bin/bash
PYTHON_ZIP="./dependencies.zip#pythonlib"
PYTHON_WHL="./src/dist/my-project-build-py3-none-any.whl"
SPARK_CMD="spark-submit \
--conf 'spark.yarn.dist.archives=${PYTHON_ZIP}' \
--conf 'spark.yarn.appMasterEnv.PYTHONPATH=pythonlib' \
--conf 'spark.executorEnv.PYTHONPATH=pythonlib' \
--conf 'spark.jars.packages=com.amazonaws:aws-java-sdk:1.11.271,org.apache.hadoop:hadoop-aws:2.10.1' \
--conf 'spark.sql.shuffle.partitions=16' \
--conf 'spark.hadoop.fs.s3a.committer.staging.conflict-mode=replace' \
--conf 'spark.hadoop.fs.s3a.committer.name=partitioned' \
--conf 'spark.hadoop.fs.s3a.connection.maximum=5000' \
--conf 'spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35' \
--conf 'spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35' \
--conf 'spark.hadoop.fs.s3a.fast.upload=true' \
--conf 'spark.hadoop.fs.s3a.fast.upload.buffer=array' \
--py-files ./dependencies.zip,${PYTHON_WHL} \
run.py"
eval ${SPARK_CMD}
spark.yarn
settings added, which since YARN is the EMR resource manager for Spark in cluster
modebgereke
10/18/2021, 7:38 PMdatajoely
10/20/2021, 5:32 PMbgereke
10/20/2021, 6:57 PMnapoleon_borntoparty
10/21/2021, 4:38 PMpyspark
as an entrypoint and see if it works. IIRC my dependency setup looked something like this:
bash
prepare-deployment-dependencies:
echo "Packaging Project" && \
kedro package && \
rm -rf deployment && \
rm -rf deployment-conf && \
mkdir deployment && \
mkdir deployment-conf && \
pip install src/dist/<my-packaged-project>.whl -t deployment/ && \
TMP_FILE=yaml_files.tmp && \
TMP=$$(find ./ -name "*.yml" -not -path "*deployment*") && \
echo $$TMP | xargs -n1 > $TMP_FILE && \
rsync --verbose --files-from=$TMP_FILE ./ ./deployment-conf && \
rm $TMP_FILE && \
cd deployment && zip -r ../dependencies.zip . && \
cd ../deployment-conf && zip -ur ../dependencies.zip . && \
dependencies.zip
as per the script abovePYTHON_ZIP="./dependencies.zip#pythonlib"
PYTHON_WHL="./src/dist/my-project-build-py3-none-any.whl"
spark-submit
or client
mode with pyspark
)spark-submit
entrypoint looked something simple like:
python
from <my-packaged-project-module>.run import run_package
run_package()
data princess
11/05/2021, 3:01 PMbgereke
11/05/2021, 3:32 PMdatajoely
11/05/2021, 3:58 PMdata princess
11/05/2021, 4:15 PMdatajoely
11/05/2021, 4:17 PMdata princess
11/05/2021, 4:18 PMdatajoely
11/05/2021, 4:36 PM