No the No such file error was in reference to the kedro bina Kedro #beginners-need-help

No, the "No such file" error was in reference to ...

bgereke

10/17/2021, 9:52 PM

No, the "No such file" error was in reference to the kedro "binary" that gets called when you do "kedro run". I tried transferring that file to a known location on the cluster and it executed fine but then failed to "import kedro" because I hadn't installed kedro on the cluster. I then tried creating a new emr cluster with a "pip install kedro" included in a boostrap.sh. With kedro installed on the cluster, I can now "import kedro" but "kedro run" still errors at "import git" inside kedro.framework.cli.starters, so it seems there are still more dependencies to install. What I was wondering is if there is a way to do runs on the remote cluster without having to install all of the dependencies on the cluster? Maybe something like the databricks-connect example but for emr?

bgereke

10/18/2021, 12:06 AM

Update: I've been able to successfully run the pyspark-iris starter on databricks following the databricks-connect example: https://kedro.readthedocs.io/en/latest/10_deployment/08_databricks.html#run-the-kedro-project-with-databricks-connect. I'm wondering if there is a similar sort of workflow to trigger a remote run from a local IDE for AWS EMR?

datajoely

10/18/2021, 9:09 AM

Hey @User I've checked with some people who have experience on the matter and I'll summarise here:

napoleon_borntoparty

10/18/2021, 9:10 AM

hi @User, I've done this some time ago - are you intending to run in

deploy

client

mode?

bgereke

10/18/2021, 4:06 PM

I think client mode will be good for now.

napoleon_borntoparty

10/18/2021, 4:29 PM

Ah ok. So the setup is two fold: 1. Master node - requires the same setup as if you were developing locally 2. Worker nodes - require basic libraries, such as

pyspark

and

pandas

and

numpy

because they should be executing "Spark code only" Firstly, you should create and package your environment (Conda/venv) into a tarball (

.tar.gz

) . Then you should have your environment variables setup in such a way that they the Spark Driver (on Master node) to a place where it can get access to the libraries and modules. This can usually be done with (if you're using Conda)

Copy code

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
pyspark --archives pyspark_conda_env.tar.gz#environment

or alternatively in a notebook/ipython environment

Copy code

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate()

This should enable the Driver to point the workers to where they can get necessary code Python resources. Please note this gets more expensive with big DS projects due to the sheer number of DS libs that need to be independently packaged.

napoleon_borntoparty

10/18/2021, 4:34 PM

Now

cluster

mode is similar in essence, but you obviously lose access to the CLI as you package the entire project as well. You'll likely need to build the project into a

whl

file, package up all dependencies and explicitly distribute them to all worker nodes, then supply all

conf

paths and settings to

spark-submit

directly, for example:

Copy code

#!/bin/bash

PYTHON_ZIP="./dependencies.zip#pythonlib"
PYTHON_WHL="./src/dist/my-project-build-py3-none-any.whl"

SPARK_CMD="spark-submit \
        --conf 'spark.yarn.dist.archives=${PYTHON_ZIP}' \
        --conf 'spark.yarn.appMasterEnv.PYTHONPATH=pythonlib' \
        --conf 'spark.executorEnv.PYTHONPATH=pythonlib' \
        --conf 'spark.jars.packages=com.amazonaws:aws-java-sdk:1.11.271,org.apache.hadoop:hadoop-aws:2.10.1' \
        --conf 'spark.sql.shuffle.partitions=16' \
        --conf 'spark.hadoop.fs.s3a.committer.staging.conflict-mode=replace' \
        --conf 'spark.hadoop.fs.s3a.committer.name=partitioned' \
        --conf 'spark.hadoop.fs.s3a.connection.maximum=5000' \
        --conf 'spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35' \
        --conf 'spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35' \
        --conf 'spark.hadoop.fs.s3a.fast.upload=true' \
        --conf 'spark.hadoop.fs.s3a.fast.upload.buffer=array' \
        --py-files ./dependencies.zip,${PYTHON_WHL} \
        run.py"

eval ${SPARK_CMD}

napoleon_borntoparty

10/18/2021, 4:35 PM

as you can see, there's now

spark.yarn

settings added, which since YARN is the EMR resource manager for Spark in

cluster

mode

napoleon_borntoparty

10/18/2021, 4:35 PM

Hope this helps!

bgereke

10/18/2021, 7:38 PM

Thanks a ton, I'll give it a shot!

datajoely

10/20/2021, 5:32 PM

Let us know how it goes - as we can basically copy and paste @User 's comments into the docs if so

bgereke

10/20/2021, 6:57 PM

Okay, so I've tried a few variations of the instructions above to run kedro in client mode on emr and haven't been able to get a successful remote run yet. The rough steps I've followed are (using pycharm): 1. create a new conda env (name: kedro_env) 2. pip install kedro 3. kedro new --starter=pyspark-iris (name: kedro_project) 4. cd into project root 5. pip install -r src/requirements.txt (kedro install doesn't appear to work) 6. kedro run (pipeline successfully runs) 7. pip install conda-pack 8. conda pack -f -o kedro_env.tar.gz 9. setup ssh interpreter in pycharm 10. setup pycharm deployment configuration to mirror project directory to emr cluster via sftp (emr path: /tmp/pycharm_project_389/kedro_project) 11. in terminal, ssh into emr cluster and cd into project root 12. export PYSPARK_DRIVER_PYTHON=python 13. export PYSPARK_PYTHON=./environment/bin/python 14. spark-submit --archives kedro_env.tar.gz src/kedro_project/__main__.py run For the last few steps, I've tried a lot of variations like using pyspark instead of spark submit, calling the cli (kedro run instead of src/kedro_project/__main__.py run), including the environment variable exports as configuration arguments to the spark submit instead of exporting before spark-submit, storing/reading the kedro_env.tar.gz on/from s3, etc.

bgereke

10/20/2021, 7:00 PM

When I try to submit the cli call (kedro run), it fails to locate the cli program. When I submit src/kedro_project/__main__.py, it fails to import dependencies (e.g., pathlib, even though the archive conda env has been provided).

bgereke

10/20/2021, 7:11 PM

Some questions: 1. Has anyone been able to setup an environment that enables you to initialize EMR runs using the cli? Something like: kedro run --from-nodes node1,node2 where the run happens on the multiple executors of the remote cluster? 2. Or, is it more straightforward to submit a python script (e.g., main.py) directly? In this case, what is the correct entrypoint script? What directory does this script need to be called from?

napoleon_borntoparty

10/21/2021, 4:38 PM

ah interesting - so initially I would recommend trying to set it up in a way that you use

pyspark

as an entrypoint and see if it works. IIRC my dependency setup looked something like this:

Copy code

bash
prepare-deployment-dependencies:
    echo "Packaging Project" && \
    kedro package && \
    rm -rf deployment && \
    rm -rf deployment-conf && \
    mkdir deployment && \
    mkdir deployment-conf && \
    pip install src/dist/<my-packaged-project>.whl -t deployment/ && \
    TMP_FILE=yaml_files.tmp && \
    TMP=$$(find ./ -name "*.yml" -not -path "*deployment*") && \
    echo $$TMP | xargs -n1 > $TMP_FILE && \
    rsync --verbose --files-from=$TMP_FILE ./ ./deployment-conf && \
    rm $TMP_FILE && \
    cd deployment && zip -r ../dependencies.zip . && \
    cd ../deployment-conf && zip -ur ../dependencies.zip . && \

napoleon_borntoparty

10/21/2021, 4:39 PM

then, I would provide

dependencies.zip

as per the script above

napoleon_borntoparty

10/21/2021, 4:39 PM

Copy code

PYTHON_ZIP="./dependencies.zip#pythonlib"
PYTHON_WHL="./src/dist/my-project-build-py3-none-any.whl"

napoleon_borntoparty

10/21/2021, 4:40 PM

the other issue that you rightly pointed out is the entrypoint - I believe that depends on your script version (but also whether you're running

spark-submit

client

mode with

pyspark

)

napoleon_borntoparty

10/21/2021, 4:41 PM

spark-submit

entrypoint looked something simple like:

Copy code

python
from <my-packaged-project-module>.run import run_package

run_package()

data princess

11/05/2021, 3:01 PM

@User did you have any luck with this? Trying to get a similar setup working and wondering how far I should get into Napoleon's suggested setup above

bgereke

11/05/2021, 3:32 PM

Unfortunately not. If someone is able to put together some clear documentation on the many ways to develop and deploy locally and remotely with kedro on databricks vs emr vs dataproc, I think it could be very helpful. At my current skill level, the options/setup effort are a bit much. I know EMR has some docker integration which may simplify the setup and dependency management, so may be worth exploring that route as well.

datajoely

11/05/2021, 3:58 PM

@User have you had any success following the above? or are you checking before you start

data princess

11/05/2021, 4:15 PM

Checking before I start haha. I've seen kedro work well with AWS Batch, which just executes a job within a container. But you can't connect to a spark cluster that way so it's not well suited to large-scale data

datajoely

11/05/2021, 4:17 PM

Okay - we've used Kedro many many times with EMR internally - it does require the packaging deployment workflow above but it can work effectively.

data princess

11/05/2021, 4:18 PM

Interesting. Yeah I'm amenable to that flow, just haven't done it before so would love to see an example of, e.g. how code changes get deployed Separately need to convince the team / client that kedro is the right choice 🙂

datajoely

11/05/2021, 4:36 PM

Well shout if you need support on that too

201 Views

Previous Next