No, the "No such file" error was in reference to ...
# beginners-need-help
b
No, the "No such file" error was in reference to the kedro "binary" that gets called when you do "kedro run". I tried transferring that file to a known location on the cluster and it executed fine but then failed to "import kedro" because I hadn't installed kedro on the cluster. I then tried creating a new emr cluster with a "pip install kedro" included in a boostrap.sh. With kedro installed on the cluster, I can now "import kedro" but "kedro run" still errors at "import git" inside kedro.framework.cli.starters, so it seems there are still more dependencies to install. What I was wondering is if there is a way to do runs on the remote cluster without having to install all of the dependencies on the cluster? Maybe something like the databricks-connect example but for emr?
Update: I've been able to successfully run the pyspark-iris starter on databricks following the databricks-connect example: https://kedro.readthedocs.io/en/latest/10_deployment/08_databricks.html#run-the-kedro-project-with-databricks-connect. I'm wondering if there is a similar sort of workflow to trigger a remote run from a local IDE for AWS EMR?
d
Hey @User I've checked with some people who have experience on the matter and I'll summarise here:
n
hi @User, I've done this some time ago - are you intending to run in
deploy
or
client
mode?
b
I think client mode will be good for now.
n
Ah ok. So the setup is two fold: 1. Master node - requires the same setup as if you were developing locally 2. Worker nodes - require basic libraries, such as
pyspark
and
pandas
and
numpy
because they should be executing "Spark code only" Firstly, you should create and package your environment (Conda/venv) into a tarball (
.tar.gz
) . Then you should have your environment variables setup in such a way that they the Spark Driver (on Master node) to a place where it can get access to the libraries and modules. This can usually be done with (if you're using Conda)
Copy code
export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
pyspark --archives pyspark_conda_env.tar.gz#environment
or alternatively in a notebook/ipython environment
Copy code
import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate()
This should enable the Driver to point the workers to where they can get necessary code Python resources. Please note this gets more expensive with big DS projects due to the sheer number of DS libs that need to be independently packaged.
Now
cluster
mode is similar in essence, but you obviously lose access to the CLI as you package the entire project as well. You'll likely need to build the project into a
whl
file, package up all dependencies and explicitly distribute them to all worker nodes, then supply all
conf
paths and settings to
spark-submit
directly, for example:
Copy code
#!/bin/bash

PYTHON_ZIP="./dependencies.zip#pythonlib"
PYTHON_WHL="./src/dist/my-project-build-py3-none-any.whl"

SPARK_CMD="spark-submit \
        --conf 'spark.yarn.dist.archives=${PYTHON_ZIP}' \
        --conf 'spark.yarn.appMasterEnv.PYTHONPATH=pythonlib' \
        --conf 'spark.executorEnv.PYTHONPATH=pythonlib' \
        --conf 'spark.jars.packages=com.amazonaws:aws-java-sdk:1.11.271,org.apache.hadoop:hadoop-aws:2.10.1' \
        --conf 'spark.sql.shuffle.partitions=16' \
        --conf 'spark.hadoop.fs.s3a.committer.staging.conflict-mode=replace' \
        --conf 'spark.hadoop.fs.s3a.committer.name=partitioned' \
        --conf 'spark.hadoop.fs.s3a.connection.maximum=5000' \
        --conf 'spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35' \
        --conf 'spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35' \
        --conf 'spark.hadoop.fs.s3a.fast.upload=true' \
        --conf 'spark.hadoop.fs.s3a.fast.upload.buffer=array' \
        --py-files ./dependencies.zip,${PYTHON_WHL} \
        run.py"

eval ${SPARK_CMD}
as you can see, there's now
spark.yarn
settings added, which since YARN is the EMR resource manager for Spark in
cluster
mode
Hope this helps!
b
Thanks a ton, I'll give it a shot!
d
Let us know how it goes - as we can basically copy and paste @User 's comments into the docs if so
b
Okay, so I've tried a few variations of the instructions above to run kedro in client mode on emr and haven't been able to get a successful remote run yet. The rough steps I've followed are (using pycharm): 1. create a new conda env (name: kedro_env) 2. pip install kedro 3. kedro new --starter=pyspark-iris (name: kedro_project) 4. cd into project root 5. pip install -r src/requirements.txt (kedro install doesn't appear to work) 6. kedro run (pipeline successfully runs) 7. pip install conda-pack 8. conda pack -f -o kedro_env.tar.gz 9. setup ssh interpreter in pycharm 10. setup pycharm deployment configuration to mirror project directory to emr cluster via sftp (emr path: /tmp/pycharm_project_389/kedro_project) 11. in terminal, ssh into emr cluster and cd into project root 12. export PYSPARK_DRIVER_PYTHON=python 13. export PYSPARK_PYTHON=./environment/bin/python 14. spark-submit --archives kedro_env.tar.gz src/kedro_project/__main__.py run For the last few steps, I've tried a lot of variations like using pyspark instead of spark submit, calling the cli (kedro run instead of src/kedro_project/__main__.py run), including the environment variable exports as configuration arguments to the spark submit instead of exporting before spark-submit, storing/reading the kedro_env.tar.gz on/from s3, etc.
When I try to submit the cli call (kedro run), it fails to locate the cli program. When I submit src/kedro_project/__main__.py, it fails to import dependencies (e.g., pathlib, even though the archive conda env has been provided).
Some questions: 1. Has anyone been able to setup an environment that enables you to initialize EMR runs using the cli? Something like: kedro run --from-nodes node1,node2 where the run happens on the multiple executors of the remote cluster? 2. Or, is it more straightforward to submit a python script (e.g., main.py) directly? In this case, what is the correct entrypoint script? What directory does this script need to be called from?
n
ah interesting - so initially I would recommend trying to set it up in a way that you use
pyspark
as an entrypoint and see if it works. IIRC my dependency setup looked something like this:
Copy code
bash
prepare-deployment-dependencies:
    echo "Packaging Project" && \
    kedro package && \
    rm -rf deployment && \
    rm -rf deployment-conf && \
    mkdir deployment && \
    mkdir deployment-conf && \
    pip install src/dist/<my-packaged-project>.whl -t deployment/ && \
    TMP_FILE=yaml_files.tmp && \
    TMP=$$(find ./ -name "*.yml" -not -path "*deployment*") && \
    echo $$TMP | xargs -n1 > $TMP_FILE && \
    rsync --verbose --files-from=$TMP_FILE ./ ./deployment-conf && \
    rm $TMP_FILE && \
    cd deployment && zip -r ../dependencies.zip . && \
    cd ../deployment-conf && zip -ur ../dependencies.zip . && \
then, I would provide
dependencies.zip
as per the script above
Copy code
PYTHON_ZIP="./dependencies.zip#pythonlib"
PYTHON_WHL="./src/dist/my-project-build-py3-none-any.whl"
the other issue that you rightly pointed out is the entrypoint - I believe that depends on your script version (but also whether you're running
spark-submit
or
client
mode with
pyspark
)
my
spark-submit
entrypoint looked something simple like:
Copy code
python
from <my-packaged-project-module>.run import run_package

run_package()
d
@User did you have any luck with this? Trying to get a similar setup working and wondering how far I should get into Napoleon's suggested setup above
b
Unfortunately not. If someone is able to put together some clear documentation on the many ways to develop and deploy locally and remotely with kedro on databricks vs emr vs dataproc, I think it could be very helpful. At my current skill level, the options/setup effort are a bit much. I know EMR has some docker integration which may simplify the setup and dependency management, so may be worth exploring that route as well.
d
@User have you had any success following the above? or are you checking before you start
d
Checking before I start haha. I've seen kedro work well with AWS Batch, which just executes a job within a container. But you can't connect to a spark cluster that way so it's not well suited to large-scale data
d
Okay - we've used Kedro many many times with EMR internally - it does require the packaging deployment workflow above but it can work effectively.
d
Interesting. Yeah I'm amenable to that flow, just haven't done it before so would love to see an example of, e.g. how code changes get deployed Separately need to convince the team / client that kedro is the right choice 🙂
d
Well shout if you need support on that too
128 Views