https://kedro.org/ logo
#advanced-need-help
Title
# advanced-need-help
d

deepyaman

12/20/2021, 9:03 PM
Not sure if this counts as "needs help," but I'm trying to put together a guide for running Kedro pipelines with Dask (i.e. for distributed node execution), and I cleaned up somebody else's work into https://github.com/deepyaman/kedro-dask-example this weekend. All of the real work is contained in https://github.com/deepyaman/kedro-dask-example/blob/develop/src/kedro_dask_example/runner/dask_runner.py. As far as help goes: 1. If anybody wants to try it out themselves and see if it works (or doesn't) for them, any feedback is much appreciated! The easiest thing is to just
kedro run --runner kedro_dask_example.runner.DaskRunner
, but it's also not that interesting. To use the distributed scheduler, you can run
dask-scheduler
and
PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
in a couple terminal windows, and then run the pipeline. I change the default value for
client_args
to
{"address": "127.0.0.1:8786"}
for this, because I'm lazy (but you can of course construct the runner the normal way). 2. If somebody has familiarity with Dask, a review of how I get the
Client
would be very helpful. I think
worker_client
in
_DaskDataSet
is correct, but not sure if I should be using
Client.current()
the way I am in
DaskRunner
. I think
worker_client
is unnecessary here, since it all runs on the scheduler, and
Client.as_current
seems to be for a use case where you have a client object already and want to use it, but I don't find much documentation around this and most of my understanding is from reading the
distributed
source. 3. I'll try and work on a first version of tracking load counts and releasing datasets tonight. My plan is to do it in the simplest way possible, in the
as_completed
loop. However, this feels a bit inefficient, as it really could've been released on the final load (rather than waiting for the node to finish running). I think this would require a distributed counter that
_DaskDataSet
instances could modify.. is this even smart?
d

datajoely

12/21/2021, 10:53 AM
Hi @User - Deepayman is on discord here, but I can't remember his name. I can put yo in touch if interested. I really want to make a set of Dask docs like we have for PySpark we just haven't gotten round to it. Probably the best resource we have is this talk by the folks at Coiled https://www.linkedin.com/posts/gustafrcavanaugh_python-ml-dask-activity-6845770788231106560-hx0v/
d

deepyaman

12/21/2021, 12:31 PM
@User this is Deepyaman lol
d

datajoely

12/21/2021, 2:03 PM
😂😂 your other username was cached
d

deepyaman

12/21/2021, 2:57 PM
Nah, I changed it this morning, to avoid confusion. Was too lazy to figure it out yesterday. 🙂
d

datajoely

12/21/2021, 2:57 PM
You have been upgraded to status 🙂
enjoy your yellow name
d

deepyaman

12/28/2021, 3:29 PM
I did further tweak the example repo above + turn this into a deployment guide (https://github.com/quantumblacklabs/kedro/blob/deepyaman-patch-3/docs/source/10_deployment/12_dask.md), for anybody who's interested in taking a look/reviewing! For review comments on the guide: https://github.com/quantumblacklabs/kedro/pull/1131
3 Views