Not sure if this counts as needs help but I m trying to put Kedro #advanced-need-help

Not sure if this counts as "needs help," but I'm t...

deepyaman

12/20/2021, 9:03 PM

Not sure if this counts as "needs help," but I'm trying to put together a guide for running Kedro pipelines with Dask (i.e. for distributed node execution), and I cleaned up somebody else's work into https://github.com/deepyaman/kedro-dask-example this weekend. All of the real work is contained in https://github.com/deepyaman/kedro-dask-example/blob/develop/src/kedro_dask_example/runner/dask_runner.py. As far as help goes: 1. If anybody wants to try it out themselves and see if it works (or doesn't) for them, any feedback is much appreciated! The easiest thing is to just

kedro run --runner kedro_dask_example.runner.DaskRunner

, but it's also not that interesting. To use the distributed scheduler, you can run

dask-scheduler

and

PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786

in a couple terminal windows, and then run the pipeline. I change the default value for

client_args

{"address": "127.0.0.1:8786"}

for this, because I'm lazy (but you can of course construct the runner the normal way). 2. If somebody has familiarity with Dask, a review of how I get the

Client

would be very helpful. I think

worker_client

_DaskDataSet

is correct, but not sure if I should be using

Client.current()

the way I am in

DaskRunner

. I think

worker_client

is unnecessary here, since it all runs on the scheduler, and

Client.as_current

seems to be for a use case where you have a client object already and want to use it, but I don't find much documentation around this and most of my understanding is from reading the

distributed

source. 3. I'll try and work on a first version of tracking load counts and releasing datasets tonight. My plan is to do it in the simplest way possible, in the

as_completed

loop. However, this feels a bit inefficient, as it really could've been released on the final load (rather than waiting for the node to finish running). I think this would require a distributed counter that

_DaskDataSet

instances could modify.. is this even smart?

datajoely

12/21/2021, 10:53 AM

Hi @User - Deepayman is on discord here, but I can't remember his name. I can put yo in touch if interested. I really want to make a set of Dask docs like we have for PySpark we just haven't gotten round to it. Probably the best resource we have is this talk by the folks at Coiled https://www.linkedin.com/posts/gustafrcavanaugh_python-ml-dask-activity-6845770788231106560-hx0v/

deepyaman

12/21/2021, 12:31 PM

@User this is Deepyaman lol

datajoely

12/21/2021, 2:03 PM

😂😂 your other username was cached

deepyaman

12/21/2021, 2:57 PM

Nah, I changed it this morning, to avoid confusion. Was too lazy to figure it out yesterday. 🙂

datajoely

12/21/2021, 2:57 PM

You have been upgraded to status 🙂

datajoely

12/21/2021, 2:57 PM

enjoy your yellow name

deepyaman

12/28/2021, 3:29 PM

I did further tweak the example repo above + turn this into a deployment guide (https://github.com/quantumblacklabs/kedro/blob/deepyaman-patch-3/docs/source/10_deployment/12_dask.md), for anybody who's interested in taking a look/reviewing! For review comments on the guide: https://github.com/quantumblacklabs/kedro/pull/1131

3 Views

Previous Next