deepyaman
12/20/2021, 9:03 PMkedro run --runner kedro_dask_example.runner.DaskRunner
, but it's also not that interesting. To use the distributed scheduler, you can run dask-scheduler
and PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
in a couple terminal windows, and then run the pipeline. I change the default value for client_args
to {"address": "127.0.0.1:8786"}
for this, because I'm lazy (but you can of course construct the runner the normal way).
2. If somebody has familiarity with Dask, a review of how I get the Client
would be very helpful. I think worker_client
in _DaskDataSet
is correct, but not sure if I should be using Client.current()
the way I am in DaskRunner
. I think worker_client
is unnecessary here, since it all runs on the scheduler, and Client.as_current
seems to be for a use case where you have a client object already and want to use it, but I don't find much documentation around this and most of my understanding is from reading the distributed
source.
3. I'll try and work on a first version of tracking load counts and releasing datasets tonight. My plan is to do it in the simplest way possible, in the as_completed
loop. However, this feels a bit inefficient, as it really could've been released on the final load (rather than waiting for the node to finish running). I think this would require a distributed counter that _DaskDataSet
instances could modify.. is this even smart?datajoely
12/21/2021, 10:53 AMdeepyaman
12/21/2021, 12:31 PMdatajoely
12/21/2021, 2:03 PMdeepyaman
12/21/2021, 2:57 PMdeepyaman
12/28/2021, 3:29 PM