Schoolmeister
12/20/2021, 2:42 PMkedro run
still works
* this is the output from the terminal when starting kedro jupyter lab
, everything seems OK: https://gist.github.com/michaeltoqua/95055dbb439a6a9240e61d5680c0aec5
* Even though I'm on 0.17.4, I've just tried what you suggested, and it seems to work!datajoely
12/20/2021, 2:42 PMdatajoely
12/20/2021, 2:43 PMkedro project rename
command to make this easierSchoolmeister
12/20/2021, 2:44 PMSchoolmeister
12/20/2021, 2:44 PMdatajoely
12/20/2021, 2:45 PM.ipython
folder at the root of the project? but I'm out of ideas!Schoolmeister
12/20/2021, 2:46 PMdatajoely
12/20/2021, 2:46 PMdatajoely
12/20/2021, 2:46 PMdeepyaman
12/20/2021, 9:03 PMkedro run --runner kedro_dask_example.runner.DaskRunner
, but it's also not that interesting. To use the distributed scheduler, you can run dask-scheduler
and PYTHONPATH=$PWD/src dask-worker 127.0.0.1:8786
in a couple terminal windows, and then run the pipeline. I change the default value for client_args
to {"address": "127.0.0.1:8786"}
for this, because I'm lazy (but you can of course construct the runner the normal way).
2. If somebody has familiarity with Dask, a review of how I get the Client
would be very helpful. I think worker_client
in _DaskDataSet
is correct, but not sure if I should be using Client.current()
the way I am in DaskRunner
. I think worker_client
is unnecessary here, since it all runs on the scheduler, and Client.as_current
seems to be for a use case where you have a client object already and want to use it, but I don't find much documentation around this and most of my understanding is from reading the distributed
source.
3. I'll try and work on a first version of tracking load counts and releasing datasets tonight. My plan is to do it in the simplest way possible, in the as_completed
loop. However, this feels a bit inefficient, as it really could've been released on the final load (rather than waiting for the node to finish running). I think this would require a distributed counter that _DaskDataSet
instances could modify.. is this even smart?datajoely
12/21/2021, 10:50 AMSchoolmeister
12/23/2021, 9:01 AMPartitionedDataset
? I like using it, since it's a nice way to deal with structured data residing in a folder structure, but I'm not happy with the way the pipelines handle them.
Simplified, our data folder structure is somewhat like the following:
└───data
├───01_raw
│ └───data_type1
│ ├───entity1.csv
│ ├───entity2.csv
│ ├───...
│ └───entityX.csv
├───02_intermediate
│ └───data_type1
│ ├───entity1.csv
│ ├───entity2.csv
│ ├───...
│ └───entityX.csv
└───03_primary
└───data_type1
├───entity1.csv
├───entity2.csv
├───...
└───entityX.csv
Each entity's data gets improved the further it goes down the layers. These entities are handled separately from one another. Ideally there is a pipeline that takes in a single pd.DataFrame
containing the entity's data and transforms it from the raw layer to the primary layer. In that sense, it is a horizontal execution that can be done in parallel for each entity. However, as far as I can tell, using a PartitionedDataset
forces the pipeline and pipeline nodes to accept a Dict
as input. Now parallelizing becomes harder as the pipeline stages have become vertical. Each entity in the dictionary must be processed before being able to go to the next stage.
Is there any way around this? We'd like to keep using DataFrames as inputs and DataFrames as outputs rather than Dicts, as using DataFrames also provides some semantic information about what exactly the pipelines does.datajoely
12/23/2021, 12:03 PMPartitionedDataSet
class for your own purposes?datajoely
12/23/2021, 12:03 PMdeepyaman
12/23/2021, 11:54 PM01_raw/data_type1/entityM.csv
-> 02_intermediate/data_type1/entityM.csv
-> 03_primary/data_type1/entityM.csv
as for 01_raw/data_type1/entityN.csv
-> 02_intermediate/data_type1/entityN.csv
-> 03_primary/data_type1/entityN.csv
, it sounds to me that you want a modular pipeline that can do this transformation, that you then reuse. This allows transformation for each entity to occur at the node level, which makes it easier to parallelize.
The route you described requires parallelization to occur within nodes, which runs into the blocking problem that you describe. It's also less Kedronic, since you're encroaching on the runner's responsibility.j c h a r l e s
12/24/2021, 8:35 AMSchoolmeister
12/24/2021, 1:01 PMPartitionedDataset
dict and maps each key to an input of the modular pipeline, is that the right way to look at it?deepyaman
12/26/2021, 6:12 AMPartitionedDataSet
? One area Kedro could be a lot better IMO is in how it handles PartitionedDataSet
, and how that can be read/consumed in the unpartitioned form. As it stands, you can do this unpacking, as you describe, but:
1. it requires some sort of dynamic behavior to get the list of partitions
2. an unpacking pipeline really doesn't do anything, given the data already exists in a split form--you're just making another set of catalog entries to point to the same data
I would probably leave the dynamic behavior to when you're constructing the pipeline, like:
from kedro.pipeline import pipeline
all_entities_pipeline = Pipeline()
for i in range(NUM_ENTITIES):
all_entities_pipeline += pipeline(single_entity_pipeline, namespace=f"entity{i}")
Still not perfect if you want a lot of catalog entries for each entity, probably need to look at using templating in that case.j c h a r l e s
01/03/2022, 9:45 AMcontent-type: binary/octet-stream
, regardless of the content type of my files. Would like to preserve their types rather than coercing them to binary/octet-streamj c h a r l e s
01/03/2022, 10:29 AMmy_catalog_data:
filepath: s3://path.../data.html
type: text.TextDataSet
fs_args:
open_args_save:
ContentType: "text/html"
j c h a r l e s
01/03/2022, 10:31 AMdatajoely
01/03/2022, 4:13 PMuser
01/06/2022, 4:42 PMaustin-hilberg
01/10/2022, 5:13 PMkedro-starter-pandas-iris
repo doesn't have a prompts.yml
, and so using a local clone of the repo as a starter doesn't allow any input. Is this an oversight, or was that intentional?datajoely
01/10/2022, 5:42 PMaustin-hilberg
01/10/2022, 5:44 PMdatajoely
01/10/2022, 5:45 PMdatajoely
01/10/2022, 5:45 PMkedro new --starter=pandas-iris
datajoely
01/10/2022, 5:46 PMaustin-hilberg
01/10/2022, 5:55 PM