Hey can someone provide me some advice regarding building pi Kedro #advanced-need-help

Hey, can someone provide me some advice regarding ...

Schoolmeister

12/23/2021, 9:01 AM

Hey, can someone provide me some advice regarding building pipelines when using

PartitionedDataset

? I like using it, since it's a nice way to deal with structured data residing in a folder structure, but I'm not happy with the way the pipelines handle them. Simplified, our data folder structure is somewhat like the following:

Copy code

└───data
    ├───01_raw
    │   └───data_type1
    │       ├───entity1.csv
    │       ├───entity2.csv
    │       ├───...
    │       └───entityX.csv
    ├───02_intermediate
    │   └───data_type1
    │       ├───entity1.csv
    │       ├───entity2.csv
    │       ├───...
    │       └───entityX.csv
    └───03_primary
        └───data_type1
            ├───entity1.csv
            ├───entity2.csv
            ├───...
            └───entityX.csv

Each entity's data gets improved the further it goes down the layers. These entities are handled separately from one another. Ideally there is a pipeline that takes in a single

pd.DataFrame

containing the entity's data and transforms it from the raw layer to the primary layer. In that sense, it is a horizontal execution that can be done in parallel for each entity. However, as far as I can tell, using a

PartitionedDataset

forces the pipeline and pipeline nodes to accept a

Dict

as input. Now parallelizing becomes harder as the pipeline stages have become vertical. Each entity in the dictionary must be processed before being able to go to the next stage. Is there any way around this? We'd like to keep using DataFrames as inputs and DataFrames as outputs rather than Dicts, as using DataFrames also provides some semantic information about what exactly the pipelines does.

2 Views

Previous Next