Hey, can someone provide me some advice regarding ...
# advanced-need-help
s
Hey, can someone provide me some advice regarding building pipelines when using
PartitionedDataset
? I like using it, since it's a nice way to deal with structured data residing in a folder structure, but I'm not happy with the way the pipelines handle them. Simplified, our data folder structure is somewhat like the following:
Copy code
└───data
    ├───01_raw
    │   └───data_type1
    │       ├───entity1.csv
    │       ├───entity2.csv
    │       ├───...
    │       └───entityX.csv
    ├───02_intermediate
    │   └───data_type1
    │       ├───entity1.csv
    │       ├───entity2.csv
    │       ├───...
    │       └───entityX.csv
    └───03_primary
        └───data_type1
            ├───entity1.csv
            ├───entity2.csv
            ├───...
            └───entityX.csv
Each entity's data gets improved the further it goes down the layers. These entities are handled separately from one another. Ideally there is a pipeline that takes in a single
pd.DataFrame
containing the entity's data and transforms it from the raw layer to the primary layer. In that sense, it is a horizontal execution that can be done in parallel for each entity. However, as far as I can tell, using a
PartitionedDataset
forces the pipeline and pipeline nodes to accept a
Dict
as input. Now parallelizing becomes harder as the pipeline stages have become vertical. Each entity in the dictionary must be processed before being able to go to the next stage. Is there any way around this? We'd like to keep using DataFrames as inputs and DataFrames as outputs rather than Dicts, as using DataFrames also provides some semantic information about what exactly the pipelines does.