Schoolmeister12/23/2021, 9:01 AM
? I like using it, since it's a nice way to deal with structured data residing in a folder structure, but I'm not happy with the way the pipelines handle them. Simplified, our data folder structure is somewhat like the following:
Each entity's data gets improved the further it goes down the layers. These entities are handled separately from one another. Ideally there is a pipeline that takes in a single
└───data ├───01_raw │ └───data_type1 │ ├───entity1.csv │ ├───entity2.csv │ ├───... │ └───entityX.csv ├───02_intermediate │ └───data_type1 │ ├───entity1.csv │ ├───entity2.csv │ ├───... │ └───entityX.csv └───03_primary └───data_type1 ├───entity1.csv ├───entity2.csv ├───... └───entityX.csv
containing the entity's data and transforms it from the raw layer to the primary layer. In that sense, it is a horizontal execution that can be done in parallel for each entity. However, as far as I can tell, using a
forces the pipeline and pipeline nodes to accept a
as input. Now parallelizing becomes harder as the pipeline stages have become vertical. Each entity in the dictionary must be processed before being able to go to the next stage. Is there any way around this? We'd like to keep using DataFrames as inputs and DataFrames as outputs rather than Dicts, as using DataFrames also provides some semantic information about what exactly the pipelines does.