Matthias Roels
08/08/2022, 1:37 PMdatajoely
08/08/2022, 2:00 PMdatajoely
08/08/2022, 2:00 PMMatthias Roels
08/08/2022, 7:28 PMKEDRO_ENV
). And on top of that, we want to be able to more dynamically construct pipelines using logic that generates additional nodes during registry.antheas
08/09/2022, 12:27 PMantheas
08/09/2022, 12:29 PMantheas
08/09/2022, 12:30 PMantheas
08/09/2022, 12:58 PMdatajoely
08/09/2022, 1:12 PMBarros
08/10/2022, 8:36 PMdatajoely
08/10/2022, 9:13 PMBarros
08/10/2022, 9:39 PM__init__
method of PartitionedDataSet but I think there should be a better waydatajoely
08/10/2022, 9:40 PMBarros
08/10/2022, 9:41 PMBarros
08/10/2022, 9:41 PMBarros
08/10/2022, 9:41 PMBarros
08/10/2022, 9:42 PMBarros
08/10/2022, 9:46 PMMatthias Roels
08/11/2022, 12:50 PMdatajoely
08/11/2022, 12:57 PMjavier.16
08/11/2022, 2:50 PMantheas
08/12/2022, 11:08 PMantheas
08/12/2022, 11:19 PM_dataset_csv: &dataset_csv
type: pandas.CSVDataSet
layer: raw
filepath: ""
dataset_2021_2:
<<: *dataset_csv
filepath: ${base_location}/raw/x/t/z.csv.gz
Then you can instantiate your pipeline with a node that has as an input a dictionary {n: n for n in your_dataset_names}. You can even use jinja2 to template it. After that you can just dump everything in a parquet file, assuming it fits in ram. If you don't like the fact that this causes duplication (you need to define your dataset names both in your catalog and in your code), then you can instantiate your datasets using the after_catalog_created hook using python.
Ofc this assumes that you don't need to use the partitioned features like lazy loading. I did check out the PartitionedDataset class. It's just 300 lines. What you want should be manageable by overwriting __init__
, adding a list param for your datasets and then calling the super().__init__()
method. Then you can use that list in _list_partitions()
to return your files instead of a directory listing.antheas
08/12/2022, 11:22 PMPartitionedDataset
works without changes 🤷♂️Barros
08/13/2022, 7:01 PMBarros
08/13/2022, 8:03 PMBarros
08/13/2022, 8:04 PMBarros
08/13/2022, 8:04 PMdatajoely
08/13/2022, 8:04 PM