antheas
08/12/2022, 11:19 PM_dataset_csv: &dataset_csv
type: pandas.CSVDataSet
layer: raw
filepath: ""
dataset_2021_2:
<<: *dataset_csv
filepath: ${base_location}/raw/x/t/z.csv.gz
Then you can instantiate your pipeline with a node that has as an input a dictionary {n: n for n in your_dataset_names}. You can even use jinja2 to template it. After that you can just dump everything in a parquet file, assuming it fits in ram. If you don't like the fact that this causes duplication (you need to define your dataset names both in your catalog and in your code), then you can instantiate your datasets using the after_catalog_created hook using python.
Ofc this assumes that you don't need to use the partitioned features like lazy loading. I did check out the PartitionedDataset class. It's just 300 lines. What you want should be manageable by overwriting __init__
, adding a list param for your datasets and then calling the super().__init__()
method. Then you can use that list in _list_partitions()
to return your files instead of a directory listing.