https://kedro.org/ logo
Title
a

antheas

08/12/2022, 11:19 PM
You can use the extension syntax for yaml files to define your datasets
_dataset_csv: &dataset_csv
  type: pandas.CSVDataSet
  layer: raw
  filepath: ""

dataset_2021_2:
  <<: *dataset_csv
  filepath: ${base_location}/raw/x/t/z.csv.gz
Then you can instantiate your pipeline with a node that has as an input a dictionary {n: n for n in your_dataset_names}. You can even use jinja2 to template it. After that you can just dump everything in a parquet file, assuming it fits in ram. If you don't like the fact that this causes duplication (you need to define your dataset names both in your catalog and in your code), then you can instantiate your datasets using the after_catalog_created hook using python. Ofc this assumes that you don't need to use the partitioned features like lazy loading. I did check out the PartitionedDataset class. It's just 300 lines. What you want should be manageable by overwriting
__init__
, adding a list param for your datasets and then calling the
super().__init__()
method. Then you can use that list in
_list_partitions()
to return your files instead of a directory listing.