Getting my first pipeline up and running, been focusing on learning how to use params as input to my pipelines, and really starts to like Kedro ๐
My next step would be to harvest the value of the Data Catalog. And here I struggle.
I am looking for anyone working with HDFDataSet that can give me a little guidance. I am refactor an existing code as a kedro data engineering pipeline.
New data arrives in a json file. I process the data and today I store them in a hdf file, splitting data by "id" as separate dataframes and stored with "id" as key.
If file and key already exist I append data, else write. (with some intervalls new files are created to maintain decent filesizes)
So, now I explore the option to use kedro data catalog, and HDFDataset. My first challange is that you have to specify key when creating a dataset, not passing it as an argument when saving. I will have up-to 5000 "ids", that i do not know before reading a file. So I think a "before_pipeline hook" is needed to create a dataset for each file/key combination during processing, correct?
But when trying to create a dataset with HDFDataSet, a struggle to use my old save params (using pandas today) . I use format, min_itemsize and data_columns today, but i get an error telling me they are not valid save_params. Help wanted.
My use-case looks like a perfect fit to PartitionedDataSet or even better IncrementalDataSet ๐
But I don't want 5000 files to be created, I want partitions to be key in same hdf-file. Anyone have something like this working and willing to share some code?
Feel free tp PM me if you want to help.