# beginners-need-help
Getting my first pipeline up and running, been focusing on learning how to use params as input to my pipelines, and really starts to like Kedro ๐Ÿ™‚ My next step would be to harvest the value of the Data Catalog. And here I struggle. I am looking for anyone working with HDFDataSet that can give me a little guidance. I am refactor an existing code as a kedro data engineering pipeline. New data arrives in a json file. I process the data and today I store them in a hdf file, splitting data by "id" as separate dataframes and stored with "id" as key. If file and key already exist I append data, else write. (with some intervalls new files are created to maintain decent filesizes) So, now I explore the option to use kedro data catalog, and HDFDataset. My first challange is that you have to specify key when creating a dataset, not passing it as an argument when saving. I will have up-to 5000 "ids", that i do not know before reading a file. So I think a "before_pipeline hook" is needed to create a dataset for each file/key combination during processing, correct? But when trying to create a dataset with HDFDataSet, a struggle to use my old save params (using pandas today) . I use format, min_itemsize and data_columns today, but i get an error telling me they are not valid save_params. Help wanted. My use-case looks like a perfect fit to PartitionedDataSet or even better IncrementalDataSet ๐Ÿ™‚ But I don't want 5000 files to be created, I want partitions to be key in same hdf-file. Anyone have something like this working and willing to share some code? Feel free tp PM me if you want to help.
Welcome to the community!
So our current dataset is set up to the same file, even if you use PartitionedDataSet we will create new files in a folder. Off the top of my head the two approaches I would suggest you subclass
and override the
method to append rather than override. Instructions for doing something like this can be found here: I think you could achieve this with various hooks but I think it will overcomplicate things
Hi, and thanks for the pointer. That was an approach I had not think of. Then I might be able to solve my challenge with "save_params" (or arguments) as well. I will use the documentation as make a try ๐Ÿ™‚
Itโ€™s worth noting that weโ€™re going to drop windows support for HDFDataSet in python 3.7 specifically
Hopefully that doesnโ€™t affect you
But it came up today so I thought Iโ€™d mention
i was looking through the source code for the HDFDataSet, and it might be to much work rewriting it towards my current approach, for now. May get back to this when i got first delivery done and get some time for re-factoring ๐Ÿ™‚
@User Thanks for taking time helping me ๐Ÿ‘
I would also say Iโ€™m a bit of a Parquet fanboy so if you can use that I would recommend it
I also like parquet, as a column based format, and we use it today. But started with hdf as containers for multiple dataframes (by keys) and really like that you can read only one table /key or susbset of rows from a table with simple query. There is not one format that is great for all needs ๐Ÿ™‚
OK, while at it, a bit of-topic maybe...and i not sure if possible, but Kafka seems to breaking into all different spaces now. A KafkeDataSet ?
So I'm all for it - but I guess the hard thing is to reconcile how streaming should work on Kedro, we had spark streaming prototype a while back that could work for this. Do you use Kafka day to day in your workflow?
We are not using Kafka today, but we are considering new architecture for our production environment, and Kafka is one option I will evaluate.