Title
#beginners-need-help
Stefan P

Stefan P

10/02/2021, 7:47 PM
Getting my first pipeline up and running, been focusing on learning how to use params as input to my pipelines, and really starts to like Kedro πŸ™‚ My next step would be to harvest the value of the Data Catalog. And here I struggle. I am looking for anyone working with HDFDataSet that can give me a little guidance. I am refactor an existing code as a kedro data engineering pipeline. New data arrives in a json file. I process the data and today I store them in a hdf file, splitting data by "id" as separate dataframes and stored with "id" as key. If file and key already exist I append data, else write. (with some intervalls new files are created to maintain decent filesizes) So, now I explore the option to use kedro data catalog, and HDFDataset. My first challange is that you have to specify key when creating a dataset, not passing it as an argument when saving. I will have up-to 5000 "ids", that i do not know before reading a file. So I think a "before_pipeline hook" is needed to create a dataset for each file/key combination during processing, correct? But when trying to create a dataset with HDFDataSet, a struggle to use my old save params (using pandas today) . I use format, min_itemsize and data_columns today, but i get an error telling me they are not valid save_params. Help wanted. My use-case looks like a perfect fit to PartitionedDataSet or even better IncrementalDataSet πŸ™‚ But I don't want 5000 files to be created, I want partitions to be key in same hdf-file. Anyone have something like this working and willing to share some code? Feel free tp PM me if you want to help.
datajoely

datajoely

10/04/2021, 9:03 AM
Welcome to the community!
9:10 AM
So our current dataset is set up to the same file, even if you use PartitionedDataSet we will create new files in a folder. Off the top of my head the two approaches I would suggest you subclass
pandas.HDFDataSet
and override the
save()
method to append rather than override. Instructions for doing something like this can be found here:https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html I think you could achieve this with various hooks but I think it will overcomplicate things
Stefan P

Stefan P

10/05/2021, 3:56 PM
Hi, and thanks for the pointer. That was an approach I had not think of. Then I might be able to solve my challenge with "save_params" (or arguments) as well. I will use the documentation as make a try πŸ™‚
datajoely

datajoely

10/05/2021, 3:58 PM
It’s worth noting that we’re going to drop windows support for HDFDataSet in python 3.7 specifically
3:59 PM
Hopefully that doesn’t affect you
3:59 PM
But it came up today so I thought I’d mention
Stefan P

Stefan P

10/06/2021, 7:24 PM
i was looking through the source code for the HDFDataSet, and it might be to much work rewriting it towards my current approach, for now. May get back to this when i got first delivery done and get some time for re-factoring πŸ™‚
7:25 PM
@User Thanks for taking time helping me πŸ‘
datajoely

datajoely

10/06/2021, 7:42 PM
I would also say I’m a bit of a Parquet fanboy so if you can use that I would recommend it
Stefan P

Stefan P

10/07/2021, 7:06 PM
I also like parquet, as a column based format, and we use it today. But started with hdf as containers for multiple dataframes (by keys) and really like that you can read only one table /key or susbset of rows from a table with simple query. There is not one format that is great for all needs πŸ™‚
7:07 PM
OK, while at it, a bit of-topic maybe...and i not sure if possible, but Kafka seems to breaking into all different spaces now. A KafkeDataSet ?
datajoely

datajoely

10/08/2021, 9:14 AM
So I'm all for it - but I guess the hard thing is to reconcile how streaming should work on Kedro, we had spark streaming prototype a while back that could work for this. Do you use Kafka day to day in your workflow?
Stefan P

Stefan P

10/09/2021, 6:46 AM
We are not using Kafka today, but we are considering new architecture for our production environment, and Kafka is one option I will evaluate.