Getting my first pipeline up and running, been foc...
# beginners-need-help
s
Getting my first pipeline up and running, been focusing on learning how to use params as input to my pipelines, and really starts to like Kedro ๐Ÿ™‚ My next step would be to harvest the value of the Data Catalog. And here I struggle. I am looking for anyone working with HDFDataSet that can give me a little guidance. I am refactor an existing code as a kedro data engineering pipeline. New data arrives in a json file. I process the data and today I store them in a hdf file, splitting data by "id" as separate dataframes and stored with "id" as key. If file and key already exist I append data, else write. (with some intervalls new files are created to maintain decent filesizes) So, now I explore the option to use kedro data catalog, and HDFDataset. My first challange is that you have to specify key when creating a dataset, not passing it as an argument when saving. I will have up-to 5000 "ids", that i do not know before reading a file. So I think a "before_pipeline hook" is needed to create a dataset for each file/key combination during processing, correct? But when trying to create a dataset with HDFDataSet, a struggle to use my old save params (using pandas today) . I use format, min_itemsize and data_columns today, but i get an error telling me they are not valid save_params. Help wanted. My use-case looks like a perfect fit to PartitionedDataSet or even better IncrementalDataSet ๐Ÿ™‚ But I don't want 5000 files to be created, I want partitions to be key in same hdf-file. Anyone have something like this working and willing to share some code? Feel free tp PM me if you want to help.
d
Welcome to the community!
So our current dataset is set up to the same file, even if you use PartitionedDataSet we will create new files in a folder. Off the top of my head the two approaches I would suggest you subclass
pandas.HDFDataSet
and override the
save()
method to append rather than override. Instructions for doing something like this can be found here: https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html I think you could achieve this with various hooks but I think it will overcomplicate things
s
Hi, and thanks for the pointer. That was an approach I had not think of. Then I might be able to solve my challenge with "save_params" (or arguments) as well. I will use the documentation as make a try ๐Ÿ™‚
d
Itโ€™s worth noting that weโ€™re going to drop windows support for HDFDataSet in python 3.7 specifically
Hopefully that doesnโ€™t affect you
But it came up today so I thought Iโ€™d mention
s
i was looking through the source code for the HDFDataSet, and it might be to much work rewriting it towards my current approach, for now. May get back to this when i got first delivery done and get some time for re-factoring ๐Ÿ™‚
@User Thanks for taking time helping me ๐Ÿ‘
d
I would also say Iโ€™m a bit of a Parquet fanboy so if you can use that I would recommend it
s
I also like parquet, as a column based format, and we use it today. But started with hdf as containers for multiple dataframes (by keys) and really like that you can read only one table /key or susbset of rows from a table with simple query. There is not one format that is great for all needs ๐Ÿ™‚
OK, while at it, a bit of-topic maybe...and i not sure if possible, but Kafka seems to breaking into all different spaces now. A KafkeDataSet ?
d
So I'm all for it - but I guess the hard thing is to reconcile how streaming should work on Kedro, we had spark streaming prototype a while back that could work for this. Do you use Kafka day to day in your workflow?
s
We are not using Kafka today, but we are considering new architecture for our production environment, and Kafka is one option I will evaluate.
2 Views