Getting my first pipeline up and running been focusing on le Kedro #beginners-need-help

Getting my first pipeline up and running, been foc...

Stefan P

10/02/2021, 7:47 PM

Getting my first pipeline up and running, been focusing on learning how to use params as input to my pipelines, and really starts to like Kedro 🙂 My next step would be to harvest the value of the Data Catalog. And here I struggle. I am looking for anyone working with HDFDataSet that can give me a little guidance. I am refactor an existing code as a kedro data engineering pipeline. New data arrives in a json file. I process the data and today I store them in a hdf file, splitting data by "id" as separate dataframes and stored with "id" as key. If file and key already exist I append data, else write. (with some intervalls new files are created to maintain decent filesizes) So, now I explore the option to use kedro data catalog, and HDFDataset. My first challange is that you have to specify key when creating a dataset, not passing it as an argument when saving. I will have up-to 5000 "ids", that i do not know before reading a file. So I think a "before_pipeline hook" is needed to create a dataset for each file/key combination during processing, correct? But when trying to create a dataset with HDFDataSet, a struggle to use my old save params (using pandas today) . I use format, min_itemsize and data_columns today, but i get an error telling me they are not valid save_params. Help wanted. My use-case looks like a perfect fit to PartitionedDataSet or even better IncrementalDataSet 🙂 But I don't want 5000 files to be created, I want partitions to be key in same hdf-file. Anyone have something like this working and willing to share some code? Feel free tp PM me if you want to help.

datajoely

10/04/2021, 9:03 AM

Welcome to the community!

datajoely

10/04/2021, 9:10 AM

So our current dataset is set up to the same file, even if you use PartitionedDataSet we will create new files in a folder. Off the top of my head the two approaches I would suggest you subclass

pandas.HDFDataSet

and override the

save()

method to append rather than override. Instructions for doing something like this can be found here: https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html I think you could achieve this with various hooks but I think it will overcomplicate things

Stefan P

10/05/2021, 3:56 PM

Hi, and thanks for the pointer. That was an approach I had not think of. Then I might be able to solve my challenge with "save_params" (or arguments) as well. I will use the documentation as make a try 🙂

datajoely

10/05/2021, 3:58 PM

It’s worth noting that we’re going to drop windows support for HDFDataSet in python 3.7 specifically

datajoely

10/05/2021, 3:59 PM

Hopefully that doesn’t affect you

datajoely

10/05/2021, 3:59 PM

But it came up today so I thought I’d mention

Stefan P

10/06/2021, 7:24 PM

i was looking through the source code for the HDFDataSet, and it might be to much work rewriting it towards my current approach, for now. May get back to this when i got first delivery done and get some time for re-factoring 🙂

Stefan P

10/06/2021, 7:25 PM

@User Thanks for taking time helping me 👍

datajoely

10/06/2021, 7:42 PM

I would also say I’m a bit of a Parquet fanboy so if you can use that I would recommend it

Stefan P

10/07/2021, 7:06 PM

I also like parquet, as a column based format, and we use it today. But started with hdf as containers for multiple dataframes (by keys) and really like that you can read only one table /key or susbset of rows from a table with simple query. There is not one format that is great for all needs 🙂

Stefan P

10/07/2021, 7:07 PM

OK, while at it, a bit of-topic maybe...and i not sure if possible, but Kafka seems to breaking into all different spaces now. A KafkeDataSet ?

datajoely

10/08/2021, 9:14 AM

So I'm all for it - but I guess the hard thing is to reconcile how streaming should work on Kedro, we had spark streaming prototype a while back that could work for this. Do you use Kafka day to day in your workflow?

Stefan P

10/09/2021, 6:46 AM

We are not using Kafka today, but we are considering new architecture for our production environment, and Kafka is one option I will evaluate.

4 Views

Previous Next