Let s say I have thousands of operations to perform which ar Kedro #beginners-need-help

Let's say I have thousands of operations to perfor...

beats-like-a-helix

04/06/2022, 9:23 PM

Let's say I have thousands of operations to perform which are computationally expensive. Each iteration yields a set of parameters which I'd like to write one at a time to the same file -- whether this be as rows of a csv, SQL table or whatever -- so that already written data is preserved should the script fail on a particular iteration. What is the "Kedro approved" factory pattern for a use case like this? Any advice would be much appreciated, cheers.

datajoely

04/06/2022, 9:45 PM

It's a very good question - one id be keen to hear other people's opinions here

datajoely

04/06/2022, 9:46 PM

I'm tempted to say incremental dataset would be useful

datajoely

04/06/2022, 9:46 PM

But im not sure

beats-like-a-helix

04/06/2022, 10:47 PM

@datajoely Thanks for the quick response! It seems to tick most of the boxes of an incremental dataset (like saving one's progress). It just feels wrong to be creating an individual file for every record in an identical schema like this!

beats-like-a-helix

04/06/2022, 11:15 PM

@datajoely I mean, this is kind of the same problem as storing any timeseries of records. Imagine I open a websocket for "BTC/USDT" and receive new json data every second. For some interval, I want to write all of these incoming records to the same database, not as individual files. Surely there are others in here that have used Kedro for exactly this

datajoely

04/07/2022, 9:25 AM

So @beats-like-a-helix in truth I think you can get Kedro to do this sort of thing

datajoely

04/07/2022, 9:25 AM

but I'm not sure it's the best tool for the job, I'd argue you're looking for some sort of streaming ingestion process

datajoely

04/07/2022, 9:32 AM

Kedro is fundamentally a batch process and scheduling it to run every second feels wrong

datajoely

04/07/2022, 9:34 AM

This may be a good opportunity to create some sort of live service which listens to the websocket and saves the data in somewhere accessible like a partitioned parquet or database

datajoely

04/07/2022, 9:34 AM

you could then pick up the data in a Kedro pipeline at the frequency that makes sense

Walber Moreira

04/07/2022, 11:57 AM

@User I don’t know if the way we do it here is the “kedro way” , but we solved creating our own dataset that accepts as input either an instance (eg pandas or spark) or an generator (that yields pandas or spark). To make myself clear: Node() -> generator: And inside the dataset _save() we changed “overwrite” to “append” after the first generator result is saved. Hope my 2 cent’s helps you 😅

datajoely

04/07/2022, 11:57 AM

Clever!

beats-like-a-helix

04/07/2022, 4:38 PM

@datajoely @Walber Moreira Thank you both for the great advice! I'm sure I should be able to get something working for my own use case now

3 Views

Previous Next