Let's say I have thousands of operations to perfor...
# beginners-need-help
b
Let's say I have thousands of operations to perform which are computationally expensive. Each iteration yields a set of parameters which I'd like to write one at a time to the same file -- whether this be as rows of a csv, SQL table or whatever -- so that already written data is preserved should the script fail on a particular iteration. What is the "Kedro approved" factory pattern for a use case like this? Any advice would be much appreciated, cheers.
d
It's a very good question - one id be keen to hear other people's opinions here
I'm tempted to say incremental dataset would be useful
But im not sure
b
@datajoely Thanks for the quick response! It seems to tick most of the boxes of an incremental dataset (like saving one's progress). It just feels wrong to be creating an individual file for every record in an identical schema like this!
@datajoely I mean, this is kind of the same problem as storing any timeseries of records. Imagine I open a websocket for "BTC/USDT" and receive new json data every second. For some interval, I want to write all of these incoming records to the same database, not as individual files. Surely there are others in here that have used Kedro for exactly this
d
So @beats-like-a-helix in truth I think you can get Kedro to do this sort of thing
but I'm not sure it's the best tool for the job, I'd argue you're looking for some sort of streaming ingestion process
Kedro is fundamentally a batch process and scheduling it to run every second feels wrong
This may be a good opportunity to create some sort of live service which listens to the websocket and saves the data in somewhere accessible like a partitioned parquet or database
you could then pick up the data in a Kedro pipeline at the frequency that makes sense
w
@User I don’t know if the way we do it here is the “kedro way” , but we solved creating our own dataset that accepts as input either an instance (eg pandas or spark) or an generator (that yields pandas or spark). To make myself clear: Node() -> generator: And inside the dataset _save() we changed “overwrite” to “append” after the first generator result is saved. Hope my 2 cent’s helps you 😅
d
Clever!
b
@datajoely @Walber Moreira Thank you both for the great advice! I'm sure I should be able to get something working for my own use case now