10/05/2022, 3:36 PM
sounds like what you're doing is taking yesterday's dataset, adding new data, and saving it again You are mutating your raw data which is not a good idea, you could corrupt it if your code crashes. If the overhead is not big, I would save each day's data into a separate timestamped file, then use kedro with an aggregation node that merges all the files into one dataset (by using a PartitionedDataset for example). The timestamped file doesn't have to be made with kedro and you can consider it as immutable and back it up with ex. S3 versioning. Otherwise, you can use environment variables with TemplatedConfigLoader, so that A's filename can use yesterday's timestamp, and B today's. So that you also keep a history of your datasets, in case something goes wrong. If something goes wrong however and you don't notice for a few days, you would have to revert and lose all those days' data. You could also combine both approaches if the overhead is too big, and start your aggregation node with say last months dataset as a base and only add this month's days... In case your dataset is not append only and would scale writes with days. In this case, even if something catastrophic happens you still have all the day data backed up, so that you can reconstruct your dataset given enough time. This parallels WAL, which you might find insightful