I don't know anything about Spark or Dask yet, so ...
# beginners-need-help
I don't know anything about Spark or Dask yet, so the docs may cover the following question: Let's say I have 1000 files, and there is some anomaly in file 926 that causes an error. So I fix the file itself or my nodes, and run again. How can I start from 926 and not waste time computing results I've already acquired? I've never really worked with a dataset large enough to warrant asking this question before, haha.
So this is a good question and I'm not sure we have a good out of the box solution since our retry logic is on a node level
I wonder if it's worth defining a custom dataset that works on ranges
and then you run the several pipelines for each range
Is the error you're worried about related to logic or memory?
because I guess we could take diff approaches depending on those variables
Thanks for the information, that's good to know
I'm anticipating either of those error types. It's gravitational wave data, so big files and the signal can be inconsistent at times, so just trying make a plan for things going wrong
So this specific data per se is outside of my wheelhouse @User with his astrophysics background may actually be helpful here
In general I'd explore two things: (1) Get good at logging out exceptions/warnings things that can be used to understand why things failed. Potentially the
hook is really useful here (2) Try and set up a data profiling pipeline that allows you to do low-cost analysis of the data that can maybe give you an idea of what may cause problems downstream (3) for logic errors, you may want to use exception handlers to gracefully swallow and log errors rather than killing the process
Fantastic! I'll do those things and hopefully the project should stay on course