I don t know anything about Spark or Dask yet so the docs ma Kedro #beginners-need-help

I don't know anything about Spark or Dask yet, so ...

beats-like-a-helix

03/08/2022, 12:44 PM

I don't know anything about Spark or Dask yet, so the docs may cover the following question: Let's say I have 1000 files, and there is some anomaly in file 926 that causes an error. So I fix the file itself or my nodes, and run again. How can I start from 926 and not waste time computing results I've already acquired? I've never really worked with a dataset large enough to warrant asking this question before, haha.

datajoely

03/08/2022, 1:40 PM

So this is a good question and I'm not sure we have a good out of the box solution since our retry logic is on a node level

datajoely

03/08/2022, 1:40 PM

I wonder if it's worth defining a custom dataset that works on ranges

datajoely

03/08/2022, 1:40 PM

and then you run the several pipelines for each range

datajoely

03/08/2022, 1:40 PM

Is the error you're worried about related to logic or memory?

datajoely

03/08/2022, 1:41 PM

because I guess we could take diff approaches depending on those variables

beats-like-a-helix

03/08/2022, 2:14 PM

Thanks for the information, that's good to know

beats-like-a-helix

03/08/2022, 2:20 PM

I'm anticipating either of those error types. It's gravitational wave data, so big files and the signal can be inconsistent at times, so just trying make a plan for things going wrong

datajoely

03/08/2022, 2:37 PM

So this specific data per se is outside of my wheelhouse @User with his astrophysics background may actually be helpful here

datajoely

03/08/2022, 2:39 PM

In general I'd explore two things: (1) Get good at logging out exceptions/warnings things that can be used to understand why things failed. Potentially the

on_pipeline_error

hook is really useful here (2) Try and set up a data profiling pipeline that allows you to do low-cost analysis of the data that can maybe give you an idea of what may cause problems downstream (3) for logic errors, you may want to use exception handlers to gracefully swallow and log errors rather than killing the process

beats-like-a-helix

03/08/2022, 2:58 PM

Fantastic! I'll do those things and hopefully the project should stay on course

2 Views

Previous Next