Hello folks! I have a question regarding Increment...
# advanced-need-help
Hello folks! I have a question regarding IncrementalDataSet catalog entry. In the documentation, it is specified that a checkpoint file will be created /at the location/ of the dataset to remember which entries have already been processed or not. My question is: if multiple data scientists are running the pipeline from different computer, will the checkpoint file remember which computer has already process which entries? Or will one user have missing points if another has already processed them on its side?
So the checkpoint is technically a whole
of its own so it will be persisted at the storage location, which if accessible team wide will be shared
That being said, it wasn't really designed to be robust in this situation. For instance there isn't a
mechanism to prevent two different people overwriting the data in parallel runs
so I think you can use it in this situation, but it requires some coordination amongst the team to be safe
Thank you for your, as always, prompt and helpful answer 🙂
The ID generated is a timestamp at I think millisecond precision which is very unlikely to be non-uniqiue, but possible