Thanks a lot for the prompt feedback. 🙂 It makes sense and would probably be a better solution.
The way I was seeing it, I have a data_extraction pipeline as a first step which interacts with a DB through an API and creates a MemoryDataset. After this initial pipeline, everything (preprocessing, training, analysis) is designed for reproducibility. I wasn't sure about the best option for the first step between having a different app and having a "hacked" custom dataset. Will give it a bit more thoughts.