Thanks again! Broader overview since I might have a debatable at best use of Kedro. 😄
I have a Django project which fills a DB with information from multiple sources and serve them through multiple APIs. Because I need to ensure a consistent train/test split across various splits (called tasks), the split is part of the DB. I therefore want to make a first API call to get the train and test indices, and then a second and third (could be wrapped into one) API calls to get the information related to the samples in the train and test sets. Because all the potential users of the DB will not use Kedro, I started to write a minimalistic CLI which just extracts the information for train and test samples given the task. While I could run this CLI before running the Kedro pipelines, I would rather have had it integrated to reduce the number of dependencies and ease the deployment. After the various workflows ran in Kedro, a final API call is made by Kedro which this time populates back the DB with the prediction associated with each strategy for the test set.
From a DevOps point of view, everything is hosted on OpenShift and the Django webpage can trigger the Kedro pipeline through a simple script which listens to a socket command. The Kedro part is deployed using an image produced using kedro-docker which was tuned to wait for a socket command.
The Kedro part can be found here [1].
In this context, while it might seem to go against the spirit of reproducibility, reproducibility is achieved through the combination of the tools and creating a more complex custom dataset might be a viable (while not optimal) option.
The stack might not be the best one but was chosen based on some already existing constraints (Django based ETL, ...). However, if you have any feedbacks on the implementation of the overall project, I would be happy to hear them. 🙂 The structure fulfills its goals while seeming a bit hacky. 😄
[1]Â
https://github.com/XavierAtCERN/dqm-playground-ds