Title
#advanced-need-help
x

xxavier

07/14/2022, 8:34 AM
Hi everyone, I am trying to achieve the following:- get a list of items from an API using APIDataSet - loop over the list of items and make one API call per item to get more information using the APIDataSet The first part is working fine (after a small modification to the APIDataset to allow for the use of token authentication as described in a previous message [1]). However, I am not sure what the best way to proceed is for the second part. I was considering checking the catalog.py option but wanted to know if there was a better way to loop over items and aggregate them within the catalog. Any help is appreciated. 🙂 [1] https://discord.com/channels/778216384475693066/778998585454755870/973951577561890856
datajoely

datajoely

07/14/2022, 9:32 AM
So the catalog is designed for reproducibility and this sort violates some of those assumptions
9:33 AM
You can go down the route of creating more and more complex custom datasets
9:33 AM
But it make make sense to simply write a different app that generates the file needed to be read by your kedro pipeline
x

xxavier

07/14/2022, 9:53 AM
Thanks a lot for the prompt feedback. 🙂 It makes sense and would probably be a better solution. The way I was seeing it, I have a data_extraction pipeline as a first step which interacts with a DB through an API and creates a MemoryDataset. After this initial pipeline, everything (preprocessing, training, analysis) is designed for reproducibility. I wasn't sure about the best option for the first step between having a different app and having a "hacked" custom dataset. Will give it a bit more thoughts.
datajoely

datajoely

07/14/2022, 10:05 AM
It's a good question!
10:05 AM
Ultimately API dataset is a thin wrapper on top of
requests.get
10:05 AM
and if you need to do anything much more dynamic than that then it's probably not fit for purpose
x

xxavier

07/14/2022, 11:36 AM
Thanks again! Broader overview since I might have a debatable at best use of Kedro. 😄 I have a Django project which fills a DB with information from multiple sources and serve them through multiple APIs. Because I need to ensure a consistent train/test split across various splits (called tasks), the split is part of the DB. I therefore want to make a first API call to get the train and test indices, and then a second and third (could be wrapped into one) API calls to get the information related to the samples in the train and test sets. Because all the potential users of the DB will not use Kedro, I started to write a minimalistic CLI which just extracts the information for train and test samples given the task. While I could run this CLI before running the Kedro pipelines, I would rather have had it integrated to reduce the number of dependencies and ease the deployment. After the various workflows ran in Kedro, a final API call is made by Kedro which this time populates back the DB with the prediction associated with each strategy for the test set. From a DevOps point of view, everything is hosted on OpenShift and the Django webpage can trigger the Kedro pipeline through a simple script which listens to a socket command. The Kedro part is deployed using an image produced using kedro-docker which was tuned to wait for a socket command. The Kedro part can be found here [1]. In this context, while it might seem to go against the spirit of reproducibility, reproducibility is achieved through the combination of the tools and creating a more complex custom dataset might be a viable (while not optimal) option. The stack might not be the best one but was chosen based on some already existing constraints (Django based ETL, ...). However, if you have any feedbacks on the implementation of the overall project, I would be happy to hear them. 🙂 The structure fulfills its goals while seeming a bit hacky. 😄 [1] https://github.com/XavierAtCERN/dqm-playground-ds