Hello Everyone I m new to kedro and have had success with a Kedro #advanced-need-help

Hello Everyone, I'm new to kedro and have had suc...

PhillyCheeseCake

03/14/2022, 11:28 AM

Hello Everyone, I'm new to kedro and have had success with a use-case for tracking performance of a traditional ML model with variable architectures, making changes to input parameters, and saving the reporting results. I'm looking now to use the same data but applied to fundamentally different architectures and with different evaluation criteria. Specifically, I'd like to be able to use deep learning frameworks and hand crafted algorithms. The hand crafted algorithms are simple mathematical operations wrapped in a class to be deployed to firmware. What are the best practices with respect to kedro for this to be 1) easily scaled 2) have facile integration with kedro-mlflow in the future My current data flow is as follows : data load -> preprocessing -> feature calculations -> model training -> evaluation where model training contains the model specifications As I understand it, I have the following options: 1) route which nodes to use within the model training pipeline using parameters e.g. a parameter that says architecture_type and routes the data flow accordingly 2) determine node logic via parameters which specify the architecture (similar to above) 3) each fundamentally different architecture gets its own pipeline: 1)traditional ML 2) deep learning 3) hand crafted algos routed at pipeline registry level 4) implement modular pipelines for these 3 cases My judgement on this, are that options 1 and 2 do not scale well, are not good practice and seem to be ridiculous. Option 4 is attractive but I don't know whether the modular pipeline framework will be sufficiently flexible. Furthermore it seems from reading other posts in here that this may complicate tracking runs with mlflow (multiple models being saved within the same run) . Thus, I'm leaning towards option 3 to start and if I need additional granularity I can make modular pipelines within those 3 categories. Would really appreciate any kind of feedback, clarification or advice. Thanks!

datajoely

03/14/2022, 12:37 PM

Hi @User creating a thread since this is a chunky set of questions

PhillyCheeseCake

03/14/2022, 12:37 PM

Thanks!

datajoely

03/14/2022, 12:38 PM

So the Kedro maintainer team specifically pivoted to building out the modular pipeline approach to solve this sort of problem for our users hitting this sort of wall

datajoely

03/14/2022, 12:39 PM

Whilst we don't maintain

kedro-mlflow

I think that using

namespaces

in general should address your worries in (4) because each instance of the pipeline will now have its own isolated name space

datajoely

03/14/2022, 12:40 PM

More than happy you help you build out the scope / architecture diagram

PhillyCheeseCake

03/14/2022, 1:07 PM

Appreciate the prompt response. If this obstacle is the use case that modular pipelines are made to address then I will go forth with that approach. I'm going to go ahead and review the documentation for namespaced pipelines more rigorously and then return.

datajoely

03/14/2022, 1:56 PM

https://github.com/datajoely/modular-spaceflights

datajoely

03/14/2022, 1:56 PM

This repo might be useful it's designed to show you what a more complex project may look like

PhillyCheeseCake

03/17/2022, 11:14 AM

I've now read the docs and am pretty convinced that modular pipelines will address my use case and can be designed similar to the modular spaceflights repo with some additional supported flexibility. My goal is to have some N number of models as supported in the spaceflights repo. Let's say (imaging the most complicated configuration) we have models A, B and C which must be able to recieve either the same/different x_train/y_train/x_test/y_test data. For instance models A and B recieve xtrain1 and ytrain1 whereas C recieves xtrain2 and ytrain2. This is valuable in the scenario that models A and B are two candidates are making a prediction towards one goal and C is making predictions towards another, and thus different features are being used for the different goals of the respective models. (which I would like to catalog seperately for data provenance). My understanding is that I first make a modelling pipeline template which specifies the arguments I want to pass to a given model exactly as in modular spaceflights. Now my first question is with respect to the best practices for variable inputs to these models. A very simple way would be that in parameters.yml under model options: model type: I have another field for the input data that should be accepted. Alternatively models could be put lower than the data input in the hierarchy of the yaml file. Or, I could implement mapping so that in the case that I would like to map some model to multiple different train sets, I could do that. I would really appreciate advice on this specific point with regards to if any of these approaches would be difficult to scale with kedro in the long run or if there is a more effective implementation I didn't consider?

datajoely

03/17/2022, 3:58 PM

So I think what you've proposed makes a lot of sense

datajoely

03/17/2022, 3:59 PM

So this sort of challenger model operations question isn't necessarily something we have an opinion on how it should work

datajoely

03/17/2022, 4:01 PM

but what we can provide is a good view of where users get into trouble managing configuration like this

datajoely

03/17/2022, 4:02 PM

we have a couple of threads here that may useful https://github.com/kedro-org/kedro/discussions/824 https://github.com/kedro-org/kedro/discussions/959 https://github.com/kedro-org/kedro/discussions/859

datajoely

03/17/2022, 4:03 PM

In general I think my two guiding principles here should be - optimise your configuration for readability - use configuration environments

datajoely

03/17/2022, 4:03 PM

but most importantly keep asking questions, we'll do our best to help you think through this

5 Views

Previous Next