Hi! Concerning the answer to my last question: I s...
# advanced-need-help
i
Hi! Concerning the answer to my last question: I sow your answer suggesting the use of modular pipelines , but in my case I'm using always the same pipeline, just changing the input files. To give more context to the problem: I have a pipeline which I will call PIP. The catalog is rendered using the variable cohort in the globals.yml. I want to run the pipeline using different values for the variables cohort. This works fine as long as I'm running it sequentially like : update globals.yml -> kedro run for each cohort. If I want to run this in parallel the steps update globals -> kedro run run in parallel for all the cohorts. So globals can be overwritten by chance. Therefore I would like to be able to specify the globals.yml in another place or provide the variable from the command line to avoid this problem. Is there any way to achieve it ?
d
Hi @User so this is something we do a lot internally there are two key parts to understand:
1. Configuration Environments - Setting up and additional config environment is documented here. In the docs we talk about it being used for say Staging/Prod having mirror but diff config paths https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments This is inspired by the thinking of the 12 factor app: https://12factor.net/config Additionally @User documented how to extend this to be even more powerful: https://discord.com/channels/778216384475693066/846330075535769601/875275273430515795
2. Is to utilise modular pipelines which allow you to reuse the same pipeline with different inputs - this is documented here: https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html#how-to-use-a-modular-pipeline-twice I am also currently working on this demo project (very much still work in progress) where you can see how I'm taking advantage of thingsL https://github.com/datajoely/modular-spaceflights/blob/main/src/modular_spaceflights/pipeline_registry.py
i
Perfect ! I will have a look at it ! Thanks!
a
Just to add it's also very common to define globals using environment variables (in addition to or instead of using globals.yml). This way you can do
COHORT=value1 kedro run
- see last code example here for how you inject this into `TemplatedConfigLoader`: https://kedrozerotohero.com/programming-patterns/how-to-inject-secrets-into-your-kedro-configuration. Not quite sure how slurm would work with setting the same environment variable multiple times in parallel but might be an easy way to do this πŸ™‚
But as Joel said, if it's just a case of injecting a different variable value then using configuration environments sounds like the right solution; if it's more complex then modular pipelines could help.
i
Thanks a lot! I will have a look at the link!
Thanks! This make sense, the env is perfect in case I have few values for the variable. But if I would need to use many different values than I would have the same amount of different envs. Did I understand it correctly ?
d
Yes
Now that does mean there will be a bit of duplication
And that’s where people tend to use jinja
But in principle this is the right approach
i
at the moment I solved in this way, I will also test the env approach later. In this way I can pass either the path to the global_dict.yml or pass the dict as a string i.e. "{'cohort':'cohort_name'}" to the --params flag. Any comment is well accepted !
d
That would work! I need to think if I like it or not haha
i
Let me know what' s the final thought and in case how to improve it. πŸ˜…
d
So ultimately - I think you should use config environments.
kedro run --env="cohort_a"
should give you the right configuration if you set it up correctly
i
But what if I have 1000 cohorts, should I generate all the folders conf/"cohort_{1 ... 1000}" to just inject one variable ? I would love to have a flag like --globals or --globals_dict to pass this I one can do with the --parameters flag for parameters.
d
ah I see
yeah if you have a 1,000 evironments don't make sense unless you start generating config dynamically
if this works for you great - I just need to think about how we can be better here
i
thanks! it would be great to understand what the best solution for something like this. In my case most of the pipelines are applied to different inputs and stored in different output folders and injecting variables like one can inject parameters would be cool.
d
Yeah the other thing to consider here is wrapping Kedro in an orchestrator
and setting it to do things different based on CLI arguments or env varaibles
i
Any suggestion on the orchestrator ? Are there examples of kedro pipeline being used with orchestrators?
d
I'm a fan of Argo personally
we don't have many examples open source
but lots of teams internally have
i
It would be nice to see an example. I will have a look at Argo. thanks for your help !
d
I don't from a Kedro point of view it's about exposing this sort of functionality via CLI commands or environment variables
then configuring your orchestrator to pass those in
i
ah I see, thanks!
a
The contents of
--params
is actually already available as a variable in
register_config_loader
- it's the
extra_params
dictionary described here: https://kedro.readthedocs.io/en/0.17.5/kedro.framework.hooks.specs.RegistrationSpecs.html#kedro.framework.hooks.specs.RegistrationSpecs.register_config_loader. So no need to delve into click to create
globals_dict_params
- it's already there for you!
Maybe I'm missing something but I also don't see why you'd pass in a yaml file using
--params
. If you want to load
globals_dict
from a yaml file then that's exactly what environments are for. I understand you don't want like 1000 different environments, but in your version you still need to create all the 1000 different yaml files? πŸ€”
Basically I think the stuff you're putting in the yml files should be done as separate environments (assuming there's a sensible number of them). Any other arguments can be injected into
globals_dict
through
--params
(this is very similar to doing it through environment variables as in the example I linked to earlier)
i
Thanks! The extra_parameters is exactly what I needed! It was missing in the code generated in the ProjectHook and I must have overseen it. Thanks for pointing me to it!
I was checking the possibility to use the extra_parameters, unfortunately I started this project with kedro 0.17.0 and there is no extra_params in the hook specification for the register_config_loader. Would there be an easy way to add it? or what would be the steps to go from kedro 0.17.0 to kedro 0.17.5 if the project was initialised with kedro 0.17.0. (just updating kedro it's not enough, i.e. the wrong run command is used etc. )
d
The extra params is part of the TemplatedConfigLoader
i
Do you mean like an attribute of TemplatedConfigLoader?
d
So it's two things - At the point the KedroSession is created there is an opportunity to pass
extra_params
https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/03_session.html?highlight=extra_params#create-a-session There is also
globals_dict
kwarg in the TemplatedConfigLoader constructor https://kedro.readthedocs.io/en/stable/_modules/kedro/config/templated_config.html#TemplatedConfigLoader
i
Sure the extra_params are in the kedro_session but not in the register_config_loader. Because I'm using kedro 0.17.0 the register_config_loader hook has a different signature and in KedroContext the hook_manager.hook.register_config_loader does not receive neither the extra_params nor the env variables like it does in kedro 0.17.5. So, I have no access to those variables in the register_config_loader function in the hooks.py, unless I change the hook specification for the register_config_loader and pass the variables in the hook_manager.hook.register_config_loader. So, I should upgrade to kedro 0.17.5 or I could keep using the click code to intercept the extra parameters provided.
d
Ah gotcha
Yes you should try and upgrade then
a
Yeah, I see this was added in 0.17.1. In theory all minor releases are non-breaking so you should be able to upgrade to anything in 0.17.x and it will still work with your 0.17.0 project. In reality there's been a few very small breaking changes to non-essential workflows - see https://github.com/quantumblacklabs/kedro/blob/master/RELEASE.md
I think it's worth trying to upgrade to kedro 0.17.5 - chances are good that it will just work straightaway, or with very little effort. If not then kedro 0.17.1 should do what you need
Upgrading should be easy enough that this is what I'd recommend rather than the click code you've got (good general rule - never delve into click objects unless you *really *need to 😬 )
i
I tried to upgrade to 0.17.5 and it runs as long as I'm not using the cli commands I implemented in the cli.py. For some reason kedro is using the wrong run command. I also tried to add the missing file that are generated when generating a new project with kedro 0.17.5 like i.e. the__main__.py file was missing. I will have a look at the link! Thanks for your help and time @User @User !
d
Good luck!
i
Thanks! It worked ! I'm using Kedro 0.17.5 and the extra_params argument works as described !
d
Sweet!
2 Views