Hi community Enthusiastic returning user here after trying K Kedro #advanced-need-help

Hi community, Enthusiastic returning user here af...

Evolute

05/25/2022, 8:58 AM

Hi community, Enthusiastic returning user here after trying Kedro for the first time 2 years ago. I'm really trying to get into the details this time, so here's two questions to begin with: 1) I really like the way you can define datasets in the catalog.yml, for use in your pipeline. However, I'm a bit stuck in where/how Kedro has defined "parameters", which is a reference to conf/base/parameters.yml. For reference, I'm using v 0.18.1 and have initialized the Iris tutorial pipeline. In my mind, that reference should exist in conf/base/catalog.yml but the only thing defined there is the "example_iris_dataset" used in the pipeline. Where/how does Kedro define "parameters"? 2) Returning to catalog.yml and defining datasets/datasources of different kinds. I have a little trouble picking the correct values for 'type' in those definitions. For example, let's say I simply have another .yml file that I want to create a reference to. Which type would I use in that scenario? The closest thing I've found so far is YAMLDataSet. I've tried it by adding the following reference, for the file conf/base/yaml_test.yml, in catalog.yml: from kedro.extras.datasets.yaml import YAMLDataSet yaml_test: type: YAMLDataSet filepath: conf/base/yaml_test.yml But it seems incorrect... I mean, I could always cheat by not creating that reference in catalog.yml and simply hardcode loading that yaml file in a suitable node of the pipeline, but that seems against the Kedro spirit. Help very much appreciated! 🙂

antony.milne

05/25/2022, 9:50 AM

Hi @Evolute and welcome back to kedro! 1. parameters are defined by the parameters.yml file (actually anything which matches a more general pattern, so you could have multiple files called paramters_1.yml, parameters_2.yml, etc. or a directory called parameters). In terms of consumption in the pipeline, this does behave very similarly to a dataset name. In the example iris tutorial, you'll find it used in the

split_data

node:

inputs=["example_iris_data", "parameters"]

. See https://kedro.readthedocs.io/en/stable/kedro_project_setup/configuration.html#use-parameters for more

antony.milne

05/25/2022, 9:54 AM

2. You are confusing two slightly different concepts here: *

YAMLDataSet

is a dataset type that can be used to store dictionary data used as a node input/output. The actual .yml file here should not live in `conf`; it should go in

data

(or s3 or wherever else) * your project configuration lives in

conf

and is also written in yaml, but it does not use

YAMLDataSet

. It's a separate concept of runtime configuration rather than a data source The case of

parameters

is a bit of a special case because it's runtime configuration defined in

conf

but you can use it as a node input. Note that there's no explicit definition of

parameters

in the catalog.yml file, i.e.

parameters

is not a

YAMLDataSet

Evolute

05/25/2022, 9:55 AM

Hi @antony.milne, thank you for the answer! Yes, of course. I completely understand that the parameters are defined in parameters.yml, however I'm a bit lost in where/how Kedro creates the reference to it so that we can use it in our nodes. In that split_data node example you posted, we can clearly see that parameters.yml is being referenced by "parameters". It's that reference that I want to understand

antony.milne

05/25/2022, 9:55 AM

You also don't need to do

from kedro.extras.datasets.yaml import YAMLDataSet

in the yaml file. If you use

type: yaml.YAMLDataSet

then kedro knows where to import it from automatically. In fact it doesn't make sense to put Python imports in a yml file, because it's not written in Python - it's written in yaml

antony.milne

05/25/2022, 9:56 AM

Ah ok, I see the confusion. Basically

parameters

and

params:..

are special. They are loaded up automatically by kedro and can be treated as dataset names even though they are not defined in the catalog.

antony.milne

05/25/2022, 9:57 AM

The actual code that does this is here in case you're interested: https://github.com/kedro-org/kedro/blob/main/kedro/framework/context/context.py#L244

antony.milne

05/25/2022, 9:57 AM

You'll see that it uses

config_loader

rather than

YAMLDataSet

. This means that you can have multiple configuration environments (folders in

conf

), each of which has its own

parameters.yml

file. And when you run

kedro run --env=...

then it will pick up the right file

antony.milne

05/25/2022, 10:00 AM

FYI this is the function that makes the

"parameters"

and

"params:..."

available in node input as a dataset name: https://github.com/kedro-org/kedro/blob/main/kedro/framework/context/context.py#L312 (feed dict is basically weird terminology here for parameters, just for historical reasons)

Evolute

05/25/2022, 10:00 AM

Absolutely fantastic! I suspected something like that but you've cleared it up completely. Thanks 🙂 Ok so here's one more! The reason I asked these things is because I actually want to add another configuration file (which will be in .yml) format. How would I go ahead and create the reference for that in catalog.yml? For example, say that I want to create a reference to conf/base/additional.yml

antony.milne

05/25/2022, 10:02 AM

What sort of thing are you intending to put in this? And is the reason you're interested in doing so because you want the file to be different for different run environments?

Evolute

05/25/2022, 10:04 AM

Well the thing is, the pipeline I want to create won't have local raw data - rather, it will fetch data from MongoDB using various configs and paths that I want to put in a .yml file. I could in theory put all of that in the paramaters.yml file but essentially I just want to learn how to add my own, additional, .yml file reference (so that I can both learn and have the option to do so in the future, if the need would arise)

noklam

05/25/2022, 10:13 AM

In that case, you most likely want a CustomDataset and define it in

catalog.yml

instead of having a separate config file.

antony.milne

05/25/2022, 10:14 AM

This is a very interesting question actually because it's not super obvious the best way to solve it! An easy way to do something similar would be just to define it as a dataset:

Copy code

# conf/base/catalog.yml
mongo_db:
   type: yaml.YAMLDataSet
   filepath: data/mongo_db_config.yml

And then have a different entry for

mongo_db

in different run environments that point to different files, e.g.

Copy code

# conf/env/catalog.yml
mongo_db:
   type: yaml.YAMLDataSet
   filepath: data/mongo_db_config_env.yml

But then you might very reasonably argue that if those

mongo_db_config_env.yml

files are different for each environment, they belong in

conf

rather than

data

as you were originally doing. So if you want to do this "properly" and have something that behaves like parameters I think you should be able to do so with some hooks:

Copy code

class MongoDBHooks:
    @hook_impl
    def after_context_created(self, context):
        self.config_loader = context.config_loader

    @hook_impl
    def after_catalog_created(self, catalog):
        mongo_db = self.config_loader.get("mongo_db*")
        catalog.add_feed_dict({"mongo_db": mongo_db})

This is basically just extracting the key parts of the code that converts

parameters.yml

into something that can be used as a node input. You can then use

"mongo_db"

as a node input. I've left out the stuff that would enable you to use subkeys like

mongo_db:key

here.

Evolute

05/25/2022, 10:19 AM

Thank you so very much! I'm going to try that 🙂 I did something similar yesterday but I think my implementation was wrong, so I'll go ahead and test yours! My pipeline will fetch different data for different runs (same environment though) and I plan to simply use kedro run --params to supply those unique variables. If I can get this to work it's going to be supernice. I'll try and report back!

antony.milne

05/25/2022, 10:23 AM

Awesome, let me know how it works! The

after_context_created

hook is very new (kedro 0.18.1 only) and we're working on improving how config loader and context work. So it will be very interesting to hear what works well here or if you have any suggestions

Evolute

05/25/2022, 11:14 AM

Amazing, it worked like a charm. I simply addded the following to catalog.yml mongo_db: type: yaml.YAMLDataSet filepath: conf/base/mongo_db_config.yml simple and elegant solution for my needs 🙂

Previous Next