Hi community, Enthusiastic returning user here af...
# advanced-need-help
e
Hi community, Enthusiastic returning user here after trying Kedro for the first time 2 years ago. I'm really trying to get into the details this time, so here's two questions to begin with: 1) I really like the way you can define datasets in the catalog.yml, for use in your pipeline. However, I'm a bit stuck in where/how Kedro has defined "parameters", which is a reference to conf/base/parameters.yml. For reference, I'm using v 0.18.1 and have initialized the Iris tutorial pipeline. In my mind, that reference should exist in conf/base/catalog.yml but the only thing defined there is the "example_iris_dataset" used in the pipeline. Where/how does Kedro define "parameters"? 2) Returning to catalog.yml and defining datasets/datasources of different kinds. I have a little trouble picking the correct values for 'type' in those definitions. For example, let's say I simply have another .yml file that I want to create a reference to. Which type would I use in that scenario? The closest thing I've found so far is YAMLDataSet. I've tried it by adding the following reference, for the file conf/base/yaml_test.yml, in catalog.yml: from kedro.extras.datasets.yaml import YAMLDataSet yaml_test: type: YAMLDataSet filepath: conf/base/yaml_test.yml But it seems incorrect... I mean, I could always cheat by not creating that reference in catalog.yml and simply hardcode loading that yaml file in a suitable node of the pipeline, but that seems against the Kedro spirit. Help very much appreciated! 🙂
a
Hi @Evolute and welcome back to kedro! 1. parameters are defined by the parameters.yml file (actually anything which matches a more general pattern, so you could have multiple files called paramters_1.yml, parameters_2.yml, etc. or a directory called parameters). In terms of consumption in the pipeline, this does behave very similarly to a dataset name. In the example iris tutorial, you'll find it used in the
split_data
node:
inputs=["example_iris_data", "parameters"]
. See https://kedro.readthedocs.io/en/stable/kedro_project_setup/configuration.html#use-parameters for more
2. You are confusing two slightly different concepts here: *
YAMLDataSet
is a dataset type that can be used to store dictionary data used as a node input/output. The actual .yml file here should not live in `conf`; it should go in
data
(or s3 or wherever else) * your project configuration lives in
conf
and is also written in yaml, but it does not use
YAMLDataSet
. It's a separate concept of runtime configuration rather than a data source The case of
parameters
is a bit of a special case because it's runtime configuration defined in
conf
but you can use it as a node input. Note that there's no explicit definition of
parameters
in the catalog.yml file, i.e.
parameters
is not a
YAMLDataSet
e
Hi @antony.milne, thank you for the answer! Yes, of course. I completely understand that the parameters are defined in parameters.yml, however I'm a bit lost in where/how Kedro creates the reference to it so that we can use it in our nodes. In that split_data node example you posted, we can clearly see that parameters.yml is being referenced by "parameters". It's that reference that I want to understand
a
You also don't need to do
from kedro.extras.datasets.yaml import YAMLDataSet
in the yaml file. If you use
type: yaml.YAMLDataSet
then kedro knows where to import it from automatically. In fact it doesn't make sense to put Python imports in a yml file, because it's not written in Python - it's written in yaml
Ah ok, I see the confusion. Basically
parameters
and
params:..
are special. They are loaded up automatically by kedro and can be treated as dataset names even though they are not defined in the catalog.
The actual code that does this is here in case you're interested: https://github.com/kedro-org/kedro/blob/main/kedro/framework/context/context.py#L244
You'll see that it uses
config_loader
rather than
YAMLDataSet
. This means that you can have multiple configuration environments (folders in
conf
), each of which has its own
parameters.yml
file. And when you run
kedro run --env=...
then it will pick up the right file
FYI this is the function that makes the
"parameters"
and
"params:..."
available in node input as a dataset name: https://github.com/kedro-org/kedro/blob/main/kedro/framework/context/context.py#L312 (feed dict is basically weird terminology here for parameters, just for historical reasons)
e
Absolutely fantastic! I suspected something like that but you've cleared it up completely. Thanks 🙂 Ok so here's one more! The reason I asked these things is because I actually want to add another configuration file (which will be in .yml) format. How would I go ahead and create the reference for that in catalog.yml? For example, say that I want to create a reference to conf/base/additional.yml
a
What sort of thing are you intending to put in this? And is the reason you're interested in doing so because you want the file to be different for different run environments?
e
Well the thing is, the pipeline I want to create won't have local raw data - rather, it will fetch data from MongoDB using various configs and paths that I want to put in a .yml file. I could in theory put all of that in the paramaters.yml file but essentially I just want to learn how to add my own, additional, .yml file reference (so that I can both learn and have the option to do so in the future, if the need would arise)
n
In that case, you most likely want a CustomDataset and define it in
catalog.yml
instead of having a separate config file.
a
This is a very interesting question actually because it's not super obvious the best way to solve it! An easy way to do something similar would be just to define it as a dataset:
Copy code
# conf/base/catalog.yml
mongo_db:
   type: yaml.YAMLDataSet
   filepath: data/mongo_db_config.yml
And then have a different entry for
mongo_db
in different run environments that point to different files, e.g.
Copy code
# conf/env/catalog.yml
mongo_db:
   type: yaml.YAMLDataSet
   filepath: data/mongo_db_config_env.yml
But then you might very reasonably argue that if those
mongo_db_config_env.yml
files are different for each environment, they belong in
conf
rather than
data
as you were originally doing. So if you want to do this "properly" and have something that behaves like parameters I think you should be able to do so with some hooks:
Copy code
class MongoDBHooks:
    @hook_impl
    def after_context_created(self, context):
        self.config_loader = context.config_loader

    @hook_impl
    def after_catalog_created(self, catalog):
        mongo_db = self.config_loader.get("mongo_db*")
        catalog.add_feed_dict({"mongo_db": mongo_db})
This is basically just extracting the key parts of the code that converts
parameters.yml
into something that can be used as a node input. You can then use
"mongo_db"
as a node input. I've left out the stuff that would enable you to use subkeys like
mongo_db:key
here.
e
Thank you so very much! I'm going to try that 🙂 I did something similar yesterday but I think my implementation was wrong, so I'll go ahead and test yours! My pipeline will fetch different data for different runs (same environment though) and I plan to simply use kedro run --params to supply those unique variables. If I can get this to work it's going to be supernice. I'll try and report back!
a
Awesome, let me know how it works! The
after_context_created
hook is very new (kedro 0.18.1 only) and we're working on improving how config loader and context work. So it will be very interesting to hear what works well here or if you have any suggestions
e
Amazing, it worked like a charm. I simply addded the following to catalog.yml mongo_db: type: yaml.YAMLDataSet filepath: conf/base/mongo_db_config.yml simple and elegant solution for my needs 🙂