Is there a way to automatically add a dataset to catalog bas Kedro #beginners-need-help

Is there a way to automatically add a dataset to c...

ChainYo

01/13/2022, 10:16 AM

Is there a way to automatically add a dataset to catalog based on params ? Because the API endpoint is exactly the same but the name

datajoely

01/13/2022, 10:50 AM

Hey @User started a thread so we don't pollute the main channel

ChainYo

01/13/2022, 10:51 AM

This is what I have got for the moment

ChainYo

01/13/2022, 10:52 AM

that's fine but I wondered if I can get it compact 😄

datajoely

01/13/2022, 10:54 AM

Copy code

python
class APICatalogHooks:

    @hook_impl
    def after_catalog_created(self, catalog, conf_catalog, conf_creds, feed_dict, save_version, load_versions, run_id):
        """This is an advanced use of the catalog hook so that we create the right
        catalog entries at runtime based on the inputs to the `params:api_stuff`. The alternative
        to this is that your non-technical users would have to create the three output
        dataset entries in the catalog for every input they declare
        """

        api_datasets_to_create = feed_dict['params:api_stuff']

        for dataset in api_datasets_to_create:
            new_entry = {
                f"{dataset['compare_1']}_vs_{dataset['compare_2']}_market_chart" : APIDataSet(url=f'https://api.coingecko.com...{params}...'),
            }
            catalog.add(new_entry, replace=True)

datajoely

01/13/2022, 10:54 AM

you then add this hook to

settings.py

datajoely

01/13/2022, 10:54 AM

and it will dynamically create the necessary catalog entries

datajoely

01/13/2022, 10:54 AM

long catalog name makes i super ugly here

datajoely

01/13/2022, 10:55 AM

but hopefully you get the point

ChainYo

01/13/2022, 10:58 AM

long catalog names are okay because I need this kind of info to be accurate (because it could be vs usd, vs eur, vs gbp... and it could be anything else than market chart)

ChainYo

01/13/2022, 10:59 AM

btw thanks for the code snippets

ChainYo

01/13/2022, 10:59 AM

I will try it now

datajoely

01/13/2022, 10:59 AM

Good luck!

datajoely

01/13/2022, 10:59 AM

to explain how this works

datajoely

01/13/2022, 11:00 AM

by default Kedro will create

MemoryDataSets

for every input/output in the pipeline definition NOT in the catalog

ChainYo

01/13/2022, 11:00 AM

I create my hook custom class then I add it to

settings.py

so when I run the kedro project it will do the job

ChainYo

01/13/2022, 11:00 AM

yes I got this point in mind

datajoely

01/13/2022, 11:01 AM

then this hook will replace the

MemoryDataSets

with the real references to the

APIDataSet

instances

ChainYo

01/13/2022, 11:02 AM

Perfect this was exactly what I needed

ChainYo

01/13/2022, 11:02 AM

thanks a lot

datajoely

01/13/2022, 11:02 AM

💪

ChainYo

01/13/2022, 11:04 AM

Are these arguments optionnal or mandatory ?

(self, catalog, conf_catalog, conf_creds, feed_dict, save_version, load_versions, run_id)

ChainYo

01/13/2022, 11:05 AM

oh ok I saw on the template project hooks that they are also here

datajoely

01/13/2022, 11:05 AM

they are mandatory

datajoely

01/13/2022, 11:05 AM

but you don't have to use them

datajoely

01/13/2022, 11:05 AM

https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html

datajoely

01/13/2022, 11:06 AM

you can read about how they work here

datajoely

01/13/2022, 11:06 AM

but essentially each of the hooks available like

before_node_run

has a function signature available that exposes certain things - in this case the catlaog

ChainYo

01/13/2022, 11:23 AM

you are using this

feed_dict['params:api_stuff']

If my comprehension is correct, I create a file in

~/conf/base/feed_dict.yml

with this inside :

Copy code

yml
params:
  - bitcoin
  - ethereum
  - chiliz
...

datajoely

01/13/2022, 11:23 AM

so if your

parameters.yaml

looks like

datajoely

01/13/2022, 11:24 AM

Copy code

crypto_currencies:
   - bitcoin
   - ethereum
   - chiliz

datajoely

01/13/2022, 11:25 AM

you would have to do

feed_dict['params:crypto_currencies']

and you would get a python list of the 3 strings in that key

ChainYo

01/13/2022, 11:25 AM

ok I see

ChainYo

01/13/2022, 11:25 AM

what is the import for feed_dict?

datajoely

01/13/2022, 11:25 AM

then I would recommend using itertools to generate the combinations in your hook

datajoely

01/13/2022, 11:25 AM

the

feed_dict

is a bit of an old name

ChainYo

01/13/2022, 11:26 AM

add_feed_dict()

right now ?

datajoely

01/13/2022, 11:26 AM

it's for advanced usage to inject more paramters

datajoely

01/13/2022, 11:26 AM

but its a part of the library we only expose if you're doing advanced things like hooks

ChainYo

01/13/2022, 11:30 AM

Sorry I didn't understand the last part 😄

ChainYo

01/13/2022, 11:30 AM

Never used itertools, I do list comprehension in python or dict comprehension

datajoely

01/13/2022, 11:32 AM

if you give

combinations

a interable and the number 2 it will generate every pair

datajoely

01/13/2022, 11:32 AM

without doing each pair in reverse too

datajoely

01/13/2022, 11:32 AM

combinations

returns a generator

datajoely

01/13/2022, 11:33 AM

so you need to iterate through it to get the pairs out

datajoely

01/13/2022, 11:33 AM

list

does that for readability here, but you could do a for loop too

datajoely

01/13/2022, 11:33 AM

itertools

is genuinely one of the best bits of the standard library

datajoely

01/13/2022, 11:34 AM

if you want the reverse pairs it's got that sorted with

permutations

ChainYo

01/13/2022, 11:34 AM

would be usefull when I implement eur or gbp aside usd

ChainYo

01/13/2022, 11:35 AM

nice thanks

ChainYo

01/13/2022, 11:35 AM

so helpfull !

datajoely

01/13/2022, 11:35 AM

my pleasure

datajoely

01/13/2022, 11:35 AM

https://docs.python.org/3/library/itertools.html

ChainYo

01/13/2022, 1:03 PM

I read the config part about config file and loading, but it's not super clear about using config params inside hook

ChainYo

01/13/2022, 1:04 PM

I know how to use Hydra config manager, so maybe it's another paradigm for kedro but I can't understand how to use them

ChainYo

01/13/2022, 1:08 PM

Ok I found ressource that will help with parameters comprehension 🙂

ChainYo

01/13/2022, 1:08 PM

https://waylonwalker.com/kedro-parameters/

datajoely

01/13/2022, 1:11 PM

The params appear in the

feed_dict

kwarg

datajoely

01/13/2022, 1:11 PM

which you can use

datajoely

01/13/2022, 1:11 PM

honestly the easiest technique is to put a

breakpoint()

in the hook body

datajoely

01/13/2022, 1:12 PM

and then inspect it at runtime to get a sense of what's avaialble

datajoely

01/13/2022, 1:12 PM

the parameters in the article linked are the standard way of doing it - you are doing an advanced mechanism for auto-generating catalog entries

ChainYo

01/13/2022, 1:13 PM

oh cmon it's my fault, I didn't add

feed_dict

to my function args

ChainYo

01/13/2022, 1:13 PM

That's why I couldn't find feed_dict...

datajoely

01/13/2022, 1:14 PM

no worries

datajoely

01/13/2022, 1:14 PM

what I do

ChainYo

01/13/2022, 1:14 PM

it seems that the doc is missing

feed_dict

btw

ChainYo

01/13/2022, 1:14 PM

https://kedro.readthedocs.io/en/latest/07_extend_kedro/02_hooks.html#hook-implementation

datajoely

01/13/2022, 1:14 PM

ah good spot!

datajoely

01/13/2022, 1:14 PM

what I do is copy the arguments from this page

datajoely

01/13/2022, 1:14 PM

https://kedro.readthedocs.io/en/stable/_modules/kedro/framework/hooks/specs.html#NodeSpecs

ChainYo

01/13/2022, 3:43 PM

the first code snippets you provided seems to be broken at the

catalog.add()

line but fixed it with the source code

ChainYo

01/13/2022, 3:43 PM

Copy code

python
            catalog.add(
                data_set_name=f"{dataset[0]}_vs_{dataset[1]}_market_chart",
                data_set=APIDataSet(
                    url=f"https://api.coingecko.com/api/v3/coins/{dataset[0]}/market_chart?vs_currency={dataset[1]}&days=max&interval=daily"
                ),
                replace=True
            )

ChainYo

01/13/2022, 3:44 PM

the

new_entry

dict doesn't work the

kedro run pipeline

raise an error because it's missing a name for the dataset, so I removed

new_entry

and added both inputs directly into the

add()

function 🙂

ChainYo

01/13/2022, 3:45 PM

But that works, thanks again.

datajoely

01/13/2022, 4:02 PM

amazing!

user

01/13/2022, 4:22 PM

Not sure if this is helpful, but I have a "pre launch pipeline" that builds csv files from a database, that runs before my pipelines. Then after this pipeline gets these CSVs, I have an "autogenerated" section for my parameters files and for my catalog.yml. I have a script that goes through and updates both of these files to include the values from the csv from my prelaunch pipeline. Once both my parameters and catalog are updated with these entries, in my create_pipeline I look through the keys in my parameters files in order to create the nodes of my pipeline. I've found when I tried to use the catalog.add, it was easy to get confused and also that there was some case (that I cant remember now), that writing updates to the params & catalog file fixed.

ChainYo

01/13/2022, 4:46 PM

I'm curious about the

in order to create the nodes of my pipeline

👀

ChainYo

01/13/2022, 4:48 PM

Do you define a template node to use for generating N nodes based on that template ?

datajoely

01/13/2022, 4:57 PM

I'll say that this isn't entirely endorsed by the Kedro core team - the only reason being we have observed the more dynamic your pipeline definition the harder it is to debug or onboard new team members

datajoely

01/13/2022, 4:57 PM

obviously every situation is different and it's great to see the creative ways people use the tool

ChainYo

01/13/2022, 5:26 PM

I get it

ChainYo

01/13/2022, 5:27 PM

I will probably go by adding nodes manually for now 🙂

ChainYo

01/13/2022, 5:28 PM

Even if automatic datasets add is useless if I can't get automatic nodes add 😄

user

01/14/2022, 1:48 AM

yeah I automatically generate a parameters file, which is then parsed in order to generate pipeline nodes

ChainYo

01/15/2022, 5:04 PM

I'm on another path

ChainYo

01/15/2022, 5:05 PM

I will make a pipeline and give params at running

ChainYo

01/15/2022, 5:05 PM

via

kedro run --params key:value key2:value ...

ChainYo

01/15/2022, 5:06 PM

so I can define a standard pipeline without worrying about datasets/nodes that depends on the params giving at runtime

ChainYo

01/15/2022, 5:07 PM

I think it will be easier to maintain and to understand

ChainYo

01/15/2022, 5:22 PM

So I use for example

kedro run --params currency:bitcoin,compare:usd

How do I format the node inputs ? I have tried :

Copy code

python
node(
  func=format_market_chart_to_dataframe,
  inputs="params:currency_vs_params:compare_market_chart",
  outputs="fetched_params:currency_vs_params:compare_market_chart",
  name="fetched_data_node",
),

ChainYo

01/15/2022, 5:24 PM

I tried also with f-strings but that doesn't work I get this error :

Copy code

bash
ValueError: Pipeline input(s) {'params:currency_vs_params:compare_market_chart'} not found in the DataCatalog

I also tried by adding single quote inside multi-quote

ChainYo

01/15/2022, 5:29 PM

I will go with standard naming for datasets without variables, but I'm still curious if there is a way to add parameters to inputs (wihtout putting them in a list).

ChainYo

01/15/2022, 5:30 PM

Sorry for spamming, but

Kedro

is so awesome I want to use it at his full potential 😄

datajoely

01/16/2022, 4:11 PM

No worries

datajoely

01/16/2022, 4:11 PM

I'm not entirely sure what you're trying to do here

datajoely

01/16/2022, 4:13 PM

can you show me how you've formatted your paramters?

datajoely

01/16/2022, 4:13 PM

are you using any hooks?

ChainYo

01/17/2022, 7:01 PM

Yes I use the hook you gave me last time

Copy code

python
class APICatalogHooks:
    @hook_impl
    def after_catalog_created(
        self, 
        catalog: DataCatalog,
        conf_catalog: Dict[str, Any],
        conf_creds: Dict[str, Any],
        feed_dict: Dict[str, Any],
        save_version: str,
        load_versions: Dict[str, str],
        run_id: str,
    ) -> None:
        """
        This hook is called after the catalog is created. It creates one entry in the catalog per crypto currency
        listed in the config file.
        """
        currency = feed_dict["params:currency"]
        compare = feed_dict["params:compare"]

        catalog.add(
            data_set_name=f"inputs_market_chart",
            data_set=APIDataSet(
                url=f"https://api.coingecko.com/api/v3/coins/{currency}/market_chart?vs_currency={compare}&days=max&interval=daily"
            ),
            replace=True
        )

ChainYo

01/17/2022, 7:01 PM

for running I add 2 parameters

kedro run --params currency:bitcoin,compare:usd

ChainYo

01/17/2022, 7:02 PM

With this way I can use the same pipelines by changing parameters I give by running them

ChainYo

01/17/2022, 7:03 PM

And I decided to use standard name for datasets in my pipelines

Copy code

python
def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=fetch_data_to_dataframe,
                inputs=["params:currency", "params:compare"],
                outputs="fetched_market_chart",
                name="fetching_data_node",
            ),
        ]
    )

datajoely

01/17/2022, 7:04 PM

nice so is it working?

datajoely

01/17/2022, 7:05 PM

what error are you getting

ChainYo

01/17/2022, 7:05 PM

no that's fine 🙂

datajoely

01/17/2022, 7:05 PM

💪

ChainYo

01/17/2022, 7:05 PM

I wanted to use dynamic name for datasets, but it's useless, it's easier like that

datajoely

01/17/2022, 7:06 PM

nice

26 Views

Previous Next