Is there a way to automatically add a dataset to c...
# beginners-need-help
c
Is there a way to automatically add a dataset to catalog based on params ? Because the API endpoint is exactly the same but the name
d
Hey @User started a thread so we don't pollute the main channel
c
This is what I have got for the moment
that's fine but I wondered if I can get it compact 😄
d
Copy code
python
class APICatalogHooks:

    @hook_impl
    def after_catalog_created(self, catalog, conf_catalog, conf_creds, feed_dict, save_version, load_versions, run_id):
        """This is an advanced use of the catalog hook so that we create the right
        catalog entries at runtime based on the inputs to the `params:api_stuff`. The alternative
        to this is that your non-technical users would have to create the three output
        dataset entries in the catalog for every input they declare
        """

        api_datasets_to_create = feed_dict['params:api_stuff']

        for dataset in api_datasets_to_create:
            new_entry = {
                f"{dataset['compare_1']}_vs_{dataset['compare_2']}_market_chart" : APIDataSet(url=f'https://api.coingecko.com...{params}...'),
            }
            catalog.add(new_entry, replace=True)
you then add this hook to
settings.py
and it will dynamically create the necessary catalog entries
long catalog name makes i super ugly here
but hopefully you get the point
c
long catalog names are okay because I need this kind of info to be accurate (because it could be vs usd, vs eur, vs gbp... and it could be anything else than market chart)
btw thanks for the code snippets
I will try it now
d
Good luck!
to explain how this works
by default Kedro will create
MemoryDataSets
for every input/output in the pipeline definition NOT in the catalog
c
I create my hook custom class then I add it to
settings.py
so when I run the kedro project it will do the job
yes I got this point in mind
d
then this hook will replace the
MemoryDataSets
with the real references to the
APIDataSet
instances
c
Perfect this was exactly what I needed
thanks a lot
d
💪
c
Are these arguments optionnal or mandatory ?
(self, catalog, conf_catalog, conf_creds, feed_dict, save_version, load_versions, run_id)
oh ok I saw on the template project hooks that they are also here
d
they are mandatory
but you don't have to use them
you can read about how they work here
but essentially each of the hooks available like
before_node_run
has a function signature available that exposes certain things - in this case the catlaog
c
you are using this
feed_dict['params:api_stuff']
If my comprehension is correct, I create a file in
~/conf/base/feed_dict.yml
with this inside :
Copy code
yml
params:
  - bitcoin
  - ethereum
  - chiliz
...
d
so if your
parameters.yaml
looks like
Copy code
crypto_currencies:
   - bitcoin
   - ethereum
   - chiliz
you would have to do
feed_dict['params:crypto_currencies']
and you would get a python list of the 3 strings in that key
c
ok I see
what is the import for feed_dict?
d
then I would recommend using itertools to generate the combinations in your hook
the
feed_dict
is a bit of an old name
c
add_feed_dict()
right now ?
d
it's for advanced usage to inject more paramters
but its a part of the library we only expose if you're doing advanced things like hooks
c
Sorry I didn't understand the last part 😄
Never used itertools, I do list comprehension in python or dict comprehension
d
if you give
combinations
a interable and the number 2 it will generate every pair
without doing each pair in reverse too
combinations
returns a generator
so you need to iterate through it to get the pairs out
list
does that for readability here, but you could do a for loop too
itertools
is genuinely one of the best bits of the standard library
if you want the reverse pairs it's got that sorted with
permutations
c
would be usefull when I implement eur or gbp aside usd
nice thanks
so helpfull !
d
my pleasure
c
I read the config part about config file and loading, but it's not super clear about using config params inside hook
I know how to use Hydra config manager, so maybe it's another paradigm for kedro but I can't understand how to use them
Ok I found ressource that will help with parameters comprehension 🙂
d
The params appear in the
feed_dict
kwarg
which you can use
honestly the easiest technique is to put a
breakpoint()
in the hook body
and then inspect it at runtime to get a sense of what's avaialble
the parameters in the article linked are the standard way of doing it - you are doing an advanced mechanism for auto-generating catalog entries
c
oh cmon it's my fault, I didn't add
feed_dict
to my function args
That's why I couldn't find feed_dict...
d
no worries
what I do
c
it seems that the doc is missing
feed_dict
btw
d
ah good spot!
what I do is copy the arguments from this page
c
the first code snippets you provided seems to be broken at the
catalog.add()
line but fixed it with the source code
Copy code
python
            catalog.add(
                data_set_name=f"{dataset[0]}_vs_{dataset[1]}_market_chart",
                data_set=APIDataSet(
                    url=f"https://api.coingecko.com/api/v3/coins/{dataset[0]}/market_chart?vs_currency={dataset[1]}&days=max&interval=daily"
                ),
                replace=True
            )
the
new_entry
dict doesn't work the
kedro run pipeline
raise an error because it's missing a name for the dataset, so I removed
new_entry
and added both inputs directly into the
add()
function 🙂
But that works, thanks again.
d
amazing!
u
Not sure if this is helpful, but I have a "pre launch pipeline" that builds csv files from a database, that runs before my pipelines. Then after this pipeline gets these CSVs, I have an "autogenerated" section for my parameters files and for my catalog.yml. I have a script that goes through and updates both of these files to include the values from the csv from my prelaunch pipeline. Once both my parameters and catalog are updated with these entries, in my create_pipeline I look through the keys in my parameters files in order to create the nodes of my pipeline. I've found when I tried to use the catalog.add, it was easy to get confused and also that there was some case (that I cant remember now), that writing updates to the params & catalog file fixed.
c
I'm curious about the
in order to create the nodes of my pipeline
👀
Do you define a template node to use for generating N nodes based on that template ?
d
I'll say that this isn't entirely endorsed by the Kedro core team - the only reason being we have observed the more dynamic your pipeline definition the harder it is to debug or onboard new team members
obviously every situation is different and it's great to see the creative ways people use the tool
c
I get it
I will probably go by adding nodes manually for now 🙂
Even if automatic datasets add is useless if I can't get automatic nodes add 😄
u
yeah I automatically generate a parameters file, which is then parsed in order to generate pipeline nodes
c
I'm on another path
I will make a pipeline and give params at running
via
kedro run --params key:value key2:value ...
so I can define a standard pipeline without worrying about datasets/nodes that depends on the params giving at runtime
I think it will be easier to maintain and to understand
So I use for example
kedro run --params currency:bitcoin,compare:usd
How do I format the node inputs ? I have tried :
Copy code
python
node(
  func=format_market_chart_to_dataframe,
  inputs="params:currency_vs_params:compare_market_chart",
  outputs="fetched_params:currency_vs_params:compare_market_chart",
  name="fetched_data_node",
),
I tried also with f-strings but that doesn't work I get this error :
Copy code
bash
ValueError: Pipeline input(s) {'params:currency_vs_params:compare_market_chart'} not found in the DataCatalog
I also tried by adding single quote inside multi-quote
I will go with standard naming for datasets without variables, but I'm still curious if there is a way to add parameters to inputs (wihtout putting them in a list).
Sorry for spamming, but
Kedro
is so awesome I want to use it at his full potential 😄
d
No worries
I'm not entirely sure what you're trying to do here
can you show me how you've formatted your paramters?
are you using any hooks?
c
Yes I use the hook you gave me last time
Copy code
python
class APICatalogHooks:
    @hook_impl
    def after_catalog_created(
        self, 
        catalog: DataCatalog,
        conf_catalog: Dict[str, Any],
        conf_creds: Dict[str, Any],
        feed_dict: Dict[str, Any],
        save_version: str,
        load_versions: Dict[str, str],
        run_id: str,
    ) -> None:
        """
        This hook is called after the catalog is created. It creates one entry in the catalog per crypto currency
        listed in the config file.
        """
        currency = feed_dict["params:currency"]
        compare = feed_dict["params:compare"]

        catalog.add(
            data_set_name=f"inputs_market_chart",
            data_set=APIDataSet(
                url=f"https://api.coingecko.com/api/v3/coins/{currency}/market_chart?vs_currency={compare}&days=max&interval=daily"
            ),
            replace=True
        )
for running I add 2 parameters
kedro run --params currency:bitcoin,compare:usd
With this way I can use the same pipelines by changing parameters I give by running them
And I decided to use standard name for datasets in my pipelines
Copy code
python
def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=fetch_data_to_dataframe,
                inputs=["params:currency", "params:compare"],
                outputs="fetched_market_chart",
                name="fetching_data_node",
            ),
        ]
    )
d
nice so is it working?
what error are you getting
c
no that's fine 🙂
d
💪
c
I wanted to use dynamic name for datasets, but it's useless, it's easier like that
d
nice
6 Views