778216384475693066 #advanced-need-help

Hey! Im trying to migrate one of my projects that currently run in kedro version 0.16.X into 0.18.0 but im having a lot of troubles when i need to use the old load_context method... I used that one when I wanted to run a Jupyter Notebook template and loading there the kedro context... how should I continue this mission?

marioFeynman

05/11/2022, 1:01 AM

And another question... how can I use the running parameters (like KEDRO_ENV) in this New version? In the past i was able to use deprecated hooks and journal that helped me in that task...

Rjify

05/11/2022, 5:36 AM

Hello all, I am at a stage where I have to deploy a DS project containing kedro template on Databricks. I am wondering what are the different ways of achieving this? I believe as per the documentation there is a way using a notebook, but that's not suggested for productionsation. I am looking for options to deploy kedro pipeline on databricks cluster.

inigohrey

05/11/2022, 8:49 AM

In 0.18.1 there's a after_context_created hook which might be interesting for you as it allows you to access the context without needing to create it yourself. In my team we've been stuck on 0.17.1 as we were using the context for a few things, but with this hook we might finally be able to move past it.

Tsakagur

05/11/2022, 9:44 AM

Thanks, I'll have a look!

Yetunde

05/11/2022, 10:25 AM

Hi @Rjify! We're so excited to see you mention this. We're actually setting up a project to work with the Databricks team to build out their IDE support and figure out best-practice ways of developing Kedro projects on Databricks. In our current sprint, we're fixing some of the bugs we've found while using Kedro on Databricks. We have suggested some workflows and will update our documentation. You can work with Databricks and Kedro with: - Packaging a Kedro project with

kedro package

and publishing the package using the Databricks DBFS API - Using Databricks Repos functionality and doing a pipeline run through a Databricks notebook

xxavier

05/11/2022, 12:25 PM

Hi everyone, I am trying to use the APIDataSet in the catalog.yml file but fail to load some credentials in the headers. What I have tried:

Copy code

run_histograms:
  type: api.APIDataSet
  url: https://xxx/
  headers:
    Authorization: Token <token>

Works without error (which is nice but token is somehow sensitive information). I tried to fill the header using credentials but failed to do so. credentials.yml

Copy code

dqm_playground_token:
  - Content-Type: application/json
  - Authorization: Token <token>

catalog.yml

Copy code

run_histograms:
  type: api.APIDataSet
  url: https://xxx/
  # Test 1
  headers: dqm_playground_token
  # Test 2
  headers:
    - dqm_playground_token
  # More tests

It seems to boil down to the fact that it reads Dict[str, Any] and not Union[Iterable[str], AuthBase]: https://kedro.readthedocs.io/en/stable/_modules/kedro/extras/datasets/api/api_dataset.html#APIDataSet I could probably modify the APIDataSet definition to solve it by having headers = auth but I guess there is a better way. 🙂 Sorry about the naive question. Any help is appreciated.

datajoely

05/11/2022, 12:53 PM

So the easiest way to debug this is to jump into a jupyter/ipython session and import the APIDataSet in python and get the .load() method working. It will then be simple to work out what the YAML should be

datajoely

05/11/2022, 12:54 PM

Oh i don't think your solution is bad by the way

datajoely

05/11/2022, 12:55 PM

Improving credentials in general is on the roadmap

marioFeynman

05/11/2022, 1:30 PM

So, do you think that maybe exposing it using this method could be the right way?

datajoely

05/11/2022, 1:31 PM

I need to think about it more, you only get the ConfigLoader at that point, before the catalog is created. So I'm leaning to No rather than Yes.

xxavier

05/11/2022, 2:15 PM

Thanks for the feedback! I should make it more generic but since the solution was not too bad, I just created a custom dataset (not to mess with Kedro's code) based on the APIDataset. catalog.yml

Copy code

run_histograms:
  type: dqm_playground_ds.extras.datasets.tuned_API_dataset.TunedAPIDataSet
  url: https://xxx/
  credentials: dqm_playground_token
  headers: credentials

credentials.yml

Copy code

dqm_playground_token:
  - Authorization: Token <token>

tuned_API_dataset.py (similar to kedro's api_dataset.py)

Copy code

python3
        auth = credentials or auth

        # Added the following three lines :)
        if headers == "credentials":
            auth = None
            headers = credentials[0]

Not great, not terrible. 🙂 Thanks again!

marioFeynman

05/11/2022, 3:51 PM

Oh, OK, i was trying to use that hook but yes, is not providing me the needed data that i was looking for. Thanks in advance!

Rjify

05/11/2022, 4:26 PM

Hi @Yetunde , thanks for replying and providing. I am more inclined towards the option of " Using Databricks Repos functionality and doing a pipeline run through a Databricks notebook" . I felt the document for deploying the project using this method was not sufficient. Is there a better example available that you can share with me? It will be of great help. Thanks

Rjify

05/11/2022, 9:04 PM

Another question, does kedro has any plan to visualize hooks associated with nodes in the visualization generated by kedro-viz?

datajoely

05/11/2022, 9:05 PM

The hooks aren't really node based, so it would be difficult to pair to a single node. If you have any ideas on how this could look please raise a GH issue

datajoely

05/11/2022, 9:06 PM

Regarding your Databricks question the docs today are all we have to share as of now, but we're hard at work on their overhaul

user

05/12/2022, 2:33 AM

Kedro SunPy - Writing Custom Data Set to S3 https://stackoverflow.com/questions/72209505/kedro-sunpy-writing-custom-data-set-to-s3

marioFeynman

05/13/2022, 1:18 AM

Hey guys! Quick question, how can I dinamically add the kedro enviroment in my parameters in order to be able to run specific stuff during the pipeline run? Im using kedro 0.18.0

antony.milne

05/13/2022, 7:45 AM

env in parameters

Kastakin

05/14/2022, 10:55 AM

For my usecase I have created a custom dataset for importing and exporting data in the mzML format, an open-source format for proteomics/metabolomics analisys using pyOpenMS (https://pyopenms.readthedocs.io/en/latest/index.html). Looking at issues and PR in the GitHub repo I’ve noticed that a decoupling of the main package and datasets is in the works. Should I wait to open an issue + PR to add the new dataset or I’m better off waiting for the aforementioned decoupling first?

Kastakin

05/14/2022, 10:56 AM

On the same note:is adding a new dataset considered a breaking change?

noklam

05/15/2022, 5:49 PM

We just release 0.18.1 last week, so it probably takes a few more weeks to release a new version. I think it is ok to open issue and PR in current repository and migrate it to a new package later. The decouple is mainly for speed up the release for datasets since these 3 party package dependencies move much faster than kedro core.

antony.milne

05/16/2022, 8:34 AM

Just to add to this: a new dataset is definitely not a breaking change 🙂 So a new dataset can be released as part of 0.18.x

user

05/17/2022, 2:07 PM

Kedro using wrong conda environment https://stackoverflow.com/questions/72275283/kedro-using-wrong-conda-environment