Yes that s exactly the problem and what I want to achieve Kedro #plugins-integrations

Join Discord

Yes, that's exactly the problem and what I want to...

# plugins-integrations

Downforu

05/03/2022, 9:20 AM

Yes, that's exactly the problem and what I want to achieve.

Downforu

05/17/2022, 8:57 AM

Hello, I'm still unable to have a unique session ID for the whole pipeline when running it using Airflow. I've been thinking to override the save_version param with an environment variable that I will have previously set. Is there a safe way to override the save_version with unique ID for all my nodes when using catalog.yml to register datasets ?

noklam

05/17/2022, 10:47 AM

In Airflow, each node is run as separate sessions so it makes sense they have different session ID. Why do you want them to be the same ID?

Downforu

05/17/2022, 2:03 PM

Because I find the "normal" behaviour of Kedro, which defines a global session ID for all nodes, very useful to see at a glance which run produced which dataset and be able to link all the datasets in my output folders. With Airflow, I ended up with versioned datasets having different run IDs as folder names.... I know that this is not the way it works with Airflow because, as you said, a new session is created for each node, that's why I wanted to find a way to somehow override the save_version of each dataset.

noklam

05/17/2022, 9:24 PM

Sorry for the delayed response, but I don't think there is an elegant solution here.

noklam

05/17/2022, 9:26 PM

session_id is basically equal to save_version, and there is no easy way to modify it since the timestamp is important to make sure Kedro is loading the correct data.

noklam

05/17/2022, 9:27 PM

A hacky solution will be override the session_id after the session creation, but before a session run.

noklam

05/17/2022, 9:35 PM

https://github.com/kedro-org/kedro/issues/1551 I created this issue for the team to discuss, for now hacking the session_id is the only obvious solution for me.

Downforu

05/18/2022, 12:06 PM

Thank you very much for opening this issue on GitHub ! For now, I'll just implement a solution to map between a global ID called AIRFLOW_VAR_GLOBAL_RUN_ID (followin Airflow's convention for env variables), that I'll pass to the whole DAG with docker-compose, and all the IDs generated by kedro at each node. I'm gonna also try to have only this global ID tracked in MLFlow.

3 Views

Previous Next