Yes, that's exactly the problem and what I want to achieve.
Hello, I'm still unable to have a unique session ID for the whole pipeline when running it using Airflow. I've been thinking to override the save_version param with an environment variable that I will have previously set. Is there a safe way to override the save_version with unique ID for all my nodes when using catalog.yml to register datasets ?
05/17/2022, 10:47 AM
In Airflow, each node is run as separate sessions so it makes sense they have different session ID. Why do you want them to be the same ID?
05/17/2022, 2:03 PM
Because I find the "normal" behaviour of Kedro, which defines a global session ID for all nodes, very useful to see at a glance which run produced which dataset and be able to link all the datasets in my output folders.
With Airflow, I ended up with versioned datasets having different run IDs as folder names....
I know that this is not the way it works with Airflow because, as you said, a new session is created for each node, that's why I wanted to find a way to somehow override the save_version of each dataset.
05/17/2022, 9:24 PM
Sorry for the delayed response, but I don't think there is an elegant solution here.
session_id is basically equal to save_version, and there is no easy way to modify it since the timestamp is important to make sure Kedro is loading the correct data.
A hacky solution will be override the session_id after the session creation, but before a session run.
Thank you very much for opening this issue on GitHub ! For now, I'll just implement a solution to map between a global ID called AIRFLOW_VAR_GLOBAL_RUN_ID (followin Airflow's convention for env variables), that I'll pass to the whole DAG with docker-compose, and all the IDs generated by kedro at each node.
I'm gonna also try to have only this global ID tracked in MLFlow.