https://kedro.org/ logo
#beginners-need-help
Title
# beginners-need-help
g

Goss

09/26/2022, 8:56 PM
If I run
kedro catalog create --pipeline __default__
on the space tutorial, it generates a bunch of datasets not in the catalog:
Copy code
data_science.active_modelling_pipeline.X_test:
  type: MemoryDataSet
data_science.active_modelling_pipeline.X_train:
  type: MemoryDataSet
data_science.active_modelling_pipeline.y_test:
  type: MemoryDataSet
data_science.active_modelling_pipeline.y_train:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.X_test:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.X_train:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.y_test:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.y_train:
  type: MemoryDataSet
Why aren't these included in
conf/base/catalog.yml
when their absence causes errors like
ValueError: Pipeline input(s) {'data_science.active_modelling_pipeline.y_train', 'data_science.active_modelling_pipeline.X_train'} not found in the DataCatalog
???
Is this happening because intermediate datasets are MemoryDataSet and work when running locally but on a distributed platform like Kubeflow, they don't?
d

datajoely

09/27/2022, 3:58 AM
They won't work if you make each kedro node at kuberflow task
You may also have to change some to persisted if they are shared between pipelines
g

Goss

09/27/2022, 11:49 AM
Thanks. It sounds like there is a way to control the granularity of how Kedro nodes map to tasks in Kubeflow based on your comment. Can you elaborate on that or point me to any docs? I couldn't find anything by searching the docs...