If I run ` kedro catalog create pipeline default ` on the sp Kedro #beginners-need-help

If I run ` kedro catalog create --pipeline __defau...

Goss

09/26/2022, 8:56 PM

If I run

kedro catalog create --pipeline __default__

on the space tutorial, it generates a bunch of datasets not in the catalog:

Copy code

data_science.active_modelling_pipeline.X_test:
  type: MemoryDataSet
data_science.active_modelling_pipeline.X_train:
  type: MemoryDataSet
data_science.active_modelling_pipeline.y_test:
  type: MemoryDataSet
data_science.active_modelling_pipeline.y_train:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.X_test:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.X_train:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.y_test:
  type: MemoryDataSet
data_science.candidate_modelling_pipeline.y_train:
  type: MemoryDataSet

Why aren't these included in

conf/base/catalog.yml

when their absence causes errors like

ValueError: Pipeline input(s) {'data_science.active_modelling_pipeline.y_train', 'data_science.active_modelling_pipeline.X_train'} not found in the DataCatalog

???

Goss

09/26/2022, 9:39 PM

Is this happening because intermediate datasets are MemoryDataSet and work when running locally but on a distributed platform like Kubeflow, they don't?

datajoely

09/27/2022, 3:58 AM

They won't work if you make each kedro node at kuberflow task

datajoely

09/27/2022, 3:59 AM

You may also have to change some to persisted if they are shared between pipelines

Goss

09/27/2022, 11:49 AM

Thanks. It sounds like there is a way to control the granularity of how Kedro nodes map to tasks in Kubeflow based on your comment. Can you elaborate on that or point me to any docs? I couldn't find anything by searching the docs...

Previous Next