Hi kedro community First of all thanks for the great tool I Kedro #advanced-need-help

Hi kedro community, First of all thanks for the gr...

user

03/12/2022, 6:30 PM

Hi kedro community, First of all thanks for the great tool! I am playing around with the deployment options, in particular the Prefect deployment. I noticed that when using the

register_flow.py

script, the datasets in the catalog object are named as in the file. However, in the Nodes the input and output datasets are namespaced. Therefore, when running that flow it will create only memory datasets because it will assume all the datasets don't exist in the catalog. Now if I change the

register_flow.py

so that it does not create MemoryDatasets for everything, the

run_node

function does not work as the input and catalog name don't match up and the save/load functions don't work anymore (it tries loading a namespaced dataset that it can't find in the catalog). Is there a way to obtain either a namespaced catalog or a pipeline object where the inputs/outputs of the nodes are not namespaced so that the

run_node

function works properly? 🙂

datajoely

03/13/2022, 11:56 AM

Hell could you post the error? In truth our Prefect docs are a little out of date and predate a lot of namespacing work so we would really appreciate some help getting them up to date

user

03/13/2022, 6:56 PM

Sure happy to help. Actually after looking into it a bit more I think the root cause is the kedro version and the spaceflight starter. (in kedro version

0.17.6

it all works) The catalog for spaceflight starter for version 0.17.7 does not include the additional

data_science

and

data_processing

namespaces. e.g.:

preprocessed_companies

instead of

data_processing.preprocessed_companies

Therefore, only MemoryDatasets are ever used when running either the prefect workflow or the normal kedro run. Is that how it is supposed to be? This layering could get a bit intricate

datajoely

03/14/2022, 9:17 AM

So there is a couple of points here (which as a maintainer group, we're still thinking through)

datajoely

03/14/2022, 9:18 AM

the advantage of memory datasets is that they dont require IO and you can keep things running efficiently that way

datajoely

03/14/2022, 9:18 AM

the easiest way to solve this is to simply provide catalog entries for everything and then persist them

datajoely

03/14/2022, 9:19 AM

@User do you have any thoughts here?

antony.milne

03/14/2022, 9:59 AM

Let me take a look at the Prefect docs - this should be easy to provide a quick fix for

Flow

03/14/2022, 10:03 AM

Again it is now working with prefect

Flow

03/14/2022, 10:04 AM

It's the fact that the spaceflight starter did not have entires in the catalog for certain datasets. My renaming fix in code for the namespace led to a missmatch between catalog/pipelines which caused an error. By providing properly namespaced entries in the catalog it all works now (including the prefect tutorial)

Flow

03/14/2022, 10:05 AM

@User

datajoely

03/14/2022, 10:05 AM

So yeah persisting everything (i.e. providing catalog entries for everything) will fix this sort of thing

datajoely

03/14/2022, 10:05 AM

but in the future we want to make it work both ways

Flow

03/14/2022, 10:06 AM

I initially did not see the data_science and data_processing name spaces so fixed it myself in code in the prefect tutorial which clearly did not work. Sorry for the confusion.

datajoely

03/14/2022, 10:40 AM

ah nice!

datajoely

03/14/2022, 10:40 AM

Is there anything we can do to improve the Prefect docs?

datajoely

03/14/2022, 10:41 AM

they need a bit of love and you're the second person to talk about them in the last few weeks

Flow

03/14/2022, 12:28 PM

I'm playing around with them atm together with mlflow and kedro-mlflow. Let me get back to you once I have a good feeling

avan-sh

03/14/2022, 1:46 PM

Hi Florian, noticed your question with prefect. As a heads-up wanted to mention that one of the users highlighted that the approach doesn't work well when using versioned dataset. If needed, you can find an alternate solution in the archived prefect-versioned-datasets thread in the #846330075535769601 channel.

Previous Next