Hi kedro community, First of all thanks for the gr...
# advanced-need-help
u
Hi kedro community, First of all thanks for the great tool! I am playing around with the deployment options, in particular the Prefect deployment. I noticed that when using the
register_flow.py
script, the datasets in the catalog object are named as in the file. However, in the Nodes the input and output datasets are namespaced. Therefore, when running that flow it will create only memory datasets because it will assume all the datasets don't exist in the catalog. Now if I change the
register_flow.py
so that it does not create MemoryDatasets for everything, the
run_node
function does not work as the input and catalog name don't match up and the save/load functions don't work anymore (it tries loading a namespaced dataset that it can't find in the catalog). Is there a way to obtain either a namespaced catalog or a pipeline object where the inputs/outputs of the nodes are not namespaced so that the
run_node
function works properly? 🙂
d
Hell could you post the error? In truth our Prefect docs are a little out of date and predate a lot of namespacing work so we would really appreciate some help getting them up to date
u
Sure happy to help. Actually after looking into it a bit more I think the root cause is the kedro version and the spaceflight starter. (in kedro version
0.17.6
it all works) The catalog for spaceflight starter for version 0.17.7 does not include the additional
data_science
and
data_processing
namespaces. e.g.:
preprocessed_companies
instead of
data_processing.preprocessed_companies
Therefore, only MemoryDatasets are ever used when running either the prefect workflow or the normal kedro run. Is that how it is supposed to be? This layering could get a bit intricate
d
So there is a couple of points here (which as a maintainer group, we're still thinking through)
the advantage of memory datasets is that they dont require IO and you can keep things running efficiently that way
the easiest way to solve this is to simply provide catalog entries for everything and then persist them
@User do you have any thoughts here?
a
Let me take a look at the Prefect docs - this should be easy to provide a quick fix for
f
Again it is now working with prefect
It's the fact that the spaceflight starter did not have entires in the catalog for certain datasets. My renaming fix in code for the namespace led to a missmatch between catalog/pipelines which caused an error. By providing properly namespaced entries in the catalog it all works now (including the prefect tutorial)
@User
d
So yeah persisting everything (i.e. providing catalog entries for everything) will fix this sort of thing
but in the future we want to make it work both ways
f
I initially did not see the data_science and data_processing name spaces so fixed it myself in code in the prefect tutorial which clearly did not work. Sorry for the confusion.
d
ah nice!
Is there anything we can do to improve the Prefect docs?
they need a bit of love and you're the second person to talk about them in the last few weeks
f
I'm playing around with them atm together with mlflow and kedro-mlflow. Let me get back to you once I have a good feeling
a
Hi Florian, noticed your question with prefect. As a heads-up wanted to mention that one of the users highlighted that the approach doesn't work well when using versioned dataset. If needed, you can find an alternate solution in the archived prefect-versioned-datasets thread in the #846330075535769601 channel.