778216384475693066 #beginners-need-help

Channels

advanced-need-help

job-posting

welcome

datajoely

11/10/2021, 1:58 PM

running a spaceflights tutorial it seems to work okay - note that we use a special

_SharedMemoryDataSet

at runtime

Isaac89

11/10/2021, 2:04 PM

whats the difference of a _SharedMemoryDataSet and a normal one ?

Isaac89

11/10/2021, 2:04 PM

is that available or a custom dataset?

datajoely

11/10/2021, 2:08 PM

The parallel runner should use it automatically as far as I understand, so I'm not sure why this is popping up

datajoely

11/10/2021, 2:09 PM

I've asked the developers but I'm not sure when I'll get a response

datajoely

11/10/2021, 2:09 PM

do thinks work okay if you use SequentialRunner?

datajoely

11/10/2021, 2:09 PM

And if you're using Spark, the ParallelRunner will not work

Isaac89

11/10/2021, 2:10 PM

sequential is working

Isaac89

11/10/2021, 2:11 PM

It is failing in ParalelRunner in the _validate_catalog function

datajoely

11/10/2021, 2:11 PM

Have you explicitly declared MemoryDataSets in the catalog?

datajoely

11/10/2021, 2:12 PM

Updating from Kedro into SQL

Isaac89

11/10/2021, 2:12 PM

yes

datajoely

11/10/2021, 2:12 PM

Ah so that may be causing the issue

datajoely

11/10/2021, 2:13 PM

so do they need to be declared there? MemoryDataSets are created implicitly if not present in the catalog

Isaac89

11/10/2021, 2:13 PM

I created the entries through the cli

Isaac89

11/10/2021, 2:13 PM

I can try to remove them

datajoely

11/10/2021, 2:13 PM

as long as the MemoryDataSets are used as outputs/inputs mid-pipeline they will be created by kedro without you declaring them

Isaac89

11/10/2021, 2:18 PM

ok Thanks! I guess it should work without them explicitly written. I've just seen in the validate_catalog function it is explicitly checking for the presence of memory datasets. So if none is found it should work, but I have no idea how memory datasets are internally stored. Could they be overwritten or create some conflicts?

datajoely

11/10/2021, 2:20 PM

So if you scroll up and use the diagram I posted earlier - you don't have to declare

preprocessed_varieties

in the catalog, it will be produced by the first node and used by the

create_variety_table

. Kedro will create a MemoryDataSet at runtime to hand it between the nodes if it doesn't existing in the catalog.

antony.milne

11/10/2021, 2:29 PM

I guess you might have found this already, but the docstring for

_validate_catalog

explains a bit what's going on here:

Copy code

Ensure that all data sets are serializable and that we do not have non proxied memory data sets being used as outputs as their content not be synchronized across threads.

The second part about memory datasets is what's relevant here. As Joel said, default for parallel runner is that

_SharedMemoryDataset

is used rather than

MemoryDataSet

(see

ParallelRunner.create_default_data_set

for where this happens). In theory you could specify this dataset type explicitly in the catalog, but the fact that it's private means that's probably not a good idea, and I've never seen anyone do so. Just don't define them in the catalog and they will default to

_SharedMemoryDataset

and everything should work ok 🙂

antony.milne

11/10/2021, 2:31 PM

Here's where it's all defined in case you're interested in what's going on under the hood: https://github.com/quantumblacklabs/kedro/blob/ded55eb824af25ea28ea9f5249317693a9b1574d/kedro/runner/parallel_runner.py#L26-L72. Just don't ask me how it works though since I've never actually looked at this code before 😄

Isaac89

11/10/2021, 10:11 PM

Thanks a lot for your help @antony.milne @datajoely! now everything makes more sense! So as long as the datasets are pickable and not defined in the catalog everything should work fine. 🤞

jcasanuevam

11/11/2021, 1:20 PM

Hello! I hope you could help me out with this doubt about the mlflow tracking server and how to set up everything in the mlflow.yml file of the kedro project. I have a database backend store in an external server to track metrics, etc and a SFTP server as artifact store for storing models in the same external server. In kedro-mlflow documentation I've seen I have to define the mlflow_tracking_uri variable but I'm not sure if a must write the sftp://user@host/path/to/directory of the artifact store or +://:@:/ of the database backend store. In my case I want to use both solutions and the mlflow.yml only give us one possible input. How can I set up both backend and artifact stores? Thanks!

Matheus Serpa

11/15/2021, 12:43 PM

Hello there, I hope you’re all doing well. Any guidelines/suggestions on how to deploy a kedro project to GCP / Cloud Composer. Should I upload the kedro project into the dag's folder? Or is there any other way to deploy it? Best,

datajoely

11/15/2021, 12:44 PM

Hello! I’m not sure we’ve had this come up before. Happy to help work through this - I wonder if any of the tutorials on the deployment guide apply here the same: https://kedro.readthedocs.io/en/stable/10_deployment/01_deployment_guide.html https://kedro.readthedocs.io/en/stable/03_tutorial/05_package_a_project.html

Matheus Serpa

11/15/2021, 12:49 PM

Thanks @User I'll dive into it and let the community know about any news (the good ones 🙂

ende

11/16/2021, 3:38 AM

Sorry if this is a dumb question, but how do you run kedro with different data locations? Like, say I have a data catalog with an S3 key... how do I run the pipeline pointing at a different key with new data?

datajoely

11/16/2021, 9:15 AM

You’re looking for additional configuration environments + templated config loader

datajoely

11/16/2021, 9:15 AM

So you can do

kedro run —env staging

and the same for

prod

ende

11/16/2021, 4:38 PM

I thought so, but does templated config loader allow injecting values from the CLI or ENV VARs?