https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • i

    Isaac89

    11/10/2021, 1:55 PM
    normal ones
  • d

    datajoely

    11/10/2021, 1:55 PM
    That's surprising
  • d

    datajoely

    11/10/2021, 1:56 PM
    let me check with the team
  • d

    datajoely

    11/10/2021, 1:58 PM
    running a spaceflights tutorial it seems to work okay - note that we use a special
    _SharedMemoryDataSet
    at runtime
  • i

    Isaac89

    11/10/2021, 2:04 PM
    whats the difference of a _SharedMemoryDataSet and a normal one ?
  • i

    Isaac89

    11/10/2021, 2:04 PM
    is that available or a custom dataset?
  • d

    datajoely

    11/10/2021, 2:08 PM
    The parallel runner should use it automatically as far as I understand, so I'm not sure why this is popping up
  • d

    datajoely

    11/10/2021, 2:09 PM
    I've asked the developers but I'm not sure when I'll get a response
  • d

    datajoely

    11/10/2021, 2:09 PM
    do thinks work okay if you use SequentialRunner?
  • d

    datajoely

    11/10/2021, 2:09 PM
    And if you're using Spark, the ParallelRunner will not work
  • i

    Isaac89

    11/10/2021, 2:10 PM
    sequential is working
  • i

    Isaac89

    11/10/2021, 2:11 PM
    It is failing in ParalelRunner in the _validate_catalog function
  • d

    datajoely

    11/10/2021, 2:11 PM
    Have you explicitly declared MemoryDataSets in the catalog?
  • d

    datajoely

    11/10/2021, 2:12 PM
    Updating from Kedro into SQL
  • i

    Isaac89

    11/10/2021, 2:12 PM
    yes
  • d

    datajoely

    11/10/2021, 2:12 PM
    Ah so that may be causing the issue
  • d

    datajoely

    11/10/2021, 2:13 PM
    so do they need to be declared there? MemoryDataSets are created implicitly if not present in the catalog
  • i

    Isaac89

    11/10/2021, 2:13 PM
    I created the entries through the cli
  • i

    Isaac89

    11/10/2021, 2:13 PM
    I can try to remove them
  • d

    datajoely

    11/10/2021, 2:13 PM
    as long as the MemoryDataSets are used as outputs/inputs mid-pipeline they will be created by kedro without you declaring them
  • i

    Isaac89

    11/10/2021, 2:18 PM
    ok Thanks! I guess it should work without them explicitly written. I've just seen in the validate_catalog function it is explicitly checking for the presence of memory datasets. So if none is found it should work, but I have no idea how memory datasets are internally stored. Could they be overwritten or create some conflicts?
  • d

    datajoely

    11/10/2021, 2:20 PM
    So if you scroll up and use the diagram I posted earlier - you don't have to declare
    preprocessed_varieties
    in the catalog, it will be produced by the first node and used by the
    create_variety_table
    . Kedro will create a MemoryDataSet at runtime to hand it between the nodes if it doesn't existing in the catalog.
  • a

    antony.milne

    11/10/2021, 2:29 PM
    I guess you might have found this already, but the docstring for
    _validate_catalog
    explains a bit what's going on here:
    Ensure that all data sets are serializable and that we do not have non proxied memory data sets being used as outputs as their content not be synchronized across threads.
    The second part about memory datasets is what's relevant here. As Joel said, default for parallel runner is that
    _SharedMemoryDataset
    is used rather than
    MemoryDataSet
    (see
    ParallelRunner.create_default_data_set
    for where this happens). In theory you could specify this dataset type explicitly in the catalog, but the fact that it's private means that's probably not a good idea, and I've never seen anyone do so. Just don't define them in the catalog and they will default to
    _SharedMemoryDataset
    and everything should work ok ๐Ÿ™‚
  • a

    antony.milne

    11/10/2021, 2:31 PM
    Here's where it's all defined in case you're interested in what's going on under the hood: https://github.com/quantumblacklabs/kedro/blob/ded55eb824af25ea28ea9f5249317693a9b1574d/kedro/runner/parallel_runner.py#L26-L72. Just don't ask me how it works though since I've never actually looked at this code before ๐Ÿ˜„
  • i

    Isaac89

    11/10/2021, 10:11 PM
    Thanks a lot for your help @antony.milne @datajoely! now everything makes more sense! So as long as the datasets are pickable and not defined in the catalog everything should work fine. ๐Ÿคž
  • j

    jcasanuevam

    11/11/2021, 1:20 PM
    Hello! I hope you could help me out with this doubt about the mlflow tracking server and how to set up everything in the mlflow.yml file of the kedro project. I have a database backend store in an external server to track metrics, etc and a SFTP server as artifact store for storing models in the same external server. In kedro-mlflow documentation I've seen I have to define the mlflow_tracking_uri variable but I'm not sure if a must write the sftp://user@host/path/to/directory of the artifact store or +://:@:/ of the database backend store. In my case I want to use both solutions and the mlflow.yml only give us one possible input. How can I set up both backend and artifact stores? Thanks!
    d
    g
    m
    • 4
    • 16
  • m

    Matheus Serpa

    11/15/2021, 12:43 PM
    Hello there, I hope youโ€™re all doing well. Any guidelines/suggestions on how to deploy a kedro project to GCP / Cloud Composer. Should I upload the kedro project into the dag's folder? Or is there any other way to deploy it? Best,
  • d

    datajoely

    11/15/2021, 12:44 PM
    Hello! Iโ€™m not sure weโ€™ve had this come up before. Happy to help work through this - I wonder if any of the tutorials on the deployment guide apply here the same: https://kedro.readthedocs.io/en/stable/10_deployment/01_deployment_guide.html https://kedro.readthedocs.io/en/stable/03_tutorial/05_package_a_project.html
  • m

    Matheus Serpa

    11/15/2021, 12:49 PM
    Thanks @User I'll dive into it and let the community know about any news (the good ones ๐Ÿ™‚
  • e

    ende

    11/16/2021, 3:38 AM
    Sorry if this is a dumb question, but how do you run kedro with different data locations? Like, say I have a data catalog with an S3 key... how do I run the pipeline pointing at a different key with new data?
Powered by Linen
Title
e

ende

11/16/2021, 3:38 AM
Sorry if this is a dumb question, but how do you run kedro with different data locations? Like, say I have a data catalog with an S3 key... how do I run the pipeline pointing at a different key with new data?
View count: 1