https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • v

    vivekumar

    04/19/2022, 6:32 AM
    message has been deleted
  • v

    vivekumar

    04/19/2022, 6:48 AM
    I changed the name of file from kedro.py to kedro_hello.py now 'nomodule' error is gone. but new error occurred.
  • v

    vivekumar

    04/19/2022, 6:50 AM
    searched and found this https://github.com/kedro-org/kedro/issues/1409
  • n

    noklam

    04/19/2022, 7:36 AM
    I see, if you name your file as
    kedro.py
    Python is confused that this is the kedro module instead of the installed one.
  • n

    noklam

    04/19/2022, 7:39 AM
    We have a work in progress to fix this. https://github.com/kedro-org/kedro/issues/1429 For now, please try to follow the suggested solution in the Github Issue that you found.
  • g

    gui42

    04/19/2022, 3:08 PM
    Hello folks. I've been using kedro for a while now. But not too much using it in the interactive mode. Now, is there a better way of computing the nodes that reach a memory dataset, and then retrieving the memory dataset, without changing the catalog so that it is written somewhere?
  • a

    avan-sh

    04/19/2022, 3:58 PM
    You could try running a partial pipeline until the memory datasets you're interested in jupyter using
    session.run
    with
    to_outputs
    arg. But this will only return a dictionary of your datasets, not as memory datasets retrievable in catalog. Reference to
    session.run
    function specs: https://kedro.readthedocs.io/en/stable/kedro.framework.session.session.KedroSession.html#kedro.framework.session.session.KedroSession.run
  • n

    noklam

    04/19/2022, 4:36 PM
    Can you explain what you are trying to achieve here?
  • g

    gui42

    04/19/2022, 6:48 PM
    The use case is simple. I want to run a pipeline upto a dateset, but I don't want to write it anywhere. The
    session.run
    only returns an empty dict, and from what I understand, only datasets with some catalog issues are returned. from the session.run docstring:
    Returns:
        Any node outputs that cannot be processed by the ``DataCatalog``.
        These are returned in a dictionary, where the keys are defined
        by the node outputs.
  • g

    gui42

    04/19/2022, 6:52 PM
    Unless I'm using the catalog wrong
  • a

    avan-sh

    04/19/2022, 6:58 PM
    You'll have to add the list of datasets you want to access in the
    to_outputs
    arg for it to return them in the dictionary. Also what noklam might be looking to know the reason you're trying to do this.
  • g

    gui42

    04/19/2022, 7:00 PM
    I've been running it as :
    session.run(to_outputs=['my_dataset'])
    And the return value is an empty dict. The pipeline runs smoothly, and everything is defined in the catalog.
  • g

    gui42

    04/19/2022, 7:01 PM
    My use case in debugging actually. There is something wrong with a node somewhere, and I'm bratracking datasets to see where the issue lies.
  • d

    datajoely

    04/19/2022, 7:01 PM
    In that case I would recommend breakpoints as a much better way of backtracking
  • g

    gui42

    04/19/2022, 7:03 PM
    yep, this is what I've been doing 😄 But I'm just after loading pretty much all datasets before to see at least in what region of the pipeline stuff changed.
  • g

    gui42

    04/19/2022, 7:03 PM
    I have a local development catalog/env that saves everything on disk, but some intermediary steps not.
  • g

    gui42

    04/19/2022, 7:04 PM
    So I was wondering if there was a way to reach those without having to write them to disk.
  • a

    avan-sh

    04/19/2022, 7:05 PM
    One other case that might not work is when debugging in production/remote systems. I had to do similar console runs to understand what was going wrong.
  • g

    gui42

    04/19/2022, 7:06 PM
    Huum, the system is running locally, but I prefer using ipython in the console rather then jupyter lab/notebook. Could that be an issue?
  • n

    noklam

    04/19/2022, 7:11 PM
    By default when you run a pipeline, any free output will be return in a dictionary.
  • g

    gui42

    04/19/2022, 7:12 PM
    At first, I thought the
    session.run
    idea was this: Have a generic function that can run nodes, pipelines, everything needed for a set of inputs and/or a set of outputs, and kedro would take care of running everything. But the return values are just for those that have a catalog issue, and there is no way to access the catalog for those runs (I think) . So I'm always lost on how to use the
    session.run
    , specifically due to the fact that I can't reach those memory datasets interactively unless I'm always persisting everything in the catalog.
  • g

    gui42

    04/19/2022, 7:12 PM
    Huuum. What do you mean by default 😄 😅?
  • a

    avan-sh

    04/19/2022, 7:13 PM
    I quickly checked a few things, looks like if it's being persisted/present in catalog it is available in the catalog object and if it's a free dataset(not present in catalog yml ) it is returned in dictionary
  • n

    noklam

    04/19/2022, 7:14 PM
    I mean
    result  = session.run()
    , the result will store any free output in the pipeline.
  • n

    noklam

    04/19/2022, 7:15 PM
    As @avan-sh said.
  • a

    antony.milne

    04/19/2022, 10:20 PM
    FYI this is just the sort of thing (jumping into a certain point of a pipeline in jupyter) that we're hoping to make easier in future. I think this sort of debugging is a pretty common thing to do. https://github.com/kedro-org/kedro/issues/1075
  • g

    gui42

    04/20/2022, 3:49 PM
    nice. I'll check out the PR.
  • g

    gui42

    04/20/2022, 3:52 PM
    But, just my 2 cents here, the
    session.run
    api seems to match perfectly the use cases, if it returned all the datasets data are a result from a node/pipeline/set of inputs or a list of the wanted outputs. Now,
    session.run
    shouldn't change, obviously, but the api and the arguments in the signature seem very ergonomic to me.
  • n

    noklam

    04/20/2022, 3:57 PM
    Do you mean it should just return every datasets instead of the free outputs only (current implementation)?
  • g

    gui42

    04/20/2022, 3:59 PM
    yep. I mean, viewing the
    session.run
    as a helper for interactive inspection and development when using outputs from other nodes/pipelines.
Powered by Linen
Title
g

gui42

04/20/2022, 3:59 PM
yep. I mean, viewing the
session.run
as a helper for interactive inspection and development when using outputs from other nodes/pipelines.
View count: 1