https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • n

    noklam

    04/20/2022, 4:12 PM
    session.run
    is actually the core API for kedro's pipeline now. The use case of debugging is a valid one, especially without the access of a debugger. The current concept of "free outputs" is essentially pipeline's output (node's output that has no consumer, thus intermediate output are not included) - set of catalog entries (anything defined in the catalog). From my experience, this is not convenient as the intermediate output could be useful for debugging, but I think the main reason is that kedro tries to be memory efficient. I think it's one area we could improve on interactive workflow.
  • a

    Apoorva

    04/21/2022, 12:43 PM
    Hey All, For one of my use-case, I have to add a node that creates histograms for each pipeline run on a dataset and then second node stitch existing stitched-histogram(from previous run) with new histogram created from previous node something like
    hist_train_candidates = node(
            func=monitor_training_cand,
            inputs="training_candidates",
            outputs="training_candidates_hist",
            name="hist_train_candidates_node")
    
        stitch_train_candidates = node(
            func=stitch_training_cand,
            inputs=["training_candidates_hist", "stitched_hist"],
            outputs="stitched_hist",
            name="stitch_train_candidates_node")
    
        create_report = node(
            func=create_report_cand,
            inputs="stitch_train_candidates",
            outputs="stitch_report",
            name="create_report_node")
    Having output same as your input isn't supported, but I do need it for my usecase. Plus I have to create custom versioned dataset for stitch_hist which is leading to(after applying a hack of different name catalog entry but points to same file location ) *`raise VersionNotFoundError(f"Did not find any versions for {self}") kedro.io.core.VersionNotFoundError: Did not find any versions for HistogramDataSet(filepath=/Users/Project/data/08_reporting/stitch_train_candidates.json, protocol=file, version=Version(load=None, save='2022-04-21T12.17.11.537Z'))`* Any suggestion on how to better handle this scenario?
  • d

    datajoely

    04/21/2022, 12:44 PM
    You can't have the same input as an output, the DAG needs to be acyclic by definition. What you can do is define two datasets that point to the same path
  • a

    Apoorva

    04/21/2022, 1:17 PM
    Then in this scenario, the dataset that I am using a custom versioned dataset and for the very first run of node
    stitch_train_candidates = node(
            func=stitch_training_cand,
            inputs=["training_candidates_hist", "stitched_hist"],
            outputs="stitch_train_candidates",
            name="stitch_train_candidates_node")
    stitched_hist isn't available and I am getting this error
    raise VersionNotFoundError(f"Did not find any versions for {self}")
    kedro.io.core.VersionNotFoundError: Did not find any versions for HistogramDataSet(filepath=/Users/Project/data/08_reporting/stitch_train_candidates.json, protocol=file, version=Version(load=None, save='2022-04-21T12.17.11.537Z'))
    How can i fix that?
  • d

    datajoely

    04/21/2022, 1:20 PM
    If I'm reading this correctly - you've never written any data to that location so it can't read any versions?
  • a

    Apoorva

    04/21/2022, 1:24 PM
    yes, so I don't understand why I am getting this. something wrong with how i am creating this versioned dataset
    class HistogramDataSet(AbstractVersionedDataSet):
    
        def __init__(self, filepath: str, version: Version = None, credentials: Dict[str, Any] = None):
            _credentials = deepcopy(credentials) or {}
            protocol, path = get_protocol_and_path(filepath)
            self._protocol = protocol
            self._fs = fsspec.filesystem(self._protocol, **_credentials)
            super().__init__(
                filepath=PurePosixPath(path),
                version=version,
                exists_function=self._fs.exists,
                glob_function=self._fs.glob, )
    
        def _load(self):
            load_path = get_filepath_str(self._filepath, self._protocol)
            log.info(f'load_path: {load_path}')
            try:
                with self._fs.open(load_path) as f:
                    return json.load(f)
            except FileNotFoundError:
                return None
    
        def _save(self, data) -> None:
            """Saves data to the specified filepath."""
            save_path = get_filepath_str(self._filepath, self._protocol)
            with self._fs.open(save_path, mode="w") as f:
                json.dump(data, f, default=dumper)
            self._invalidate_cache()
    for versioning I am using kedro functionality
  • a

    avan-sh

    04/21/2022, 2:30 PM
    What you're doing is ok as far I see, but you might need to catch
    VersionNotFoundError
    instead of
    FileNotFoundError
    .
  • l

    LightMiner

    04/21/2022, 7:52 PM
    Hi everyone, I've been using kedro for a little while , and followed EngineerOne videos , i had a question about programatically adding datasets, for one of my projects i have a hierarchy of files that is growing in a structured way where i have recordings that are being added for new subjects (data/sub01/recordings.txt) , in one of the videos of Dataengineerone he does so by changing the ProjectContext class in the run.py file , but it seems that in the recents version this file is no more.

    https://www.youtube.com/watch?v=CIRVpMqWEIs▾

    I wanna be able to create the datasets automatically from params , and create corresponding nodes from params, I was thinking of 4 solutions: 1-Find the Equivalent of the ProjectContextClass, iv'e been wondering if this class is still used in a new file or if there is the equivalent in the new version of kedro 2-Use jinja2 in the catalog, if i use jinja2 i've been wondering then how i could load the parameters for iterating over them and creating the catalog entries, 3-Create a custom class, but then i've been wondering how to return a dictionary of callables like the partitionDataset does, 4- Use hooks as proposed in a past question, but sincerely i still never used them, which solution is the best ? or is there another simpler one ?
    d
    • 2
    • 8
  • d

    datajoely

    04/21/2022, 8:09 PM
    dynamic datasets
  • n

    nd0rf1n

    04/25/2022, 2:32 PM
    Hi, everybody! I came back to the project I set up last week, going through the Kedro tutorial, which was working just fine. When I try to execute
    kedro run
    , I get the following ValueError:
    ValueError: Pipeline input(s) {'params:data_science.model_options_experimental', 'params:data_science.active_modelling_pipeline.model_options'} not found in the DataCatalog
  • d

    datajoely

    04/25/2022, 2:33 PM
    What does your parameters.yml look like?
  • n

    nd0rf1n

    04/25/2022, 2:33 PM
    I have not been able to figure out what changed in the meanwhile. Any ideas are more than welcome.
  • d

    datajoely

    04/25/2022, 2:33 PM
    This did change between 0.17.7 and 0.18.0 it's in the release notes
  • n

    nd0rf1n

    04/25/2022, 2:33 PM
    It is actually empty
  • d

    datajoely

    04/25/2022, 2:33 PM
    Well it needs to be populated!
  • n

    nd0rf1n

    04/25/2022, 2:35 PM
    Hahaha! Yeah, I'd imagine. Last week it was working fine. I thought since you have a parameters/ folder and then a YAML file for each pipeline it was getting those through there
  • d

    datajoely

    04/25/2022, 2:38 PM
    so if it's mapped in a pipeline, it needs to be there - even if you're not calling it. This open PR I was just looking at actually has the right answer here: https://github.com/kedro-org/kedro/pull/1424/files#diff-fb71c3224b6465dcb63b9daff7d185ea9fda1fa61464d8670e7ba917b2e8fdb1R85
  • n

    nd0rf1n

    04/25/2022, 2:42 PM
    The
    conf/base/parameters/data_science.yml
    is indeed populated with:
    model_options:
      test_size: 0.2
      random_state: 3
      features:
        - engines
        - passenger_capacity
        - crew
        - d_check_complete
        - moon_clearance_complete
        - iata_approved
        - company_rating
        - review_scores_rating
    
    model_options_experimental:
      test_size: 0.2
      random_state: 8
      features:
        - engines
        - passenger_capacity
        - crew
        - review_scores_rating
    I thought you were referring to the
    conf/base/parameters.yml
    , which is empty
  • d

    datajoely

    04/25/2022, 2:44 PM
    can you type
    kedro -V
    I think you have accidentally upgrade to 0.18.x which has breaking changes
  • d

    datajoely

    04/25/2022, 2:44 PM
    and if you compare to the PR needs to be updated to have the pipeline
    namespace
    as the top level
  • d

    datajoely

    04/25/2022, 2:45 PM
    https://github.com/kedro-org/kedro/blob/main/RELEASE.md#modular-pipelines
  • n

    nd0rf1n

    04/25/2022, 2:47 PM
    I do have 0.18.0; but I'm pretty sure I have not upgraded since I last run the pipeline
  • d

    datajoely

    04/25/2022, 2:47 PM
    The snippet you posted would have worked in 0.17.7 so maybe you were looking at an older version of the docs
  • n

    nd0rf1n

    04/25/2022, 2:49 PM
    I see. I just checked again and indeed the docs have been changed
  • n

    nd0rf1n

    04/25/2022, 2:50 PM
    No, actually they have not been changed--my bad
  • n

    nd0rf1n

    04/25/2022, 2:50 PM
    https://kedro.readthedocs.io/en/stable/tutorial/namespace_pipelines.html#adding-namespaces-to-the-data-science-pipeline
  • n

    nd0rf1n

    04/25/2022, 3:07 PM
    I will use the updated docs from the PR you sent me though and hopefully it will solve my issues. Thanks for the help! Note to self: be more aware of updates to projects you're not familiar with and read the release notes! 😉
  • d

    datajoely

    04/25/2022, 3:08 PM
    I think the tutorial is broken and that pr is the fix. So in this case it's us not you! Thanks
  • r

    Rafał

    04/26/2022, 10:08 AM
    Hello, could anyone help me finding good example for YAML file being used
    kedro run -c config.yml
    . I see the official documentation says nothing about
    --from-nodes
    . I am afraid I have a case that
    kedo 0.18.0
    ignores my option provided in
    run.from-nodes
    n
    • 2
    • 17
  • n

    noklam

    04/26/2022, 10:13 AM
    What does your
    config.yml
    looks like?
Powered by Linen
Title
n

noklam

04/26/2022, 10:13 AM
What does your
config.yml
looks like?
View count: 1