datajoely
04/19/2022, 7:01 PMgui42
04/19/2022, 7:03 PMgui42
04/19/2022, 7:03 PMgui42
04/19/2022, 7:04 PMavan-sh
04/19/2022, 7:05 PMgui42
04/19/2022, 7:06 PMnoklam
04/19/2022, 7:11 PMgui42
04/19/2022, 7:12 PMsession.run
idea was this: Have a generic function that can run nodes, pipelines, everything needed for a set of inputs and/or a set of outputs, and kedro would take care of running everything. But the return values are just for those that have a catalog issue, and there is no way to access the catalog for those runs (I think) .
So I'm always lost on how to use the session.run
, specifically due to the fact that I can't reach those memory datasets interactively unless I'm always persisting everything in the catalog.gui42
04/19/2022, 7:12 PMavan-sh
04/19/2022, 7:13 PMnoklam
04/19/2022, 7:14 PMresult = session.run()
, the result will store any free output in the pipeline.noklam
04/19/2022, 7:15 PMantony.milne
04/19/2022, 10:20 PMgui42
04/20/2022, 3:49 PMgui42
04/20/2022, 3:52 PMsession.run
api seems to match perfectly the use cases, if it returned all the datasets data are a result from a node/pipeline/set of inputs or a list of the wanted outputs. Now, session.run
shouldn't change, obviously, but the api and the arguments in the signature seem very ergonomic to me.noklam
04/20/2022, 3:57 PMgui42
04/20/2022, 3:59 PMsession.run
as a helper for interactive inspection and development when using outputs from other nodes/pipelines.noklam
04/20/2022, 4:12 PMsession.run
is actually the core API for kedro's pipeline now. The use case of debugging is a valid one, especially without the access of a debugger. The current concept of "free outputs" is essentially pipeline's output (node's output that has no consumer, thus intermediate output are not included) - set of catalog entries (anything defined in the catalog).
From my experience, this is not convenient as the intermediate output could be useful for debugging, but I think the main reason is that kedro tries to be memory efficient. I think it's one area we could improve on interactive workflow.Apoorva
04/21/2022, 12:43 PMhist_train_candidates = node(
func=monitor_training_cand,
inputs="training_candidates",
outputs="training_candidates_hist",
name="hist_train_candidates_node")
stitch_train_candidates = node(
func=stitch_training_cand,
inputs=["training_candidates_hist", "stitched_hist"],
outputs="stitched_hist",
name="stitch_train_candidates_node")
create_report = node(
func=create_report_cand,
inputs="stitch_train_candidates",
outputs="stitch_report",
name="create_report_node")
Having output same as your input isn't supported, but I do need it for my usecase. Plus I have to create custom versioned dataset for stitch_hist which is leading to(after applying a hack of different name catalog entry but points to same file location )
*`raise VersionNotFoundError(f"Did not find any versions for {self}")
kedro.io.core.VersionNotFoundError: Did not find any versions for HistogramDataSet(filepath=/Users/Project/data/08_reporting/stitch_train_candidates.json, protocol=file, version=Version(load=None, save='2022-04-21T12.17.11.537Z'))`*
Any suggestion on how to better handle this scenario?datajoely
04/21/2022, 12:44 PMApoorva
04/21/2022, 1:17 PMstitch_train_candidates = node(
func=stitch_training_cand,
inputs=["training_candidates_hist", "stitched_hist"],
outputs="stitch_train_candidates",
name="stitch_train_candidates_node")
stitched_hist isn't available and I am getting this error
raise VersionNotFoundError(f"Did not find any versions for {self}")
kedro.io.core.VersionNotFoundError: Did not find any versions for HistogramDataSet(filepath=/Users/Project/data/08_reporting/stitch_train_candidates.json, protocol=file, version=Version(load=None, save='2022-04-21T12.17.11.537Z'))
How can i fix that?datajoely
04/21/2022, 1:20 PMApoorva
04/21/2022, 1:24 PMclass HistogramDataSet(AbstractVersionedDataSet):
def __init__(self, filepath: str, version: Version = None, credentials: Dict[str, Any] = None):
_credentials = deepcopy(credentials) or {}
protocol, path = get_protocol_and_path(filepath)
self._protocol = protocol
self._fs = fsspec.filesystem(self._protocol, **_credentials)
super().__init__(
filepath=PurePosixPath(path),
version=version,
exists_function=self._fs.exists,
glob_function=self._fs.glob, )
def _load(self):
load_path = get_filepath_str(self._filepath, self._protocol)
log.info(f'load_path: {load_path}')
try:
with self._fs.open(load_path) as f:
return json.load(f)
except FileNotFoundError:
return None
def _save(self, data) -> None:
"""Saves data to the specified filepath."""
save_path = get_filepath_str(self._filepath, self._protocol)
with self._fs.open(save_path, mode="w") as f:
json.dump(data, f, default=dumper)
self._invalidate_cache()
for versioning I am using kedro functionalityavan-sh
04/21/2022, 2:30 PMVersionNotFoundError
instead of FileNotFoundError
.LightMiner
04/21/2022, 7:52 PMdatajoely
04/21/2022, 8:09 PMnd0rf1n
04/25/2022, 2:32 PMkedro run
, I get the following ValueError:
ValueError: Pipeline input(s) {'params:data_science.model_options_experimental', 'params:data_science.active_modelling_pipeline.model_options'} not found in the DataCatalog
datajoely
04/25/2022, 2:33 PMnd0rf1n
04/25/2022, 2:33 PMdatajoely
04/25/2022, 2:33 PM