noklam
05/13/2022, 5:47 PMValentin DM.
05/13/2022, 5:48 PMkedro-viz
be added to requirements.txt
?
Or in the documentation : https://kedro.readthedocs.io/en/0.18.1/tutorial/create_pipelines.html ?Kastakin
05/13/2022, 6:40 PMdatajoely
05/13/2022, 7:15 PMwwliu
05/16/2022, 9:04 PMclass ModelTrackingHooks:
@hook_impl
def after_node_run(self, node: Node, outputs: Dict[str, Any], inputs: Dict[str, Any]) -> None:
if node._func_name == "train_model":
model = outputs["example_model"]
mlflow.sklearn.log_model(model, "model")
mlflow.log_params(inputs["parameters"])
My question is, I only need to log metrics in this specific train_model
node, while based on my understanding, this function will run every time a node finishes, and there could be a lot of nodes in the whole pipeline. Is there way I could specify which node this hook is hooked to?noklam
05/16/2022, 9:06 PMwwliu
05/16/2022, 9:12 PMdatajoely
05/16/2022, 9:16 PMwwliu
05/16/2022, 9:36 PMif
statements in the scripts.
if node._func_name == "split_data":
mlflow.log_params(
{"split_data_ratio": inputs["params:example_test_data_ratio"]}
)
elif node._func_name == "train_model":
model = outputs["example_model"]
mlflow.sklearn.log_model(model, "model")
mlflow.log_params(inputs["parameters"])
These are node specific functions, Do you think these are better put in the node logic itself instead of hooks or it is a proper user case for hooks? Or as @noklam suggested using kedro-mlflow?datajoely
05/16/2022, 9:38 PMdatajoely
05/16/2022, 9:38 PMwwliu
05/16/2022, 10:13 PMdatajoely
05/16/2022, 10:23 PMwwliu
05/16/2022, 10:38 PMsplit_data
node I need to log train_split_ratio
, and in train_model
node, I need to log model
object. Does this mean this senario is kind of against the hooks usage guide?datajoely
05/16/2022, 10:46 PMdatajoely
05/16/2022, 10:46 PMwwliu
05/16/2022, 10:49 PMAnnaRie
05/18/2022, 6:59 PMantony.milne
05/18/2022, 8:32 PMwwliu
05/18/2022, 11:29 PMnoklam
05/18/2022, 11:35 PMnoklam
05/18/2022, 11:37 PMdatajoely
05/19/2022, 10:48 AMThere is no guarantee about the order
but only per dependency level, i.e. If dataset D
requires A
,B
and C
. D
will always be the last executed, but the order in which A
, B
and C
in not fixed per run.Lazy2PickName
05/19/2022, 4:41 PMdef _parse_inctf() -> Pipeline:
return Pipeline(
[
node(
func=nodes.insert_columns_inctf,
inputs='external-inct-fracionada',
outputs="inctf-preprocess-01-insert-columns",
name="read-and-insert-columns-inctf",
),
node(
func=nodes.parse_inct_dates,
inputs="inctf-preprocess-01-insert-columns",
outputs="inctf-preprocess-02-parse-dates"
),
node(
func=nodes.get_pct_change,
inputs="inctf-preprocess-02-insert-columns",
outputs="inctf-preprocessed"
),
]
)
From, those datasets, only the external-inct-fracionada
and inctl-preprocessed
are actually declared in the catalog.yml
. I want to pass the others as MemoryDatasets, they are intermediaries to my pipeline, but when I run, I get this error:
ValueError: Pipeline input(s) {'inctf-preprocess-02-insert-columns'} not found in the DataCatalog
Is there a way of doing this without declaring each intermediary dataset in my catalog? Just so you know, this is the entrance of external-inct-fracionada
in my catalog:
external-inct-fracionada:
type: project.io.encrypted_excel.EncryptedExcelDataSet
filepath: "${DATA_DIR}/External/INCT/INCTF_0222.xls"
And EncryptedExcelDataSet
and it's implementation is seen in the attached filenoklam
05/19/2022, 4:53 PMLazy2PickName
05/19/2022, 4:57 PMSirTylerDurden
05/20/2022, 1:34 AMdatajoely
05/20/2022, 9:56 AMdatajoely
05/20/2022, 10:01 AMSirTylerDurden
05/21/2022, 12:08 AM