https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
plugins-integrations
  • d

    Dhaval

    12/31/2021, 5:06 PM
    Hi, so this slightly deviates from kedro but I have a list of models which I am trying to train on the same data. This is a classification problem and I want to log the parameters of all the metrics that I am willing to track with MLFlow. I checked out the kedro-mlflow repo but when I run this on a pipeline which internally has multiple models, the autologging functionality returns the parameters of the last model that was ran. Is there any way around this? I am using the modular-spaceflights tutorial from @datajoely https://github.com/datajoely/modular-spaceflights @Galileo-Galilei would appreciate your feedback on this since you are the repo owner for kedro-mlflow
  • d

    Dhaval

    01/03/2022, 5:12 AM
    Looking for your feedback @User
  • g

    Galileo-Galilei

    01/03/2022, 9:10 AM
    Yes sorry, I'll answer tonight
  • g

    Galileo-Galilei

    01/03/2022, 8:57 PM
    Hello @User, can you give a reproducible example (e.g. a repo I can clone) so I can try to reproduce the bug? I guess that what you mean by "running a pipeline with several model" is to use modular pipelines to duplicate a "base" pipeline, and you use a namespace which changes the names of its datasets (in particular, the metrics and models you persist in your ``catalog.yml`` which should be persisted). In this case, it seems that Kedro does not namespace parameters according to this pinned issue, so mlflow has no way to distinguish them: https://github.com/quantumblacklabs/kedro/issues/929, but this will be fixed in 0.18.0. Since this is more a Kedro bug than a kedro-mlflow bug, I won't support any specific trick to fix it and we will have to wait 0.18.0 unfortunately â˜šī¸ Would you mind trying using the develop branch and tell me if it fixes the issue? If i misunderstood what you said, feel free to ask again!
  • d

    Dhaval

    01/04/2022, 6:54 AM
    Hi @User, I am just trying to use mlflow.autolog() in the modeling pipeline but it doesn't log the parameters of different models separately. It is all included in one run and not separate runs This is the repo https://github.com/datajoely/modular-spaceflights
  • g

    Galileo-Galilei

    01/04/2022, 11:55 AM
    Some remarks about your comment: - ``kedro-mlflow`` does not use ``mlflow.autolog()``, so I don't know what you are doint but it is not related to the plugin. Where did you add this command? inside a node? Inside a hook? - I've seen the repo, but I have no idea of what you are doing with it (what did you modify? what pipeline are you running? what lines of the code are you referring to?) so I'll need more details to be more helpful In general, mlflow logs everythin inside the same run. In pseudo code, if you are tuning several models and you want to store the informations in different runs you can do:
    python
    mlflow.autolog()
    for hyperparams in hyperparams_candidates:
      with mlflow.start_run(nested=True):
       model.train(**hyperparams)
    which will create sub runs inside main one. Is it what you want?
  • d

    Dhaval

    01/05/2022, 3:40 PM
    @User I have just run the command
    kedro mlflow init
    and then
    kedro run
    on the repository mentioned above. The thing that is happening as of now is, the parameters aren't logged separately as separate runs. Its all present in just one single run. Since the models are different I want the runs recorded separately. How do I achieve that? Do note that the modelling pipeline in this repo has used namespace for grouping multiple regression models
  • g

    Galileo-Galilei

    01/05/2022, 9:27 PM
    Ok, I get it. I have very bad news: - it is not possible to do it with kedro-mlflow for now - I don't think it will ever be possible one day in a general way for the following reasons: - functional reason: there is no reason to suppose that every user want to log every namespaced modular pipeline in a sub run. It is common to use modular pipeline which are the continuation of the same run. For instance, I often use a namespaced modular "evaluation" pipeline which takes a ``pandas.DataFrame`` of predictions and outputs a lot of metrics. I may use this pipeline just after training on a validtion data set, or standalone on another extraction, but it does not make sense to create a mlflow subrun for this pipeline. - technical reason: even if we wanted to trigger a sub run for all modular pipelines: - it is very hard to identify the beginning and the end of such pipelines (because they can have several inputs and outputs, and Kedro does not always run them in the same order). It is very hard to catch at execution time if this is "the first input node" or the "last output node" of the sub pipeline to start and end the run properly - it is very hard to identify sub pipeliens once they are sumed up altogether. When you do something like ``final_pipeline=pipeline_etl+pipeline_training1+pipeline_training2+evaluation`` (with ``pipeline_training1`` and ``pipeline_training2`` being the same pipeline just with different namespaces), Kedro recreat a single unique big pipeline composed of the nodes of all the sub pipelines. There is not more notion of "sub pipelines" any more, so ``kedro-mlflow`` has no obvious way to identify these "sub" pipelines.
  • g

    Galileo-Galilei

    01/05/2022, 9:28 PM
    What do to then? You can either: - create a custom ``AbstractRunner`` to run several time the same pipeline and create a mlflow sub run each time:
    python
    # inside a custom AbstractRunner
    with mlflow.start_run(nested=True):
       runner.run(pipeline, catalog)
    - do all hyperparameter tuning inside a node, and laucnh sub run like described above:
    python
    # inside a node
    for hyperparams in hyperparams_candidates:
      with mlflow.start_run(nested=True):
       model.train(**hyperparams)
    There is an interesting discussion on the same theme here: https://github.com/Galileo-Galilei/kedro-mlflow/issues/246
  • c

    ChainYo

    01/28/2022, 2:44 PM
    Anyone already used
    Celery
    to run multiple
    Kedro
    pipelines on a daily basis ?
  • d

    Dhaval

    02/03/2022, 8:01 AM
    @User I am trying to load a model from Model Registry of MLFlow but it throws an error. Currently my mlflow.yml file has the tracking_uri set to
    'postgresql://postgres:postgres@localhost:5432/mlflow_db'
    When I use this command:
    model = mlflow.pyfunc.load_model(
        model_uri=f"models:/temp/1"
    )
    I get the following error: Note: temp here is the model name and 1 is the version of the registered model
  • g

    Galileo-Galilei

    02/04/2022, 12:21 PM
    Hi, the log shows that you didn't set the mlflow tracking uri before loading the model, hence it looks for a local mlruns folder
  • g

    Galileo-Galilei

    02/04/2022, 12:24 PM
    Can you try this:
    from kedro.framework.session import KedroSession
    from kedro_mlflow.config import get_mlflow_config
    
    with KedroSession.create() as session:
      mlflow_config=get_mlflow_config()
      mlflow_config.setup()
      < your code>
  • d

    Dhaval

    02/08/2022, 1:11 PM
    Thanks, this works
  • l

    lbonini

    02/10/2022, 5:15 PM
    Hello, could someone help me with
    kedro build-docs
    ? Nodes, pipelines and subpackages in general (docstrings) are not showing in docs
  • a

    antony.milne

    02/10/2022, 5:25 PM
    The docstrings should build by default when you do
    kedro build-docs
    - no need to modify index.rst. Something that's tripped me up before: you need
    __init__.py
    files everywhere for Sphinx to find all the right modules
  • l

    lbonini

    02/10/2022, 5:50 PM
    @User thank you very much for replying! I'm trying to build docs for pandas-iris demo
    kedro new --starter=pandas-iris
    , there are
    __init__.py
    files in every folder. 😕
    a
    • 2
    • 6
  • i

    IS_0102

    02/10/2022, 8:31 PM
    Hi everyone! Nice to meet you all! I'm not sure if I'm posting this in the right place, but I'm trying to set up kedro-mlflow to track some experiments and I have some doubt: - Is there a straightforward way to save the actual logs? Either as an artifact or in the metadata - I created a node to compute some metrics following the documentation here https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/05_version_metrics.html#how-to-return-metrics-from-a-node, but even if the node runs smoothly I do not see any metric recorded. Do I have to specifically call the .save function? - Where can I find the run-ids on the web ui? Even double-clicking on a specific run I cannot see them anywhere, except from the indication of the artifacts path, which seems quite strange
  • d

    datajoely

    02/10/2022, 9:08 PM
    @User any ideas on this one?
  • a

    Arnaldo

    02/11/2022, 12:26 PM
    @User > I created a node to compute some metrics to log metrics on MLFlow using the kedro-mlflow plugin, you need to: 1. return a dictionary of metrics in the following format: { "metric_name_1": {"value": 0.9, "step": 1}, "metric_name_2": {"value": 0.9, "step": 1}, } 2. have the metric defined in the catalog. For example:
    output_metric_name_defined_in_node_output:
        type: kedro_mlflow.io.metrics.MlflowMetricsDataSet
    > Where can I find the run-ids on the web ui? If I understood it correctly, the run id is the top of the run page (see image attached)
  • a

    Arnaldo

    02/11/2022, 12:28 PM
    > Is there a straightforward way to save the actual logs? probably @User can provide a better answer, but I know that the logs of Kedro are saved in
    logs/info.log
    . Therefore, probably you could call
    mlflow.log_artifact("logs/info.log")
    to save it
  • a

    antony.milne

    02/11/2022, 4:08 PM
    kedro build-docs
  • i

    IS_0102

    02/11/2022, 10:48 PM
    Hi Arnaldo! Thanks a lot fro your answer! Logs: Thanks, I'll try this out! My fear is that in logs.info, the logs of all consequent runs are appended, so the file might become a bit too heavy (at least we should be able to identify the logs we are interested in since they should always be at the end of the file). Metrics: Actually I had done as you suggested, but there was I typo in my catalog name and so the results were not stored. Now it works, thanks! Run_id: Where you have the run_id, I get only the pipeline name (image below). This might be due to the fact that I am running single pipelines (kedro run --pipeline ) and not a complete 'kedro run'. Do you know if there's a way to save the run_id also in this situation (maybe tweaking something in the mlflow.yml)? Otherwise I'll probably need to 'kedro run' changing every time the default pipeline
  • d

    Dhaval

    02/12/2022, 7:05 AM
    Hi, I am trying to schedule runs on airflow but there's no guide specific to just airflow. There is a guide regarding astronomer airflow but not just airflow. I'm looking to create a docker compose image which has airflow to schedule runs of Kedro pipelines, MLFlow to log the metrics and postgres as Metric logging db. Can anyone help with this?
  • g

    Galileo-Galilei

    02/12/2022, 7:36 AM
    Thanks @User for replying faster than I could 😉 Some remarks: 1. I have no better way than the one suggested to log the log. I know that logging the entire file is not satisfactory but I don't know how I could log only the logs for the specific run. It would require to store them somewhere during log execution, and I don't think this exists for now. Another issue is that you would have the logs only up to the ``after_pipeline_run`` call, which may strip off some of runs occuring afterwards. 2. I suspected the error came from a typo in the catalog, thanks for confirming. Notice that in most recetn versions of ``kedro-mlflow`` you can return only a float / a list of float or a dict {step; value} rather the complicated format above (see ``MlflowMetricDataSet`` without "s" or ``MlflowMetricHistoryDataSet`` on the same documentation page you sent above. 3. Actually I've noticed that several times in the past: the mlflow UI does not display the same thing depending on the mlflow version you have, and it varies a lot between versions (sometimes nothing shows up, sometimes you have the name and sometimes you have the run id). @User and @User can you tell me what mlflow version you have? Since mlflow is highly inconsistent on what is displayed, I may log the ``run_id`` as a tag in the future to ensure consistency.
  • g

    Galileo-Galilei

    02/12/2022, 7:46 AM
    Hi @User , there is an interesting discussion here: https://github.com/Galileo-Galilei/kedro-mlflow/issues/44 but it raises more questions than it gives answers 😅 maybe @User got something functional given it seems his was in touch with the airflow team back then
  • i

    IS_0102

    02/13/2022, 5:22 PM
    Thank a lot for your reply @User! 1. Yes, that makes sense. I think I might try to 'manually' parse the logs keeping only the characters after the last 'kedro run', and see if I can make it work 2. Got it! Does this mean that I should add a node for every metric I want to log? 3. Currently I have mlflow==1.23.1 kedro-mlflow==0.8.0
  • a

    Arnaldo

    02/14/2022, 12:32 PM
    I'm using
    mlflow==1.23.1
    as well, but with
    kedro-mlflow==0.7.6
  • a

    Arnaldo

    02/14/2022, 12:41 PM
    @User > My fear is that in logs.info, the logs of all consequent runs are appended, so the file might become a bit too heavy to log only the current session to
    log/info.log
    , you can change the
    info_file_handler
    in
    conf/<env>/logging.yml
    to the following:
    info_file_handler:
            class: logging.FileHandler
            level: INFO
            formatter: simple
            filename: logs/info.log
            mode: w
            encoding: utf8
            delay: True
  • d

    Dhaval

    02/14/2022, 4:16 PM
    @User So, I am trying to push my artifact data to a local minio bucket. It's hosted on
    http://192.168.0.150:9000
    with the bucket name
    mlflow
    The minio server is running locally so I've added the following entries in the
    credentials.yml
    file of my kedro project:
    mlflow_creds:
      MLFLOW_S3_ENDPOINT_URL: 'http://192.168.0.150:9000'
      AWS_ACCESS_KEY_ID: 'minioadmin'
      AWS_SECRET_ACCESS_KEY: 'minioadmin'
    These are the values for the mlflow.yml file:
    server:
      mlflow_tracking_uri: 'postgresql://postgres:postgres@localhost:5432/mlflow_db'
      stores_environment_variables: {}
      credentials: mlflow_creds
    What I want to do is set postgres to track metrics and manage registered models and use minio's bucket(
    mlflow
    ) to save the artifacts . The above configuration saves all the runs locally inside
    ./mlruns
    folder, I want it to point to the
    s3://mlflow
    bucket. Please help
Powered by Linen
Title
d

Dhaval

02/14/2022, 4:16 PM
@User So, I am trying to push my artifact data to a local minio bucket. It's hosted on
http://192.168.0.150:9000
with the bucket name
mlflow
The minio server is running locally so I've added the following entries in the
credentials.yml
file of my kedro project:
mlflow_creds:
  MLFLOW_S3_ENDPOINT_URL: 'http://192.168.0.150:9000'
  AWS_ACCESS_KEY_ID: 'minioadmin'
  AWS_SECRET_ACCESS_KEY: 'minioadmin'
These are the values for the mlflow.yml file:
server:
  mlflow_tracking_uri: 'postgresql://postgres:postgres@localhost:5432/mlflow_db'
  stores_environment_variables: {}
  credentials: mlflow_creds
What I want to do is set postgres to track metrics and manage registered models and use minio's bucket(
mlflow
) to save the artifacts . The above configuration saves all the runs locally inside
./mlruns
folder, I want it to point to the
s3://mlflow
bucket. Please help
View count: 1