https://kedro.org/ logo
#plugins-integrations
Title
# plugins-integrations
g

Galileo-Galilei

02/14/2022, 9:58 PM
Hello @User, this should work with your current configuration, this is very similar to what we use at work. We should figure out if it is a problem from ``mlflow`` or from ``kedro-mlflow``.
Copy code
python
# First test with plain mlflow
# temp.py

import os
import mlflow

os.environ["MLFLOW_S3_ENDPOINT_URL"]='http://192.168.0.150:9000'
os.environ["AWS_ACCESS_KEY_ID"]='minioadmin'
os.environ["AWS_SECRET_ACCESS_KEY"]='minioadmin'

mlflow.set_tracking_uri('postgresql://postgres:postgres@localhost:5432/mlflow_db')

with mlflow.start_run():
    mlflow.log_param("a",1)
Then open the ui, and check if the resutsl are stored where you want. If it is not the case, it means that your mlflow configuration is incorrect, check your server settings / port / password. If it works as expected, restart your kernel and try the following script:
Copy code
python
# Second test kedro-mlflow for configuration settingg and plain mlflow for logging

from kedro.framework.session import KedroSession
from kedro_mlflow.config import get_mlflow_config

with KedroSession.create() as session:
    mlflow_config=get_mlflow_config()
    mlflow_config.setup()
  
    with mlflow.start_run():
        mlflow.log_param("a",1)
This sets the environment variables and the uri through ``kedro-mlflow`` and then log in plain mlflow. Does it logs where you want?
d

Dhaval

02/15/2022, 8:00 AM
@User It doesn't log in the s3 bucket because there is no way to set the value of this variable
artifact_location
This value can only be set when the experiment is created using
mlflow.create_experiment()
. As far as I've seen in your source code for kedro-mlflow, I've seen that you use
set_experiment
which can't access the
artifact_location
variable
This particular piece of code works as expected:
Copy code
import os
import mlflow

os.environ["MLFLOW_S3_ENDPOINT_URL"]='http://192.168.0.150:9000'
os.environ["AWS_ACCESS_KEY_ID"]='minioadmin'
os.environ["AWS_SECRET_ACCESS_KEY"]='minioadmin'

mlflow.set_tracking_uri('postgresql://postgres:postgres@localhost:5432/mlflow_db')
mlflow.create_experiment(name='new_flow_test', artifact_location='s3://mlflow')
mlflow.set_experiment('new_flow_test')
with mlflow.start_run():
    print(mlflow.get_artifact_uri())
    mlflow.log_param("a",1)
    mlflow.log_artifact('data/01_raw/movie.csv')
I want to emulate this with kedro-mlflow but I don't see a way through which this can be achieved
Is there anything that can be done @Galileo-Galilei ?
g

Galileo-Galilei

02/16/2022, 8:29 PM
I am just writing an answer πŸ˜‰
d

Dhaval

02/16/2022, 8:32 PM
You're a life saver dude, I'm sorry to annoy you so frequently but you have made something amazing and I do want this to be adopted in the company that I'm currently working at. So thanks😁
g

Galileo-Galilei

02/16/2022, 8:57 PM
Hi @User , There are two distinct parts to the problem: 1. On the bad side, As you mention, ``kedro-mlflow`` does not allow you to pass extra arguments to the ``create_experiment`` function, so you cannot specify the artifact location. i'll add it to the backlog and update it soon (= within few weeks rather than immediately). Feel free to open a PR if you need it soon, I'd be happy to guide you. The key components to update are ``kedro_mlflow.config.kedro_mlflow_config.KedroMlflowConfig._set_experiment`` and ``kedro_mlflow.template.project.mlflow.yml``. It should be quite straightforward, but there are a few edge cases to deal with (e.g. an experiment created, then deleted, then restored). 2. On the good side, your configuration is quite unusual (actually I think I never seen anyone doing this before) and I think it may easily be tweaked to something working. Let me elaborate: - You specify an ``mlflow_tracking_uri`` with a ``postgresql://...`` scheme. - It seems than mlflow understands it as the database to track all entities excepts artifacts. Since artifact location is not set, it goes back to its default ``mlruns`` folder when you try to log. The following issue is different from your need but the picture should explain quite clearly what's going on: https://github.com/Galileo-Galilei/kedro-mlflow/issues/77. On the picture, your ``postgresql`` uri links to the backend store, and mlflow defaults the artifact store to ``mlruns`` folder as you have no server in between. - The usual configuration is to set up a mlflow server expose it (eventually locally as you are doing), and set the uri to the server IP: ``http://:`` in the ``mlflow.yml``. Everything should work as expected.
P.S: thanks for the kind words and glad to here you want to make it adopted in your company. Just for curiosity, do you use the ``pipeline_ml_factory`` to serve kedro pipelines? I feel that it is quite hard to understand / use but in my experience once people have tried it they do not want to give it up afterwards :p I'd be glad to have feedback about it. For the record, I do know if you have seen it but I have a repo (quite unmaintained for now - i want to do a lot of things but I do it on my spare time and I am not paid for that so I can't be as reactive as kedro's team members and i cannot develop as fast as they can) with some examples and feedback is always appreciated: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial/stargazers
i

IS_0102

02/17/2022, 5:08 PM
Hi everyone! I'm using this thread since I would be interested to store artifact to s3, too. Reading the above I would say that for now there is no way yet to set up the artifact_location and so they will always be stored in the default mlflow folder in project root. Is this correct or am I missing something? The only thing that can be changed is the backend store server, which is specified in the pipeline_ml_factory entry in the mlflow.html. I am asking this because I will need to run some experiments from a DataBricks notebook (I'll simply do the kedro run from there because dbconnect does not work for or specific needs). To do it I'll have to clone the repo on Databricks and this will be delete every time the cluster is restarted so the local mlruns folder would vanish and we would lose the artifacts. So, unless kedro-mlflow allows to specify a different folder where to store artifacts I think we might need to use the native mlflow integration with Databricks
g

Galileo-Galilei

02/17/2022, 6:35 PM
Actually everything already work except if you have a very specific (and likely not suitable to a production environment above) configuration like above. Just use the ``http://...`` uri to your server, or ``databricks`` or a ``databricks://PROFILE`` as it and all your artefacts will be logged in s3. For the record my company uses ``kedro-mlflow`` and S3 storage for almost 2 years.
@IS_0102 a small precision: as you can see in above issues and in the documentation, you should use LOCAL path in the catalog and mlflow will automatically send them to S3
d

Dhaval

02/17/2022, 7:16 PM
@User Can the mlflowartifact dataset can't take care of autologged data from models? Because if not they have to be declared every single time we use it
@User Thanks for your help. Based on your instructions, i just went ahead with creating a new docker image which has MLFlow, Mysql db and S3 storage for backend store and artifact store. It is working seamlessly with prefect for scheduling runs for inference purposes. I tried looking at pipelineML but couldn't get the gist of it properly. I'll sit and understand it properly as I get time. Thank you very much
g

Galileo-Galilei

02/17/2022, 11:06 PM
Sure it can, providing you know their path in the mlflow artifacts folder, but it does not look like a good practice. I think it is better to return the model at the need of a node, and specify a ``MlflowModelSaver/LoggerDataSet`` entry in the catalog. This is much more declarative and makes the model very easy to reuse (but it is a bit more verbose since you have write the catalog entry). Even better, I think that ``pipeline_ml_factory`` will package your inference pipeline along with all its required artifacts (including the model), and make predicting on new data very easy.
d

Dhaval

02/18/2022, 7:25 AM
I'm going to have a look at pipeline ml factory for sure. I've just installed prefect for scheduling runs and now I can experiment with your suggested work. As always thanks a lot for your help @Galileo-Galilei 😁
l

lbonini

02/21/2022, 8:09 PM
Hi guys, I would like to take advantage of this thread and ask about kedro-mlflow.
I'm trying to use kedro-mlflow on this scenario:
therefore, the only environment variable that the client needs is the tracking uri
but for some reason the plugin seems to force me to provide these variables (AWS_SECRET... ). I can't configure it in a way that it uses the proxy.
d

Dhaval

02/21/2022, 8:34 PM
As far as I know, you need the environment variables to write to the s3 artifact. One good docker image that helped me out was this https://github.com/Toumash/mlflow-docker. Might be of some help to you
l

lbonini

02/21/2022, 8:37 PM
I managed to run without needing other variables besides the tracking server using the scenario above. But I used pure mlflow with a demo code. I was not using kedro-mlflow (plugin)
mlflow server --serve-artifacts --artifacts-destination s3://$${AWS_BUCKET}/artifacts --backend-store-uri sqlite:///$${FILE_DIR}/sqlite.db --host 0.0.0.0 --port $${PORT}"
g

Galileo-Galilei

02/22/2022, 7:50 AM
The plugin does not force you to anything, actually it's quite the opposite πŸ˜… If you specify the "host:port" in the server: mflow_tracking_uri of your mlflow.yml, what happens?
If nothing happens inside kedro but mlflow works outside of kedro, can you check if you have a ".env" or ".aws" config file somewhere on your computer which exports these variables automatically?
l

lbonini

02/22/2022, 1:39 PM
hello @User! Thanks for your response. Let me summarize: - The mlflow is outside of kedro environment. (Remote server) - I'm already setting the mlflow_tracking_uri with ip and port - This tracking_uri is the only config I need to provide since the server is using the proxy scenario. I suppose I don't need the .aws or .env file to export AWS_SECRET_KEY or something needed for artifact storage... - If I run some demo experiment using only mlflow and tracking server it works like a charm. I can export artifacts too without provide any extra variable. - Using kedro-mlfow raises an error because it can't find AWS_SECRET_KEY as you mentioned
Update: For some reason now I can run properly... I restarted the mlflow server deleting the tracking database as well (clean install) and now I'm able to use the kedro-mlflow with the scenario I sent yesterday. Anyway, sorry for bothering and thanks for your attention...
i

IS_0102

03/01/2022, 8:03 PM
Hi everyone! Sorry @User to bring up this thread again, but I'm having some issues with storing artifacts on s3. I am using kedro-mlflow with postgre as
mlflow_tracking_uri
. I have followed this link (https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/03_version_datasets.html) to set up the versioning of a csv dataset I am interested in but, even if the node creating this runs smoothly (and the csv is saved in the local path specified), it is not automatically uploaded in th s3 bucked I specified in
MLFLOW_S3_ENDPOINT_URL
in mlflow.yml. For dubugging I printed the tracking_uri and artifact_uri: while the former is correctly set up as postgre, the second one report the default local mlruns folder. Do you know what could be causing this issue?
Copy code
run_id = mlflow.active_run().info.run_id
    run_id_str = "Mlflow run id: {}".format(run_id)
    if save_logs:
        with open(f"{root_path}/logs/info.log", "r") as log_info_file:
            log_lines = log_info_file.read().split(
                "root - INFO - ** Kedro project pmpx-patient-embedding"
            )[-1]
            mlflow.log_text(run_id_str + log_lines, "log_last_run.txt")

        mlflow.log_artifact(f"{root_path}/logs/info.log")

        error_file_path = f"{root_path}/logs/errors.log"
        if Path(error_file_path).is_file():
            mlflow.log_artifact(error_file_path)

    tags = {
        "mlflow_run_id": run_id,
    }
    mlflow.set_tags(tags)
    logging.info(run_id_str)
    logging.info("Mlflow artifact uri: {}".format(mlflow.get_artifact_uri()))
    logging.info("Mlflow tracking uri: {}".format(mlflow.get_tracking_uri()))
This is the code I used to print the
tracking_uri
and
artifact_uri
This is how I set up the mlflow.yml
Copy code
mlflow_tracking_uri: "postgresql://{username}:{password}@{ipaddress}:{port}/{dbname}"
  stores_environment_variables: {
        MLFLOW_S3_ENDPOINT_URL: "s3://{s3_bucket_name}"
  } 
  credentials: mlflow_credentials
g

Galileo-Galilei

03/02/2022, 7:39 AM
Try to create a mlflow server as described above rather than passing the posture URI, you have the exact same problem as Dhaval above
i

IS_0102

03/02/2022, 10:53 AM
Hi @User, thanks a lot ! Do you mean sustituiting
MLFLOW_S3_ENDPOINT_URL: "s3://{s3_bucket_name}"
with
MLFLOW_S3_ENDPOINT_URL: "https://s3://{s3_bucket_name}"
? I tried this but it but I get the same result
g

Galileo-Galilei

03/02/2022, 11:09 AM
No, you have to create a mlflow server (see command above) and add the http url of this server as tracking uri
i

IS_0102

03/02/2022, 11:22 AM
Sorry @User is this the command you are referrring to?
mlflow server --backend-store-uri postgresql://{} --default-artifact-root s3://{}
g

Galileo-Galilei

03/02/2022, 11:36 AM
Yes
i

IS_0102

03/02/2022, 11:56 AM
Got, it, thanks! I see this is taking quite a long time to run (more than 30 minutes and did not end yet), is this normal? If so, do I have to do it everytime I run an experiment or is it just one-off? In the mlflow.yml I still have to define the `mlflow_tracking_uri `and
MLFLOW_S3_ENDPOINT_URL
as in the
mlflow server
command I run, correct?
d

Dhaval

03/02/2022, 1:36 PM
@IS_0102 use this repo to create a docker image As far as I know, you need the environment variables to write to the s3 artifact. One good docker image that helped me out was this https://github.com/Toumash/mlflow-docker. Might be of some help to you Use the credentials.yml file to state the mlflow s3 endpoint URL, aws access key and secret key. Just change the tracking uri to localhost:5000
g

Galileo-Galilei

03/03/2022, 7:54 AM
@IS_0102 Setting the server up should be really fast, maybe you have some connection issues with the S3 storage and experienced some timeout? Setting up the server should be done only once: as long as it is up, you can just forget about it and everything will be logged properly. As Dhaval said, the mlflow tracking you is NOT the postgresql you use in the mlflow server command, but the URL of the server itself (if launched locally, likely localhost:5000). You still need the MLFLOW_S3_ENDPOINT_URL variable.
4 Views