Hello I hope you could help me out with this doubt about the Kedro #beginners-need-help

Hello! I hope you could help me out with this doub...

jcasanuevam

11/11/2021, 1:20 PM

Hello! I hope you could help me out with this doubt about the mlflow tracking server and how to set up everything in the mlflow.yml file of the kedro project. I have a database backend store in an external server to track metrics, etc and a SFTP server as artifact store for storing models in the same external server. In kedro-mlflow documentation I've seen I have to define the mlflow_tracking_uri variable but I'm not sure if a must write the sftp://user@host/path/to/directory of the artifact store or +://:@:/ of the database backend store. In my case I want to use both solutions and the mlflow.yml only give us one possible input. How can I set up both backend and artifact stores? Thanks!

datajoely

11/11/2021, 1:21 PM

Is this using

kedro-mlflow

jcasanuevam

11/11/2021, 1:22 PM

yep

datajoely

11/11/2021, 1:22 PM

So that's actually a 3rd party plugin not developed by the core Kedro team - @User are you able to help?

Galileo-Galilei

11/11/2021, 1:27 PM

Thanks for tagging @User. Actually I'm pretty sure there was an issue on this, let me check

datajoely

11/11/2021, 1:28 PM

Thanks for picking up - I've also created a #908346260224872480 channel if helpful

Galileo-Galilei

11/11/2021, 1:37 PM

I think this look like to your problem (even if it is about storing artifacts on S3) : https://github.com/Galileo-Galilei/kedro-mlflow/issues/15#issuecomment-653257558. The key idea is that you set up your mlflow server with the command defined here: https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers and then you will communicate with mlflow only through the tracking uri (the ``MlflowClient`` will manage whether data should be logged as artifacts or in the backend store.) As a consequence you only need to set the ``mlflow tracking_uri`` in your kedro project (but it assumes that you have configured your server store) and : - specify the tracking uri in `mlflow.yml`:

Copy code

yaml
mlflow_tracking_uri: <dialect>+<driver>://<host>:<port>/<database>
credentials: mlflow_credentials

- create a credentials entry like:

Copy code

yaml
#credentials.yml
mlflow_credentials:
  - MLFLOW_TRACKING_USERNAME
  - MLFLOW_TRACKING_PASSWORD

jcasanuevam

11/11/2021, 4:22 PM

Thanks! I will give it a try. Do you have/know any recommended Dockerfile or Docker compose file for setting up the mlflow server on my external server? (at least a good starting point for my needs)

jcasanuevam

11/11/2021, 4:57 PM

BTW, can I use any kedro-mlflow plugin version on my local machine and any mlflow package version on the external server? I mean, is there any restriction between the versions on the local machine and on the server side?

Galileo-Galilei

11/11/2021, 9:45 PM

Regarding the DockerFile, I can't really help. I don't usually bother to create an mlflow server locally on my personal projects, and at work we use a very custom one.

Galileo-Galilei

11/11/2021, 9:50 PM

Regarding your 2nd question, it theoretically might be an issue if the database schema for tracking differs between versions, but in practice I've seen dozens of people and projects running kedro-mlflow locally with almost all versions of mlflow (from 0.8.0 to 1.21.0) for the past two years, and switching to production to our mlflow server (which has an old mlflow version, but above 1.0.0, likely something like 1.3.0) just by changing the ``mlflow_tracking_uri`` in their project , and we never experienced any issue as far as I know. I strongly recommend to use ``mlflow>=1.0.0`` for your production server though.

martinlarsalbert

12/13/2021, 6:20 PM

Does anyone have experience with Mlflow and modular pipelines in kedro? I want each modular pipeline to be a mlflow run. My problem is that the modular pipelines will exist simultaneously, which means that also two Mlflow runs must exist simultaneously...

datajoely

12/13/2021, 6:22 PM

I'll defer to people with more experience on this - but one of the reasons we have started our own first party version experiment tracking as it's hard to rationalise what a 'run is'

martinlarsalbert

12/14/2021, 8:55 AM

I chose to have only one run for the pipeline that consist of many modular pipelines and maintain the namespaces when logging artifacts, parameters and names so that I have many models as artifacts _model.pkl etc.

datajoely

12/14/2021, 10:06 AM

@User If you customise your runs to run

kedro run --pipeline={name}

and limit your run to just one registered pipeline how does the affect the output?

martinlarsalbert

12/14/2021, 12:11 PM

That works fine. It is only when I run many joined modular (cloned) pipelines with namespaces that parameter names, artifact names etc. collide in mlflow, but if you giving them namespaced names is one solution. I tried to create a mlflow run each time the first node in the modular pipeline was called (with a "before_node_run hook") but since the modular pipelines are not necessarily run one by one, more than one mlflow run needed to be active at the same time. I tried to solve it with nested mlflow runs, but I never got it to work.

Galileo-Galilei

12/14/2021, 10:09 PM

This is an interesting use case and I'd like to support this in kedro-mlflow! To solve your use case, you must use the mlflow client to log whatever you want in a specific run instead of the active run. You need to track the runs id after starting them in order to specify in which run you want to log later, but it is quite technical. If you want to give it a try to create a prototype I'd be happy to help you and make it work

10 Views

Previous Next