https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
plugins-integrations
  • g

    Galileo-Galilei

    02/14/2022, 9:58 PM
    Hello @User, this should work with your current configuration, this is very similar to what we use at work. We should figure out if it is a problem from ``mlflow`` or from ``kedro-mlflow``.
    python
    # First test with plain mlflow
    # temp.py
    
    import os
    import mlflow
    
    os.environ["MLFLOW_S3_ENDPOINT_URL"]='http://192.168.0.150:9000'
    os.environ["AWS_ACCESS_KEY_ID"]='minioadmin'
    os.environ["AWS_SECRET_ACCESS_KEY"]='minioadmin'
    
    mlflow.set_tracking_uri('postgresql://postgres:postgres@localhost:5432/mlflow_db')
    
    with mlflow.start_run():
        mlflow.log_param("a",1)
    Then open the ui, and check if the resutsl are stored where you want. If it is not the case, it means that your mlflow configuration is incorrect, check your server settings / port / password. If it works as expected, restart your kernel and try the following script:
    python
    # Second test kedro-mlflow for configuration settingg and plain mlflow for logging
    
    from kedro.framework.session import KedroSession
    from kedro_mlflow.config import get_mlflow_config
    
    with KedroSession.create() as session:
        mlflow_config=get_mlflow_config()
        mlflow_config.setup()
      
        with mlflow.start_run():
            mlflow.log_param("a",1)
    This sets the environment variables and the uri through ``kedro-mlflow`` and then log in plain mlflow. Does it logs where you want?
    d
    i
    l
    • 4
    • 40
  • g

    Galileo-Galilei

    02/14/2022, 10:06 PM
    @User 1. You can try @User suggestion which is a really good idea! 2. You do not need a node for each metric, you can return a tuple for a node, e.g. :
    python
    # in node.py
    def compute_metrics (y_true, y_preds):
        <compute>
        return metric1, metric2, [metric3_step0, metric3_step1, metric3_step2, metric3_step3]
    
    # in pipeline_registry.py
    
    Pipeline([
    ..., 
    node(compute_metrics, ["y_true", "y_preds"], [my_metric1, my_metric2, my_metric3]),
    ...
    ])
    
    # in catalog.yml
    my_metric1:
        type: kedro_mlflow.io.metrics.MlflowMetricDataSet
    
    my_metric2:
        type: kedro_mlflow.io.metrics.MlflowMetricDataSet
    
    my_metric3:
        type: kedro_mlflow.io.metrics.MlflowMetricHistoryDataSet
    This is easier than returning the complex format above 3. Unfortunately this is not linked to ``kedro-mlflow``, I don't modify the UI. I know that mlflow often changes its UI and it slightly varies across versions. Feel free to [open an issue to my repo](https://github.com/Galileo-Galilei/kedro-mlflow/issues), I'll add it to my backlog to log it as a tag in the future so we can have consistent access across mlflow versions.
  • d

    Dhaval

    02/15/2022, 7:58 AM
    MLFlow_s3 issue
  • d

    DarthGreedius

    02/16/2022, 10:03 PM
    Hello everyone
  • d

    DarthGreedius

    02/16/2022, 10:04 PM
    Has anyone by chance created a custom dataset to wrap around the Azure ML datasets (https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py) ?
  • d

    DarthGreedius

    02/16/2022, 10:04 PM
    I've checked the docs and googled around some but no luck...
  • i

    IS_0102

    02/17/2022, 4:56 PM
    Great! Thanks a lot @User and @User , super useful! I'll try this!
  • a

    Alexandros Tsakpinis

    03/30/2022, 12:17 PM
    Hello Kedro community 🙂 has anyone of you integrated a data versioning tool like DVC in a Kedro application? I wanted to try out a more sophisticated data versioning method compared to the one built-in into the Kedro application. Can anyone help with this and share their experience? Thanks a lot for your help!
  • d

    datajoely

    03/30/2022, 12:21 PM
    There is an ongoing project which @User is leading https://github.com/FactFiber/kedro-dvc
  • d

    datajoely

    03/30/2022, 12:22 PM
    The other thing to consider is kedro-dolt https://www.dolthub.com/blog/2021-06-16-kedro-dolt-plugin/
  • s

    shaunc

    03/30/2022, 1:11 PM
    Thanks for the call-out @User -- We are still not to a useable state yet, though, @User Should be by the end of May.
  • n

    noklam

    04/21/2022, 11:08 AM
    Adding a new after_context_created hook into Kedro's hooks collection
    d
    • 2
    • 2
  • d

    Downforu

    04/22/2022, 12:50 PM
    Hello everyone ! I'm trying to use kedro-airflow with Dockerfile and docker-compose files to execute a very simple Kedro node but I'm having the following error "Task exited with return code Negsignal.SIGKILL". Has anyone encountered this before? I'm not sure but it seems that in my case the scheduler is not working well with the KedroOperator because I have another very simple dag file where I use the standard PythonOperator and it works well. You can find the project in my Github here : https://github.com/Downfor-u/kedro-airflow-test Here are the commands I execute : 1/ In a terminal :
    docker-compose up postgres
    2/ In another teminal :
    docker-compose up init_db
    3/ In the previous terminal :
    docker-compose up scheduler webserver
    Thanks a lot for your help !
  • n

    noklam

    04/22/2022, 2:50 PM
    Hi @Downforu , I suspect it is the same problem here. I tried running local Airflow before but I couldn't reproduce the issue, it only fails with certain docker image so there may be something funky happening. https://github.com/kedro-org/kedro-plugins/issues/13 The current workaround would be updating the
    logging.yml
    with this setting.
    "disable_existing_loggers": True
    This is an issue that we are keen to fix, so please share if you have any finding! Thank you for the very detail report!
  • d

    Downforu

    04/22/2022, 3:24 PM
    Hi @noklam, I just tested it and it works now !!! thank you very much ! I'm gonna do the same in my real ML pipeline and will let you know how it goes.
  • n

    noklam

    04/22/2022, 3:43 PM
    Glad it works!
  • d

    Downforu

    04/27/2022, 12:03 PM
    Hi, I'm coming back to you regarding the execution of my "real" pipeline that didn't work with Airflow. I think I figured out where the problem comes from. It seems that hooks implementations are not taken into account when the pipeline is launched in Airflow. Are you aware of that? The problem is that I use hooks not only for tracking experiments using MLflow and Azure integration but also to initialize and update the parameters.yml file. I'm doing forecasting and needed the parameters.yml to be updated easily based on a specific parameter also defined in hooks.py. The nodes are not running at all in Airflow mainly because the parameters.yml file is not initialized but it is also problematic that the MLFlowTrackingClass() I created in the hooks.py file does not execute. Here's an example for the UpdateParametersFile class defined in the hooks.py file.
    class UpdateParametersFile:
        
        @hook_impl
        def before_pipeline_run(self,  run_params: Dict[str, Any]) -> None:
            conf_paths = ['conf/base', 'conf/local']
            conf_loader = ConfigLoader(conf_paths)
            config_params = conf_loader.get("parameters*", "parameters*/**")
            with open('conf/base/parameters.yml', 'w') as file:
                pass
            config_params["param1"] = dict(key1=10, key2=24)
            # Initialize key-values pairs
            config_params["param3"] = dict(key1='sum')
            with open('conf/base/parameters.yml', 'w') as file:
                yaml.dump(config_params, file)
    I also made the config_params available to the nodes that require "parameters" by adding the following to the ProjectHooks class:
    class ProjectHooks:
        @hook_impl
        def before_node_run(self, node: Node, inputs):
            conf_paths = ['conf/base', 'conf/local']
            conf_loader = ConfigLoader(conf_paths)
            config_params = conf_loader.get("parameters*", "parameters*/**")
            if node.name == "node_name":
                return {"parameters": config_params}
    Thank you in advance
  • n

    noklam

    04/27/2022, 12:50 PM
    Was the
    pass
    intended in the
    UpdateParametersFile
    ?
  • d

    Downforu

    04/27/2022, 1:26 PM
    Just forgot to remove it, we can get rid of it
  • d

    datajoely

    04/27/2022, 1:28 PM
    I also have a question, what are you actually trying to do here, there return will have no effect here
  • d

    datajoely

    04/27/2022, 1:29 PM
    you also don't need to load the
    ConfigLoader
    like this, the parameters are already accessible in the
    catalog
    object
  • d

    Downforu

    04/27/2022, 1:36 PM
    I noticed that when I generated the parameters.yml file before each pipeline run, the "parameters" were somehow not updated for my nodes. So that's why I added these lines to load the ConfigLoader.
  • d

    Downforu

    04/27/2022, 1:47 PM
    I tested a simpler pipeline with MLFlow tracking only in the hooks.py file. It works "outside" of Airflow but not with Airflow. Tasks are executed in this case but there is no recording of metrics. Are you aware of the airflow integration not taking hooks.py into account during execution ?
  • d

    datajoely

    04/27/2022, 1:58 PM
    So I think there are two issues here (1) the parameters piece feels like a order of execution thing and should be simple to solve (2) we're a little sceptical about the mlflow hook never firing, could you put a breakpoint/logging the prove this
  • d

    Downforu

    04/27/2022, 2:12 PM
    For point (2), unfortunately Airflow logging does not provide any info 😦 because nodes are executed normally in the simpler pipeline. The only difference let's say with a "normal" run with
    kedro run
    is that I see experiments and metrics recording in Azure Workspace which is not the case when I trigger the DAG from the Airflow UI
  • d

    datajoely

    04/27/2022, 2:14 PM
    Are you running the Airflow instance in a container? If so are there any ports you need to allow list
  • d

    Downforu

    04/27/2022, 2:15 PM
    Yes I use docker-compose and Docker files to set up the Airflow services
  • d

    Downforu

    04/27/2022, 2:17 PM
    I run that inside an Azure VM (compute instance with SSH enable) and then I ssh into it and create a tunnel to access the Airflow UI from a local computer.
  • d

    Downforu

    04/27/2022, 2:26 PM
    Ports are normally exposed within the docker-compose file (see below). The Airflow UI is accessible through SSH and execution is also successul if triggered from there. Here's the content of my docker-compose file.
    services:
      postgres:
        image: postgres:13
        environment:
          - POSTGRES_USER=airflow
          - POSTGRES_PASSWORD=airflow
          - POSTGRES_DB=airflow
        ports:
          - "5434:5432"
      init_db:
        build:
          context: .
          dockerfile: Dockerfile
        command: bash -c "airflow db init && airflow db upgrade"
        env_file: .env
        depends_on:
          - postgres
      scheduler:
        build:
          context: .
          dockerfile: Dockerfile
        restart: on-failure
        command:  bash -c "airflow scheduler"
        env_file: .env
        depends_on:
          - postgres
        ports:
          - "8080:8793"
        volumes:
          - ./airflow_dags:/opt/airflow/dags
          - ./airflow_logs:/opt/airflow/logs
        healthcheck:
          test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
          interval: 30s
          timeout: 30s
          retries: 3
      webserver:
        build:
          context: .
          dockerfile: Dockerfile
        hostname: webserver
        restart: always
        env_file: .env
        depends_on:
          - postgres
        command: bash -c "airflow users create -r Admin -u admin -e admin@example.com -f admin -l user -p admin && airflow webserver"
        volumes:
          - ./airflow_dags:/opt/airflow/dags
          - ./airflow_logs:/opt/airflow/logs
        ports:
          - "5000:8080"
        healthcheck:
          test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
          interval: 30s
          timeout: 30s
          retries: 32
  • d

    Downforu

    05/02/2022, 10:47 AM
    Hello guys ! I finally got it to work. The problem was on my end... In fact, as I was also tracking the git hash with MLflow, I had to add the git folder in the Docker container which of course I didn't do ... Anyway, execution works now ! Just had one question : When I run the pipeline locally without Airflow, I have the same run ID for each of versioned datasets. However, when I run it with Airflow, it creates different run IDs which makes it difficult to track and reproduce the outputs. Can you please help me get the same run ID with Airflow ? You can reproduce this behaviour with this repo : https://github.com/Downfor-u/kedro-airflow-simple-dag Here are the commands I execute for Airflow run : 1/
    kedro package
    2/
    docker-compose up postgres
    3/ Open another terminal :
    docker-compose up init_db
    4/ In the new terminal :
    docker-compose up scheduler webserver
    Thank you in advance !
Powered by Linen
Title
d

Downforu

05/02/2022, 10:47 AM
Hello guys ! I finally got it to work. The problem was on my end... In fact, as I was also tracking the git hash with MLflow, I had to add the git folder in the Docker container which of course I didn't do ... Anyway, execution works now ! Just had one question : When I run the pipeline locally without Airflow, I have the same run ID for each of versioned datasets. However, when I run it with Airflow, it creates different run IDs which makes it difficult to track and reproduce the outputs. Can you please help me get the same run ID with Airflow ? You can reproduce this behaviour with this repo : https://github.com/Downfor-u/kedro-airflow-simple-dag Here are the commands I execute for Airflow run : 1/
kedro package
2/
docker-compose up postgres
3/ Open another terminal :
docker-compose up init_db
4/ In the new terminal :
docker-compose up scheduler webserver
Thank you in advance !
View count: 1