https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • a

    antheas

    08/15/2022, 2:37 PM
    A big problem I have with kedro is that there are no fixed value inputs for nodes. Consider the following node. It needs to know the name of the results that it's logging. This is fixed for the node that runs the function. For now, I have to use closure functions (such as
    mlflow_log_model_closure
    ) to pin the name for the version of the function that will run. What's the better solution for this?
    def mlflow_log_model_results(name: str, res: pd.DataFrame):
        if not mlflow.active_run():
            return
    
        ...
    
    
    def mlflow_log_model_closure(name: str):
        def closure(res: pd.DataFrame):
            return mlflow_log_model_results(name, res)
    
        closure.__name__ = f"log_{name}_model_results"
        return closure
  • d

    datajoely

    08/15/2022, 3:05 PM
    Yes, I think lifecycle hooks have everything you need for this
  • b

    brewski

    08/16/2022, 11:34 PM
    Was just wondering if there was any best practice for trying to manage the complexity of a lot of datasets? I remember in my databases class we went over ORM modeling but it was always in the context of creating entities s.t. their interactions would be represented sensibly in tables that mirror that object structure. In this context though, I feel like we're all being given data tables without a pre-made ORM
  • r

    roman

    08/17/2022, 8:05 AM
    Hi
  • r

    roman

    08/17/2022, 8:06 AM
    Hi I am new to Kedro. I am trying to improve the machine learning workflow using Kedro. For our use case we are using Kedro with YOLO algorithm for object detection. I implemented a pipeline using Kedro in which the dataset (images and labels ) is saved to an intermediate data directory within the kedro project and hardcoding the path in the training node. This implementaion results in a non linear style of data flow (because there is no actual connection between the data node and the training node). I was wondering if there is a better way to handle this, The problem is that YOLO and most object detection algorithms needs only the path to the data instead of the actual data being loaded. Any help is appreciated, TIA.
  • d

    datajoely

    08/17/2022, 10:30 AM
    So the typical way to do this is to define a custom YOLO dataset which can be declared in the catalog and also removes the need for you to do the hard coding. We can help you define this, but if you're interested in contributing it back to kedro I'm sure the community would appreciate it!
  • r

    roman

    08/18/2022, 7:17 AM
    Okay, How should I proceed If I am to do this? YOLO uses images(JPG/PNG) and labels(.txt) files as dataset.
  • d

    datajoely

    08/18/2022, 2:29 PM
    Hey @romhttps://kedro.readthedocs.io/en/stable/extend_kedro/custom_datasets.html we have a tutorial here.
  • d

    datajoely

    08/18/2022, 2:29 PM
    You could also have a go at subclassing the existing image dataset to get this working
  • d

    datajoely

    08/18/2022, 2:29 PM
    I'd also make sure you're aware of how the PartitionedDataSet wrapper works as that's usually important for this type of work
  • r

    roman

    08/18/2022, 5:01 PM
    But, For YOLO we need to handle both image files and text files(labels). I think if we are to create a custom Kedro dataset for YOLO, this dataset should handle both these files. Which might involve combining the existing image dataset and existing text dataset and using it with PartitionedDataSet. What do you think?
  • d

    datajoely

    08/18/2022, 5:05 PM
    I don't think there is anything wrong with expecting two file paths and essentially wrapping a version of ImageDataSet, TextDataSet and PartitionedDataSet all together!
  • w

    waylonwalker

    08/18/2022, 8:16 PM
    Anyone have experience creating vscode plugins? I have kedro-lsp updated to work with kedro 0.17.x and 0.18.x in a beta release. Currently all that is out is go to defintion on datasets, but I have dataset completion close to working. I use neovim and have it working there, but I have spent a day trying to get it running in vscode with no luck. Pycharm, sublime, spyder, emacs, any ide with lsp support would also be welcome.
  • i

    ithomp

    08/19/2022, 5:51 PM
    Hi everyone! I've been using Kedro for a couple of months but I've recently come across an issue that I'm hoping someone here can give me some advice on. I'm trying to use a runtime parameter (specified via
    kedro run --params ...
    ) with templated configuration of my data catalog so I can specify a project/site name to use as a prefix (subdirectory) on my file paths in the catalog. I'm able to achieve this functionality if I specify the parameter in my globals config, but it appears that runtime parameters provided through the CLI are not available to the TemplateConfigLoader. My goal is to enable execution of the pipeline on different raw datasets while preserving the previous dataset's data directory and without requiring the user to edit the global config file. Is this possible or is there another way I should go about this? Any advice would be greatly appreciated 😀
  • p

    PetitLepton

    08/20/2022, 12:12 PM
    If you are OK with the parameters being environment variables, you can add
    os.environ
    in globals_dict to parametrize the paths like https://github.com/kedro-org/kedro/issues/403.
  • p

    PetitLepton

    08/21/2022, 1:53 PM
    Hi folks, in a previous post here, there was a question about passing parameters to SQL queries. While randomly reading about transcoding in dataset yesterday, it gave me an idea on how to use it for the problem at stake. The idea is to use a Jinja template for the SQL query — nothing new here — and use transcoding to impose the sequential run of the nodes, namely filling the template with the parameters and only then run the query by using a common file. The catalog would look like
    aggregates@query_template:
      type: text.TextDataSet
      filepath: data/01_raw/aggregates_query.sql
    
    aggregates@query_string:
      type: text.TextDataSet
      filepath: data/02_intermediate/filled_aggregates_query.sql
    
    aggregates@query:
      type: pandas.SQLQueryDataSet
      filepath: data/02_intermediate/filled_aggregates_query.sql
      credentials: aggregates_uri
    and the pipeline
    def create_pipeline(**kwargs) -> Pipeline:
        return Pipeline(
            [
                node(
                    parse_parameters,
                    inputs=[
                        "params:start_date",
                        "params:end_date",
                        "params:metric",
                    ],
                    outputs="query_parameters",
                ),
                node(
                    fill_template,
                    inputs=["aggregates@query_template", "query_parameters"],
                    outputs="aggregates@query_string",
                ),
                node(
                    perform_query,
                    inputs=["aggregates@query"],
                    outputs="results",
                ),
            ]
        )
    Transcoding ensures that the second node runs before the third node. I like using transcoding in this situation because it makes the link between nodes more transparent than using an extra output/input. Please let me know what you think about it.
  • d

    datajoely

    08/21/2022, 6:10 PM
    We have configuration environments for this too
  • d

    datajoely

    08/21/2022, 6:12 PM
    So this is a neat solution - it will work. SQL execution rather than storage isn't going to be a perfect fit for kedro unless there is a data frame like api
  • a

    antheas

    08/21/2022, 7:55 PM
    the config loader takes in the extra parameters via the
    runtime_params
    property. You can extend the template config loader and just add your path via the
    self._config_mapping
    in
    __init__()
  • d

    datajoely

    08/21/2022, 7:56 PM
    On 0.18.x this is a little different there are a few examples out there on GitHub
  • i

    Isaac89

    08/21/2022, 8:06 PM
    Hi! Before kedro 0.18.0 I was achieving it with the register_config_loader hook because all the params passed through the command line were available in the extra_params. I would also be interested in knowing what is the best practice in the latest version. What I was thinking it may work is providing your custom class for the TemplateConfigLoader intercepting the extra parameters (runtime_params) , updating the global dict and returning the tamplate config loader with the updated global_dict.
  • n

    noestl

    08/22/2022, 4:26 PM
    Hi, when I try to install kedro 0.17.7, I got an error message indicating that the version does not exist. Is still available via pip? Where do I find all distributions?
  • d

    datajoely

    08/22/2022, 4:28 PM
    It may not work on your version of python. 3.9 support came in 0.18.x
  • n

    noestl

    08/22/2022, 5:25 PM
    Correct, many thanks @datajoely !
  • i

    ithomp

    08/22/2022, 7:47 PM
    Thanks @PetitLepton @antheas @datajoely @Isaac89 , Extending the TemplateConfigLoader class updating the globals_dict with runtime_params did the trick! 🙏
  • m

    mcapetz

    08/22/2022, 9:27 PM
    has anyone used kedro with flask?
  • m

    mcapetz

    08/22/2022, 9:27 PM
    can anyone help me pls?
  • l

    LawrenceS

    08/23/2022, 11:10 AM
    Hey everyone, I'm just wondering what are peoples approaches to using Jupyter Notebooks within Kedro? I believe the general consensus is that Kedro Jupyter Notebooks are still considered to provide use for things like data exploration and prototyping, so the aim is not to for go using them. However, one of the driving forces for me moving to Kedro was the difficulty around version control with Jupyter Notebooks. Is there built in functionality within Kedro for dealing with this? Or are people just excluding notebooks files in .gitignore after conversion of notebooks into pipelines? Interested to get peoples opinions and what is considered to be Kedro best practise regarding this! 🙂
  • d

    datajoely

    08/23/2022, 12:28 PM
    So one of the driving forces for us building kedro originally was to get people out of notebooks when it comes to building robust pipelines
  • d

    datajoely

    08/23/2022, 12:29 PM
    It's a difficult question - personally I commit notebooks in the notebooks directory provided in the template, but they're a communication and prototyping tool, not software in my eyes
Powered by Linen
Title
d

datajoely

08/23/2022, 12:29 PM
It's a difficult question - personally I commit notebooks in the notebooks directory provided in the template, but they're a communication and prototyping tool, not software in my eyes
View count: 1