https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    10/06/2021, 6:44 AM
    So globals.yml only affects the catalog by default - you will need to use parameters.yml to access keys within a node https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#parameters
  • s

    Stefan P

    10/06/2021, 7:31 PM
    I am new to Kedro, but i am feeding my nodes with parameters as inputs, added in pipeline.py. You might want to structure the yaml file so you can pass several params with one input, and then "unpack" the params in your node. The values of the params can also be passed as "extra_params" when establish a session.
  • d

    datajoely

    10/07/2021, 9:40 AM
    Do any 🪟 's users know how to help this user? https://github.com/quantumblacklabs/kedro/issues/941
  • w

    WolVez

    10/08/2021, 7:07 PM
    @User is there a dataset to load tons of log (json) files in s3 and take advantage of async? Looking at how
    --async
    is implemented with
    _run_node_async
    , there doesn't seem to be a good way to achieve this that I can think of, other than making my own dataset which applies an async get while running in non async mode (as stacking async on async is bad). However, this must be a common use case you have overcome? I feel like adding a hook to append to the node would be a good thing here.
  • d

    datajoely

    10/08/2021, 7:45 PM
    What about wrapping one of the existing datasets in PartitionedDataSet?
  • w

    WolVez

    10/08/2021, 7:51 PM
    @User , that was my original plan. But after looking into it, I am pretty sure that the Partitioned Dataset gets loaded as a single item with async. Thus, if you have other datasets, each one gets run, but for each partition in the partitioned Dataset would run synchronously.
  • d

    datajoely

    10/09/2021, 10:52 AM
    I feel like we’re hitting the limits of reading data from the file system without an index - it may be time to start dumping data in something like Postgres or elastic and reading it into kedro via a query
  • w

    WolVez

    10/11/2021, 1:34 AM
    @User for my case, probably. For a more generic use of the partitioned dataset, say a per year csv for like 5 or 10 years, I think a quick edit on the
    _run_node_async
    to change from a list of node items to an extended list of all datasets (to include partitioned datasets) might be nice speed improvement at somepoint. Just a thought. I added my own dataset for my use case to manage it.
  • d

    datajoely

    10/11/2021, 6:52 AM
    If you wouldn't mind raising a GitHub issue under 'feature request' with the specific changes needed to
    _run_node_async
    we could maybe discuss the approach in detail there?
  • w

    wulfcrona

    10/11/2021, 8:49 AM
    I'm doing my first real Kedro project and I need to run the same model with different features and parameters on different categories of the same underlying data. Ideally I would like to have a different pipeline for each category so I can run them separately while reusing the node code. How should this be implemented? My initial thought was to move the model code to a separate module and keep the node code clean for each pipeline but since it's very project specific I rather keep it in the node code if possible. Thanks in advance for any suggestions!
  • d

    datajoely

    10/11/2021, 9:19 AM
    Hi @User this is a great application of modular pipelines. Our documentation is here https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html Essentially this technique allows pipelines to be reused with different inputs and outputs. I have a very simple example here too: https://gist.github.com/datajoely/018607d5d721c747d742605494b822a3 The current documentation is a bit heavy to read - so we are working on more accessible content.
  • d

    dmb23

    10/13/2021, 8:34 AM
    Hi, I have some metadata in a project stored as a SQLite database file. In handling that more or less clean, I stumbled upon two questions, and would be curious if there are more elegant solutions: - currently I store the data I recieve e.g. in
    data/01_raw/my_metadata.db
    , but specify this location in the credentials.yml (which is not the first place I would look if future me has to change something). I think it might also be possible to put it in a globals.yml with a TemplatedConfigLoader, but I did not find a way to include it in a catalog.yml where I would search for it when I want to define SQLQueryDataSet(s) - and in that context: the documentation states that it might be possible (at least for a SQLTableDataSet) to provide the connection string not in credentials, but in the load_args. But both classes SQLTableDataSet and SQLQueryDataSet have a check for credentials[con] that raises a DataSetError when it is not found. Is there some way around that which I am missing? (or should I offer to change the documentation accordingly?)
    d
    • 2
    • 14
  • d

    datajoely

    10/13/2021, 9:00 AM
    Hi @User let me look into it and come back to you
  • d

    datajoely

    10/13/2021, 1:33 PM
    SQLite set up
  • b

    Bastian

    10/13/2021, 2:37 PM
    Hi everyone, we have a few steps that all our kedro pipelines use, just with different parameters. I understand that this could be a use case for modular pipelines. Now I am unsure how to do the actual sharing: using
    kedro pipeline pull
    copies the source code into the project. I am afraid that this would lead to the different usages of the modular pipeline diverging over time. What is the suggested way of reusing modular pipelines in different projects?
  • d

    datajoely

    10/13/2021, 6:05 PM
    Hi @Bastian @wulfcrona this is a really interesting topic of discussion and in truth, something where best practice is still being defined. I would say that this is somewhere that feels similar to API versioning where a level of governances is needed, breaking changes need to be communicated and transparent etc. I have some initial thoughts: 1. The smaller the modular pipeline the easier it is to manage changes. Several of our internal use-cases have very simple pipelines e.g. Splitting for cross validation is reused across several pipelines and projects. 2. This also brings in questions related to Governance, mono-repo vs multi-repo where things are centralised versus decentralised. 3. We currently allow users to pull as source code, rather than importable and installable libraries as we want be people to be able to remix and modify pipelines to fit their needs. Ultimately, we're looking for you, the community, to help us understand how we can best suit your needs. I'm delighted you're using some of the newest functionality we've brought to Kedro and I'm super keen to set you up for success. What I would love is if you could raise a GitHub discussion so we can allow our community to comment on (1) What is working for them today? (2) What isn't (3) What they would ❤️ to be added in the future?
  • b

    Bastian

    10/14/2021, 8:02 AM
    Thank you for your thoughts, I created a discussion for this: https://github.com/quantumblacklabs/kedro/discussions/959
  • u

    user

    10/15/2021, 5:41 PM
    Apologies if this has been asked before. Inside the catalog.yml, how do I specify a Partitioned Dataset where individual partitions are SQLite files ? I’m going to be running the same sql query on each file. I thought to try pandas.SQLTableDataset but that requires a “con” input, which varies depending on the partition.
  • d

    datajoely

    10/15/2021, 5:42 PM
    Ooo I don’t think this has been done before
  • u

    user

    10/15/2021, 5:42 PM
    Interesting. Maybe the right thing then is to make a custom dataset?
  • d

    datajoely

    10/15/2021, 5:42 PM
    Why partition the SQLite and not just write different tables within?
  • u

    user

    10/15/2021, 5:43 PM
    The SQLite files come to me as a data source. Not in my control
  • d

    datajoely

    10/15/2021, 5:43 PM
    Very interesting
  • u

    user

    10/15/2021, 5:43 PM
    It’s a strange file format given to me by external forces 🙂
  • d

    datajoely

    10/15/2021, 5:43 PM
    So I think you will need to do something custom but it should be doable without too much trouble
  • d

    datajoely

    10/15/2021, 5:44 PM
    Also I think use of templated config is needed
  • u

    user

    10/15/2021, 5:47 PM
    I’ll look into that. I think I can get away with a custom dataset, but if that doesn’t work I’ll try out a template config.
  • u

    user

    10/15/2021, 5:47 PM
    Thanks for the quick response!
  • d

    datajoely

    10/15/2021, 6:35 PM
    Yeah I think it will be a bit of both - but happy to help you think through the implementation
  • u

    user

    10/15/2021, 6:38 PM
    I managed to get it to work! My next question, if I have a sql database that needs an intermediate jump host to access, is there a best way to put that in the credentials.yml?
Powered by Linen
Title
u

user

10/15/2021, 6:38 PM
I managed to get it to work! My next question, if I have a sql database that needs an intermediate jump host to access, is there a best way to put that in the credentials.yml?
View count: 1