https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • u

    user

    07/30/2021, 6:04 PM
    nice.. thank you 🙂
  • w

    waylonwalker

    07/31/2021, 1:27 AM
    Honestly some side projects I've done with no kedro template whatsoever, starting with a blank file, I hand rolled an auto catalog in ~10 lines (probably only 3 lines of execution). I feels so fast to just ignore the catalog. Honestly I do not care where the files go, I just care that they are of the right type, and rarely (outside of the raw layer) am I fussing with any settings on them. Id rather just say dump all dataframes into parquet, in the data folder, and name the file just like the dataset, anything else make it a pickle. I can't remember ever hand loading a file, so I really don't care about the structure, I just want there to be enough there that the option is still there if I neede it.
  • w

    waylonwalker

    07/31/2021, 1:30 AM
    Im a fan of Brian Okens utils modules are like junk drawers, once you let yourself have one its chuck full of random stuff that belongs somewhere else. there are some rare occasions that they dont really belong anywhere else, but generally if you roll back the problem enough you can sus out what it actually is and put it where it belongs.
  • u

    user

    08/02/2021, 2:09 PM
    Nice @User. I'll keep an eye on it. Thank you.
  • w

    WolVez

    08/04/2021, 9:07 PM
    Is it possible to load() a dataset inside a notebook which is not saved, but whose generation node has been run inside of KedroSession.run()?
  • w

    waylonwalker

    08/04/2021, 9:55 PM
    yes, if it is a dataset that is stored somewhere. i.e. not a MemoryDataset. two things you can try, using
    cars
    as an example dataset name.
    catalog.load('cars')
    catalog.datasets.cars.load()
  • w

    WolVez

    08/04/2021, 10:12 PM
    @User , I am asking specifically for the MemoryDataset situation.
  • w

    waylonwalker

    08/05/2021, 1:20 AM
    I think the only way is to to get it from the return of the run, which returns a dictionary of
    dataset (str): dataset (value)
    for the final datasets. There is no way to get intermediate datasets if they are not saved.
  • w

    waylonwalker

    08/05/2021, 1:22 AM
    @User , do you use
    MemoryDatasets
    often? I use them for examples on my blog or live stream when I simply do not set up a catalog, but for my production pipelines I have a lint check for
    MemoryDatasets
    to make sure we dont accidently commit any.
  • w

    WolVez

    08/05/2021, 2:35 AM
    @User , I am not going out and creating a static dataframe and passing it as a MemoryDatasets. However, we have lots of pipelines inside a project, each with a ton of nodes which aggregate data. Its not uncommon for us to have 6 different aggregation nodes, which then only get passed into further aggregation nodes down stream. Thus there is never a need to save the output of the first aggregation nodes to a storage location. I am just assuming that the passing of the data without it being saved between nodes is being kept in a MemoryDataset until it is used by later nodes. Does that make sense? I do know that I cannot call these unsaved datasets via the catalog methods you mentioned above (at least not that I have found yet).
  • w

    waylonwalker

    08/05/2021, 3:35 AM
    I see, I need to remember that I am blessed with cheap storage and relatively small datasets compared to the price of storage. The cost of storage outweighs any engineering time even thinking about if datasets should be saved, even before an event that needs to see it.
  • w

    waylonwalker

    08/05/2021, 3:37 AM
    It sounds like you have a very large project that you maintain, and it consists of many small pipelines. You often need to run run a portion of these pipelines and see the results in a notebook, but only while the session is still active?
  • w

    waylonwalker

    08/05/2021, 3:42 AM
    Would it be appropriate to store those intermediate steps to disk next to the notebook? You could make your own function for running that asks the section of pipeline you are running for its
    all_outptuts
    then make pickle or parquet datasets for each of them on the fly. They won't exist in the project's catalog but only in memory. You might even be able to use the
    tempfile
    module to let the os clean up for you after your done.
  • d

    datajoely

    08/05/2021, 9:08 AM
    @User the best option is to persist your dataset for inspection and then find a way to clear it up automatically like @User suggests. 1. The easiest way to do this is to set up a cron job to clear a directory at  some predefined interval. 2. The other option is to define a custom dataset that uses a temp file.
  • w

    waylonwalker

    08/05/2021, 1:10 PM
    @User would it make sense for kedro to have a semi-permanant dataset that sits between MemoryDatasets and the others, one that just pickles to a tmp file? Mixing this with my issue to change the default dataset would be an interesting prospect. Since settings.py is a python script you could set up standards for your team to change base on env variables. Then its pretty much automatically doing what is best for your project in both prod and dev modes. https://github.com/quantumblacklabs/kedro/issues/849
  • d

    datajoely

    08/05/2021, 1:30 PM
    I'm not sure - I'm still thinking through how best to work with this. We also have plans to improve the
    CachedDataSet
    and I'm wondering how much of that will cover this use case.
  • w

    WolVez

    08/05/2021, 3:53 PM
    Thanks for the feedback. I can absolutely make a temp storage datatype. Thats a really good idea! Next question, is there a way to layer environments? I suppose I could just copy the environment I care about and apply the temp storage data type, but I am thinking it might be even cooler to automatically reference the other environments. For example, it would escalate like base > dev (not local) > temp. This would allow for dev to not be bogged down with local write events for temp edits, but still have temp inherent from continued development inside of dev?
  • d

    datajoely

    08/05/2021, 3:55 PM
    There absolutely is @User
  • d

    datajoely

    08/05/2021, 3:56 PM
    https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments
  • w

    WolVez

    08/05/2021, 3:56 PM
    sweet!
  • d

    datajoely

    08/05/2021, 3:56 PM
    so you basically maintain mirror catalogs with the delta differences
  • d

    datajoely

    08/05/2021, 3:56 PM
    and there is a hierarchy {custom_env} / local / base
  • w

    WolVerz

    08/10/2021, 10:08 PM
    Hey! I have a kedro project with a custom conf including parameters and catalog. I am trying to access these parameters and datasets within kedro ipython, but it keeps only loading base and local. How do I specify the right conf?
  • d

    datajoely

    08/10/2021, 10:15 PM
    I think you’re looking for additional run environments https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments
  • w

    WolVerz

    08/10/2021, 10:16 PM
    Also, @User, can you explain how you got that to work with multiple confs? I have been wanting to do something similar, but I am not following @User's response for the three environments layer. That is what you are trying to do yes? In other words if all confs "base", "dev", and "temp" have the same value parameter, the temp value is used, else it grabs the value from "dev" and lastly from "base"?
  • d

    datajoely

    08/10/2021, 10:17 PM
    So it’s quite late here in London I can respond in detail tomorrow
  • w

    WolVerz

    08/10/2021, 10:18 PM
    @User , I tried that, but it doesn't seem to work when you do "kedro ipython --env=test"
  • d

    datajoely

    08/10/2021, 10:18 PM
    I think this video is useful

    https://youtu.be/D2v9k9ARDBEâ–¾

  • w

    WolVerz

    08/10/2021, 10:18 PM
    Feel free to respond tomorrow, no rus.
  • d

    datajoely

    08/10/2021, 10:18 PM
    Around 5 minutes in
Powered by Linen
Title
d

datajoely

08/10/2021, 10:18 PM
Around 5 minutes in
View count: 1