https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • a

    amos

    02/24/2022, 6:12 PM
    Okay, I’ll keep playing around, was just wondering if there was a good way of doing it. I’m currently inlining all of my yamls as gigantic dicts and it doesn’t quite feel right. Thanks for your help πŸ™‚
  • d

    datajoely

    02/24/2022, 6:13 PM
    I agree we need to find a better way!
  • b

    beats-like-a-helix

    02/25/2022, 3:01 PM
    Can anyone link me to a project that uses partitioned datasets? Looking for some general examples, would be much appreciated.
  • b

    beats-like-a-helix

    02/25/2022, 5:48 PM
    Following the contents of this video:

    https://www.youtube.com/watch?v=mPPLk4IKu_sβ–Ύ

    I see this gentleman creates a "singular" and a "plural" node so that he essentially has two separate pipelines: one for processing a specified file and one for processing the entire partitioned dataset. Is this the recommended way to design all projects that involve many files of the same type and structure?
  • d

    datajoely

    02/25/2022, 5:50 PM
    Hi @User It’s one pattern that users have been successful with
  • d

    datajoely

    02/25/2022, 5:50 PM
    Maybe it would best for you to tell us a little about what you’re trying to build and we can think it through together?
  • b

    beats-like-a-helix

    02/25/2022, 6:31 PM
    @User Thanks for responding. The project is nothing serious, just trying to re-work a past project using Kedro for learning purposes. In this case it's ~20 files of timeseries data of cepheid variable star brightness, where the objective is to calculate some best fit parameters such as the period of pulsation, etc. No ML, maybe 10 functions in total. In my previous implementation of the project, I just had a
    main()
    function that looped over all the files in the directory. Many of my Astro projects are of a similar format to this, so been wondering what the best design choices are for a Kedro implementation
  • d

    datajoely

    02/25/2022, 6:35 PM
    I’m with you I think the DE1 pattern above makes a lot of sense. The only think that has changed since then is the introduction of modular pipelines that facilitate more powerful reuse of logic: https://kedro.readthedocs.io/en/0.17.7/06_nodes_and_pipelines/03_modular_pipelines.html
  • b

    beats-like-a-helix

    02/25/2022, 6:39 PM
    Thanks for the advice! @User PS. Is DE1 still around? Looks like he dropped off the face of the earth sometime last year
  • d

    datajoely

    02/25/2022, 6:42 PM
    Unfortunately not he’s a bit too successful to keep doing this in his spare time! We’re actually about to start hiring a full time DevRel which should help here
  • b

    beats-like-a-helix

    02/25/2022, 6:47 PM
    Ah, I assumed he was part of the team or something. That's great!
  • w

    wulfcrona

    02/28/2022, 12:43 PM
    I have more of a conceptual question, for my latest project one of the features are scraped from a react-web app. To do this I need a path to chrome driver in addition to some python libraries. What is the Kedro way to solve this? Should I just add them to the catalog and feed to the nodes for create a special folder and have the path in the parameters?
    d
    • 2
    • 4
  • l

    lbonini

    02/28/2022, 2:19 PM
    Hello people! Could someone give me a suggestion if there is a simple way to persist SQLQueryDataSet into parquet and create a parameter to use the persisted or non-persisted dataset? (Without duplicating the entries on catalog.yml)
  • d

    datajoely

    02/28/2022, 2:23 PM
    So you would have to create an output dataset that does the persisting - simple node that accepts the data and then outputs to a new dataset that gets persisted. What you can do is an
    after_pipeline_created
    hook that replaces the dataset with a MemoryDataSet dynamically based on the parameter or env variable
  • l

    lbonini

    02/28/2022, 2:26 PM
    Thank you for your response @User ! Do you have any code example or video that I can use to understand it better?
    d
    • 2
    • 8
  • b

    beats-like-a-helix

    03/04/2022, 7:17 PM
    I'm looking at the documentation for matplotlibwriter regarding saving a dictionary of plots: https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.matplotlib.MatplotlibWriter.html However when the images are saved, they do not have a format, even if a format is specified in the save_args dict as "format": "". I recon I'm misunderstanding something. Anyone got any advice?
  • b

    beats-like-a-helix

    03/04/2022, 8:32 PM
    Found a workaround, which is to specify the format in the dictionary key, but this doesn't feel right, as it's not what the documentation suggests:
    python
    plots_dict = dict()
    for colour in ["blue", "green", "red"]:
        plots_dict[f"{colour}.pdf"] = plt.figure()
        plt.plot([1, 2, 3], [4, 5, 6], color=colour)
    plt.close("all")
    dict_plot_writer = MatplotlibWriter(
        filepath="matplotlib_dict",
        save_args={
            # "format": "pdf",
            "dpi": 300,
            "bbox_inches": "tight",
        },
    )
    dict_plot_writer.save(plots_dict)
  • p

    pypeaday

    03/04/2022, 9:37 PM
    Would a PartitionedDataSet made up of MatplotlibWriter datasets work/make sense? I'm admittedly totally unfamiliar with the MatplotlibWriter one but we used Partitioned and IncrementalDataSets kind of a lot and they're super nice
  • d

    desrame

    03/04/2022, 10:14 PM
    very new to kedro - the first time I ran kedro new, it generated a Project folder, with a repo folder inside, and a package under source... almost every time since then as I explore and learn, it seems to be generating only the repo and package level
  • d

    desrame

    03/04/2022, 10:15 PM
    will this end up causing any issues?
  • b

    beats-like-a-helix

    03/04/2022, 10:17 PM
    Ah, I didn't know that MatplotlibWriter could be treated as a dataset in catalog.yml! That makes my job easier, since I'm actually trying to create plots for each file in an existing PartitionedDataset. But I'm still experiencing the same problem of not having a file format by default!
  • b

    beats-like-a-helix

    03/04/2022, 10:31 PM
    Crisis averted, just had to specify things properly in catalog.yml. In my case:
    yml
    power_spectrum_figures:
      type: PartitionedDataSet
      path: data/07_model_output/figures
      dataset:
        type: matplotlib.MatplotlibWriter
        save_args:
          format: pdf
          dpi: 300
          bbox_inches: tight
      filename_suffix: ".pdf"
  • b

    beats-like-a-helix

    03/04/2022, 10:37 PM
    top level of a new project should always look something like this I believe:
    sh
    .
    β”œβ”€β”€ README.md
    β”œβ”€β”€ conf
    β”œβ”€β”€ data
    β”œβ”€β”€ docs
    β”œβ”€β”€ info.log
    β”œβ”€β”€ logs
    β”œβ”€β”€ notebooks
    β”œβ”€β”€ pyproject.toml
    β”œβ”€β”€ setup.cfg
    └── src
  • b

    beats-like-a-helix

    03/04/2022, 10:42 PM
    Another general question, what is the accepted directory in which to place any generated figures? Do they "belong" in one of the later data layer folders, or should I just create a new damn folder? Not that it matters much, but I'm trying to learn the jedi way
  • d

    desrame

    03/04/2022, 10:44 PM
    oh cool, i must have mis-remembered something when i populated my first project
  • d

    desrame

    03/04/2022, 10:45 PM
    thanks beats πŸ™‚
  • d

    desrame

    03/05/2022, 3:25 AM
    another newb question: running into the following error, and after reading the docs im not sure what catalog and credentials mismatch exists
  • d

    desrame

    03/05/2022, 3:26 AM
    KeyError: "Unable to find credentials 'sql1': check your data catalog and credentials configuration.
  • d

    desrame

    03/05/2022, 3:28 AM
    # catalog.yml definition
    lstm_base:
        type: pandas.SQLTableDataSet
        table_name: sometable
        credentials: sql1
    
    # credentials.yml in local
    sql1:
        con: mssql+pyodbc:///?odbc_connect=DRIVER={ODBC+Driver+17+for+SQL+Server};SERVER=someserver;DATABASE=somedatabase;UID=someuser;PWD=somepwd
    
    # .py version working in .py script
    conn = \
        'DRIVER={ODBC Driver 17 for SQL Server};SERVER=someserver;DATABASE=somedatabase;UID=someuser;PWD=somepassword'
    quoted = quote_plus(conn)
    new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
    engine = create_engine(new_con, fast_executemany=True, connect_args={'timeout': 100})
  • d

    desrame

    03/05/2022, 3:28 AM
    based on the docs, it feels like catalog.yml should be able to reference the credentials in credentials.ym
Powered by Linen
Title
d

desrame

03/05/2022, 3:28 AM
based on the docs, it feels like catalog.yml should be able to reference the credentials in credentials.ym
View count: 1