https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • r

    RRoger

    03/07/2022, 4:57 AM
    Can items in
    global.yml
    refer to other items in the same file for templating?
    date: 2022-02-02
    somefile: filename_${date}.txt
  • d

    datajoely

    03/07/2022, 9:27 AM
    I'm not actually sure - no one has ever asked that before.
  • d

    datajoely

    03/07/2022, 9:28 AM
    after looking at the code - I don't thinks so, the files which match
    *globals.yml
    first are consumed and then the templating is applied to the other files
  • d

    datajoely

    03/07/2022, 9:32 AM
    Using environment vars is a common way to get round this sort of thing. Set the var outside of Kedro and then tweak your
    register_config_loader
    method in
    hooks.py
    python
    globals_pattern="*globals.yml",
            globals_dict={
                k: v for k, v in os.environ 
                if k.startswith("date")
            },
        )
  • d

    Daehyun Kim

    03/07/2022, 7:16 PM
    Hi team, I'm trying to pass Parameters at runtime. https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#specify-parameters-at-runtime One of my params have comma in value for example, kedro run --params "param1=hi, param2=a,b" is there a way that I can have comma in the value of the param such as escape sequence or so?
  • d

    datajoely

    03/07/2022, 7:18 PM
    I'm not actually sure - can you try putting a \ before it? Otherwise there is a chance this isn't possible
  • d

    Daehyun Kim

    03/07/2022, 7:19 PM
    kedro run --params "param1=hi, param2=a\\,b"
  • d

    Daehyun Kim

    03/07/2022, 7:19 PM
    is not working either
  • d

    datajoely

    03/07/2022, 7:19 PM
    Yeah can you use a different delimiter?
  • d

    datajoely

    03/07/2022, 7:19 PM
    I think it's fair to say commas are reserved
  • d

    Daehyun Kim

    03/07/2022, 7:20 PM
    i see. thank you. probably I need change our system that is related to ,
  • d

    datajoely

    03/07/2022, 7:21 PM
    This isnt speaking for the project, but I wouldn't be against changing this to pure JSON
  • l

    lbonini

    03/07/2022, 7:58 PM
    Hello! @User Could you help me with this one? I'm trying to use environment variables in mlflow.yml. I need to set
    MLFLOW_TRACKING_INSECURE_TLS
    to
    true
    . How am I supposed to declare it? It works only if I export in the terminal and executes
    kedro run
    g
    • 2
    • 6
  • g

    Galileo-Galilei

    03/07/2022, 8:46 PM
    Environment variables in mlflow
  • d

    Daehyun Kim

    03/07/2022, 8:52 PM
    @User is there a way to specify params in config file? https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#configure-kedro-run-arguments
  • d

    datajoely

    03/07/2022, 8:53 PM
    Yes! --config some.yaml
  • d

    datajoely

    03/07/2022, 8:54 PM
    https://kedro.readthedocs.io/en/stable/09_development/03_commands_reference.html#modifying-a-kedro-run
  • d

    Daehyun Kim

    03/07/2022, 9:13 PM
    thanks! having comma value param in yaml file seems to work
  • w

    waylonwalker

    03/07/2022, 10:24 PM
    Looking for how others would do this. If you have a number of catalog entries that all need merged into one table on the same key, how do you do it. I've always made a node for each dataset separately, but today I read a pr that passed them all in as inputs and the function signature was
    def joiner(left, *rights)
    , I can't tell if this is genious or too much magic. Would love some thoughts on how others would review this PR.
  • d

    datajoely

    03/07/2022, 10:30 PM
    I encourage this pattern on the modular spaceflights demo
  • d

    datajoely

    03/07/2022, 10:30 PM
    I think it's a good idea
  • d

    datajoely

    03/07/2022, 10:31 PM
    You're still being explicit where it matters
  • m

    Melen

    03/08/2022, 2:35 AM
    Hi I am very interested in kedro. However I am not interested in the full workflow as described in Spaceflight turtorial. At my work we have an in house built python DAG framework based on dask and I would like to replace it with kedro. What I would like to do is create pipelines dynamically, and use mostly memory datasets in the catalogue. I fear for my use case, I may not be able to use kedro and kedro-viz how I would like. Does the pipelining part of kedro stand alone without the need setting up the kedro template, setting up the data folders and packaging of project?
    y
    d
    • 3
    • 5
  • y

    Yetunde

    03/08/2022, 9:10 AM
    Using Kedro in your workflow
  • b

    beats-like-a-helix

    03/08/2022, 11:57 AM
    Let's say I have a partitioned dataset that's quite large, 0.5TB or so. Can I expect
    kedro run
    to work as normal, or is there anything special I need to do or configure before executing to make sure I don't run out of memory?
  • d

    datajoely

    03/08/2022, 12:17 PM
    I think you can assume it works - but I would also suggest if you're hitting that sort of data size you may want to explore Spark or Dask as an execution engine
  • d

    datajoely

    03/08/2022, 12:17 PM
    we have docs on both if you're interested
  • b

    beats-like-a-helix

    03/08/2022, 12:19 PM
    Thanks, I'll check that out!
  • d

    datajoely

    03/08/2022, 12:21 PM
    The latest dask docs are on the
    latest
    docs branch
  • b

    beats-like-a-helix

    03/08/2022, 12:44 PM
    I don't know anything about Spark or Dask yet, so the docs may cover the following question: Let's say I have 1000 files, and there is some anomaly in file 926 that causes an error. So I fix the file itself or my nodes, and run again. How can I start from 926 and not waste time computing results I've already acquired? I've never really worked with a dataset large enough to warrant asking this question before, haha.
    d
    • 2
    • 10
Powered by Linen
Title
b

beats-like-a-helix

03/08/2022, 12:44 PM
I don't know anything about Spark or Dask yet, so the docs may cover the following question: Let's say I have 1000 files, and there is some anomaly in file 926 that causes an error. So I fix the file itself or my nodes, and run again. How can I start from 926 and not waste time computing results I've already acquired? I've never really worked with a dataset large enough to warrant asking this question before, haha.
d

datajoely

03/08/2022, 1:40 PM
So this is a good question and I'm not sure we have a good out of the box solution since our retry logic is on a node level
I wonder if it's worth defining a custom dataset that works on ranges
and then you run the several pipelines for each range
Is the error you're worried about related to logic or memory?
because I guess we could take diff approaches depending on those variables
b

beats-like-a-helix

03/08/2022, 2:14 PM
Thanks for the information, that's good to know
I'm anticipating either of those error types. It's gravitational wave data, so big files and the signal can be inconsistent at times, so just trying make a plan for things going wrong
d

datajoely

03/08/2022, 2:37 PM
So this specific data per se is outside of my wheelhouse @User with his astrophysics background may actually be helpful here
In general I'd explore two things: (1) Get good at logging out exceptions/warnings things that can be used to understand why things failed. Potentially the
on_pipeline_error
hook is really useful here (2) Try and set up a data profiling pipeline that allows you to do low-cost analysis of the data that can maybe give you an idea of what may cause problems downstream (3) for logic errors, you may want to use exception handlers to gracefully swallow and log errors rather than killing the process
b

beats-like-a-helix

03/08/2022, 2:58 PM
Fantastic! I'll do those things and hopefully the project should stay on course
View count: 1