https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    05/16/2022, 10:46 PM
    No the point of hooks is that you have a lot of flexibility, the example is very much demonstrating functionality more than specific good practices
  • d

    datajoely

    05/16/2022, 10:46 PM
    In my opinion
  • w

    wwliu

    05/16/2022, 10:49 PM
    I see. Thanks for replying and helping me out.
  • a

    AnnaRie

    05/18/2022, 6:59 PM
    Hello 🙂 I'm using Kedro to train a model and predict values with this model afterwards. The model is saved versioned after training and for prediction I usually take the lastes version. But if I want to use a specific version, I have to define this in the terminal after kedro run (I read on the Kedro documentation). Is there an option to get the defined version? I want to write a log for my prediction to keep things reproducible. Thanks, Anna
    a
    • 2
    • 7
  • a

    antony.milne

    05/18/2022, 8:32 PM
    Logging load version
  • w

    wwliu

    05/18/2022, 11:29 PM
    Hello. I have a question regarding to the execution order of nodes. I understand the node execution order is decided by Kedro, not exactly like the layout in pipeline.py. I would like to understand the underlying mechanism how the order is determined? Is there randomness involved? If there is randomness, it might make the testing and QA work difficult.
  • n

    noklam

    05/18/2022, 11:35 PM
    Under the hood it is a topological sort. There is no guarantee about the order - if there are more than 1 possible solutions. Node are suppose to be pure python function without side effect, so ordering should not affect the result.
  • n

    noklam

    05/18/2022, 11:37 PM
    That said, there are cases like random sequence generation are affected by the order, there is an issue to change this to a more deterministic way.
  • d

    datajoely

    05/19/2022, 10:48 AM
    I @wwliu , I would also clarify @noklam 's point slightly that
    There is no guarantee about the order
    but only per dependency level, i.e. If dataset
    D
    requires
    A
    ,
    B
    and
    C
    .
    D
    will always be the last executed, but the order in which
    A
    ,
    B
    and
    C
    in not fixed per run.
  • l

    Lazy2PickName

    05/19/2022, 4:41 PM
    Hi, so, I have a pipeline like this:
    def _parse_inctf() -> Pipeline:
        return Pipeline(
            [
                node(
                    func=nodes.insert_columns_inctf,
                    inputs='external-inct-fracionada',
                    outputs="inctf-preprocess-01-insert-columns",
                    name="read-and-insert-columns-inctf",
                ),
                node(
                    func=nodes.parse_inct_dates,
                    inputs="inctf-preprocess-01-insert-columns",
                    outputs="inctf-preprocess-02-parse-dates"
                ),
                node(
                    func=nodes.get_pct_change,
                    inputs="inctf-preprocess-02-insert-columns",
                    outputs="inctf-preprocessed"
                ),
            ]
        )
    From, those datasets, only the
    external-inct-fracionada
    and
    inctl-preprocessed
    are actually declared in the
    catalog.yml
    . I want to pass the others as MemoryDatasets, they are intermediaries to my pipeline, but when I run, I get this error:
    ValueError: Pipeline input(s) {'inctf-preprocess-02-insert-columns'} not found in the DataCatalog
    Is there a way of doing this without declaring each intermediary dataset in my catalog? Just so you know, this is the entrance of
    external-inct-fracionada
    in my catalog:
    external-inct-fracionada:
      type: project.io.encrypted_excel.EncryptedExcelDataSet
      filepath: "${DATA_DIR}/External/INCT/INCTF_0222.xls"
    And
    EncryptedExcelDataSet
    and it's implementation is seen in the attached file
  • n

    noklam

    05/19/2022, 4:53 PM
    The naming of intermediare data can be arbitary, you just need to have a consistent name. If it is a memory data, it must be output of some other nodes. Change 02 to 01 and it should run
  • l

    Lazy2PickName

    05/19/2022, 4:57 PM
    Thanks!
  • s

    SirTylerDurden

    05/20/2022, 1:34 AM
    Is there any control flow that’s supported in a Kedro Dag?
  • d

    datajoely

    05/20/2022, 9:56 AM
    Could you explain more what you mean? Do you mean conditional nodes?
  • d

    datajoely

    05/20/2022, 10:01 AM
    If you do, we don't be design since we want to enforce reproducibility. The way to achieve this is to have different registered pipelines/environments/instances of modular pipelines
  • s

    SirTylerDurden

    05/21/2022, 12:08 AM
    I’m referring to the ability to run nodes conditionally depending on outputs of other nodes or configs. In many DAG solutions, you can have branching flows depending on conditions but I don’t see that supported in Kedro. I was wondering if I may be missing something.
  • d

    datajoely

    05/21/2022, 3:04 AM
    Yeah this is a deliberate choice not to support that because we believe it harms reproducibility and makes things harder to debug
  • r

    RRoger

    05/21/2022, 6:16 AM
    What's the pattern for numerous files as raw data? I want to download about 2000 files of the same type with different dates, e.g. "senate_2006-03-30.xml". 1. Do I create a catalog entry for each file? 2. Does the download-node
    output
    to a list of length 2000, i.e.
    ["senate_2006-03-30", "senate_2006-03-31", ...]
    , i.e. a 2000-line
    pipeline.py
    ? Or is there some sort of clever templating?
  • d

    datajoely

    05/21/2022, 6:35 AM
    PartitionedDataset?
  • r

    RRoger

    05/21/2022, 11:46 AM
    Yes this solved the problem thank you.
  • m

    Mackson

    05/24/2022, 12:37 AM
    Hello people, how should I work with chunking a huge dataset when applying a function? I found an issue but it was not clear how should I deak with the context manager inside the function. Thanks!!!!
  • d

    datajoely

    05/24/2022, 1:10 AM
    Are you using pandas? Can you post a snippet?
  • m

    Mackson

    05/24/2022, 8:37 AM
    It's from my work, but you cant just consider a really huge pandas dataset that does not fit memory in a function where you just, let's say, add a collumn.
  • m

    Mackson

    05/24/2022, 8:38 AM
    The only lazy evaluation I know is though Spark, which I dont have access at the time.
  • d

    datajoely

    05/24/2022, 8:50 AM
    Sure - but is this a general pandas question or a Kedro one? This tutorial shows simple chunking + full dask https://pythonspeed.com/articles/faster-pandas-dask/
  • m

    Mackson

    05/24/2022, 8:51 AM
    It's a Kedro question, I know how to chunk outside the node context
  • m

    Mackson

    05/24/2022, 8:56 AM
    My question is: I know how the _load (let's say I return the chunk iterator object from pandas) can return the iterator, but how the node itself will know how to apply all the steps inside the function for EACH chunk (not just for one) without a outside loop (let's say a wrapper in the node).
  • n

    noklam

    05/24/2022, 10:02 AM
    I think what you get in a node will be a generator instead of a dataframe if you are using the chunk iterator. That looping would be logic that you write inside the node.
  • m

    Mackson

    05/24/2022, 10:54 AM
    Yeah, maybe returning a "map" will do trick?
  • m

    Mackson

    05/24/2022, 10:55 AM
    Or doing a whole new AbstractDataSet that will write the iterator
Powered by Linen
Title
m

Mackson

05/24/2022, 10:55 AM
Or doing a whole new AbstractDataSet that will write the iterator
View count: 1