https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • a

    Anish Shah @ WANDB

    10/11/2021, 2:40 PM
    Im super excited to hear about this! Dataset previews
  • d

    datajoely

    10/11/2021, 2:44 PM
    I'll be open this is a 2022 thing - features to expect before: - Modular pipeline folding - Experiment comparisons (underpinned by plotly in many places) We're also open to PRs and for you to raise feature request tickets on GitHub
  • d

    datajoely

    10/11/2021, 2:58 PM
    here is a sneak peak at what the collapsable modular pipelines will look like 🔥
  • w

    waylonwalker

    10/14/2021, 3:28 PM
    Is this specifically for viz? that looks wicked cool!
  • d

    datajoely

    10/14/2021, 3:34 PM
    If you use modular pipelines you get it for free 😊
  • d

    datajoely

    10/14/2021, 3:35 PM
    Will be released soon
  • w

    waylonwalker

    10/14/2021, 4:20 PM
    What constitutes a modular pipeline? Entries in the pipeline registry?
  • d

    datajoely

    10/14/2021, 4:29 PM
    No you have to set them up in a certain folder structure, the guide is here: https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html I've also made a quick demo here showing how you could apply two modelling techniques with the same pipeline
  • d

    datajoely

    10/14/2021, 4:29 PM
    https://gist.github.com/datajoely/018607d5d721c747d742605494b822a3
  • w

    waylonwalker

    10/14/2021, 6:46 PM
    Are these modular pipelines? We lean heavily on this. This is straight from the modular pipelines section of the docs, but I don't think this requires any change to your folder structure.
  • d

    datajoely

    10/15/2021, 8:35 AM
    So the docs are really not where I want them to be on this one. We've been waiting for the Viz work to catch up since it makes the whole concept much easier to articulate, we also need a new version of the spaceflights tutorial that takes advantage of the functionality. What I would encourage you to do is run
    kedro pipeline create <pipeline_name>
    in a new project to see the folder structure in action. The clever part is articulated in this example:
    python
    final_pipeline = Pipeline(
        [
            pipeline(cook_pipeline, outputs={"grilled_meat": "new_name"}),
            pipeline(lunch_pipeline, inputs={"food": "new_name"}),
            node(...),
            ...,
        ]
    )
    Both
    cook_pipeline
    and
    lunch_pipeline
    are existing modular pipeline, but by using the
    pipeline
    method (not class) you are able to create an instance of them where you can swap catalog inputs/outputs for them
  • u

    user

    10/15/2021, 6:57 PM
    We definitely need to articulate modular pipelines better. I actually have this idea for an online tutorial: write music with Kedro (I plan to name it as "Your data pipeline is a symphony") or something equally pretentious
  • u

    user

    10/15/2021, 6:59 PM
    But let's say your pipeline is like a song: it has a structure * Intro * Verse 1 * Chorus * Verse 2 * Chorus * Outro Each modular pipeline is like a materialisation of that structure but with a different instrument, i.e. you can namespace this structure with
    guitar
    ,
    piano
    ,
    vocal
    etc. and connect them all together to make the final master pipeline (the entire song)
  • u

    user

    10/15/2021, 7:00 PM
    not sure if this helps or accidentally complicate the matter futher 😓
  • u

    user

    10/15/2021, 7:03 PM
    In any case I got a demo build of the expand / collapse modular pipelines feature here: http://kedro-viz-fe.s3-website.eu-west-2.amazonaws.com/KED-2517/expand-collapse-modular-pipeline-fe/?data=spaceflights if anyone wants to play. The final version will have some more changes but we are hoping with this change we can de-clutter the pipeline visualisation as well as encourage people to build pipelines in a more modular manner.
  • u

    user

    10/16/2021, 1:56 AM
    Kedro - Memory management https://stackoverflow.com/questions/69592125/kedro-memory-management
  • w

    waylonwalker

    10/16/2021, 2:25 PM
    I often use the cli to create pipelines, but not every time. I use find-kedro to collect all pipelines into a dictionary for me. Then after that we make some special ones that are slices of those that make it easy for the project to get sceduled quickly. One thing I would really like to see is the ability to just import pipelines, rather than copying them accross projects. The catalog being in yaml seems to make this difficult. I had done it in some early pipelines, but it was hard to keep up with newer versions of kedro doing something that is not supported like that.
    python
    from .pipelines import lunch_pipeline
    from other_project import cook_pipeline # simply just import from another project
    
    final_pipeline = Pipeline(
        [
            pipeline(cook_pipeline, outputs={"grilled_meat": "new_name"}),
            pipeline(lunch_pipeline, inputs={"food": "new_name"}),
            node(...),
            ...,
        ]
    )
  • w

    waylonwalker

    10/16/2021, 2:29 PM
    This helps quite bit I do see where you are contructing a Pipeline, with other pipelines, and the pipeline method to map some io, rather than just the nodes.
    python
    final_pipeline = Pipeline(
        [
            **pipeline(cook_pipeline, outputs={"grilled_meat": "new_name"}).nodes,
            **pipeline(lunch_pipeline, inputs={"food": "new_name"}).nodes,
            node(...),
            ...,
        ]
    )
    This is closer to how I have been doing it. This achieves a similar affect of reusing pipelines, but looses the history of where nodes came from. And typically I am not passing in the inputs/outputs here.
  • s

    SandyShocks™

    10/17/2021, 2:25 PM
    Any reason why writing explicitly works but not via catalog? Tried to replicate how the save method got it's variables via catalog.
  • d

    datajoely

    10/18/2021, 8:30 AM
    Hi Sandy if you do
    catalog.save()
    do you get the same error? Additionally it looks like your having a DeltaTable specific merge issue to do with types. You can still use Delta via the python API, but our full support for delta is WIP https://github.com/quantumblacklabs/kedro/pull/964
  • s

    SandyShocks™

    10/18/2021, 11:31 PM
    I plucked it out of my custom pyspark _save mothod. I wanted to test if my save args were being respected.
    d
    • 2
    • 45
  • d

    datajoely

    10/19/2021, 7:31 AM
    PySpark Delta table
  • e

    Edmund M

    10/21/2021, 8:27 PM
    Getting
    %load_ext rpy2.ipython
    
    2021-10-21 14:39:23,435 - rpy2.rinterface_lib.callbacks - WARNING - R[write to console]: Error in .Primitive("as.environment")("package:utils") : 
      no item called "package:utils" on the search list
  • e

    Edmund M

    10/21/2021, 8:28 PM
    Reposting from #846330075535769601 because both statements in #778994879958876161 say to post here.
  • d

    datajoely

    10/22/2021, 9:02 AM
    Hi @User - this looks like an r2py issue rather than a python one. StackOverflow suggests that this is something to do with visible R environment https://stackoverflow.com/questions/56199823/unable-to-run-rpy2-under-alpine-linux-no-item-called-packageutils-on-the-se
  • e

    Edmund M

    10/22/2021, 2:17 PM
    Thanks! I think I came to the same conclusion last night, I couldn't install rpy2 and jupyter and get it to work in a fresh conda environment.
  • d

    datajoely

    10/22/2021, 2:18 PM
    Okay let us know if you do get it working as we could maybe put an example in the docs
  • e

    Edmund M

    10/22/2021, 2:20 PM
    Will do! Is conda support going away in the new version? I saw
    kedro install
    is but I'm assuming
    conda install
    is still fair game
  • d

    datajoely

    10/22/2021, 2:28 PM
    Good question! So we've had a lot of feedback that the install flow was confusing to users. That's because we're doing a few things behind the scenes like
    pip-tools compile
    ,
    requirements.in
    +
    requirements.txt
    so
    kedro install
    In the next major version things will be simplified into the following: 1.
    kedro build-reqs
    is still going to prepare
    src/requirements.txt
    so that the dependencies are fully resolved. 2. We then recommend running
    pip install -r src/requirements.txt
    to install your compiled dependencies. I'm pretty sure
    conda install
    will work here too, but we do know that conda sometimes has issues with the kedro optional dependencies like
    pip install "kedro[pandas]"
    .
    e
    a
    a
    • 4
    • 8
  • s

    simon_myway

    11/03/2021, 1:13 PM
    Hi team, question about datasets: What is the recommended way to handle incremental dataset based on a SQL-like database (e.g. Google BigQuery)? Context: I'm building a data pipeline with a daily fetch of new data which are inserted in Google BigQuery table and then has to process those new data only subsequently
Powered by Linen
Title
s

simon_myway

11/03/2021, 1:13 PM
Hi team, question about datasets: What is the recommended way to handle incremental dataset based on a SQL-like database (e.g. Google BigQuery)? Context: I'm building a data pipeline with a daily fetch of new data which are inserted in Google BigQuery table and then has to process those new data only subsequently
View count: 1