https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    12/14/2021, 2:56 PM
    So we err on the side of explicit rather than implicit so we reserve that type of thing for the human input parts. What you could do is declare a
    YAMLDataSet
    for the parameters you generate at runtime for safe keeping, then either an automatic or manual process outside of Kedro could mirror those in you actual
    paramerters.yml
  • n

    NC

    12/14/2021, 2:58 PM
    Ahh, great idea, thank you! šŸ˜€
  • d

    datajoely

    12/14/2021, 2:58 PM
    Yeah the balance between readability and dynamism is super hard to find
  • d

    datajoely

    12/14/2021, 2:58 PM
    but I think that works okay
  • n

    NC

    12/14/2021, 2:58 PM
    Definitely!
  • j

    j c h a r l e s

    12/14/2021, 6:08 PM
    This seems like it addresses my concerns almost fully, thank you for sharing
  • r

    Rroger

    12/15/2021, 11:07 PM
    I ran
    kedro viz
    but ā€œthere is no active Kedro sessionā€. Does anyone know how to make it work? I had managed to run it successfully before, just not sure what has changed since then.
    d
    • 2
    • 5
  • d

    datajoely

    12/16/2021, 8:47 AM
    Viz session issue
  • c

    czix

    12/16/2021, 1:26 PM
    Is there a way to output a tuple in a node, without having to split them to separate variables?
  • d

    datajoely

    12/16/2021, 1:48 PM
    Of course - just in your regular function:
    python
    
    my_node_func() -> Tuple[int]:
       return tuple([1,2])
  • c

    czix

    12/16/2021, 2:03 PM
    But in the pipeline I have to specify two output variables, e.g., in your example:
    python
    Pipeline([
      node(func=my_node_func, input=None, output=["a","b"])
    ])
    Or am I wrong?
  • d

    datajoely

    12/16/2021, 2:06 PM
    ah you just provide one output!
  • c

    czix

    12/16/2021, 2:08 PM
    Like
    output=["ab"]
    ?
  • d

    datajoely

    12/16/2021, 2:10 PM
    So in this case you would just need to return
    output="a"
    and
    "a"
    would store a tuple
  • c

    czix

    12/16/2021, 2:11 PM
    hmm, it seems like it works for other datatypes than a pandas dataframe, e.g, if I return two dataframes in a tuple, it will like to split them
  • d

    datajoely

    12/16/2021, 2:15 PM
    are you getting an error if you try that with Pandas?
  • d

    datajoely

    12/16/2021, 2:17 PM
    One thing you can do if it's giving you and error In
    catalog.yml
    yaml
    a:
      type: MemoryDataSet
      copy_mode: copy
  • c

    czix

    12/16/2021, 2:27 PM
    No, it was actually my own fault. I forgot to remove the [] around the output as you said. Thank you!
  • d

    Dhaval

    12/16/2021, 4:23 PM
    Hi everyone, kinda new to kedro. I was looking for some examples where in I can pass different datasets to the same pipeline(reusing same pipeline code for different datasets) to process information but unable to find anything. Can anyone help?
  • d

    datajoely

    12/16/2021, 4:29 PM
    This is exactly what modular pipelines are for - I have a work in progress example project here https://github.com/datajoely/modular-spaceflights
  • d

    Dhaval

    12/16/2021, 4:30 PM
    Thanks for the fast response. I'll go through it 😃
  • r

    Rroger

    12/16/2021, 9:21 PM
    I suppose I could output a dummy dataset.
  • r

    Rroger

    12/17/2021, 12:54 AM
    Is there a built in dataset that saves to a database? Or do we have to create our own class for that?
  • d

    datajoely

    12/17/2021, 7:35 AM
    Yes there are pandas and spark database connectors
    kedro.extras.datasets.pandas.SQLQueryDataSet
    kedro.extras.datasets.pandas.SQLTableDataSet
    kedro.extras.datasets.spark.SparkJDBCDataSet
    kedro.extras.datasets.spark.SparkHiveDataSet
  • r

    RRoger

    12/17/2021, 9:25 PM
    I tried doing a modification to the data ingestion pipeline in https://github.com/datajoely/modular-spaceflights by adding a node to save to a db; in
    data_ingestion/pipeline.py
    node(
       name="upload_to_db",
       func=lambda x: x,
       input="shuttles",
       output="shuttles_table",
        ),
    in
    catalog_01_raw.yml
    shuttles_table:
      type: pandas.SQLTableDataSet
      table_name: shuttles
      credentials: postgres
      save_args:
        if_exists: replace
    but the log shows that
    shuttles_table
    is a
    MemoryDataSet
    2021-12-18 08:24:36,993 - kedro.pipeline.node - INFO - Running node: <lambda>([shuttles]) -> [data_ingestion.shuttles_table]
    2021-12-18 08:24:36,993 - kedro.io.data_catalog - INFO - Saving data to `data_ingestion.shuttles_table` (MemoryDataSet)...
    And the table is not created in the database.
  • r

    RRoger

    12/17/2021, 9:37 PM
    āœ”The solution was to add the output in the
    new_ingestion_pipeline
    Pipeline
    . I didn't realise that creating another function just to add a namespace to an existing
    Pipeline
    is done.
  • r

    RRoger

    12/17/2021, 9:53 PM
    If a node (node
    B
    ) is dependent on a previous node (node
    A
    ) having uploaded to a database (e.g.
    some_table
    as
    pandas.SQLTableDataSet
    ) and I use
    some_table
    as the input for
    B
    , does
    B
    automatically try to download
    some_table
    to memory (if not already in memory)? I would not like the data downloaded if: - the data is large, hence most of the pipeline time is spent on downloading - `B`'s code is to run SQL queries without ever requiring the data locally
  • d

    datajoely

    12/18/2021, 8:48 AM
    So we don’t support SQL as a remote execution engine - today we bring things into python world. If you use PySpark it will expose sql as data frames and do some of this lazily
  • d

    datajoely

    12/18/2021, 8:48 AM
    We use SQL as a storage layer not an execution engine
  • d

    datajoely

    12/18/2021, 8:49 AM
    If you need to use SQL for execution- maybe dbt is right for data engineering and Kedro kicks in when doing the ML engineering
Powered by Linen
Title
d

datajoely

12/18/2021, 8:49 AM
If you need to use SQL for execution- maybe dbt is right for data engineering and Kedro kicks in when doing the ML engineering
View count: 1