https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • d

    datajoely

    09/22/2021, 12:06 PM
    Well modular pipelines and configuration scalable pattern for managing large projects, but you should also be wary of premature opimisation
  • w

    Waldrill

    09/22/2021, 12:13 PM
    Yep, we do have the solution currently running, and the client asked to scale from 4 to 20 .. we started to seeing problems of duplicated stuff, but it is done ... But the projects next step is to scale up to hundreds, and now It looks like it is the time to think of doing it in a way that will prevent a nightmare of supports 😅
  • d

    datajoely

    09/22/2021, 12:14 PM
    in which case - I'd optimise for the future, but would warn that the namespacing of parameters is changing to be consistent with the way we namespace catalog entries in 0.18.0
  • w

    Waldrill

    09/22/2021, 12:15 PM
    Thanks, I'll take a look at it. By the way, thank you very much ... this was much helpful, I now have more luggage to keep discussing it internally and find a way to go.
  • d

    datajoely

    09/22/2021, 12:16 PM
    Good luck! Do shout if you have any other questions
  • u

    user

    09/29/2021, 3:08 PM
    How to dynamically pass save_args to kedro catalog? https://stackoverflow.com/questions/69378898/how-to-dynamically-pass-save-args-to-kedro-catalog
  • e

    ende

    10/01/2021, 6:53 PM
    If you're trying to create a new custom DataSset where the
    _load
    method is wrapping some other library's read operation that only takes file paths (not file like objects, etc)... what's the best general strategy here using fsspec ?
  • d

    datajoely

    10/04/2021, 8:58 AM
    I would recommend taking an existing dataset core to Kedro like
    pandas.CSVDataSet
    and altering it for your purposes - since that's all tested to work https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.CSVDataSet.html
  • u

    user

    10/08/2021, 7:34 AM
    Want to run Specific node or group of nodes and capture the output into a variable in kedro jupyter lab https://stackoverflow.com/questions/69492121/want-to-run-specific-node-or-group-of-nodes-and-capture-the-output-into-a-variab
  • s

    simon_myway

    10/08/2021, 2:15 PM
    Hi team, I have ben using Kedro for couple of years and recently been looking into deploying a kedro pipeline with airflow. As each node becomes an Airflow task, is there a way to specify different requirements for each node/task as the nodes will use different libraries and I would like to avoid including unused libraries? Thanks for the help!
  • d

    datajoely

    10/08/2021, 2:25 PM
    So we've actually recently released the ability to package modular pipelines with local dependencies. The full docs are here https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html#package-a-modular-pipeline There is some nuance here, we're still working on this experience and the airflow stuff is still downstream, but this may get you on your way
  • d

    datajoely

    10/08/2021, 2:25 PM
    essentially if you include a
    requirements.txt
    within a modular pipeline subfolder, it will take that as gospel for that particular pipeline
  • u

    user

    10/08/2021, 5:01 PM
    Kedro cannot find run https://stackoverflow.com/questions/69499388/kedro-cannot-find-run
  • m

    mlemainque

    10/11/2021, 9:09 AM
    Hello Kedro team! I am new to Kedro and I am trying to assess in which ways my team could use this framework in their daily work to replace other heavy tools. For now it is almost exactly the framework we were looking for so long, great work! One of my concern is about incremental datasets as we often work with huge partitioned datasets fed on a regular basis. I have two questions: 1- Is it planned to integrate the incremental behavior to other datasets than fsspec-based ones? (such as SQL). The checkpoint could be somehow based on one datetime or incremental-id column in the table... 2- Is there any way to somehow load one node's output content? A typical use case is when we want to transform a non-incremental dataset to an incremental one, we read the input data and do an anti join with the output's previous content. But I saw it is not currently possible to have one dataset both as input and output of one node (even though it should not be a problem to solve the DAG order) Thanks for your time,
  • d

    datajoely

    10/11/2021, 9:27 AM
    Hi @User glad to hear Kedro helps some of your team's workflow. 1. It hasn't been a feature I recall being requested before as part of the existing IncrementalDataSet. I'd love to see what that would look like as YAML psuedocode if you have any ideas. Quite a lot of people template SQL calls via custom datasets, but we've been reluctant to support something like that out of the box for security reasons. 2. I'm not entirely sure what you mean here - I guess you could have two datasets pointing at the same data one in incremental form and one not and perform the operation in the node. From a glance what you're describing doesn't sound acyclic but I might be wrong so keen to understand more
  • m

    mlemainque

    10/11/2021, 9:43 AM
    For the first point, I was thinking of something like this:
    yaml
    incremental_sql_dataset:
      type: SQLQueryDataSet
      sql: SELECT * FROM table WHERE id > %(checkpoint)s
      checkpoint:
        column: id  # Which column to use to update the checkpoint based on the loaded content
        filepath: ... # Where to store the checkpoint (same as for partitioned incremental datasets)
    But you're right it could easily be done with a custom implementation
  • m

    mlemainque

    10/11/2021, 9:59 AM
    For the second point, declaring the same dataset twice would definitely work but would not be very elegant, wouldn't it? Below is the use case I am describing. Even though the anti-join can become a costly task, it can worth it if the following pipeline is even more costly (ML tasks)
    python
    def make_incremental(input_data: pd.DataFrame, output_partitioned_data: Dict) -> Dict:
      for _, load_output in output_partitioned_data.items():
        input_data = input_data.merge(load_output()[['id']], on='id', how='outer', indicator=True)
        input_data = input_data[input_data._merge == 'right_only'].drop(columns=['_merge'])
      return {str(datetime.utcnow()): input_data}
    
    node(make_incremental, 'input_dataset', 'output_partitioned_dataset')
  • d

    datajoely

    10/11/2021, 10:00 AM
    So I actually think point 2 could also be done with a custom dataset, essentially inherit from PartiotionedDataSet and do the logic you describe in there too
  • m

    mlemainque

    10/11/2021, 10:02 AM
    I am not sure as the node won't pass the output to the inner function. It would require a custom implementation of the node
  • d

    datajoely

    10/11/2021, 10:03 AM
    Possibly - maybe a before_node_run hook is an option too
  • m

    mlemainque

    10/11/2021, 10:03 AM
    The
    Node._run_with_dict
    method should also pass the outputs if they are in the inner function's signature
  • d

    datajoely

    10/11/2021, 10:04 AM
    Yeah it's an interesting question
  • d

    datajoely

    10/11/2021, 10:04 AM
    I've never seen someone request to do this before so I'd be very keen to see where your land and learn how we can make this easier for you in the future
  • m

    mlemainque

    10/11/2021, 10:06 AM
    Ok, thanks for your help. If we find a convenient and elegant solution to this use case we'll probably come back to you
  • d

    datajoely

    10/11/2021, 10:07 AM
    Please do - it's a very cool problem
  • m

    mlemainque

    10/11/2021, 2:36 PM
    Hi again, I am wondering how difficult it would be for you to add more interactivity to
    kedro-viz
    and finally have it somehow integrated in our favorites IDE? A first easy step I think would be to add hyperlinks: * From a node you can go directly to the inner func's code in VScode thanks to a
    vscode://
    hyperlink * From a FS dataset you can see the list of files and open them thanks to a
    file://
    hyperlink. Or even display a table preview directly in kedro-viz * From an image/matplotlib dataset you can display a preview...
  • d

    datajoely

    10/11/2021, 2:37 PM
    We may or not be working on a prototype for this 🤫
  • d

    datajoely

    10/11/2021, 2:37 PM
    The FastAPI rewrite allows all of this
  • d

    datajoely

    10/11/2021, 2:37 PM
    cc @User
  • m

    mlemainque

    10/11/2021, 2:38 PM
    That would be amazing... Would you need any beta tester, I'm here 😄
Powered by Linen
Title
m

mlemainque

10/11/2021, 2:38 PM
That would be amazing... Would you need any beta tester, I'm here 😄
View count: 1