778216384475693066 #advanced-need-help

Channels

advanced-need-help

job-posting

welcome

Waldrill

09/22/2021, 12:13 PM

Yep, we do have the solution currently running, and the client asked to scale from 4 to 20 .. we started to seeing problems of duplicated stuff, but it is done ... But the projects next step is to scale up to hundreds, and now It looks like it is the time to think of doing it in a way that will prevent a nightmare of supports 😅

datajoely

09/22/2021, 12:14 PM

in which case - I'd optimise for the future, but would warn that the namespacing of parameters is changing to be consistent with the way we namespace catalog entries in 0.18.0

Waldrill

09/22/2021, 12:15 PM

Thanks, I'll take a look at it. By the way, thank you very much ... this was much helpful, I now have more luggage to keep discussing it internally and find a way to go.

datajoely

09/22/2021, 12:16 PM

Good luck! Do shout if you have any other questions

user

09/29/2021, 3:08 PM

How to dynamically pass save_args to kedro catalog? https://stackoverflow.com/questions/69378898/how-to-dynamically-pass-save-args-to-kedro-catalog

ende

10/01/2021, 6:53 PM

If you're trying to create a new custom DataSset where the

_load

method is wrapping some other library's read operation that only takes file paths (not file like objects, etc)... what's the best general strategy here using fsspec ?

datajoely

10/04/2021, 8:58 AM

I would recommend taking an existing dataset core to Kedro like

pandas.CSVDataSet

and altering it for your purposes - since that's all tested to work https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.CSVDataSet.html

user

10/08/2021, 7:34 AM

Want to run Specific node or group of nodes and capture the output into a variable in kedro jupyter lab https://stackoverflow.com/questions/69492121/want-to-run-specific-node-or-group-of-nodes-and-capture-the-output-into-a-variab

simon_myway

10/08/2021, 2:15 PM

Hi team, I have ben using Kedro for couple of years and recently been looking into deploying a kedro pipeline with airflow. As each node becomes an Airflow task, is there a way to specify different requirements for each node/task as the nodes will use different libraries and I would like to avoid including unused libraries? Thanks for the help!

datajoely

10/08/2021, 2:25 PM

So we've actually recently released the ability to package modular pipelines with local dependencies. The full docs are here https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html#package-a-modular-pipeline There is some nuance here, we're still working on this experience and the airflow stuff is still downstream, but this may get you on your way

datajoely

10/08/2021, 2:25 PM

essentially if you include a

requirements.txt

within a modular pipeline subfolder, it will take that as gospel for that particular pipeline

user

10/08/2021, 5:01 PM

Kedro cannot find run https://stackoverflow.com/questions/69499388/kedro-cannot-find-run

mlemainque

10/11/2021, 9:09 AM

Hello Kedro team! I am new to Kedro and I am trying to assess in which ways my team could use this framework in their daily work to replace other heavy tools. For now it is almost exactly the framework we were looking for so long, great work! One of my concern is about incremental datasets as we often work with huge partitioned datasets fed on a regular basis. I have two questions: 1- Is it planned to integrate the incremental behavior to other datasets than fsspec-based ones? (such as SQL). The checkpoint could be somehow based on one datetime or incremental-id column in the table... 2- Is there any way to somehow load one node's output content? A typical use case is when we want to transform a non-incremental dataset to an incremental one, we read the input data and do an anti join with the output's previous content. But I saw it is not currently possible to have one dataset both as input and output of one node (even though it should not be a problem to solve the DAG order) Thanks for your time,

datajoely

10/11/2021, 9:27 AM

Hi @User glad to hear Kedro helps some of your team's workflow. 1. It hasn't been a feature I recall being requested before as part of the existing IncrementalDataSet. I'd love to see what that would look like as YAML psuedocode if you have any ideas. Quite a lot of people template SQL calls via custom datasets, but we've been reluctant to support something like that out of the box for security reasons. 2. I'm not entirely sure what you mean here - I guess you could have two datasets pointing at the same data one in incremental form and one not and perform the operation in the node. From a glance what you're describing doesn't sound acyclic but I might be wrong so keen to understand more

mlemainque

10/11/2021, 9:43 AM

For the first point, I was thinking of something like this:

Copy code

yaml
incremental_sql_dataset:
  type: SQLQueryDataSet
  sql: SELECT * FROM table WHERE id > %(checkpoint)s
  checkpoint:
    column: id  # Which column to use to update the checkpoint based on the loaded content
    filepath: ... # Where to store the checkpoint (same as for partitioned incremental datasets)

But you're right it could easily be done with a custom implementation

mlemainque

10/11/2021, 9:59 AM

For the second point, declaring the same dataset twice would definitely work but would not be very elegant, wouldn't it? Below is the use case I am describing. Even though the anti-join can become a costly task, it can worth it if the following pipeline is even more costly (ML tasks)

Copy code

python
def make_incremental(input_data: pd.DataFrame, output_partitioned_data: Dict) -> Dict:
  for _, load_output in output_partitioned_data.items():
    input_data = input_data.merge(load_output()[['id']], on='id', how='outer', indicator=True)
    input_data = input_data[input_data._merge == 'right_only'].drop(columns=['_merge'])
  return {str(datetime.utcnow()): input_data}

node(make_incremental, 'input_dataset', 'output_partitioned_dataset')

datajoely

10/11/2021, 10:00 AM

So I actually think point 2 could also be done with a custom dataset, essentially inherit from PartiotionedDataSet and do the logic you describe in there too

mlemainque

10/11/2021, 10:02 AM

I am not sure as the node won't pass the output to the inner function. It would require a custom implementation of the node

datajoely

10/11/2021, 10:03 AM

Possibly - maybe a before_node_run hook is an option too

mlemainque

10/11/2021, 10:03 AM

The

Node._run_with_dict

method should also pass the outputs if they are in the inner function's signature

datajoely

10/11/2021, 10:04 AM

Yeah it's an interesting question

datajoely

10/11/2021, 10:04 AM

I've never seen someone request to do this before so I'd be very keen to see where your land and learn how we can make this easier for you in the future

mlemainque

10/11/2021, 10:06 AM

Ok, thanks for your help. If we find a convenient and elegant solution to this use case we'll probably come back to you

datajoely

10/11/2021, 10:07 AM

Please do - it's a very cool problem

mlemainque

10/11/2021, 2:36 PM

Hi again, I am wondering how difficult it would be for you to add more interactivity to

kedro-viz

and finally have it somehow integrated in our favorites IDE? A first easy step I think would be to add hyperlinks: * From a node you can go directly to the inner func's code in VScode thanks to a

vscode://

hyperlink * From a FS dataset you can see the list of files and open them thanks to a

file://

hyperlink. Or even display a table preview directly in kedro-viz * From an image/matplotlib dataset you can display a preview...

datajoely

10/11/2021, 2:37 PM

We may or not be working on a prototype for this 🤫

datajoely

10/11/2021, 2:37 PM

The FastAPI rewrite allows all of this

datajoely

10/11/2021, 2:37 PM

cc @User

mlemainque

10/11/2021, 2:38 PM

That would be amazing... Would you need any beta tester, I'm here 😄

Anish Shah @ WANDB

10/11/2021, 2:40 PM

Im super excited to hear about this! Dataset previews