https://kedro.org/ logo
Title
m

mlemainque

10/11/2021, 9:59 AM
For the second point, declaring the same dataset twice would definitely work but would not be very elegant, wouldn't it? Below is the use case I am describing. Even though the anti-join can become a costly task, it can worth it if the following pipeline is even more costly (ML tasks)
python
def make_incremental(input_data: pd.DataFrame, output_partitioned_data: Dict) -> Dict:
  for _, load_output in output_partitioned_data.items():
    input_data = input_data.merge(load_output()[['id']], on='id', how='outer', indicator=True)
    input_data = input_data[input_data._merge == 'right_only'].drop(columns=['_merge'])
  return {str(datetime.utcnow()): input_data}

node(make_incremental, 'input_dataset', 'output_partitioned_dataset')