If you post the node function syntax we can try and work it Kedro #beginners-need-help

If you post the node function syntax we can try an...

datajoely

06/27/2022, 6:23 PM

If you post the node function syntax we can try and work it out, but double check that the object being returned is definitely a df

sjster

06/27/2022, 6:43 PM

I have posted my node function in this gist https://gist.github.com/sjster/f5a51f19bf818d540327ac7b0f8f5769

datajoely

06/27/2022, 6:46 PM

Can you confirm how many columns are present? if its 0 or 1 the df may just be the index which is technically a series. I always use merge rather than join because I find it easier reason with.

sjster

06/27/2022, 6:52 PM

It says that it has around 43 columns

sjster

06/27/2022, 6:52 PM

Let me try this with a merge

datajoely

06/27/2022, 6:52 PM

And what does your catalog entry look like? Discord supports snippets so you don't have to create a gist

datajoely

06/27/2022, 6:53 PM

It looks like your dataframe is correct

sjster

06/27/2022, 6:57 PM

Still figuring out how to use discord :)

sjster

06/27/2022, 7:00 PM

Copy code

yaml
joined_es_tu_target:
  type: PartitionedDataSet
  dataset:
     type: pandas.ParquetDataSet
     save_args:
      index: False
  path: data/03_primary/joined_es_tu_target
  filename_suffix: ".parquet"

datajoely

06/27/2022, 7:01 PM

That looks correct (you can do ```yaml to make that highlight correctly)

datajoely

06/27/2022, 7:01 PM

So can you try without the index save argument?

datajoely

06/27/2022, 7:02 PM

And finally can I see the node syntax in your pipeline

datajoely

06/27/2022, 7:04 PM

Oh also you need to return a dictionary of chunks sorry forgot to mention that https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset-save

sjster

06/27/2022, 7:05 PM

Ah ok, let me try that

sjster

06/27/2022, 7:06 PM

For reference, my node logic

Copy code

python
def read_inputs_join(ev: pd.DataFrame, tu: pd.DataFrame, df_target: pd.DataFrame) -> pd.DataFrame:
    print(ev.head())
    print(tu.head())
    ev.set_index('IDENTITY_ID', inplace=True)
    tu.set_index('IDENTITY_ID', inplace=True)
    df_target.set_index('IDENTITY_ID', inplace=True)

    log = logging.getLogger(__name__)
    log.info(f"Length of ev is {len(ev)}")
    log.info(f"Length of tu is {len(tu)}")
    log.info(f"Length of target is {len(df_target)}")

    df_joined = ev.join(tu, how='inner')
    df_ev_tu_target = df_target.join(df_joined, how='inner')
    print("Target type is ",type(df_ev_tu_target))
    print("Target columns ",df_ev_tu_target.columns)
    print("Target columns ",df_ev_tu_target.head())
    print("Average age of credit ",df_ev_tu_target['AVG_AGE_OF_CREDIT'])
    log.info(f"Length of target joined with es_tu is {len(df_ev_tu_target)}")

    return(df_ev_tu_target)

sjster

06/27/2022, 7:21 PM

Do the chunks correspond to columns?

datajoely

06/27/2022, 7:22 PM

No more like chunks of rows so you can break it up typically two ways 1) dict(df.groupby(value) 2) chunking by some number of rows

sjster

06/27/2022, 7:29 PM

I have about 4 million rows, so chunking to about 800k rows per chunk for 5 chunks

sjster

06/27/2022, 11:25 PM

I want to say that the save worked!

Previous Next