If you post the node function syntax we can try an...
# beginners-need-help
d
If you post the node function syntax we can try and work it out, but double check that the object being returned is definitely a df
s
I have posted my node function in this gist https://gist.github.com/sjster/f5a51f19bf818d540327ac7b0f8f5769
d
Can you confirm how many columns are present? if its 0 or 1 the df may just be the index which is technically a series. I always use merge rather than join because I find it easier reason with.
s
It says that it has around 43 columns
Let me try this with a merge
d
And what does your catalog entry look like? Discord supports snippets so you don't have to create a gist
It looks like your dataframe is correct
s
Still figuring out how to use discord :)
Copy code
yaml
joined_es_tu_target:
  type: PartitionedDataSet
  dataset:
     type: pandas.ParquetDataSet
     save_args:
      index: False
  path: data/03_primary/joined_es_tu_target
  filename_suffix: ".parquet"
d
That looks correct (you can do ```yaml to make that highlight correctly)
So can you try without the index save argument?
And finally can I see the node syntax in your pipeline
Oh also you need to return a dictionary of chunks sorry forgot to mention that https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset-save
s
Ah ok, let me try that
For reference, my node logic
Copy code
python
def read_inputs_join(ev: pd.DataFrame, tu: pd.DataFrame, df_target: pd.DataFrame) -> pd.DataFrame:
    print(ev.head())
    print(tu.head())
    ev.set_index('IDENTITY_ID', inplace=True)
    tu.set_index('IDENTITY_ID', inplace=True)
    df_target.set_index('IDENTITY_ID', inplace=True)

    log = logging.getLogger(__name__)
    log.info(f"Length of ev is {len(ev)}")
    log.info(f"Length of tu is {len(tu)}")
    log.info(f"Length of target is {len(df_target)}")

    df_joined = ev.join(tu, how='inner')
    df_ev_tu_target = df_target.join(df_joined, how='inner')
    print("Target type is ",type(df_ev_tu_target))
    print("Target columns ",df_ev_tu_target.columns)
    print("Target columns ",df_ev_tu_target.head())
    print("Average age of credit ",df_ev_tu_target['AVG_AGE_OF_CREDIT'])
    log.info(f"Length of target joined with es_tu is {len(df_ev_tu_target)}")

    return(df_ev_tu_target)
Do the chunks correspond to columns?
d
No more like chunks of rows so you can break it up typically two ways 1) dict(df.groupby(value) 2) chunking by some number of rows
s
I have about 4 million rows, so chunking to about 800k rows per chunk for 5 chunks
I want to say that the save worked!