03/04/2022, 11:16 PM
I've been debugging this issue for a couple days now, and I'm about to lose my sanity 😁 So, I'm dealing with Spark dataframes and Tensorflow. To have them talk, I usually save my dataframes as csv and then read them into a teonsorflow dataset with a call to
. In this particular case I have a node at the end of one of my pipelines saving the dataframe to s3 (
), and I have written a custom dataset (essentially copied most of the code from
) that does the reading at the beginning of the next pipeline. The maddening issue I haven't been able to solve is that, if I run both pipelines with the --from-node option, the run fails as my call to
returns an empty result. I have verified that the dataframe is being correctly written to my s3 bucket, but a call to
comes back empty as well. If after my failed run, I run just the second pipeline, everything works as expected,
returns my csv files and I'm able to load my data into a TF dataset and train my model without issue. Does anybody have any idea about what I'm doing wrong?