jaweiss2305
02/26/2022, 1:06 PMdatajoely
02/26/2022, 2:38 PMjaweiss2305
02/26/2022, 3:43 PMArnaldo
02/26/2022, 4:51 PMjaweiss2305
02/26/2022, 4:56 PMArnaldo
02/26/2022, 4:57 PMdatajoely
02/26/2022, 7:30 PMFelicioV
03/02/2022, 5:42 PMPartitionedDataSets
with pandas.ExcelDataSet
and specifying load_args
as sheet_name
, names
and dtype
. It works like a charm but I'm worrying about the size of the catalog/ingest.yml
. I've been searching for a way to split that catalog yml into a few files, maybe on business oriented segments, but I have had no luck with it. Is there a intended way to do such a thing? If not intended way implemented, I've been thinking (not really tried though) to mess up with the register_catalog
on the ProjectHooks
class. Am I making any sense? Thanks!FelicioV
03/02/2022, 6:07 PMdatajoely
03/02/2022, 6:08 PMwaylonwalker
03/03/2022, 4:32 PMdatajoely
03/03/2022, 4:57 PMchunk_size
argument of pd.read_sql_table
should work in load_args
datajoely
03/03/2022, 4:58 PMParititionedDataSet
and a file systemwaylonwalker
03/03/2022, 5:08 PMdatajoely
03/03/2022, 5:08 PMwaylonwalker
03/03/2022, 5:08 PMdatajoely
03/03/2022, 5:09 PMchunksize
is an argument of pd.to_sql_table
so you can use it in pandas.SQLTableDataSet
jaweiss2305
03/03/2022, 6:25 PMwaylonwalker
03/03/2022, 6:26 PMwilliamc
03/04/2022, 11:16 PMtf.data.experimental.make_csv_dataset
. In this particular case I have a node at the end of one of my pipelines saving the dataframe to s3 (spark.SparkDataFrame
), and I have written a custom dataset (essentially copied most of the code from TensorFlowModelDataset
) that does the reading at the beginning of the next pipeline. The maddening issue I haven't been able to solve is that, if I run both pipelines with the --from-node option, the run fails as my call to self._fs.get()
returns an empty result. I have verified that the dataframe is being correctly written to my s3 bucket, but a call to self._fs.ls(load_path)
comes back empty as well.
If after my failed run, I run just the second pipeline, everything works as expected, self._fs.get()
returns my csv files and I'm able to load my data into a TF dataset and train my model without issue.
Does anybody have any idea about what I'm doing wrong?williamc
03/04/2022, 11:16 PMwilliamc
03/04/2022, 11:16 PMavan-sh
03/05/2022, 1:26 AMwilliamc
03/05/2022, 3:27 AMself._fs.ls(load_path)
to find something in the S3 bucket where my csv dataframe is, to no luck.datajoely
03/05/2022, 11:31 AMwilliamc
03/05/2022, 12:14 PMtf.data.DataSet
object. According to their docs "when exiting the context, the reader of the dataset will be closed".
RE breakpoints: unfortunately I'm working with an old version of Jupyter Lab and can't readily update it nor install plugins. I'd rather use vscode but I've had some trouble setting up the ssh + Docker integration (my dev env is a Docker container running on an EC2 instance). I'll keep trying things to isolate the error further. Thanks for the pointersdatajoely
03/05/2022, 12:15 PMbreakpoint()
syntaxwilliamc
03/05/2022, 12:18 PMDeep
03/07/2022, 1:41 PMDeep
03/07/2022, 1:42 PMDeep
03/07/2022, 1:42 PM