filpa11/03/2022, 3:29 PM
I run this pipeline from my local workstation for testing purposes. My Dask Cluster is then deployed on AWS EC2 (Scheduler+Workers) and they communicate privately. I noticed that on the last node, the
dask.ParquetDataSet from s3 -> MemoryDataSet -> dask.ParquetDataSet to s3
causes the data to be transferred to my local machine where the Kedro pipeline is being run, and then transferred back to s3. Needless to say this introduces costs and lag and is not what I intended. Can I tell the workers to write this data directly to s3? If not, what is the intended way to do this? I read through the documentation, and there is some very good information on getting the Pipeline to run as either step functions (https://kedro.readthedocs.io/en/stable/deployment/aws_step_functions.html) or on AWS Batch (https://kedro.readthedocs.io/en/stable/deployment/aws_batch.html), but this is not quite the deployment flow I had in mind. Is it intended for the pipeline to be run on the same infrastructure where the workers are deployed?
MemoryDataSet -> dask.ParquetDataSet to s3