Hello everyone! I'm relatively new to Kedro. I'm u...
# beginners-need-help
Hello everyone! I'm relatively new to Kedro. I'm using it together with Dask for some data processing, and I have some issues with regards to data locality. I have a pipeline that has three nodes where the datasets are loaded like follows:
dask.ParquetDataSet from s3 -> MemoryDataSet -> dask.ParquetDataSet to s3
I run this pipeline from my local workstation for testing purposes. My Dask Cluster is then deployed on AWS EC2 (Scheduler+Workers) and they communicate privately. I noticed that on the last node, the
MemoryDataSet -> dask.ParquetDataSet to s3
causes the data to be transferred to my local machine where the Kedro pipeline is being run, and then transferred back to s3. Needless to say this introduces costs and lag and is not what I intended. Can I tell the workers to write this data directly to s3? If not, what is the intended way to do this? I read through the documentation, and there is some very good information on getting the Pipeline to run as either step functions (https://kedro.readthedocs.io/en/stable/deployment/aws_step_functions.html) or on AWS Batch (https://kedro.readthedocs.io/en/stable/deployment/aws_batch.html), but this is not quite the deployment flow I had in mind. Is it intended for the pipeline to be run on the same infrastructure where the workers are deployed?