Hi ! actually i'm strugling with a resources issue...
# beginners-need-help
i
Hi ! actually i'm strugling with a resources issues, because i have to process a lot of images with high resolution and the node processing have to store them in RAM and that raises OutOfMemory issue. so i think of processing that sequentially in the node ( the node process image then save it, not like now all images are processed then saved in catalog). i found that it can be doable with custom runner isn't it ? any tips for that ?
d
So I think you probably could do it with a custom runner
I think an
after_node_run
hook is probably the best
you have access to the catalog object, input names and output names
You can then explicitly call the
catalog.releases(dataset_name)
method
i
Thank you ! is the catalog released automaticly ? or should i do it manuelly ? But my main issue is that i have a datacatalog that i can't process once with a node so how can i batch the processing
d
Okay so Kedro will release at the end of the pipeline run (I think)
but if your issue is on load
What kind of data are you loading?
and roughly how many rows?
i
Perfect ! Yes actually the issue is on loading all datacatalog on memory for processing. so is there any trick to do batch processing of datacatalog or somthing like this ( load_image, process, save in catalog. then repeat for all other images to avoid OOM)
d
Is it image datasets?
are you using
PartitionedDataSet
or
IncrementalDataSet
to make life easier?
i
Yes exatly i'm using a custom image dataset with partitionedDataset
d
Okay have you tired
IncrementalDataSet
it may help here
I think the partitions provide a lazy load method for each partition
i
I think the incrementalDataset is not relevent in this case
I think that the solution is how to save data throw datacatalog while node is running ( i mean when using PartitionedDataset i have to iterate throw datacatalog then process each image) so the idea is to save image after each iteration throw PartitionedDataset reading loop
d
today - the partitioned dataset is eager
and doesn't really allow this well
I'm trying to think how we can make it lazy
this is how load works today
where essentially the
partitions
dictionary already contains the data already loaded into memory
I think if you drop the
.load()
that has been highlighted it will return a lazy object reference that you could
load()
yourself in the node
within a loop
you should be able to subclass
PartitionedDataSet
just like you have for your custom image dataset
If you have any time to test this and see if it works I'd love to make it out of the box functionality
i
actually PartitionedDataset provide a loader function to load image isn't it ? so the image is not loaded to memory automatically
d
so I think the highlighted part actually loads it
but if you remove it
you could do the load when the dictionary is provided to you in the node
i
i see your point. but don't you think that the loading isn't a problem since we can use the load function to load image within loop, but i think we should consider how to save images gradually to not keep them in RAM. Currently we have to keep it in RAM till the node run ends to save it in datacatalog
d
ah I'm with you
I'm trying to think of the best way of doing this
i
Yeaaah ! i really apreciate you help 😄
d
We essentially need ways of running the node on small batches
i
actually i thought of saving images manually whitout using datacatalog but is not the proper way i think
d
well that would work - and we don't currently offer a better solution
i
exatly that what i thought about
d
so maybe to get you going thats not a bad idea
but I'm just trying to think how we fix it conceptually
i
Yes ! if there is no better way i think i have no choice 😅
I think it can be a really good feature for kedro "batch processing"
d
Yeah I'm just whiteboarding a hook idea
i
Cool !
d
so I've done something a bit crazy but I think it may work
I've not tested it so it's only psuedocode
it would work like this
essentially the you have two partitioned datasets one for your source images and one for your target limages
you do you load/transform/save each source/target pair at the same time
and we do some creative stuff to pull out the ability to save to the target folder using the same partition IDs as we had in the source folder
i
Wow you rock !
d
let me know if it works
i
so i can do the processing in the hook part and save directly in cataalog
d
if it does I'd love to do come up with an offical way of doing this
yeah
i
Of course !
d
I think there is a lot of merit in this being generic
i
Ok i will try it , thank you!
d
there is a core assumption you have the same number of partitions as inputs as you do outputs
i
yes i think too
Thx for your time, i'll let you know of the results 😉
d
🤞
Feel free to comment if you have any thoughts https://github.com/kedro-org/kedro/issues/1413
d
@idriss__ Sorry for the late reply (found this through the issue on GitHub), but what you can do is return a dictionary with `Callable`s as values from your node in order to save lazily using
PartitionedDataSet
.
This will allow you to essentially save each image on each iteration through the
PartitionedDataSet
loop.
d
This a really cool idea - could you provide a snippet?
r
The example (https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset-save) of returning a dictionary with
Callable
values doesn't seem to have the expected behaviour. The point is to save each partition one at a time right? It seems that all the files are saved at the same time (see screenshot); each file actually took 4 minutes to process. So it seems like the outputs are stored in RAM until the end of the node. Or am I missing something?
m
I am having some problems lazyly saving parquet files
And yes, it saves all at the same time
Whats happening is that I create the callable dictionary but when the writes happens it writes the same file
d
If you look at the implementation its definitely running in a for loop https://kedro.readthedocs.io/en/stable/_modules/kedro/io/partitioned_dataset.html#PartitionedDataSet
can you put a breakpoint in the
PartitionedDataSet
to prove that?
Think you may need more precision on your timestamps
Can you post your YAML definition?
m
I just discovered what happenned with the same writing (Exactly the same), if your files are inside another folder it will write all at once, once I removed the subfolder it worked
partitioned_data: type: PartitionedDataSet path: data/07_model_output/scored data: pandas.ParquetDataSet
I just tested, the function that writes the dictionary with calleble is ok! (the names and files are all listed) Once it goes to the saving hook it only uses the last partition repeatedly
3 Views