Hi actually i m strugling with a resources issues because i Kedro #beginners-need-help

Hi ! actually i'm strugling with a resources issue...

idriss__

03/24/2022, 9:32 AM

Hi ! actually i'm strugling with a resources issues, because i have to process a lot of images with high resolution and the node processing have to store them in RAM and that raises OutOfMemory issue. so i think of processing that sequentially in the node ( the node process image then save it, not like now all images are processed then saved in catalog). i found that it can be doable with custom runner isn't it ? any tips for that ?

datajoely

03/24/2022, 9:52 AM

So I think you probably could do it with a custom runner

datajoely

03/24/2022, 9:52 AM

I think an

after_node_run

hook is probably the best

datajoely

03/24/2022, 9:52 AM

https://kedro.readthedocs.io/en/latest/kedro.framework.hooks.specs.NodeSpecs.html#kedro.framework.hooks.specs.NodeSpecs.after_node_run

datajoely

03/24/2022, 9:52 AM

you have access to the catalog object, input names and output names

datajoely

03/24/2022, 9:53 AM

You can then explicitly call the

catalog.releases(dataset_name)

method

idriss__

03/24/2022, 1:32 PM

Thank you ! is the catalog released automaticly ? or should i do it manuelly ? But my main issue is that i have a datacatalog that i can't process once with a node so how can i batch the processing

datajoely

03/24/2022, 1:36 PM

Okay so Kedro will release at the end of the pipeline run (I think)

datajoely

03/24/2022, 1:36 PM

but if your issue is on load

datajoely

03/24/2022, 1:36 PM

What kind of data are you loading?

datajoely

03/24/2022, 1:39 PM

and roughly how many rows?

idriss__

03/24/2022, 2:50 PM

Perfect ! Yes actually the issue is on loading all datacatalog on memory for processing. so is there any trick to do batch processing of datacatalog or somthing like this ( load_image, process, save in catalog. then repeat for all other images to avoid OOM)

datajoely

03/24/2022, 3:01 PM

Is it image datasets?

datajoely

03/24/2022, 3:02 PM

are you using

PartitionedDataSet

IncrementalDataSet

to make life easier?

idriss__

03/24/2022, 3:04 PM

Yes exatly i'm using a custom image dataset with partitionedDataset

datajoely

03/24/2022, 3:05 PM

Okay have you tired

IncrementalDataSet

it may help here

datajoely

03/24/2022, 3:30 PM

I think the partitions provide a lazy load method for each partition

idriss__

03/24/2022, 3:40 PM

I think the incrementalDataset is not relevent in this case

idriss__

03/24/2022, 3:43 PM

I think that the solution is how to save data throw datacatalog while node is running ( i mean when using PartitionedDataset i have to iterate throw datacatalog then process each image) so the idea is to save image after each iteration throw PartitionedDataset reading loop

datajoely

03/24/2022, 3:44 PM

today - the partitioned dataset is eager

datajoely

03/24/2022, 3:44 PM

and doesn't really allow this well

datajoely

03/24/2022, 3:44 PM

I'm trying to think how we can make it lazy

datajoely

03/24/2022, 3:45 PM

this is how load works today

datajoely

03/24/2022, 3:47 PM

where essentially the

partitions

dictionary already contains the data already loaded into memory

datajoely

03/24/2022, 3:47 PM

I think if you drop the

.load()

that has been highlighted it will return a lazy object reference that you could

load()

yourself in the node

datajoely

03/24/2022, 3:47 PM

within a loop

datajoely

03/24/2022, 3:48 PM

you should be able to subclass

PartitionedDataSet

just like you have for your custom image dataset

datajoely

03/24/2022, 3:48 PM

If you have any time to test this and see if it works I'd love to make it out of the box functionality

idriss__

03/24/2022, 3:51 PM

actually PartitionedDataset provide a loader function to load image isn't it ? so the image is not loaded to memory automatically

datajoely

03/24/2022, 3:51 PM

so I think the highlighted part actually loads it

datajoely

03/24/2022, 3:51 PM

but if you remove it

datajoely

03/24/2022, 3:51 PM

you could do the load when the dictionary is provided to you in the node

idriss__

03/24/2022, 3:56 PM

i see your point. but don't you think that the loading isn't a problem since we can use the load function to load image within loop, but i think we should consider how to save images gradually to not keep them in RAM. Currently we have to keep it in RAM till the node run ends to save it in datacatalog

datajoely

03/24/2022, 3:57 PM

ah I'm with you

datajoely

03/24/2022, 3:57 PM

I'm trying to think of the best way of doing this

idriss__

03/24/2022, 3:58 PM

Yeaaah ! i really apreciate you help 😄

datajoely

03/24/2022, 3:58 PM

We essentially need ways of running the node on small batches

idriss__

03/24/2022, 3:59 PM

actually i thought of saving images manually whitout using datacatalog but is not the proper way i think

datajoely

03/24/2022, 3:59 PM

well that would work - and we don't currently offer a better solution

idriss__

03/24/2022, 3:59 PM

exatly that what i thought about

datajoely

03/24/2022, 3:59 PM

so maybe to get you going thats not a bad idea

datajoely

03/24/2022, 3:59 PM

but I'm just trying to think how we fix it conceptually

idriss__

03/24/2022, 4:01 PM

Yes ! if there is no better way i think i have no choice 😅

idriss__

03/24/2022, 4:01 PM

I think it can be a really good feature for kedro "batch processing"

datajoely

03/24/2022, 4:05 PM

Yeah I'm just whiteboarding a hook idea

idriss__

03/24/2022, 4:11 PM

Cool !

datajoely

03/24/2022, 4:40 PM

so I've done something a bit crazy but I think it may work

datajoely

03/24/2022, 4:43 PM

I've not tested it so it's only psuedocode

datajoely

03/24/2022, 4:43 PM

it would work like this

datajoely

03/24/2022, 4:44 PM

essentially the you have two partitioned datasets one for your source images and one for your target limages

datajoely

03/24/2022, 4:45 PM

you do you load/transform/save each source/target pair at the same time

datajoely

03/24/2022, 4:45 PM

and we do some creative stuff to pull out the ability to save to the target folder using the same partition IDs as we had in the source folder

idriss__

03/24/2022, 4:52 PM

Wow you rock !

datajoely

03/24/2022, 4:53 PM

let me know if it works

idriss__

03/24/2022, 4:53 PM

so i can do the processing in the hook part and save directly in cataalog

datajoely

03/24/2022, 4:53 PM

if it does I'd love to do come up with an offical way of doing this

datajoely

03/24/2022, 4:53 PM

yeah

idriss__

03/24/2022, 4:53 PM

Of course !

datajoely

03/24/2022, 4:54 PM

I think there is a lot of merit in this being generic

idriss__

03/24/2022, 4:54 PM

Ok i will try it , thank you!

datajoely

03/24/2022, 4:54 PM

there is a core assumption you have the same number of partitions as inputs as you do outputs

idriss__

03/24/2022, 4:54 PM

yes i think too

idriss__

03/24/2022, 4:57 PM

Thx for your time, i'll let you know of the results 😉

datajoely

03/24/2022, 4:57 PM

🤞

datajoely

04/05/2022, 1:29 PM

Feel free to comment if you have any thoughts https://github.com/kedro-org/kedro/issues/1413

deepyaman

04/07/2022, 1:28 AM

@idriss__ Sorry for the late reply (found this through the issue on GitHub), but what you can do is return a dictionary with `Callable`s as values from your node in order to save lazily using

PartitionedDataSet

deepyaman

04/07/2022, 1:30 AM

This will allow you to essentially save each image on each iteration through the

PartitionedDataSet

loop.

datajoely

04/07/2022, 8:56 AM

This a really cool idea - could you provide a snippet?

RRoger

06/04/2022, 12:34 PM

The example (https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset-save) of returning a dictionary with

Callable

values doesn't seem to have the expected behaviour. The point is to save each partition one at a time right? It seems that all the files are saved at the same time (see screenshot); each file actually took 4 minutes to process. So it seems like the outputs are stored in RAM until the end of the node. Or am I missing something?

Mackson

06/07/2022, 2:16 PM

I am having some problems lazyly saving parquet files

Mackson

06/07/2022, 2:17 PM

And yes, it saves all at the same time

Mackson

06/07/2022, 2:28 PM

Whats happening is that I create the callable dictionary but when the writes happens it writes the same file

datajoely

06/07/2022, 2:43 PM

If you look at the implementation its definitely running in a for loop https://kedro.readthedocs.io/en/stable/_modules/kedro/io/partitioned_dataset.html#PartitionedDataSet

datajoely

06/07/2022, 2:44 PM

can you put a breakpoint in the

PartitionedDataSet

to prove that?

datajoely

06/07/2022, 2:44 PM

Think you may need more precision on your timestamps

datajoely

06/07/2022, 2:46 PM

Can you post your YAML definition?

Mackson

06/07/2022, 4:24 PM

I just discovered what happenned with the same writing (Exactly the same), if your files are inside another folder it will write all at once, once I removed the subfolder it worked

Mackson

06/07/2022, 4:25 PM

partitioned_data: type: PartitionedDataSet path: data/07_model_output/scored data: pandas.ParquetDataSet

Mackson

06/07/2022, 4:26 PM

I just tested, the function that writes the dictionary with calleble is ok! (the names and files are all listed) Once it goes to the saving hook it only uses the last partition repeatedly

Mackson

06/07/2022, 5:17 PM

I think I found the error: https://stackoverflow.com/questions/34854400/python-dict-of-lambda-functions

3 Views

Previous Next