Oh so you want a general dataset that represents the whole bucket and then you grab specific parts of that bucket
And I assume you want to use all the parts after that query and it will fit in ram
So maybe using a standard/incremental custom dataset that takes in a query and a bucket and returns the applicable images in memory would be better?
The you can optionally dump that in a tar file in an ingest pipeline so it's faster afterwards. Or in a better columnar format. I haven't worked with images much