https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • b

    Barros

    08/13/2022, 7:01 PM
    Working on this right now. I need the lazy features too, so I am going for the second approach. Let's see how well it goes.
  • b

    Barros

    08/13/2022, 8:03 PM
    The thing is that I want to be able to use a PartitionedDataSet of PartitionedDataSets. This is some borderline case that I want do test.
  • b

    Barros

    08/13/2022, 8:04 PM
    I have a folder structure that has sub-folders and these sub-folders have some specific files each
  • b

    Barros

    08/13/2022, 8:04 PM
    I want to specify each sub-folder as a single dataset
  • d

    datajoely

    08/13/2022, 8:04 PM
    This feels slightly out of the bounds of what we've tested so far. If you get it working I'd be interested in trying to support this natively
  • b

    Barros

    08/13/2022, 8:05 PM
    Maybe I'll have to use something like a tar file, I'll see what happens
  • a

    antheas

    08/13/2022, 8:06 PM
    How many subfolders? Less than 20 I'd use the first approach with N partitioned datasets
  • a

    antheas

    08/13/2022, 8:06 PM
    I have a node that will receive 1-2 dozen datasets when I'm done with it
  • b

    Barros

    08/13/2022, 8:07 PM
    It's a lot of folders
  • b

    Barros

    08/13/2022, 8:07 PM
    It's satellite imagery
  • b

    Barros

    08/13/2022, 8:07 PM
    I have been successfully using Kedro as a geospatial data pipeline for machine learning
  • b

    Barros

    08/13/2022, 8:07 PM
    And I am making improvements right now
  • a

    antheas

    08/13/2022, 8:07 PM
    Oh so it's a directory structure that ends in folders with images
  • b

    Barros

    08/13/2022, 8:08 PM
    Yes, and images that have regular structure between folders
  • a

    antheas

    08/13/2022, 8:08 PM
    How many gigs?
  • b

    Barros

    08/13/2022, 8:08 PM
    about 30gb of images
  • b

    Barros

    08/13/2022, 8:08 PM
    But distributed in small pieces of about 200kb
  • d

    datajoely

    08/13/2022, 8:09 PM
    Super cool usecase!
  • b

    Barros

    08/13/2022, 8:10 PM
    I wanted so bad to contribute for this kind of dataset, but I didn't find the time until recently
  • b

    Barros

    08/13/2022, 8:10 PM
    All I have written is closed source
  • d

    datajoely

    08/13/2022, 8:11 PM
    I've been keen to improve PartitionedDataSet in general so this could feed into a new design
  • b

    Barros

    08/13/2022, 8:11 PM
    If you want to understand better the case, take a look at this bucket: https://registry.opendata.aws/sentinel-2-l2a-cogs/
  • b

    Barros

    08/13/2022, 8:12 PM
    It's all based on this folder structure in
    s3://sentinel-cogs/sentinel-s2-l2a-cogs/
  • b

    Barros

    08/13/2022, 8:13 PM
    In the end you have some files that are images that represent the same location in space. However, each image has different data in different resolution
  • b

    Barros

    08/13/2022, 8:14 PM
    I have built a GDALRasterDataSet that can open geospatial data with some interesting functionalities, like cropping with a specified polygon (directly reading from S3)
  • b

    Barros

    08/13/2022, 8:15 PM
    Kedro really has a good framework for this
  • b

    Barros

    08/13/2022, 8:16 PM
    But I'd like to work with all image files like a single one
  • b

    Barros

    08/13/2022, 8:16 PM
    Because they represent literally the same place
  • a

    antheas

    08/13/2022, 8:20 PM
    Sounds like partitioned dataset is the proper use case. Provided you can figure out how you'd shard your initial dataset If you use a compressed format afterwards would you get a performance benefit? then I'd be keen to dump the partitions on that format so they're available locally and faster afterwards. Some food for thought
  • b

    Barros

    08/13/2022, 8:21 PM
    Compression would hurt performance a lot
Powered by Linen
Title
b

Barros

08/13/2022, 8:21 PM
Compression would hurt performance a lot
View count: 1