https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
advanced-need-help
  • b

    Barros

    08/13/2022, 8:22 PM
    The thing is that if I use compression I would not be able to use this data source (the s3 bucket), since I cannot change it
  • a

    antheas

    08/13/2022, 8:22 PM
    About that, you said 30gb. Is it updating?
  • b

    Barros

    08/13/2022, 8:22 PM
    The source is far more
  • b

    Barros

    08/13/2022, 8:22 PM
    But when I have a specified dataset it is 30gb
  • b

    Barros

    08/13/2022, 8:23 PM
    It is not changing
  • b

    Barros

    08/13/2022, 8:24 PM
    What I do is a spatial join - I have a polygon dataset, and I find the satellite images inside these polygons and download just them
  • b

    Barros

    08/13/2022, 8:24 PM
    Just what is inside the polygon
  • b

    Barros

    08/13/2022, 8:24 PM
    And then a dataset that would be many TB is now just 30gb
  • b

    Barros

    08/13/2022, 8:25 PM
    It's the magic of cloud optimized geotiffs
  • a

    antheas

    08/13/2022, 8:27 PM
    Oh so you want a general dataset that represents the whole bucket and then you grab specific parts of that bucket And I assume you want to use all the parts after that query and it will fit in ram So maybe using a standard/incremental custom dataset that takes in a query and a bucket and returns the applicable images in memory would be better? The you can optionally dump that in a tar file in an ingest pipeline so it's faster afterwards. Or in a better columnar format. I haven't worked with images much
  • a

    antheas

    08/13/2022, 8:29 PM
    One of the downsides in kedro is that it assunes you want to load a dataset in ram. No virtual datasets. Works for some use cases, maybe not yours
  • b

    Barros

    08/13/2022, 8:30 PM
    That would be the ideal case, but I don't think it is possible due to the massive size of the bucket. Instead I catalog all the files in the bucket in the CSV and I get only the files I need from the bucket.
  • b

    Barros

    08/13/2022, 8:32 PM
    I decided to fork the repo to make a possible implementation for Kedro. I'll do it and show you how it is done
  • b

    beats-like-a-helix

    08/14/2022, 4:12 PM
    Let's say I have some raw image data. I use a class to manually label objects within these images, which is done in a Jupyter Notebook, and this yields another dataset. This manual process is only performed once, and afterwards the pipeline continues as normal. Is there a way to create a dummy node for Kedro Viz just to demonstrate to others that the labelling process actually occurred?
  • d

    datajoely

    08/14/2022, 4:14 PM
    I think the dummy node is the right call today. but I've wanted to push custom icons for ages- would you like something like that
  • b

    beats-like-a-helix

    08/14/2022, 4:19 PM
    I suppose I'd need to see an example of the implementation, but if it helps cut out meaningless code and communicate steps clearly, it's a yes from me 👍
  • d

    datajoely

    08/14/2022, 4:20 PM
    I was thinking on the catalog entry providing a font awesome or material icon name, something like that
  • b

    beats-like-a-helix

    08/14/2022, 4:25 PM
    Could be interesting! I wonder how much demand there would be in general for this
  • b

    Barros

    08/15/2022, 12:08 AM
    @datajoely @antheas I have added a proof of concept of GDALRasterDataSet, with the expected functionalities for Cloud Optimized GeoTIFF (COG). There is a jupyter notebook showing how it works here: https://github.com/yurigba/kedro/blob/main/kedro/extras/datasets/rasterio/raster_dataset_usage.ipynb
  • b

    Barros

    08/15/2022, 12:09 AM
    There is a problem though, and that is why I had so many issues: GDAL virtual file systems (https://gdal.org/user/virtual_file_systems.html) kind of conflicts with fsspec
  • b

    Barros

    08/15/2022, 12:10 AM
    And for COG usage you need to enable virtual file systems
  • b

    Barros

    08/15/2022, 12:11 AM
    So there is a lot of problems to work around before working properly with geospatial data
  • b

    Barros

    08/15/2022, 12:12 AM
    But it is really possible if you give up on some stuff like fsspec or make it work around it
  • z

    Zhee

    08/15/2022, 9:54 AM
    Hi, this would be great! could be easy to represent each main node role... stack, clean, join, predict.. etc..
  • d

    datajoely

    08/15/2022, 10:00 AM
    If either of you want this feature pleasure comment on Kedro Viz issue https://github.com/kedro-org/kedro-viz/issues/480
  • a

    antheas

    08/15/2022, 2:22 PM
    looks good! And intuitive enough. I don't do geospatial though so my understanding stops there
  • b

    Barros

    08/15/2022, 4:10 PM
    Thanks! There is a lot of work needed to be merge-able but I need to make some decisions about how to implement virtual file systems properly and write a lot of unit tests
  • b

    Barros

    08/15/2022, 4:10 PM
    This will make it easier to make the other dataset that I proposed (the "partitioned dataset of partitioned datasets")
  • u

    user

    08/16/2022, 8:13 AM
    Is it possible to automate creating readme content using sphinx in kedro? https://stackoverflow.com/questions/73370726/is-it-possible-to-automate-creating-readme-content-using-sphinx-in-kedro
  • m

    marioFeynman

    08/16/2022, 8:28 PM
    Hi! Is there any way to close the sql engines (created when the catalog is created) after the pipeline runs?
Powered by Linen
Title
m

marioFeynman

08/16/2022, 8:28 PM
Hi! Is there any way to close the sql engines (created when the catalog is created) after the pipeline runs?
View count: 1