Hello I am wondering if there is any possibility to list all Kedro #advanced-need-help

Join Discord

Hello, I am wondering if there is any possibility ...

# advanced-need-help

Rafał

04/26/2022, 3:12 PM

Hello, I am wondering if there is any possibility to list all available versions of kero's catalog versioned dataset ?

avan-sh

04/26/2022, 4:12 PM

One simple way to get the list of all available versions would be to get a list of all folders in the high level folder. Another approach is possibly to leave version as empty(not pass version), load the object in ipython session and run the following (code from VersionedDataset Implementation)

Copy code

dataset_name = catalog.load("<your_versioned_dataset_name>")
pattern = str(dataset_name._get_versioned_path("*"))
version_paths = sorted(dataset_name._glob_function(pattern), reverse=True)

I'm guessing @datajoely means the second approach

datajoely

04/26/2022, 4:14 PM

I did mean this, but didn't realise it was so complicated

datajoely

04/26/2022, 4:14 PM

Perhaps it makes sense to expose this as a property

datajoely

04/26/2022, 4:14 PM

The session store will eventually cover more of this

Rafał

04/26/2022, 5:56 PM

The best would be to expose catalog item before loading it. I would then call ds=catalog.get(„datasetname”) after that I could load any version I want and of course ask for list of available versions.

Rafał

04/26/2022, 5:58 PM

For exampl if I had raw DataSet then I could take filepath in order to get top level root directory of VersionedDataSet

datajoely

04/26/2022, 6:02 PM

so if you do

catalog.datasets.{dataset_name}

you get access to the lazy object not the data

datajoely

04/26/2022, 6:03 PM

so you can do everything @avan-sh suggests in an interactive session

Rafał

04/26/2022, 6:03 PM

Oh that’s great news for me.

datajoely

04/26/2022, 6:03 PM

but I think it makes sense for me to make this much easier for users

datajoely

04/26/2022, 6:03 PM

we do for partitioned dataset, no reason not to here

Rafał

04/26/2022, 6:05 PM

I see and agree it is quite complicated. Especially to load the single item in order to resolve all others versions.

datajoely

04/26/2022, 6:15 PM

yes

datajoely

04/26/2022, 6:16 PM

it's trivial to make a little convenience function

datajoely

04/26/2022, 6:16 PM

but not accessible to newbies

Rafał

04/26/2022, 6:36 PM

I thought that one has to load the single item. Actually it works without loading any item.

Copy code

dataset = catalog.datasets.__your_dataset_name___
pattern = dataset._get_versioned_path('*')

version_paths = sorted(dataset._glob_function(pattern), reverse=True)

Rafał

04/26/2022, 6:37 PM

Would be great to parse the paths and expose only versions, but I think I would manage to do that. Thank you guys.

Rafał

04/27/2022, 4:17 AM

Actually, this method works only for local filepath. In the case of

s3

protocol, I got the follwoing error:

Copy code

sorted(dataset._glob_function(pattern), reverse=True)

  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise return_result
  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/s3fs/core.py", line 624, in _glob
    if path.startswith("*"):
AttributeError: 'PurePosixPath' object has no attribute 'startswith'
`

Rafał

04/27/2022, 4:22 AM

This helped

Copy code

dataset = catalog.datasets.__your_dataset_name___
pattern = dataset._get_versioned_path('*')

version_paths = sorted(dataset._glob_function(str(pattern)), reverse=True)

datajoely

04/27/2022, 10:02 AM

Yes that makes sense

2 Views

Previous Next