Hello, I am wondering if there is any possibility ...
# advanced-need-help
r
Hello, I am wondering if there is any possibility to list all available versions of kero's catalog versioned dataset ?
a
One simple way to get the list of all available versions would be to get a list of all folders in the high level folder. Another approach is possibly to leave version as empty(not pass version), load the object in ipython session and run the following (code from VersionedDataset Implementation)
Copy code
dataset_name = catalog.load("<your_versioned_dataset_name>")
pattern = str(dataset_name._get_versioned_path("*"))
version_paths = sorted(dataset_name._glob_function(pattern), reverse=True)
I'm guessing @datajoely means the second approach
d
I did mean this, but didn't realise it was so complicated
Perhaps it makes sense to expose this as a property
The session store will eventually cover more of this
r
The best would be to expose catalog item before loading it. I would then call ds=catalog.get(„datasetname”) after that I could load any version I want and of course ask for list of available versions.
For exampl if I had raw DataSet then I could take filepath in order to get top level root directory of VersionedDataSet
d
so if you do
catalog.datasets.{dataset_name}
you get access to the lazy object not the data
so you can do everything @avan-sh suggests in an interactive session
r
Oh that’s great news for me.
d
but I think it makes sense for me to make this much easier for users
we do for partitioned dataset, no reason not to here
r
I see and agree it is quite complicated. Especially to load the single item in order to resolve all others versions.
d
yes
it's trivial to make a little convenience function
but not accessible to newbies
r
I thought that one has to load the single item. Actually it works without loading any item.
Copy code
dataset = catalog.datasets.__your_dataset_name___
pattern = dataset._get_versioned_path('*')

version_paths = sorted(dataset._glob_function(pattern), reverse=True)
Would be great to parse the paths and expose only versions, but I think I would manage to do that. Thank you guys.
Actually, this method works only for local filepath. In the case of
s3
protocol, I got the follwoing error:
Copy code
sorted(dataset._glob_function(pattern), reverse=True)

  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise return_result
  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "/opt/miniconda3/envs/deep-identifier/lib/python3.8/site-packages/s3fs/core.py", line 624, in _glob
    if path.startswith("*"):
AttributeError: 'PurePosixPath' object has no attribute 'startswith'
`
This helped
Copy code
dataset = catalog.datasets.__your_dataset_name___
pattern = dataset._get_versioned_path('*')

version_paths = sorted(dataset._glob_function(str(pattern)), reverse=True)
d
Yes that makes sense