https://kedro.org/ logo
#beginners-need-help
Title
# beginners-need-help
w

williamc

06/23/2022, 4:31 PM
Let's say I just cloned my kedro project repo to another machine, and its datasets are versioned and configured to use S3 for storage. If I try to run a pipeline that depends on those datasets I get the infamous
kedro.io.core.VersionNotFoundError
. Bucket has versions all the way up to
2022-06-07T22.04.39.460Z/
and the error says
2022-06-23T16.20.52.945Z
. Is this the intended behavior? Thanks
n

noklam

06/27/2022, 5:35 PM
Would be great if you can share the stack trace, if all you did is
kedro run
it should grab the latest dataset. Maybe also useful to share the
catalog.yml
or just the related datasets
w

williamc

06/27/2022, 5:51 PM
https://gist.github.com/williamcaicedo/febf490a87fda1d4fc187e97014712de is the related custom dataset I created by modifiying
PickleDataSet
. The relevant portion of the stack trace is
Copy code
File "/home/ec2-user/SageMaker/tars/tars-env/lib/python3.9/site-packages/kedro/io/core.py", line 539, in _fetch_latest_load_version
    raise VersionNotFoundError(f"Did not find any versions for {self}")
kedro.io.core.VersionNotFoundError: Did not find any versions for KerasStringLookupLayer(backend=pickle, filepath=.../data/06_models/censor_rating_lookup_conf.pkl, load_args={}, protocol=s3, save_args={}, version=Version(load=None, save='2022-06-27T16.59.49.284Z'))
The relevant portion of my data catalog is as follows:
Copy code
censor_rating_lookup:
  type: tars.extras.datasets.tensorflow.KerasStringLookupLayer
  filepath: s3:${s3_bucket}${exhibitor}/data/06_models/censor_rating_lookup_conf.pkl
  backend: pickle
  versioned: True
n

noklam

06/28/2022, 12:57 PM
Looks like you are doing it on SageMaker, does other datasets load successfully or does only the custom dataset have a problem with
versioned: true
? From your stack trace, it fails to see any versions available.
w

williamc

06/28/2022, 1:33 PM
The Spark datasets work fine. These are versioned as well
n

noklam

06/28/2022, 3:49 PM
Is there any non-spark versioned Dataset? I asked that because Spark is slightly special with the I/O. I just want to understand is this an issue with your Custom Dataset or something more generic. Are you able to load the data in a local repository rather than a S3 storage?
w

williamc

06/29/2022, 6:17 PM
My custom dataset works as intended when executing the project from my previous location, as well as the Pandas based ones. I'm experiencing some permissions problems in aws right now, so I can't rule out yet if that has something to do with this issue. I don't think it does, though. I'll report back as soon as I get the permissions stuff sorted. Thanks!
It seems this was because of a permissions issue. The s3 bucket access policy was set to
s3:*Object
which I think is insufficient for Kedro's versioned datasets. Once I set the policy to
s3:*
everything worked as intended. Sorry for wasting your time here 😅
n

noklam

06/30/2022, 3:31 PM
Did you figure out which policy is actually needed?
I am guessing you may need
ListObjects
or
ListObjectsV2
, which your regex excluded
I would expect an error like
Insufficient permissions to list objects
be thrown here, @datajoely any idea about this? I am not super familiar with the S3 policy, does it just shows nothing or it will tell you permission is needed?
d

datajoely

06/30/2022, 3:42 PM
Given this is a custom dataset it may have been suppressed for many reasons
w

williamc

06/30/2022, 4:13 PM
Yes, I believe
ListObjects
is necessary for versioned datasets to work. I originally coded my project on an EC2 instance that had its own, much more permissive, IAM role, and it worked fine because of that. Setting the right permissions for SageMaker notebooks is a bit more involved and the original role I was passing on was different to the EC2 one. Those IAM errors don't actually tell you much more that access denied or something like that
n

noklam

07/01/2022, 10:55 AM
Thank you for your feedback, I think that make sense. Under the hood, we relies
fsspec
and I don't know how it handles this, there may be a chance that they are slightly different in different storage too. I think there may be something we can do about it but will require some more investigation. I don't have access to quickly spin up S3 storage and play with the S3 Policy myself, but I will jot some notes about this first, thank you!
d

datajoely

07/01/2022, 10:56 AM
@idanov can show you how to get a sandbox s3 bucket btw
8 Views