Let's say I just cloned my kedro project repo to a...
# beginners-need-help
Let's say I just cloned my kedro project repo to another machine, and its datasets are versioned and configured to use S3 for storage. If I try to run a pipeline that depends on those datasets I get the infamous
. Bucket has versions all the way up to
and the error says
. Is this the intended behavior? Thanks
Would be great if you can share the stack trace, if all you did is
kedro run
it should grab the latest dataset. Maybe also useful to share the
or just the related datasets
https://gist.github.com/williamcaicedo/febf490a87fda1d4fc187e97014712de is the related custom dataset I created by modifiying
. The relevant portion of the stack trace is
Copy code
File "/home/ec2-user/SageMaker/tars/tars-env/lib/python3.9/site-packages/kedro/io/core.py", line 539, in _fetch_latest_load_version
    raise VersionNotFoundError(f"Did not find any versions for {self}")
kedro.io.core.VersionNotFoundError: Did not find any versions for KerasStringLookupLayer(backend=pickle, filepath=.../data/06_models/censor_rating_lookup_conf.pkl, load_args={}, protocol=s3, save_args={}, version=Version(load=None, save='2022-06-27T16.59.49.284Z'))
The relevant portion of my data catalog is as follows:
Copy code
  type: tars.extras.datasets.tensorflow.KerasStringLookupLayer
  filepath: s3:${s3_bucket}${exhibitor}/data/06_models/censor_rating_lookup_conf.pkl
  backend: pickle
  versioned: True
Looks like you are doing it on SageMaker, does other datasets load successfully or does only the custom dataset have a problem with
versioned: true
? From your stack trace, it fails to see any versions available.
The Spark datasets work fine. These are versioned as well
Is there any non-spark versioned Dataset? I asked that because Spark is slightly special with the I/O. I just want to understand is this an issue with your Custom Dataset or something more generic. Are you able to load the data in a local repository rather than a S3 storage?
My custom dataset works as intended when executing the project from my previous location, as well as the Pandas based ones. I'm experiencing some permissions problems in aws right now, so I can't rule out yet if that has something to do with this issue. I don't think it does, though. I'll report back as soon as I get the permissions stuff sorted. Thanks!
It seems this was because of a permissions issue. The s3 bucket access policy was set to
which I think is insufficient for Kedro's versioned datasets. Once I set the policy to
everything worked as intended. Sorry for wasting your time here 😅
Did you figure out which policy is actually needed?
I am guessing you may need
, which your regex excluded
I would expect an error like
Insufficient permissions to list objects
be thrown here, @datajoely any idea about this? I am not super familiar with the S3 policy, does it just shows nothing or it will tell you permission is needed?
Given this is a custom dataset it may have been suppressed for many reasons
Yes, I believe
is necessary for versioned datasets to work. I originally coded my project on an EC2 instance that had its own, much more permissive, IAM role, and it worked fine because of that. Setting the right permissions for SageMaker notebooks is a bit more involved and the original role I was passing on was different to the EC2 one. Those IAM errors don't actually tell you much more that access denied or something like that
Thank you for your feedback, I think that make sense. Under the hood, we relies
and I don't know how it handles this, there may be a chance that they are slightly different in different storage too. I think there may be something we can do about it but will require some more investigation. I don't have access to quickly spin up S3 storage and play with the S3 Policy myself, but I will jot some notes about this first, thank you!
@idanov can show you how to get a sandbox s3 bucket btw