Let s say I just cloned my kedro project repo to another mac Kedro #beginners-need-help

Let's say I just cloned my kedro project repo to a...

williamc

06/23/2022, 4:31 PM

Let's say I just cloned my kedro project repo to another machine, and its datasets are versioned and configured to use S3 for storage. If I try to run a pipeline that depends on those datasets I get the infamous

kedro.io.core.VersionNotFoundError

. Bucket has versions all the way up to

2022-06-07T22.04.39.460Z/

and the error says

2022-06-23T16.20.52.945Z

. Is this the intended behavior? Thanks

noklam

06/27/2022, 5:35 PM

Would be great if you can share the stack trace, if all you did is

kedro run

it should grab the latest dataset. Maybe also useful to share the

catalog.yml

or just the related datasets

williamc

06/27/2022, 5:51 PM

https://gist.github.com/williamcaicedo/febf490a87fda1d4fc187e97014712de is the related custom dataset I created by modifiying

PickleDataSet

. The relevant portion of the stack trace is

Copy code

File "/home/ec2-user/SageMaker/tars/tars-env/lib/python3.9/site-packages/kedro/io/core.py", line 539, in _fetch_latest_load_version
    raise VersionNotFoundError(f"Did not find any versions for {self}")
kedro.io.core.VersionNotFoundError: Did not find any versions for KerasStringLookupLayer(backend=pickle, filepath=.../data/06_models/censor_rating_lookup_conf.pkl, load_args={}, protocol=s3, save_args={}, version=Version(load=None, save='2022-06-27T16.59.49.284Z'))

The relevant portion of my data catalog is as follows:

Copy code

censor_rating_lookup:
  type: tars.extras.datasets.tensorflow.KerasStringLookupLayer
  filepath: s3:${s3_bucket}${exhibitor}/data/06_models/censor_rating_lookup_conf.pkl
  backend: pickle
  versioned: True

noklam

06/28/2022, 12:57 PM

Looks like you are doing it on SageMaker, does other datasets load successfully or does only the custom dataset have a problem with

versioned: true

? From your stack trace, it fails to see any versions available.

williamc

06/28/2022, 1:33 PM

The Spark datasets work fine. These are versioned as well

noklam

06/28/2022, 3:49 PM

Is there any non-spark versioned Dataset? I asked that because Spark is slightly special with the I/O. I just want to understand is this an issue with your Custom Dataset or something more generic. Are you able to load the data in a local repository rather than a S3 storage?

williamc

06/29/2022, 6:17 PM

My custom dataset works as intended when executing the project from my previous location, as well as the Pandas based ones. I'm experiencing some permissions problems in aws right now, so I can't rule out yet if that has something to do with this issue. I don't think it does, though. I'll report back as soon as I get the permissions stuff sorted. Thanks!

williamc

06/30/2022, 3:10 PM

It seems this was because of a permissions issue. The s3 bucket access policy was set to

s3:*Object

which I think is insufficient for Kedro's versioned datasets. Once I set the policy to

s3:*

everything worked as intended. Sorry for wasting your time here 😅

noklam

06/30/2022, 3:31 PM

Did you figure out which policy is actually needed?

noklam

06/30/2022, 3:32 PM

I am guessing you may need

ListObjects

ListObjectsV2

, which your regex excluded

noklam

06/30/2022, 3:38 PM

I would expect an error like

Insufficient permissions to list objects

be thrown here, @datajoely any idea about this? I am not super familiar with the S3 policy, does it just shows nothing or it will tell you permission is needed?

datajoely

06/30/2022, 3:42 PM

Given this is a custom dataset it may have been suppressed for many reasons

williamc

06/30/2022, 4:13 PM

Yes, I believe

ListObjects

is necessary for versioned datasets to work. I originally coded my project on an EC2 instance that had its own, much more permissive, IAM role, and it worked fine because of that. Setting the right permissions for SageMaker notebooks is a bit more involved and the original role I was passing on was different to the EC2 one. Those IAM errors don't actually tell you much more that access denied or something like that

noklam

07/01/2022, 10:55 AM

Thank you for your feedback, I think that make sense. Under the hood, we relies

fsspec

and I don't know how it handles this, there may be a chance that they are slightly different in different storage too. I think there may be something we can do about it but will require some more investigation. I don't have access to quickly spin up S3 storage and play with the S3 Policy myself, but I will jot some notes about this first, thank you!

datajoely

07/01/2022, 10:56 AM

@idanov can show you how to get a sandbox s3 bucket btw

10 Views

Previous Next