Hi team Hopping you are all right i want to ask which is the Kedro #advanced-need-help

Hi team! Hopping you are all right, i want to ask ...

marioFeynman

05/24/2022, 11:01 PM

Hi team! Hopping you are all right, i want to ask which is the best way to "mount" a kedro proyect into a datalake path... is there any good practice for this? I want to read and write non spark datasets to the datalake using the catalog feature... I am deploying my project in databricks that has access to a mounted datalake

datajoely

05/25/2022, 8:52 AM

Could you explain a little more about what you're trying to do? We also have some Databricks centric integration work going on at the moment, would you mind commenting on this thread? https://discord.com/channels/778216384475693066/778996598880862208/978664438968225802

marioFeynman

05/25/2022, 2:49 PM

So, by now, we have different data science projects using kedro, and we are deploying those ones in databricks. Those databricks workspace has the capabilities to read data from Azure Datalake using a mounting point (mnt path). Each project has a dedicated blob storage path in the datalake, were we like to store some inputs/outputs from our scheduled job of the projects (versioned data stuff, predictions, models .pkl, etc), but if we follow the instructions, they only apply for using de same Cluster Hard drive or databricks file system... there is no native integration with databricks and azure datalake... So some workarounds we are using is: 1. Generate a symbolic link between the cluster and the dedicated blob storage, i.e, link the .data folder from the cluster into some .data folder in the datalake. 2. After running some job, add extra snipped for copy the outputs into the lake.

datajoely

05/25/2022, 2:50 PM

Would you mind copying this into the thread above?

marioFeynman

05/25/2022, 2:51 PM

There is a third option that could be to create a New enviroment for dedicated azure datalake stuff but that is Hard to mantained... like change all the path to abs stuff, credentials, and so on

datajoely

05/25/2022, 2:57 PM

So I'm not a databricks expert - but we leverage fsspec behind the scenes to abstract local and cloud filesystems

datajoely

05/25/2022, 2:57 PM

there does appear to be

dbfs

protocol support in fsspec - is this not fit for your purposes? https://filesystem-spec.readthedocs.io/en/stable/_modules/fsspec/implementations/dbfs.html

datajoely

05/25/2022, 2:58 PM

and then if you're looking for azure as part of fsspec all of these azure targets are supported too: https://github.com/fsspec/adlfs

marioFeynman

05/25/2022, 9:57 PM

The problem is when i define the project root path it doesnt match with de dbfs mounting path to the lake

marioFeynman

05/25/2022, 9:59 PM

Thats the reason why we ended up using a symbolic link, in order to cheat kedro to read and save from the cluster but everything is mapped into the lake

datajoely

05/25/2022, 10:00 PM

The only think I don't understand is why the kedro project location dictates where dats lives

datajoely

05/25/2022, 10:01 PM

Would you mind posting a catalog entry?

marioFeynman

05/25/2022, 10:03 PM

The entries are the basic ones:

data/...

marioFeynman

05/25/2022, 10:04 PM

And if i use those kind of entries, Kedro Will try to read and write directly from the cluster HD

datajoely

05/25/2022, 10:04 PM

Right if you use the dbfs:// protocol prefix does it change things?

marioFeynman

05/25/2022, 10:08 PM

And without any other args? Like directly the dbfs:// into the filepath?

datajoely

05/25/2022, 10:08 PM

Yeah fsspec should resolve this if it's already authenticated on the cluster

marioFeynman

05/25/2022, 10:09 PM

I Will try that, we try to use as a regular path, like dbfs/mnt/...

marioFeynman

05/25/2022, 10:09 PM

Not with ://

marioFeynman

05/25/2022, 10:10 PM

And for abfs or az, we still need to use the credential stuff? Or it should grab the permissions from the enviroment

datajoely

05/25/2022, 10:11 PM

That depends on the set up but hopefully

marioFeynman

05/25/2022, 10:12 PM

Ok, bc when I define the path without the :// , the final path that kedro use was project_root + 'dbfs/mnt/data...'

marioFeynman

05/26/2022, 3:01 AM

So, manually leaving a copy of the dataset in the dbfs and then using the dbfs:// for the file path works

marioFeynman

05/26/2022, 3:02 AM

I also tried using the abfs but I need to fight with my company bc i dont have the permissions for that one

3 Views

Previous Next