https://kedro.org/ logo
Title
m

marioFeynman

05/24/2022, 11:01 PM
Hi team! Hopping you are all right, i want to ask which is the best way to "mount" a kedro proyect into a datalake path... is there any good practice for this? I want to read and write non spark datasets to the datalake using the catalog feature... I am deploying my project in databricks that has access to a mounted datalake
d

datajoely

05/25/2022, 8:52 AM
Could you explain a little more about what you're trying to do? We also have some Databricks centric integration work going on at the moment, would you mind commenting on this thread? https://discord.com/channels/778216384475693066/778996598880862208/978664438968225802
m

marioFeynman

05/25/2022, 2:49 PM
So, by now, we have different data science projects using kedro, and we are deploying those ones in databricks. Those databricks workspace has the capabilities to read data from Azure Datalake using a mounting point (mnt path). Each project has a dedicated blob storage path in the datalake, were we like to store some inputs/outputs from our scheduled job of the projects (versioned data stuff, predictions, models .pkl, etc), but if we follow the instructions, they only apply for using de same Cluster Hard drive or databricks file system... there is no native integration with databricks and azure datalake... So some workarounds we are using is: 1. Generate a symbolic link between the cluster and the dedicated blob storage, i.e, link the .data folder from the cluster into some .data folder in the datalake. 2. After running some job, add extra snipped for copy the outputs into the lake.
d

datajoely

05/25/2022, 2:50 PM
Would you mind copying this into the thread above?
m

marioFeynman

05/25/2022, 2:51 PM
There is a third option that could be to create a New enviroment for dedicated azure datalake stuff but that is Hard to mantained... like change all the path to abs stuff, credentials, and so on
d

datajoely

05/25/2022, 2:57 PM
So I'm not a databricks expert - but we leverage fsspec behind the scenes to abstract local and cloud filesystems
there does appear to be
dbfs
protocol support in fsspec - is this not fit for your purposes? https://filesystem-spec.readthedocs.io/en/stable/_modules/fsspec/implementations/dbfs.html
and then if you're looking for azure as part of fsspec all of these azure targets are supported too: https://github.com/fsspec/adlfs
m

marioFeynman

05/25/2022, 9:57 PM
The problem is when i define the project root path it doesnt match with de dbfs mounting path to the lake
Thats the reason why we ended up using a symbolic link, in order to cheat kedro to read and save from the cluster but everything is mapped into the lake
d

datajoely

05/25/2022, 10:00 PM
The only think I don't understand is why the kedro project location dictates where dats lives
Would you mind posting a catalog entry?
m

marioFeynman

05/25/2022, 10:03 PM
The entries are the basic ones:
data/...
And if i use those kind of entries, Kedro Will try to read and write directly from the cluster HD
d

datajoely

05/25/2022, 10:04 PM
Right if you use the dbfs:// protocol prefix does it change things?
m

marioFeynman

05/25/2022, 10:08 PM
And without any other args? Like directly the dbfs:// into the filepath?
d

datajoely

05/25/2022, 10:08 PM
Yeah fsspec should resolve this if it's already authenticated on the cluster
m

marioFeynman

05/25/2022, 10:09 PM
I Will try that, we try to use as a regular path, like dbfs/mnt/...
Not with ://
And for abfs or az, we still need to use the credential stuff? Or it should grab the permissions from the enviroment
d

datajoely

05/25/2022, 10:11 PM
That depends on the set up but hopefully
m

marioFeynman

05/25/2022, 10:12 PM
Ok, bc when I define the path without the :// , the final path that kedro use was project_root + 'dbfs/mnt/data...'
So, manually leaving a copy of the dataset in the dbfs and then using the dbfs:// for the file path works
I also tried using the abfs but I need to fight with my company bc i dont have the permissions for that one