Hi team! Hopping you are all right, i want to ask ...
# advanced-need-help
m
Hi team! Hopping you are all right, i want to ask which is the best way to "mount" a kedro proyect into a datalake path... is there any good practice for this? I want to read and write non spark datasets to the datalake using the catalog feature... I am deploying my project in databricks that has access to a mounted datalake
d
Could you explain a little more about what you're trying to do? We also have some Databricks centric integration work going on at the moment, would you mind commenting on this thread? https://discord.com/channels/778216384475693066/778996598880862208/978664438968225802
m
So, by now, we have different data science projects using kedro, and we are deploying those ones in databricks. Those databricks workspace has the capabilities to read data from Azure Datalake using a mounting point (mnt path). Each project has a dedicated blob storage path in the datalake, were we like to store some inputs/outputs from our scheduled job of the projects (versioned data stuff, predictions, models .pkl, etc), but if we follow the instructions, they only apply for using de same Cluster Hard drive or databricks file system... there is no native integration with databricks and azure datalake... So some workarounds we are using is: 1. Generate a symbolic link between the cluster and the dedicated blob storage, i.e, link the .data folder from the cluster into some .data folder in the datalake. 2. After running some job, add extra snipped for copy the outputs into the lake.
d
Would you mind copying this into the thread above?
m
There is a third option that could be to create a New enviroment for dedicated azure datalake stuff but that is Hard to mantained... like change all the path to abs stuff, credentials, and so on
d
So I'm not a databricks expert - but we leverage fsspec behind the scenes to abstract local and cloud filesystems
there does appear to be
dbfs
protocol support in fsspec - is this not fit for your purposes? https://filesystem-spec.readthedocs.io/en/stable/_modules/fsspec/implementations/dbfs.html
and then if you're looking for azure as part of fsspec all of these azure targets are supported too: https://github.com/fsspec/adlfs
m
The problem is when i define the project root path it doesnt match with de dbfs mounting path to the lake
Thats the reason why we ended up using a symbolic link, in order to cheat kedro to read and save from the cluster but everything is mapped into the lake
d
The only think I don't understand is why the kedro project location dictates where dats lives
Would you mind posting a catalog entry?
m
The entries are the basic ones:
data/...
And if i use those kind of entries, Kedro Will try to read and write directly from the cluster HD
d
Right if you use the dbfs:// protocol prefix does it change things?
m
And without any other args? Like directly the dbfs:// into the filepath?
d
Yeah fsspec should resolve this if it's already authenticated on the cluster
m
I Will try that, we try to use as a regular path, like dbfs/mnt/...
Not with ://
And for abfs or az, we still need to use the credential stuff? Or it should grab the permissions from the enviroment
d
That depends on the set up but hopefully
m
Ok, bc when I define the path without the :// , the final path that kedro use was project_root + 'dbfs/mnt/data...'
So, manually leaving a copy of the dataset in the dbfs and then using the dbfs:// for the file path works
I also tried using the abfs but I need to fight with my company bc i dont have the permissions for that one