here We have two exciting announcements Kedro is officially Kedro #announcements

*@here We have two exciting announcements! 🎉 🎉🎉...

Yetunde

05/24/2022, 2:23 PM

*@here We have two exciting announcements! 🎉 🎉🎉 * *Kedro is officially online 🌍 * We have a Kedro website: https://kedro.org/. This is our first version; please give us feedback. We're working with Databricks We're setting up a hackathon with a team from Databricks to address issues related to the Databricks / Kedro development and deployment workflows. We would love to understand the scope of issues that you have encountered. Comment in a thread below with your struggles or use the 🗓️ emoji so that we can book time in your calendar to dive deeper into your workflow.

WolVez

05/24/2022, 2:35 PM

I am happy to book time to go over our workflows with you all, I think you would probably find it pretty interesting. That being said, we utilize Databricks as our primary cloud resource given hurdles our team has with accessing dev-ops resources and Databricks allows us to manage a lot ourselves. Primary Areas of interest: 1). Kedro doesn't really fit the notebook approach for most development. As a result we are dependent on utilizing jobs aggressively. We have developed our own job deployment approach for Kedro via the Databricks API, but being able to take advantage of Kedro's DAG like approach and quickly being able to utilize Databrick's new Workflow/Task development for on the fly multi-pipeline or node job setup would be huge! 2). When working in an interactive databricks environment, we are constantly having to push from PyCharm into Git, then install the project into the interactive notebook session for any and all updates. This requires detaching and reattaching to the notebook to clear the Databricks environment, and if it gets stuck sometimes restarting the entire cluster. It would be interesting if something like Databricks connect could be utilized with the equivalent of ipythons %autoreload functionality. 3). We have a constant need for utilizing Kedro for incremental updates. This has required us to heavily utilize partial functions in the creation of Kedro Pipelines. Kedro integrations with Databricks Delta systems for far more effective incremental updates would save us a ton of time.

Malaguth

05/24/2022, 2:48 PM

Hi @WolVez, you can use Databricks Repos with non-notebook files as a workaround to integrate the Databricks Environment. The code below creates a kedro session and updates the Databricks current path with the repo code (It's just a workaround, not the solution).

Copy code

python
import os
from kedro.framework.startup import bootstrap_project
from kedro.framework.session import KedroSession

current_path = os.getcwd()
index_databricks_folder = current_path.find('folder')
project_root = current_path[:index_databricks_folder]

bootstrap_project(project_root)
kedro_session = KedroSession.create(project_path=project_root)

%load_ext autoreload
%autoreload

WolVez

05/24/2022, 2:50 PM

Thanks @Malaguth! I was actually already aware of this approach. It has been a while since I attempted to utilize Databricks Repos, however when I used them a year ago, it definitely left a lot wanting. Will have to check that out again.

Malaguth

05/24/2022, 2:52 PM

I agree with you that a more integrated workflow between kedro and Databricks could be a good improvement for the project, but until now, this is the best option that I find

WolVez

05/24/2022, 2:56 PM

@Malaguth Totally understand. It just occurred to me that when I was applying this appraoch historically, that the edits needed to be made into the actual Databricks Repo files. Has this been edited such that I could still use PyCharm?

WolVez

05/24/2022, 2:56 PM

Thus pushing a commit from pycharm would be all that the %autoreload would need?

Malaguth

05/24/2022, 7:05 PM

You'll need to push the commit and pull on databricks repos (This'll clean the notebook results, but the context still works)

marioFeynman

05/25/2022, 2:54 PM

Ey team! Those are very good news!! In my case, we have different data science projects using kedro, and we are deploying those ones in databricks. Those databricks workspace has the capabilities to read data from Azure Datalake using a mounting point (mnt path). Each project has a dedicated blob storage path in the datalake, were we like to store some inputs/outputs from our scheduled job of the projects (versioned data stuff, predictions, models .pkl, etc), but if we follow the instructions, they only apply for using the same Cluster Hard drive or databricks file system... there is no native integration with databricks and azure datalake... So some workarounds we are using is: 1. Generate a symbolic link between the cluster and the dedicated blob storage, i.e, link the .data folder from the cluster into some .data folder in the datalake. 2. After running some job, add extra snipped for copy the outputs into the lake. 3. Create a dedicated enviroment in conf for cloud stuff, but that is Hard to maintain due we need to change everything into the right Azure Datalake path, credentials management and others.

Yetunde

06/29/2022, 9:11 AM

Hey everyone! Thank you so much for sharing your workflows and challenges in this thread. I've setup the research task for this and will be reaching out. The structure is that we'll do the research to understand the key pain points and then we'll share this information with the Databricks Developer Experience team because we're going to structure a hackathon to test fixing some of these things. Please share any feedback here: https://github.com/kedro-org/kedro/issues/1653

devintaylor03

08/05/2022, 4:35 PM

Hey @Yetunde , if it is possible to get involved in the user interviews please let me know. We are currently exploring alternatives to

databricks-connect

as part of our workflows, so would be great to hear others' experiences/workflows too. Thanks!

6 Views

Previous Next