https://kedro.org/ logo
Title
j

jcasanuevam

07/30/2021, 8:54 AM
Hello guys! I got a question related to the kedro-airflow plugin but I'm not sure if this is the place to ask for...my company has Airflow installed in an internal server using Docker. Kedro docs says that we should install our kedro packaged project inside the airflow environment, but that means we should change the Airflow Docker Image every time we want to install a new package, module, etc. Is there a way to manage that scenario without modifying the Airflow DockerImage/Compose file??
d

datajoely

07/30/2021, 11:18 AM
Let me get back to you on what our view of best practice is
So we have two out of the box plugins which may help We have
kedro-docker
which helps you dockerise your pipeline, but in my opinion is mostly useful for people unfamiliar with docker and need help getting started. We also have
kedro-airflow
which helps convert your kedro pipeline into an Airflow DAG. This was co-developed by the folks at Astronomer. Have you used either of these? The second one feels more useful for your purposes
j

jcasanuevam

07/30/2021, 12:32 PM
Yes, I've used kedro-airflow, so I already have my kedro pipeline as an Airflow DAG. The thing is when I open the Airflow UI an error 'X module not found' arises, where X could be kedro, pandas or whatever package Airflow doesn't install when it is deployed in the internal server
So I know I have to install my kedro project package inside the airflow venv. The thing is that for doing this I need to modify the airflow docker image or docker compose file everytiem I want to install a new kedro project package in my airflow environment
d

datajoely

07/30/2021, 12:34 PM
This isn't an area that I'm super knowledgeable - let me ask about
In truth this feels like an airflow problem / nature of python problem rather than a Kedro one. I'm still waiting on some answers though
It's not a particularly long post but this SO thread my be useful https://stackoverflow.com/questions/56891035/how-to-manage-python-dependencies-in-airflow
j

jcasanuevam

08/02/2021, 11:06 AM
I'm running with the following error when trying to execute a kedro pipeline with airflow:
It seems a problem with the conf/base directory. Any tips?
d

datajoely

08/02/2021, 11:07 AM
Yeah the * directory is causing issues I think?
Is that the actual name of the folder?
because I think we use the * symbol as a glob wildcard?
j

jcasanuevam

08/02/2021, 11:08 AM
nope. thats not the name of the folder
d

datajoely

08/02/2021, 11:08 AM
any idea where the * is coming from?
j

jcasanuevam

08/02/2021, 11:10 AM
* matches any number of directories (3 in this case) between /opt/ and /conf/base
d

datajoely

08/02/2021, 11:11 AM
is that a docker or AirFlow thing?
The stars will cause issues with Kedro
j

jcasanuevam

08/02/2021, 11:18 AM
I don't know... I guess it is an Airflow thing but I don't know how to deal with the stars 😦
d

datajoely

08/02/2021, 11:18 AM
hmm let me do some googling
Is there anywhere you specify that?
seeing you docker-compose/dockerfile may be helpful too
j

jcasanuevam

08/02/2021, 12:27 PM
that's my docker-compose file
inside projects folder I've the following structure: project_name >> conf, data, logs folder related to kedro project
and I've installed the kedro project inside all of the containers of airflow
always in the root project_name
Airflow recognize my DAG and all of the python packages needed for my pipeline
but when I execute the workflow I get the error I've shared before
d

datajoely

08/02/2021, 12:51 PM
Thank you for this
let me check wit the team
u

user

08/02/2021, 1:07 PM
Hi @jcasanuevam the problem is that
kedro package
doesn't package your
conf/
directory. In our guide where we use Astronomer, their Dockerfile copies the entire project directoy into the container. I suspect that isn't happening in your case?
If you post your Dockerfile, I can help you amend it to make it work
j

jcasanuevam

08/03/2021, 6:10 AM
That is. kedro package doesn't package my conf/ directory and that's why I copied conf/ data/ and logs/ directories as kedro docs says
I dont have any Dockerfile related to my kedro project. I've just accessed to the airflow containers and installed the kedro wheel file in a root where conf/, data/ and logs/ directories live.
PROBLEM SOLVED. The thing was the project_path variable of the DAG file generated by kedro using the kedro-airflow plugin. It was set by default as project_path = Path().cwd, and the project folder is not the same as the dag file, so changing this to the correct project path makes everything work πŸ™‚
d

datajoely

08/03/2021, 9:47 AM
Oh wow!
Well done @User πŸ₯³
So in an ideal world - would it be something that we could make configurable in kedro-airflow?
or would it be something that would just be worth putting in docs?
j

jcasanuevam

08/03/2021, 9:52 AM
Putting in docs would make it much easier. As a reminder to modify the value of that variable if the project path is not the same as where the dag file is located (It shouldn't be the same as a best practice)
d

datajoely

08/03/2021, 9:53 AM
Understood - I'll raise a documentation PR
j

jcasanuevam

08/03/2021, 9:55 AM
super!
j

jcasanuevam

08/03/2021, 10:08 AM
I'm not using Astronomer but I guess we would face the same issue, so yes, I think this is a good place. I'd also modify the kedro-airflow github repository
d

datajoely

08/03/2021, 10:13 AM
Understood - thank you!
If you have a second - it would be great if you could look at this PR to the readme? https://github.com/quantumblacklabs/kedro-airflow/pull/83/files?short_path=b335630#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5 Would this have helped if it was there a few days ago! 🀦
j

jcasanuevam

08/03/2021, 10:48 AM
Perfect! I think that's enought. Thank you!
d

datajoely

08/03/2021, 10:48 AM
Thank you!
j

jcasanuevam

08/03/2021, 11:41 AM
One question. If my project uses the kedro-mlflow plugin, once I packaged my project, does it package the kedro-mlflow plugin? I'm geting the following error:
d

datajoely

08/03/2021, 11:42 AM
Yes we only package the kedro project
this is something we're thinking about
j

jcasanuevam

08/03/2021, 11:45 AM
mmm so maybe a better solution than package the project is to use the kedro-docker plugin in order to get the Dockerfile and run it inside the Airflow environment in the desired project root?
d

datajoely

08/03/2021, 11:46 AM
I think that sounds simpler to me
j

jcasanuevam

08/03/2021, 11:47 AM
I will give it a try and tell you, thanks!
u

user

08/03/2021, 1:41 PM
@User if you have a docker container, you can also run it in Airflow with the DockerOperator
j

jcasanuevam

08/03/2021, 5:29 PM
But with DockerOperator I lose the option of having each node of my kedro pipeline defined as a task in airflow (kedro-airflow plugin allows us to do that, 'translating' our kedro pipeline as an airflow DAG). My DAG using DockerOperator would be summarised to a single task built with DockerOperator and i like the fact of having my kedro pipeline visually represented in airflow. Is there a way to have it using DockerOperator?
Hi guys. Here again (sorryπŸ˜… ). This is my DockerOperator in the airflow dag:
I'm getting the following error in airflow:
any tips about what's going on?
u

user

08/05/2021, 9:47 AM
Yea, in this setup, you need to give the user inside airflow permission to access the
docker_url
so it can launch the operator. Try running
sudo chmod 777 /var/run/docker.sock
on your host machine and see if it helps?
In production, it will be a docker cluster somewhere with a proper url so this permission problem will not exist.
Regarding the shape of the DAG: you can certainly retain the shape of the DAG with DockerOperator. In the
command
section of the operator, simply change
kedro run
to
kedro run --node=...
. Essentially change
KedroOperator
with
DockerOperator
and retain the same DAG by parameterising the command.
> I'm not using Astronomer but I guess we would face the same issue, so yes, I think this is a good place. I'd also modify the kedro-airflow github repository With Astronomer CLI you won't face the same problem because their Dockerfile gobbles up the entire project directory, so
conf/
will be included. An example on how to do that by yourself would be: * Package the Kedro project as a wheel and install it in the dockerfile with
RUN pip install <the wheel>
(this will install all of the project dependencies into the container, including `kedro-mlflow`in your case) * Copy the whole project directory into the current working directory in the container with:
COPY . /home/kedro-project
* Set the working directory in the container to
WORKDIR /home/kedro-project
* Now
KedroOperator
or
kedro run
should work because the `conf/`is present at the current working directory
Let me see if I have time today. I will cook up a sample repository for you to demonstrate this.
Sorry I caused more confusion the other day: I was suggesting `DockerOperator`as another option, not that you really need it to make things work.
@User hey I got this one up as a demo for you: https://github.com/limdauto/demo-kedro-airflow-docker-compose. The idea is: * Install the wheel file into your container with this https://github.com/limdauto/demo-kedro-airflow-docker-compose/blob/main/Dockerfile. It should install all dependencies of your project, including kedro-mlflow etc. * Mount the conf and data dir: https://github.com/limdauto/demo-kedro-airflow-docker-compose/blob/main/docker-compose.yaml#L64-L65 * Tell docker compose to build the image: https://github.com/limdauto/demo-kedro-airflow-docker-compose/blob/main/docker-compose.yaml#L47-L48 (or you can do what you did and build an image yourself and push it somewhere) The rest is normal airflow and kedro: create the dags with kedro airflow create, etc.
You shouldn't need more than this to run Kedro on Airflow with docker-compose
lmk if it helps
j

jcasanuevam

08/05/2021, 12:12 PM
With that solution we need to do it everytime we want to add a new kedro project to our airflow env and we could have problems with python dependencies. I think I prefer having a docker container of my kedro project and 'call it' from airflow in order to execute it. Now I'm trying running sudo chmod 777 /var/run/docker.sock but I don't find the docker.sock file 😦
BTW thank you very much for your support and your demo πŸ™‚
u

user

08/05/2021, 12:24 PM
which environment is your docker host machine? Is the docker cluster running on that machine?
j

jcasanuevam

08/05/2021, 12:25 PM
I have everything running in my local machine
u

user

08/05/2021, 12:26 PM
and you are running on your local machine as root?
also what os is your local machine?
j

jcasanuevam

08/05/2021, 12:27 PM
windows
u

user

08/05/2021, 12:27 PM
aha! sorry I assume you use linux because you use unix socket
1s
is this docker running on WSL?
and could you do a
docker network ls
and paste the result here please?
since this is windows, instead of running the chmod command, another approach is to mount this under `x-airflow-common: volumes`:
# note the 2 // here for windows
- //var/run/docker.sock:/var/run/docker.sock
If that's still not working, try adding
network_mode: airflow-docker_default
in
x-airflow-common
j

jcasanuevam

08/05/2021, 4:41 PM
Not working 😦 This is my docker-compose file:
and my DAG:
u

user

08/05/2021, 5:33 PM
What's error? Could you remove
docker-socke-proxy
as well? Also remove
network_mode
and
docker_url
in the
DockerOperator
constructor. They are overriding values from docker compose.
'
j

jcasanuevam

08/06/2021, 4:44 AM
I'm still facing the same error:
I think the problem is in my kedro project Dockerfile, where I have to add the airflow user I guess:
UPDATE: I've set this in Docker Desktop
and set Ubunto as the default WSL distro
via PS terminal I've accessed the wsl -d Ubuntu, and /var/run/docker.sock exist
so I run chmod 777 /var/run/docker.sock
ok with that
but still getting the same error 😦
and I have changed the kedro project Dockerfile also:
u

user

08/09/2021, 1:52 PM
Sorry you are struggling with this 😦 This all seems like a problem with docker and WSL... Let me try this out tonight on my windows machine when I get home
j

jcasanuevam

08/17/2021, 8:25 AM
Hi @User ! Did you try it out?
d

datajoely

08/17/2021, 9:00 AM
@User unfortunately he's out this week. Do you have any updates? I can do my best to help
j

jcasanuevam

08/23/2021, 5:57 AM
Hi @User , sorry I've been on holidays. Today I'll give it a try again
Still facing the same issue 😦 I bet my problem is with my kedro project DockerFile:
d

datajoely

08/23/2021, 12:22 PM
what error are you seeing?
j

jcasanuevam

08/23/2021, 12:24 PM
That's the problem, I don't see any hahaha. I'm a newbie with Docker but I think the issue is with the user/userid and group/groupid and their permissions
d

datajoely

08/23/2021, 12:25 PM
ah it's a bit outside of my knowledge too
all I can suggest is digging around the logs for clues
j

jcasanuevam

08/25/2021, 12:20 PM
https://discord.com/channels/778216384475693066/870590055754899536/872778121110249495 Hi @User I was able to run my kedro project Dockerfile with airflow using the DockerOperator, but what you say in that message I think is not correct. If we just change the 'command' variable across the different DockerOperator it won't work because each DockerOperator creates a new container of your Docker Image so once the first task is finished the container of the next task doesn't have the information of the previous task, so the chain is broken. We just can run all of our kedro project in one DockerOperator task, loosing the tree diagram of airflow.
I'm trying to figure out if it is possible for tasks to share the same container in order to share state between tasks
d

datajoely

08/25/2021, 1:12 PM
At this point you've reached my Airflow knowledge too!
j

jcasanuevam

08/26/2021, 6:02 AM
Hi guys! I think I'll be back with the KedroOperator and kedro-airflow plugin solution, but I have a question about the wheel file. Why doing it in the way @User says I'll have all dependencies of my project and doing it installing the wheel file using pip command inside my airflow container and copying there the /conf, /data and /logs directories I won't have all dependencies? For me seems to be the same process but one using a Dockerfile of your project and other installing the wheel file directly in the airflow container and copying the desired directories
u

user

08/26/2021, 8:35 AM
Hello @User regarding the first question: > it won't work because each DockerOperator creates a new container of your Docker Image so once the first task is finished the container of the next task doesn't have the information of the previous task, so the chain is broken Could you elaborate what do you mean by "the chain is broken"? Which information do you want to retain between tasks? If it's data, since airflow is a distributed scheduler, you will need to persist your data between tasks to a shared location.
Sorry I lost a bit of context. I will need to re-read this thread again but you are right in theory: installing the wheel file should install all of the project dependencies.
j

jcasanuevam

08/26/2021, 8:43 AM
Picture from kedro docs. My question to @User is: Does the resulting package also contain the dependencies of the project? (for example, kedro-mlflow plugin if I've got it in my prooject)
All data stored as MemoryDataset in my kedro project would be lost, forcing me to persist every dataset, artifact, etc somewhere and I think that's not a good solution/best practice
d

datajoely

08/26/2021, 8:53 AM
Yes - you need to persist at every point in an orchestrator
we are working on a way to deploy things as a modular pipelines which take advantage of MemoryDataSet
but it's not released yet!
j

jcasanuevam

08/26/2021, 10:37 AM
Thanks @User , could you bring us some light about above question?
d

datajoely

08/26/2021, 10:38 AM
let me check with the team
j

jcasanuevam

08/26/2021, 11:59 AM
Another issue (this is an old one that seemed to be resolved, but I think not). Regarding the kedro-airflow plugin. Even if we have th same directory for our dag and kedro project we're still having the same issue:
d

datajoely

08/26/2021, 12:00 PM
hmm I'm not sure on the second one
but this is the answer to the dependencies one
j

jcasanuevam

08/26/2021, 12:04 PM
Thanks! I manage dependencies with conda, not pip, so requirements.txt file I don't know if fit me
Maybe It'd be good to know how does the variable 'project_path' work inisde the dag.py, because there is something strange in that behaviour
Hi guys, one important thing to know that I have already discovered: using KedroOperator MemoryDataset are not shared between task, so we need to persist every data of our project in a shared folder (volume) with Airflow as using the DockerOperator. I think that is an important thing to remark in kedro-airflow and kedro-docker docs.
I guess you are in touch with the kedro-airflow plugin developers, so I think in a future release they could fix that problem using the XComs functionality of Airflow, telling Airflow that every data stored as a MemoryDataset in kedro should be transferred between task using XComs.
d

datajoely

08/27/2021, 9:36 AM
The airflow plugin is managed by the Kedro team
we can update the docs - but as I said a bit earlier, we will revisit deployment in general early 2022 once our current work on modular pipeliens enables this
so it's front of mind - just need to sequence things right