Hello guys! I got a question related to the kedro-...
# beginners-need-help
j
Hello guys! I got a question related to the kedro-airflow plugin but I'm not sure if this is the place to ask for...my company has Airflow installed in an internal server using Docker. Kedro docs says that we should install our kedro packaged project inside the airflow environment, but that means we should change the Airflow Docker Image every time we want to install a new package, module, etc. Is there a way to manage that scenario without modifying the Airflow DockerImage/Compose file??
d
Let me get back to you on what our view of best practice is
So we have two out of the box plugins which may help We have
kedro-docker
which helps you dockerise your pipeline, but in my opinion is mostly useful for people unfamiliar with docker and need help getting started. We also have
kedro-airflow
which helps convert your kedro pipeline into an Airflow DAG. This was co-developed by the folks at Astronomer. Have you used either of these? The second one feels more useful for your purposes
j
Yes, I've used kedro-airflow, so I already have my kedro pipeline as an Airflow DAG. The thing is when I open the Airflow UI an error 'X module not found' arises, where X could be kedro, pandas or whatever package Airflow doesn't install when it is deployed in the internal server
So I know I have to install my kedro project package inside the airflow venv. The thing is that for doing this I need to modify the airflow docker image or docker compose file everytiem I want to install a new kedro project package in my airflow environment
d
This isn't an area that I'm super knowledgeable - let me ask about
In truth this feels like an airflow problem / nature of python problem rather than a Kedro one. I'm still waiting on some answers though
It's not a particularly long post but this SO thread my be useful https://stackoverflow.com/questions/56891035/how-to-manage-python-dependencies-in-airflow
j
I'm running with the following error when trying to execute a kedro pipeline with airflow:
It seems a problem with the conf/base directory. Any tips?
d
Yeah the * directory is causing issues I think?
Is that the actual name of the folder?
because I think we use the * symbol as a glob wildcard?
j
nope. thats not the name of the folder
d
any idea where the * is coming from?
j
* matches any number of directories (3 in this case) between /opt/ and /conf/base
d
is that a docker or AirFlow thing?
The stars will cause issues with Kedro
j
I don't know... I guess it is an Airflow thing but I don't know how to deal with the stars 😦
d
hmm let me do some googling
Is there anywhere you specify that?
seeing you docker-compose/dockerfile may be helpful too
j
that's my docker-compose file
inside projects folder I've the following structure: project_name >> conf, data, logs folder related to kedro project
and I've installed the kedro project inside all of the containers of airflow
always in the root project_name
Airflow recognize my DAG and all of the python packages needed for my pipeline
but when I execute the workflow I get the error I've shared before
d
Thank you for this
let me check wit the team
u
Hi @jcasanuevam the problem is that
kedro package
doesn't package your
conf/
directory. In our guide where we use Astronomer, their Dockerfile copies the entire project directoy into the container. I suspect that isn't happening in your case?
u
If you post your Dockerfile, I can help you amend it to make it work
j
That is. kedro package doesn't package my conf/ directory and that's why I copied conf/ data/ and logs/ directories as kedro docs says
I dont have any Dockerfile related to my kedro project. I've just accessed to the airflow containers and installed the kedro wheel file in a root where conf/, data/ and logs/ directories live.
PROBLEM SOLVED. The thing was the project_path variable of the DAG file generated by kedro using the kedro-airflow plugin. It was set by default as project_path = Path().cwd, and the project folder is not the same as the dag file, so changing this to the correct project path makes everything work πŸ™‚
d
Oh wow!
Well done @User πŸ₯³
So in an ideal world - would it be something that we could make configurable in kedro-airflow?
or would it be something that would just be worth putting in docs?
j
Putting in docs would make it much easier. As a reminder to modify the value of that variable if the project path is not the same as where the dag file is located (It shouldn't be the same as a best practice)
d
Understood - I'll raise a documentation PR
j
super!
j
I'm not using Astronomer but I guess we would face the same issue, so yes, I think this is a good place. I'd also modify the kedro-airflow github repository
d
Understood - thank you!
If you have a second - it would be great if you could look at this PR to the readme? https://github.com/quantumblacklabs/kedro-airflow/pull/83/files?short_path=b335630#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5 Would this have helped if it was there a few days ago! 🀦
j
Perfect! I think that's enought. Thank you!
d
Thank you!
j
One question. If my project uses the kedro-mlflow plugin, once I packaged my project, does it package the kedro-mlflow plugin? I'm geting the following error:
d
Yes we only package the kedro project
this is something we're thinking about
j
mmm so maybe a better solution than package the project is to use the kedro-docker plugin in order to get the Dockerfile and run it inside the Airflow environment in the desired project root?
d
I think that sounds simpler to me
j
I will give it a try and tell you, thanks!
u
@User if you have a docker container, you can also run it in Airflow with the DockerOperator
j
But with DockerOperator I lose the option of having each node of my kedro pipeline defined as a task in airflow (kedro-airflow plugin allows us to do that, 'translating' our kedro pipeline as an airflow DAG). My DAG using DockerOperator would be summarised to a single task built with DockerOperator and i like the fact of having my kedro pipeline visually represented in airflow. Is there a way to have it using DockerOperator?
Hi guys. Here again (sorryπŸ˜… ). This is my DockerOperator in the airflow dag:
I'm getting the following error in airflow:
any tips about what's going on?
u
Yea, in this setup, you need to give the user inside airflow permission to access the
docker_url
so it can launch the operator. Try running
sudo chmod 777 /var/run/docker.sock
on your host machine and see if it helps?
u
In production, it will be a docker cluster somewhere with a proper url so this permission problem will not exist.
u
Regarding the shape of the DAG: you can certainly retain the shape of the DAG with DockerOperator. In the
command
section of the operator, simply change
kedro run
to
kedro run --node=...
. Essentially change
KedroOperator
with
DockerOperator
and retain the same DAG by parameterising the command.
u
> I'm not using Astronomer but I guess we would face the same issue, so yes, I think this is a good place. I'd also modify the kedro-airflow github repository With Astronomer CLI you won't face the same problem because their Dockerfile gobbles up the entire project directory, so
conf/
will be included. An example on how to do that by yourself would be: * Package the Kedro project as a wheel and install it in the dockerfile with
RUN pip install <the wheel>
(this will install all of the project dependencies into the container, including `kedro-mlflow`in your case) * Copy the whole project directory into the current working directory in the container with:
COPY . /home/kedro-project
* Set the working directory in the container to
WORKDIR /home/kedro-project
* Now
KedroOperator
or
kedro run
should work because the `conf/`is present at the current working directory
u
Let me see if I have time today. I will cook up a sample repository for you to demonstrate this.
u
Sorry I caused more confusion the other day: I was suggesting `DockerOperator`as another option, not that you really need it to make things work.
u
@User hey I got this one up as a demo for you: https://github.com/limdauto/demo-kedro-airflow-docker-compose. The idea is: * Install the wheel file into your container with this https://github.com/limdauto/demo-kedro-airflow-docker-compose/blob/main/Dockerfile. It should install all dependencies of your project, including kedro-mlflow etc. * Mount the conf and data dir: https://github.com/limdauto/demo-kedro-airflow-docker-compose/blob/main/docker-compose.yaml#L64-L65 * Tell docker compose to build the image: https://github.com/limdauto/demo-kedro-airflow-docker-compose/blob/main/docker-compose.yaml#L47-L48 (or you can do what you did and build an image yourself and push it somewhere) The rest is normal airflow and kedro: create the dags with kedro airflow create, etc.
u
You shouldn't need more than this to run Kedro on Airflow with docker-compose
u
lmk if it helps
j
With that solution we need to do it everytime we want to add a new kedro project to our airflow env and we could have problems with python dependencies. I think I prefer having a docker container of my kedro project and 'call it' from airflow in order to execute it. Now I'm trying running sudo chmod 777 /var/run/docker.sock but I don't find the docker.sock file 😦
BTW thank you very much for your support and your demo πŸ™‚
u
which environment is your docker host machine? Is the docker cluster running on that machine?
j
I have everything running in my local machine
u
and you are running on your local machine as root?
u
also what os is your local machine?
j
windows
u
aha! sorry I assume you use linux because you use unix socket
u
1s
u
is this docker running on WSL?
u
and could you do a
docker network ls
and paste the result here please?
u
since this is windows, instead of running the chmod command, another approach is to mount this under `x-airflow-common: volumes`:
Copy code
# note the 2 // here for windows
- //var/run/docker.sock:/var/run/docker.sock
If that's still not working, try adding
network_mode: airflow-docker_default
in
x-airflow-common
j
Not working 😦 This is my docker-compose file:
and my DAG:
u
What's error? Could you remove
docker-socke-proxy
as well? Also remove
network_mode
and
docker_url
in the
DockerOperator
constructor. They are overriding values from docker compose.
u
'
j
I'm still facing the same error:
I think the problem is in my kedro project Dockerfile, where I have to add the airflow user I guess:
UPDATE: I've set this in Docker Desktop
and set Ubunto as the default WSL distro
via PS terminal I've accessed the wsl -d Ubuntu, and /var/run/docker.sock exist
so I run chmod 777 /var/run/docker.sock
ok with that
but still getting the same error 😦
and I have changed the kedro project Dockerfile also:
u
Sorry you are struggling with this 😦 This all seems like a problem with docker and WSL... Let me try this out tonight on my windows machine when I get home
j
Hi @User ! Did you try it out?
d
@User unfortunately he's out this week. Do you have any updates? I can do my best to help
j
Hi @User , sorry I've been on holidays. Today I'll give it a try again
Still facing the same issue 😦 I bet my problem is with my kedro project DockerFile:
d
what error are you seeing?
j
That's the problem, I don't see any hahaha. I'm a newbie with Docker but I think the issue is with the user/userid and group/groupid and their permissions
d
ah it's a bit outside of my knowledge too
all I can suggest is digging around the logs for clues
j
https://discord.com/channels/778216384475693066/870590055754899536/872778121110249495 Hi @User I was able to run my kedro project Dockerfile with airflow using the DockerOperator, but what you say in that message I think is not correct. If we just change the 'command' variable across the different DockerOperator it won't work because each DockerOperator creates a new container of your Docker Image so once the first task is finished the container of the next task doesn't have the information of the previous task, so the chain is broken. We just can run all of our kedro project in one DockerOperator task, loosing the tree diagram of airflow.
I'm trying to figure out if it is possible for tasks to share the same container in order to share state between tasks
d
At this point you've reached my Airflow knowledge too!
j
Hi guys! I think I'll be back with the KedroOperator and kedro-airflow plugin solution, but I have a question about the wheel file. Why doing it in the way @User says I'll have all dependencies of my project and doing it installing the wheel file using pip command inside my airflow container and copying there the /conf, /data and /logs directories I won't have all dependencies? For me seems to be the same process but one using a Dockerfile of your project and other installing the wheel file directly in the airflow container and copying the desired directories
u
Hello @User regarding the first question: > it won't work because each DockerOperator creates a new container of your Docker Image so once the first task is finished the container of the next task doesn't have the information of the previous task, so the chain is broken Could you elaborate what do you mean by "the chain is broken"? Which information do you want to retain between tasks? If it's data, since airflow is a distributed scheduler, you will need to persist your data between tasks to a shared location.
u
Sorry I lost a bit of context. I will need to re-read this thread again but you are right in theory: installing the wheel file should install all of the project dependencies.
j
Picture from kedro docs. My question to @User is: Does the resulting package also contain the dependencies of the project? (for example, kedro-mlflow plugin if I've got it in my prooject)
All data stored as MemoryDataset in my kedro project would be lost, forcing me to persist every dataset, artifact, etc somewhere and I think that's not a good solution/best practice
d
Yes - you need to persist at every point in an orchestrator
we are working on a way to deploy things as a modular pipelines which take advantage of MemoryDataSet
but it's not released yet!
j
Thanks @User , could you bring us some light about above question?
d
let me check with the team
j
Another issue (this is an old one that seemed to be resolved, but I think not). Regarding the kedro-airflow plugin. Even if we have th same directory for our dag and kedro project we're still having the same issue:
d
hmm I'm not sure on the second one
but this is the answer to the dependencies one
j
Thanks! I manage dependencies with conda, not pip, so requirements.txt file I don't know if fit me
Maybe It'd be good to know how does the variable 'project_path' work inisde the dag.py, because there is something strange in that behaviour
Hi guys, one important thing to know that I have already discovered: using KedroOperator MemoryDataset are not shared between task, so we need to persist every data of our project in a shared folder (volume) with Airflow as using the DockerOperator. I think that is an important thing to remark in kedro-airflow and kedro-docker docs.
I guess you are in touch with the kedro-airflow plugin developers, so I think in a future release they could fix that problem using the XComs functionality of Airflow, telling Airflow that every data stored as a MemoryDataset in kedro should be transferred between task using XComs.
d
The airflow plugin is managed by the Kedro team
we can update the docs - but as I said a bit earlier, we will revisit deployment in general early 2022 once our current work on modular pipeliens enables this
so it's front of mind - just need to sequence things right