On the same topic of environments, if you wanted t...
# advanced-need-help
b
On the same topic of environments, if you wanted to have a separate spark configuration for different nodes in a pipeline, is the correct approach to store those configurations in separate environments and execute the nodes in separate runs with -env flags? Or is there some other way that would allow changing the configuration, perhaps by reloading the context, within the same run?
d
So Kedro isn't designed to be execution environment aware like fully fledged orchestrators like Argo, Flyte etc
you could achieve this by breaking up your pipeline and having different configuration environments i.e.
kedro run --pipeline initial_pipeline --env local_cluster
and then
kedro run --pipeline second_piepline --env emr_cluster
but it doesn't feel like a robust solution
so I would perhaps encourage you to wrapping your pipelines around a proper orchestrator designed for this sort of thing
a
Generally agreed with Joel on this, but actually since kedro 0.18.1 introduced the
after_context_created
hook I wonder if there's a better way of doing it now... Here's a rough demo of how you could do it: https://gist.github.com/AntonyMilneQB/792a748b0d921e2f9f78cc7dd9c13c97. The advantage of this are: * no need for a custom
KedroContext
at all, since all the spark stuff is done in hooks * you can still use run environments as you currently do, no need to create a separate run environment for each spark config (although you still can do so if you like)
I'm very interested in hearing what you think of this approach and whether it works for you! In general I'm wondering if we should move to using this sort of pattern instead of a custom
KedroContext
for spark initialisation. See https://github.com/kedro-org/kedro/issues/1563
b
I also agree that much of this can/should be delegated to the orchestrator; however, for early developing stages or configuring spark params that aren't tied so tightly to cluster params (maybe num_shuffle_partitions or something?), I'm interested in trying out the hook-based approach you outline here. It seems potentially simpler to have multiple spark configs in the same environment and map to them via node names or tags. I haven't used hooks yet, so this gives me a good excuse 🙂
d
Let us know how this goes - as @antony.milne mentioned this is a brand new feature so any and all feedback is really appreciated
a
The more I think about it, the more I think that node tags is actually a very good way to do this. It's quite similar to this idea: https://discord.com/channels/778216384475693066/846330075535769601/935879323813036142.
@bgereke Something else I'd be interested in as part of general user research related to this: roughly speaking, what are the different spark configurations you want to use? And do you like specifying this configuration in a .yml file or does that seem weird? Do you ever specify configuration for spark using yaml outside kedro?
b
I think for deployment I might keep most/all of the spark configuration outside kedro and instead include it in the spark-submit. During development, I don't mind yaml inside kedro. Some of the options in my current config are things like: spark.hadoop.fs.s3.canned.acl, spark.sql.adaptive.enabled, spark.sql.adaptive.coalescePartitions.enabled, spark.sql.adaptive.coalescePartitions.minPartitionSize, spark.sql.shuffle.partitions, spark.sql.adaptive.skewJoin.enabled
I haven't needed to apply different spark configs to different nodes yet, but this is a question I've gotten from colleagues while introducing them to kedro
a
Thanks very much for the explanation! stupid question from someone who doesn't know much about spark, but when you say "include it in spark-submit", you mean you'd do something like this?
spark-submit --conf spark.hadoop.fs.s3.canned.acl=...  spark.sql.adaptive.enabled=...
b
Yes exactly. When we run a pipeline from airflow, we typically run spark tasks as EMR "steps" via a custom EMR operator which uses spark-submit. Also, there are no stupid spark questions, but I very well may give stupid answers!
a
Thank you, this is very helpful! The thing which I'm really interested in here is whether storing spark config in a spark.yml file is a natural thing to do, since I'm not sure where we originally got this idea from on Kedro. If you want to be able to specify a different spark configuration per run environment then using yaml is the typical Kedro way to do things, but I've always wondered whether anyone else would use yaml to specify spark config outside of Kedro. As far as I can tell spark config is just a set of key-value pairs (is that right?), so having a yaml file for it in one sense overkill because you don't need a general dictionary structure. On the other hand, just listing arguments like in
spark-submit
feels very awkward to me and I'm surprised there's no
--config-file
option already where you can input some file in a standardised format of key-value pairs. Like doesn't your
spark-submit
command get huge if you want to specify 100 options? Or does that never really happen?
b
Ah gotcha, yes the spark-submits can get large but we typically call them through a command-runner.jar that can accept a bunch of arguments to assemble the spark-submit call like in this pic.
So you could imagine a list like this getting populated from a yaml
You can also zip configuration files and pass to the --files and/or --archives arguments of spark-submit. This works for the other kedro config files but I haven't actually tried to load spark configuration that way. I'm sure I'll have more opinions once I deploy a kedro project.
a
Thank you very much for the explanation @bgereke, that's super helpful. Let me know how it goes and if you develop any more opinions on how we handle spark in kedro - I'm very interested in seeing if this is something we could improve.