On the same topic of environments if you wanted to have a se Kedro #advanced-need-help

On the same topic of environments, if you wanted t...

bgereke

05/25/2022, 12:23 AM

On the same topic of environments, if you wanted to have a separate spark configuration for different nodes in a pipeline, is the correct approach to store those configurations in separate environments and execute the nodes in separate runs with -env flags? Or is there some other way that would allow changing the configuration, perhaps by reloading the context, within the same run?

datajoely

05/25/2022, 8:56 AM

So Kedro isn't designed to be execution environment aware like fully fledged orchestrators like Argo, Flyte etc

datajoely

05/25/2022, 8:57 AM

you could achieve this by breaking up your pipeline and having different configuration environments i.e.

kedro run --pipeline initial_pipeline --env local_cluster

and then

kedro run --pipeline second_piepline --env emr_cluster

datajoely

05/25/2022, 8:57 AM

but it doesn't feel like a robust solution

datajoely

05/25/2022, 8:57 AM

so I would perhaps encourage you to wrapping your pipelines around a proper orchestrator designed for this sort of thing

antony.milne

05/25/2022, 9:29 AM

Generally agreed with Joel on this, but actually since kedro 0.18.1 introduced the

after_context_created

hook I wonder if there's a better way of doing it now... Here's a rough demo of how you could do it: https://gist.github.com/AntonyMilneQB/792a748b0d921e2f9f78cc7dd9c13c97. The advantage of this are: * no need for a custom

KedroContext

at all, since all the spark stuff is done in hooks * you can still use run environments as you currently do, no need to create a separate run environment for each spark config (although you still can do so if you like)

antony.milne

05/25/2022, 9:30 AM

I'm very interested in hearing what you think of this approach and whether it works for you! In general I'm wondering if we should move to using this sort of pattern instead of a custom

KedroContext

for spark initialisation. See https://github.com/kedro-org/kedro/issues/1563

bgereke

05/25/2022, 7:32 PM

I also agree that much of this can/should be delegated to the orchestrator; however, for early developing stages or configuring spark params that aren't tied so tightly to cluster params (maybe num_shuffle_partitions or something?), I'm interested in trying out the hook-based approach you outline here. It seems potentially simpler to have multiple spark configs in the same environment and map to them via node names or tags. I haven't used hooks yet, so this gives me a good excuse 🙂

datajoely

05/25/2022, 7:37 PM

Let us know how this goes - as @antony.milne mentioned this is a brand new feature so any and all feedback is really appreciated

antony.milne

05/25/2022, 8:15 PM

The more I think about it, the more I think that node tags is actually a very good way to do this. It's quite similar to this idea: https://discord.com/channels/778216384475693066/846330075535769601/935879323813036142.

antony.milne

05/25/2022, 8:17 PM

@bgereke Something else I'd be interested in as part of general user research related to this: roughly speaking, what are the different spark configurations you want to use? And do you like specifying this configuration in a .yml file or does that seem weird? Do you ever specify configuration for spark using yaml outside kedro?

bgereke

05/25/2022, 10:08 PM

I think for deployment I might keep most/all of the spark configuration outside kedro and instead include it in the spark-submit. During development, I don't mind yaml inside kedro. Some of the options in my current config are things like: spark.hadoop.fs.s3.canned.acl, spark.sql.adaptive.enabled, spark.sql.adaptive.coalescePartitions.enabled, spark.sql.adaptive.coalescePartitions.minPartitionSize, spark.sql.shuffle.partitions, spark.sql.adaptive.skewJoin.enabled

bgereke

05/25/2022, 10:10 PM

I haven't needed to apply different spark configs to different nodes yet, but this is a question I've gotten from colleagues while introducing them to kedro

antony.milne

05/25/2022, 10:17 PM

Thanks very much for the explanation! stupid question from someone who doesn't know much about spark, but when you say "include it in spark-submit", you mean you'd do something like this?

spark-submit --conf spark.hadoop.fs.s3.canned.acl=...  spark.sql.adaptive.enabled=...

bgereke

05/25/2022, 10:26 PM

Yes exactly. When we run a pipeline from airflow, we typically run spark tasks as EMR "steps" via a custom EMR operator which uses spark-submit. Also, there are no stupid spark questions, but I very well may give stupid answers!

antony.milne

05/25/2022, 10:34 PM

Thank you, this is very helpful! The thing which I'm really interested in here is whether storing spark config in a spark.yml file is a natural thing to do, since I'm not sure where we originally got this idea from on Kedro. If you want to be able to specify a different spark configuration per run environment then using yaml is the typical Kedro way to do things, but I've always wondered whether anyone else would use yaml to specify spark config outside of Kedro. As far as I can tell spark config is just a set of key-value pairs (is that right?), so having a yaml file for it in one sense overkill because you don't need a general dictionary structure. On the other hand, just listing arguments like in

spark-submit

feels very awkward to me and I'm surprised there's no

--config-file

option already where you can input some file in a standardised format of key-value pairs. Like doesn't your

spark-submit

command get huge if you want to specify 100 options? Or does that never really happen?

bgereke

05/25/2022, 11:47 PM

Ah gotcha, yes the spark-submits can get large but we typically call them through a command-runner.jar that can accept a bunch of arguments to assemble the spark-submit call like in this pic.

bgereke

05/25/2022, 11:48 PM

So you could imagine a list like this getting populated from a yaml

bgereke

05/25/2022, 11:50 PM

You can also zip configuration files and pass to the --files and/or --archives arguments of spark-submit. This works for the other kedro config files but I haven't actually tried to load spark configuration that way. I'm sure I'll have more opinions once I deploy a kedro project.

antony.milne

05/26/2022, 3:13 PM

Thank you very much for the explanation @bgereke, that's super helpful. Let me know how it goes and if you develop any more opinions on how we handle spark in kedro - I'm very interested in seeing if this is something we could improve.

Previous Next