ende
12/10/2021, 11:26 PMdatajoely
12/11/2021, 5:45 PMdatajoely
12/11/2021, 5:46 PMRroger
12/13/2021, 10:15 PMj c h a r l e s
12/13/2021, 10:34 PMj c h a r l e s
12/13/2021, 10:43 PMfrom kedro.config import ConfigLoader, MissingConfigException
conf_paths = ["conf/base", "conf/local"]
conf_loader = ConfigLoader(conf_paths)
try:
credentials = conf_loader.get("credentials*", "credentials*/**")
except MissingConfigException:
credentials = {}
j c h a r l e s
12/13/2021, 10:43 PMj c h a r l e s
12/13/2021, 10:47 PMj c h a r l e s
12/13/2021, 11:42 PMconf_loader = ConfigLoader("conf", "local")
credentials = conf_loader.get("credentials*", "credentials*/**")
. Also am calling this directly from a helper function within a specific node for nowj c h a r l e s
12/13/2021, 11:48 PMraw
dataset. For each entity, it takes between 15 minutes and 1 hour to process that specific entity. I have a node that is looping over each entity, processing that entity, and then merging all of these entities together. What's the best way to split up this node so that I can process each user as its own node? The number of desired nodes would be one node per entity in the original dataset.Rroger
12/14/2021, 12:09 AMj c h a r l e s
12/14/2021, 12:13 AMj c h a r l e s
12/14/2021, 12:14 AMj c h a r l e s
12/14/2021, 12:16 AMj c h a r l e s
12/14/2021, 12:18 AMdef create_pipeline(**kwargs):
build_list_of_entity_nodes = node(
func=generate_nodes_from_entities,
inputs="entities_csv",
outputs="list_of_entity_nodes"
)
##
# How do I run my_expensive_entity_function across this list of nodes,
# which is determined through the contents of entities_csv
return Pipeline([build_list_of_entity_nodes])
Rroger
12/14/2021, 12:44 AMj c h a r l e s
12/14/2021, 12:58 AMj c h a r l e s
12/14/2021, 12:59 AMj c h a r l e s
12/14/2021, 1:07 AMRroger
12/14/2021, 2:00 AMnode1
executes a sql script in a db producing table1
, then node2
executes another sql script on table1
.datajoely
12/14/2021, 10:13 AMpandas.QueryDataSet
to so that it takes a reference to a SQL file
- We have introduced spark.DeltaTableDataSet
and some tutorials on how to do 'out of dag operations' which may translate here:
https://kedro.readthedocs.io/en/latest/11_tools_integration/01_pyspark.html#spark-and-delta-lake-interactiondatajoely
12/14/2021, 10:14 AMNC
12/14/2021, 2:51 PMdatajoely
12/14/2021, 2:52 PMdatajoely
12/14/2021, 2:53 PMNC
12/14/2021, 2:54 PMdatajoely
12/14/2021, 2:56 PMYAMLDataSet
for the parameters you generate at runtime for safe keeping, then either an automatic or manual process outside of Kedro could mirror those in you actual paramerters.yml
NC
12/14/2021, 2:58 PMdatajoely
12/14/2021, 2:58 PMdatajoely
12/14/2021, 2:58 PM