Lorena
08/20/2021, 9:11 AMconfigure_project
can't/shouldn't be bypassed as that's where settings and pipelines are (lazily) configured, in order to a) be able to import them anywhere in a project, and b) use them in the framework code. If you really really need the parameters, I suggest recreating the configloader logic of fetching the parameters in a helper function that you can call in the node. But generally dynamically generated pipelines are to be avoided if you can, I'm curious what your use case is maybe there's an alternative?WolVez
08/20/2021, 2:16 PMget_current_session()
fails. Specifically, getting to the run
function in __main__
take forever. In my mind, this implies something is up with configure_project
.
def _get_config() -> ConfigLoader:
"""Get the kedro configuration context
Returns:
The kedro configuration context
"""
try:
return get_current_session().load_context().config_loader
except Exception:
env = os.getenv("RUN_ENV")
if env:
project_path = Path(os.getenv("PROJECT_PATH") or Path.cwd()).resolve()
conf_root = settings.CONF_ROOT
conf_paths = [
str(project_path / conf_root / "base"),
str(project_path / conf_root / env),
]
return ConfigLoader(conf_paths)
else:
return ConfigLoader(["./conf/base", f"./conf/{get_env()}"])
def get_param(key: str, default_value: Any = None) -> str:
"""Get a parameter value from the parameters.yml
Args:
key: The id from parameter .yml files
Returns:
The parameter value
"""
return _get_config().get("parameters*", "parameters*/**", "**/parameters*").get(key, default_value)
Lorena
08/20/2021, 3:32 PMkedro.framework.project
or your own project package my_package.settings
?WolVez
08/20/2021, 3:34 PMfrom kedro.framework.project import settings
Lorena
08/20/2021, 3:35 PMWolVez
08/20/2021, 3:38 PMxsell_group = get_param("xsell_group")
def create_pipeline_train():
pipeline = Pipeline([])
for item in xsell_group:
item_camel = camel_to_snake(item[1])
model_name = f"xsell_propensity_{item_camel}"
pipeline += create_pipeline_train_model(
model_name=model_name,
filter_pretargets_node=filter_pretargets_xsell,
build_target_node=build_target_xsell,
additional_tags=["xsell", model_name, f"train_{model_name}", item_camel],
)
return pipeline
where xsell_group is a .yml list of lists from the context parametersconfigure_project
while pipelines is the last step.Lorena
08/20/2021, 3:41 PMkedro run
or in package mode, i.e. you've packaged the project as a Python artifact and running it as my_package run
?WolVez
08/20/2021, 3:44 PMfrom phoenix_max.__main__ import main
argvs = {k:v for k, v in
{
'--pipeline': os.getenv("PIPELINES"),
'--project_path': os.getenv("PROJECT_PATH"),
'--env': os.getenv("RUN_ENV"),
'--tags': os.getenv("TAGS")
}.items() if v is not None
}
sys.argv += list(sum(argvs.items(), tuple()))
main()
Lorena
08/20/2021, 4:03 PMxsell_group
is a parameter that's used somewhere else in the project's pipelines too, not just for the dynamic creation here, right? Otherwise, if it's just here, you could just move the list to a constant in the Python file.
I find it a bit surprising that settings.CONF_ROOT
is that slow in package mode but I'll have a play with this myself next week to see if I can reproduce it / figure out what's happening. Meanwhile just importing from local package or using a literal directly should be good enough.
Can you let me know what version of Kedro you're using?WolVez
08/20/2021, 5:00 PMdatajoely
08/20/2021, 5:01 PMWolVez
08/20/2021, 5:01 PMdatajoely
08/20/2021, 5:02 PMWolVez
08/20/2021, 5:05 PMconfigure_project
? I am wondering if the get_param is slowing things down by consistently having to remake new ConfigLoaders (though that should only be happening 3 or 4 times in our code base to generate pipelines)?Lorena
08/24/2021, 4:22 PMWolVez
08/24/2021, 7:52 PMregister_catalog
, DataCatalog.from_config
is utilized to instantiate the catalog. We are using custom Datasets which inherent from abstractdataset and more basic connects we setup which automatically handle authentication themselves. As a result, the Datasets are able to connect to exterior databases without the need for credentials to be stored in kedro. We have probably 100+ items in our catalog. The extensive lag time seems to be the entirety of all 100+ items instantiating a new instance of the dataset and then, as a result, going through authentication. We only have in total about 10 different Datasets types.
That being said, during each run, we are calling a specific pipeline to run which only has probably a max of 5 or 6 Catalog items used.
My questions are thus:
1. Do I need to utilize the _filtered_pipelines
to instantiate the correct catalogs in our own register_catalog
?
2. If i use kedro credentials, will it only create a single class of these (I am not seeing where in the code this would happen, but I remain hopefull I am just missing it)?
3. The actual instantiation of the base classes takes about 0.2 seconds. However, there is a minute occurring between each instantiation of the base classes while converting to Kedro Datasets? Ideas what would be causing this?
4. We are using Databricks to run this. We see that the expected runtime when utilizing notebooks, but when running jobs we hit this significant lag time. Given that the code is fundamentally the same (and the environment setup), do you have any ideas as to why this would change the runtime speed and the instantiation of the catalog?datajoely
08/28/2021, 9:50 AMWolVez
09/01/2021, 2:10 AMget_config
function inside of KedroContext during the Session.Run() prior to hitting the actual runners. There is a 1 minute lag time in the creation of each Abstract instantiation. The lag time is not specific to any one dataset type.datajoely
09/01/2021, 9:54 AMidanov
09/01/2021, 10:24 AMget_current_session()
since it seems to instantiate too many things for your needs. What you can do instead is in your _get_config()
method, you can use https://kedro.readthedocs.io/en/stable/kedro.framework.startup.bootstrap_project.html?highlight=bootstrap_project to make sure the project is setup and then simply instantiate a ConfigLoader instance yourself as in the code block after except
in your _get_config()
function.DataCatalog
instantiation, this happens due to the eagerness of the default DataCatalog
which will eagerly instantiate all catalog entries and as you mentioned the lag may come due to the authentication for each of the connections. So each of the 100+ datasets will setup the connection on instantiation. One way to improve that is to make sure that your custom DataSet class has a class property for the connection which will be set only once when the first dataset will be instantiated and then all other DataSet instances will reuse the same connection. This way Kedro will instantiate only 1 connection instead of 100+ and the slowness will disappear. In that case though you need to ensure that the access to the connection is thread-safe (if you are using ThreadRunner).WolVez
09/02/2021, 9:12 PMConfigLoader.get()
. While _lookup_config_filepaths
is the primary offender inside ConfigLoader.get()
, _load_configs
also took a large amount of time.
We created a get_param function inside of our Connector Repo to help with grabbing various parameters held inside the conf. Below are the functions:def _get_config() -> ConfigLoader:
"""Get the kedro configuration context
Returns:
The kedro configuration context
"""
try:
value = get_current_session().load_context().config_loader
logging.info("_get_config - GET_CURRENT_SESSION_METHOD USED")
return value
except Exception:
value = ConfigLoader(["./conf/base", f"./conf/{get_env()}"])
logging.info("_get_config - NEW CONFIGLOADER CREATED!!!")
return value
def get_param(key: str, default_value: Any = None) -> str:
"""Get a parameter value from the parameters.yml
Args:
key: The id from parameter .yml files
Returns:
The parameter value
"""
logging.info(f"GETTING PARAM from kedro_connect: key - {key}")
start = perf_counter()
config_start = perf_counter()
config = _get_config()
config_end = perf_counter()
logging.info(f"TOTAL TIME TO RETRIEVE CONFIG FOR {key} from PARAM - CONFIG TIME: {config_end - config_start}s.")
get_1_start = perf_counter()
params = config.get("parameters*", "parameters*/**", "**/parameters*")
get_1_end = perf_counter()
logging.info(f"TOTAL TIME TO RETRIEVE GET1 PARAM FOR {key} from PARAM - CONFIG TIME: {get_1_end - get_1_start}s.")
get_2_start = perf_counter()
value = params.get(key, default_value)
get_2_end = perf_counter()
logging.info(f"TOTAL TIME TO RETRIEVE GET2 PARAM FOR {key} from PARAM - CONFIG TIME: {get_2_end - get_2_start}s.")
end = perf_counter()
logging.info(f"TOTAL TIME TO RETRIEVE {key} from PARAM - TOTAL TIME: {end-start}s.")
logging.info(f"CONFIG: {config}")
return value
__init__
.get_param
function?idanov
09/03/2021, 9:04 AMget_param
often, that means that every single time you call this function, you will load all the parameters from scratch since config.get("parameters*", "parameters*/**", "**/parameters*")
reads the files and parses them. Could you somehow cache the results from the first time this is called? As for whether it is a good practice to have something like get_params
, I think in general it'd be best to keep the standard way Kedro passes parameters around and not use this kind of functions, but I understand that this is not always possible. If refactoring your code at the moment is out of the question, don't worry too much about it and keep what you have, and look for opportunities to refactor that in the future to get the parameters the usual way.WolVez
09/03/2021, 2:28 PM- conf
- base
- dev
- catalog
- pipeline1
- snowflake.yml
- sql.yml
- pipeline2
- catalog.yml
- snowflake.yml
- sql.yml
Where the snowflake.yml
in the catalog could contain something like
#snowflake table
_snowflake_table: &create_snowflake_table
type: kedro_connector.datasets.SnowflakeTable
db: connection.string.stuff
other_creds: creds stuff
Then in the pipeline1 snowflake.yml
file and/or the pipeline2 catalog.yml
we could have something like:
pipeline1_table_output:
<<: *create_snowflake_table
table: db.schema.table_name
Would the load order matter in this situation though? Could the CatalogLoader handle inheritance across multiple files like that?datajoely
09/03/2021, 2:29 PMWolVez
09/09/2021, 3:29 PMdatajoely
09/09/2021, 3:31 PM