So I'm afraid `configure_project` can't/shouldn't ...
# beginners-need-help
l
So I'm afraid
configure_project
can't/shouldn't be bypassed as that's where settings and pipelines are (lazily) configured, in order to a) be able to import them anywhere in a project, and b) use them in the framework code. If you really really need the parameters, I suggest recreating the configloader logic of fetching the parameters in a helper function that you can call in the node. But generally dynamically generated pipelines are to be avoided if you can, I'm curious what your use case is maybe there's an alternative?
w
@User , Thanks for your reply. We have created models for about 50 our of product catagories. We don't have the time or man power to stand-up / monitor each of these. Thus, we created several different nodes to manage training, analysis, and production of these 50 models. Thus, we really only create the list in catalog/the params to manage all 50 pipelines. I created a ConfigLoader in a very similar manner to the KedroContext method (see below). However, for some reason, when utilizing this method, it is extreeeemely slow if the
get_current_session()
fails. Specifically, getting to the
run
function in
__main__
take forever. In my mind, this implies something is up with
configure_project
.
Copy code
def _get_config() -> ConfigLoader:
    """Get the kedro configuration context
    Returns:
        The kedro configuration context
    """
    try:
        return get_current_session().load_context().config_loader
    except Exception:
        env = os.getenv("RUN_ENV")
        if env:
            project_path = Path(os.getenv("PROJECT_PATH") or Path.cwd()).resolve()
            conf_root = settings.CONF_ROOT
            conf_paths = [
                str(project_path / conf_root / "base"),
                str(project_path / conf_root / env),
            ]
            return ConfigLoader(conf_paths)
        else:
            return ConfigLoader(["./conf/base", f"./conf/{get_env()}"])

def get_param(key: str, default_value: Any = None) -> str:
    """Get a parameter value from the parameters.yml
    Args:
        key: The id from parameter .yml files
    Returns:
        The parameter value
    """
    return _get_config().get("parameters*", "parameters*/**", "**/parameters*").get(key, default_value)
l
Where do settings come from in the snippet above? Is it
kedro.framework.project
or your own project package
my_package.settings
?
w
from kedro.framework.project import settings
l
If I step back a bit though, are you creating one pipeline per product category? And if so, are they structurally different, or is it just the data (catalog entries, parameters) that is different? If it's a yes, are you familiar with modular pipelines? https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html Could be what you're looking for.
That's probably why, if you try importing from your actual package, or even better just using "conf" straightaway is that working as expected?
w
Copy code
xsell_group = get_param("xsell_group")

def create_pipeline_train():
    pipeline = Pipeline([])

    for item in xsell_group:
        item_camel = camel_to_snake(item[1])
        model_name = f"xsell_propensity_{item_camel}"
        pipeline += create_pipeline_train_model(
            model_name=model_name,
           filter_pretargets_node=filter_pretargets_xsell,
            build_target_node=build_target_xsell,
            additional_tags=["xsell", model_name, f"train_{model_name}", item_camel],
        )

    return pipeline
where xsell_group is a .yml list of lists from the context parameters
The callout is that it only runs into the issue during the pipeline build. Settings look like they are initialized during the first step of
configure_project
while pipelines is the last step.
l
Also just to clarify, are you running the pipeline with
kedro run
or in package mode, i.e. you've packaged the project as a Python artifact and running it as
my_package run
?
w
So we are pip installing the package, then running the a python script, which passes CLI options (I.e. --pipeline, --project_path) to sys.argv. then running main()
Copy code
from phoenix_max.__main__ import main

argvs = {k:v for k, v in
    {
        '--pipeline': os.getenv("PIPELINES"),
        '--project_path': os.getenv("PROJECT_PATH"),
        '--env': os.getenv("RUN_ENV"),
        '--tags': os.getenv("TAGS")
    }.items() if v is not None
}

sys.argv += list(sum(argvs.items(), tuple()))
main()
l
I see that makes sense. I presume the
xsell_group
is a parameter that's used somewhere else in the project's pipelines too, not just for the dynamic creation here, right? Otherwise, if it's just here, you could just move the list to a constant in the Python file. I find it a bit surprising that
settings.CONF_ROOT
is that slow in package mode but I'll have a play with this myself next week to see if I can reproduce it / figure out what's happening. Meanwhile just importing from local package or using a literal directly should be good enough. Can you let me know what version of Kedro you're using?
w
@User Yes, we use xsell_group in a lot of places. We could create a python file for it, though that seems against the inherent build design of Kedro. We are using Kedro=0.17.2
d
@User - Thanks for getting back to us with the version. We'll have to pick this up on Monday, so have a good weekend and speak then!
w
Thanks @User
@User that this thread archives in 24 hours, ( I dont have permissions to make it last longer). I am not sure you want to extend that or not.
d
I'm still trying to work out what that actually means
I've been able to reopen threads older than that
if we have to open a new one on Monday we can
maybe even move this to advanced :p
w
Additional thoughts @User, would it make since to build an optional ConfigLoader into
configure_project
? I am wondering if the get_param is slowing things down by consistently having to remake new ConfigLoaders (though that should only be happening 3 or 4 times in our code base to generate pipelines)?
@User & @User, I did some additional testing and realized its less linear than I thought. It does look like session.run is being hit. The lag time is between Sesssion.run() and the first pipeline being hit. This should be a 3 minute pipeline, but this lag time is making it over an hour. Note, it is not the data loading, as we have Logging happening on the datasets being hit after the 1 hour delay.
l
@User would you be able to upgrade your kedro version, to latest 0.17.4 or even 0.17.3? From 0.17.3 we've made the pipelines to be loaded lazily, which might solve your problem. It should be a simple move as it's non-breaking realease.
w
@User , I cannot update to 17.4 or 17.3 right now. There seem to be a number of breaks, specifically with our custom Datasets who inherit from AbstractDataSet
@User , rrr I might have lied and just introduced another breaking change prior to the edit.
@User and @User, alright. Update. I have added an extensive level of logging to find this. I updated kedro to 0.17.4. It looks like during
register_catalog
,
DataCatalog.from_config
is utilized to instantiate the catalog. We are using custom Datasets which inherent from abstractdataset and more basic connects we setup which automatically handle authentication themselves. As a result, the Datasets are able to connect to exterior databases without the need for credentials to be stored in kedro. We have probably 100+ items in our catalog. The extensive lag time seems to be the entirety of all 100+ items instantiating a new instance of the dataset and then, as a result, going through authentication. We only have in total about 10 different Datasets types. That being said, during each run, we are calling a specific pipeline to run which only has probably a max of 5 or 6 Catalog items used. My questions are thus: 1. Do I need to utilize the
_filtered_pipelines
to instantiate the correct catalogs in our own
register_catalog
? 2. If i use kedro credentials, will it only create a single class of these (I am not seeing where in the code this would happen, but I remain hopefull I am just missing it)? 3. The actual instantiation of the base classes takes about 0.2 seconds. However, there is a minute occurring between each instantiation of the base classes while converting to Kedro Datasets? Ideas what would be causing this? 4. We are using Databricks to run this. We see that the expected runtime when utilizing notebooks, but when running jobs we hit this significant lag time. Given that the code is fundamentally the same (and the environment setup), do you have any ideas as to why this would change the runtime speed and the instantiation of the catalog?
d
@WolVez Lorena is now on holiday for two weeks and so I’ll have to pick this up next week with another team member. This is very helpful for us so please keep the questions coming I’m keen to find a solution/ improve this because it feels like something which shouldn’t be taking this long
w
@User , here are some logs of the runtime shown during the for loop of the
get_config
function inside of KedroContext during the Session.Run() prior to hitting the actual runners. There is a 1 minute lag time in the creation of each Abstract instantiation. The lag time is not specific to any one dataset type.
Given we have close to 100 (despite only using 7 in the pipeline) the run time for any process is close to an hour and half. The actual implementation of the pipeline is very speedy.
if we go deeper we see that the one minute between runs is is coming from the creation of the class_object from the passed dictionary within the abstractdataset.
d
Hi @User thank you for the detailed analysis
let me consult with the team
i
@User the challenge seems to be that in order to get the config loader, your code loads the whole session and instantiates the full DataCatalog. I would avoid using
get_current_session()
since it seems to instantiate too many things for your needs. What you can do instead is in your
_get_config()
method, you can use https://kedro.readthedocs.io/en/stable/kedro.framework.startup.bootstrap_project.html?highlight=bootstrap_project to make sure the project is setup and then simply instantiate a ConfigLoader instance yourself as in the code block after
except
in your
_get_config()
function.
For the slowness of the
DataCatalog
instantiation, this happens due to the eagerness of the default
DataCatalog
which will eagerly instantiate all catalog entries and as you mentioned the lag may come due to the authentication for each of the connections. So each of the 100+ datasets will setup the connection on instantiation. One way to improve that is to make sure that your custom DataSet class has a class property for the connection which will be set only once when the first dataset will be instantiated and then all other DataSet instances will reuse the same connection. This way Kedro will instantiate only 1 connection instead of 100+ and the slowness will disappear. In that case though you need to ensure that the access to the connection is thread-safe (if you are using ThreadRunner).
The 1 hour slowness makes sense, I would expect Databricks to have some kind of brute-force attack prevention for authenticating and it seems that your code does too many authentincations at once, thus they throttle down the authentication requests. This could explain the 1 minute slowness between each authentication because of the timeout they apply between two consequtive authentications for security. The one shared connection between all datasets should solve this problem. Let us know if it fixes it for you. Here is an example I found after a quick search for a thread-safe singleton, hopefully it will help you with making a thread-safe class instance for your connection object: https://blog.hbis.fr/2019/03/23/python-singleton/
w
@User, @User Thanks for the response. I ended up adding singletons and a slew of other enhancements to reduce requests. However, this actually didn't end up speeding up the system very much. I went as deep as I could go this time around and actually identified the issue as
ConfigLoader.get()
. While
_lookup_config_filepaths
is the primary offender inside
ConfigLoader.get()
,
_load_configs
also took a large amount of time. We created a get_param function inside of our Connector Repo to help with grabbing various parameters held inside the conf. Below are the functions:
Copy code
def _get_config() -> ConfigLoader:
    """Get the kedro configuration context
    Returns:
        The kedro configuration context
    """
    try:
        value = get_current_session().load_context().config_loader
        logging.info("_get_config - GET_CURRENT_SESSION_METHOD USED")
        return value
    except Exception:
        value = ConfigLoader(["./conf/base", f"./conf/{get_env()}"])
        logging.info("_get_config - NEW CONFIGLOADER CREATED!!!")
        return value


def get_param(key: str, default_value: Any = None) -> str:
    """Get a parameter value from the parameters.yml
    Args:
        key: The id from parameter .yml files
    Returns:
        The parameter value
    """
    logging.info(f"GETTING PARAM from kedro_connect: key - {key}")
    start = perf_counter()
    config_start = perf_counter()
    config = _get_config()
    config_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE CONFIG FOR {key} from PARAM - CONFIG TIME: {config_end - config_start}s.")
    get_1_start = perf_counter()
    params = config.get("parameters*", "parameters*/**", "**/parameters*")
    get_1_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE GET1 PARAM FOR {key} from PARAM - CONFIG TIME: {get_1_end - get_1_start}s.")
    get_2_start = perf_counter()
    value = params.get(key, default_value)
    get_2_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE GET2 PARAM FOR {key} from PARAM - CONFIG TIME: {get_2_end - get_2_start}s.")
    end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE {key} from PARAM - TOTAL TIME: {end-start}s.")
    logging.info(f"CONFIG: {config}")
    return value
Speeds from the notebook:
Speeds from the job.
While significantly slower in the job, GET1 is still pretty slow in the notebook. I will keep going down the rabbit hole, but I have confirmed that this is the primary source of the minute-ish long gaps. We utilize these get_param functions extensively within our custom kedro datasets (commonly calling 2-4 per
__init__
.
Do you recommend a better solution than the above
get_param
function?
i
@User if you call
get_param
often, that means that every single time you call this function, you will load all the parameters from scratch since
config.get("parameters*", "parameters*/**", "**/parameters*")
reads the files and parses them. Could you somehow cache the results from the first time this is called? As for whether it is a good practice to have something like
get_params
, I think in general it'd be best to keep the standard way Kedro passes parameters around and not use this kind of functions, but I understand that this is not always possible. If refactoring your code at the moment is out of the question, don't worry too much about it and keep what you have, and look for opportunities to refactor that in the future to get the parameters the usual way.
w
@User, another singleton! @User & @User , One of the issues is that we have soo many catalog items, but also so many of them needing similar attributes. For example, we probably have 75+ catalog items for snowflake. Each environment also needs to push to a different database (dev, pre, prd). So, we could include the required database as part of each snowflake catalog item, but then we would be having to specify the same information over and over again in the .yml. It seems easier to create one set of parameters in the environment conf, then utilize something like get_params to manage that process instead as within the dataset. I am not super familiar with how far the .yml inheritance process goes. I suppose we could use that as a work around if it can go down multiple levels? For example:
Copy code
- conf
    - base
    - dev
         - catalog
               - pipeline1
                    - snowflake.yml
                    - sql.yml
               - pipeline2
                    - catalog.yml
               - snowflake.yml
               - sql.yml
Where the
snowflake.yml
in the catalog could contain something like
Copy code
#snowflake table
_snowflake_table: &create_snowflake_table
   type: kedro_connector.datasets.SnowflakeTable
   db: connection.string.stuff
   other_creds: creds stuff
Then in the
pipeline1 snowflake.yml
file and/or the
pipeline2 catalog.yml
we could have something like:
Copy code
pipeline1_table_output:
    <<: *create_snowflake_table
    table: db.schema.table_name
Would the load order matter in this situation though? Could the CatalogLoader handle inheritance across multiple files like that?
d
^ So we actually have internal users who have abstracted their pipeline definition into YAML much like this
We've researched how it's gone and have landed on the idea that his is not a good idea for a couple of reasons
You end up writing an ungodly amount of YAML, with no help from your IDE, tests etc. Things become very hard to work with, handover and debug.
I'm not sure if you were involved in our most recent UX research piece where we tested our possible solutions to config hell
did you join any of those calls?
w
@User, unfortunately not. What was your final conclusion on the best way to handle it? Also note, the Singleton solved the speed problem!!!! Thanks for all your help all!
d
Re the config research - @User should be posting it on GitHub shortly