So I m afraid `configure project` can t shouldn t be bypasse Kedro #beginners-need-help

So I'm afraid `configure_project` can't/shouldn't ...

Lorena

08/20/2021, 9:11 AM

So I'm afraid

configure_project

can't/shouldn't be bypassed as that's where settings and pipelines are (lazily) configured, in order to a) be able to import them anywhere in a project, and b) use them in the framework code. If you really really need the parameters, I suggest recreating the configloader logic of fetching the parameters in a helper function that you can call in the node. But generally dynamically generated pipelines are to be avoided if you can, I'm curious what your use case is maybe there's an alternative?

WolVez

08/20/2021, 2:16 PM

@User , Thanks for your reply. We have created models for about 50 our of product catagories. We don't have the time or man power to stand-up / monitor each of these. Thus, we created several different nodes to manage training, analysis, and production of these 50 models. Thus, we really only create the list in catalog/the params to manage all 50 pipelines. I created a ConfigLoader in a very similar manner to the KedroContext method (see below). However, for some reason, when utilizing this method, it is extreeeemely slow if the

get_current_session()

fails. Specifically, getting to the

run

function in

__main__

take forever. In my mind, this implies something is up with

configure_project

Copy code

def _get_config() -> ConfigLoader:
    """Get the kedro configuration context
    Returns:
        The kedro configuration context
    """
    try:
        return get_current_session().load_context().config_loader
    except Exception:
        env = os.getenv("RUN_ENV")
        if env:
            project_path = Path(os.getenv("PROJECT_PATH") or Path.cwd()).resolve()
            conf_root = settings.CONF_ROOT
            conf_paths = [
                str(project_path / conf_root / "base"),
                str(project_path / conf_root / env),
            ]
            return ConfigLoader(conf_paths)
        else:
            return ConfigLoader(["./conf/base", f"./conf/{get_env()}"])

def get_param(key: str, default_value: Any = None) -> str:
    """Get a parameter value from the parameters.yml
    Args:
        key: The id from parameter .yml files
    Returns:
        The parameter value
    """
    return _get_config().get("parameters*", "parameters*/**", "**/parameters*").get(key, default_value)

Lorena

08/20/2021, 3:32 PM

Where do settings come from in the snippet above? Is it

kedro.framework.project

or your own project package

my_package.settings

WolVez

08/20/2021, 3:34 PM

from kedro.framework.project import settings

Lorena

08/20/2021, 3:35 PM

If I step back a bit though, are you creating one pipeline per product category? And if so, are they structurally different, or is it just the data (catalog entries, parameters) that is different? If it's a yes, are you familiar with modular pipelines? https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html Could be what you're looking for.

Lorena

08/20/2021, 3:37 PM

That's probably why, if you try importing from your actual package, or even better just using "conf" straightaway is that working as expected?

WolVez

08/20/2021, 3:38 PM

Copy code

xsell_group = get_param("xsell_group")

def create_pipeline_train():
    pipeline = Pipeline([])

    for item in xsell_group:
        item_camel = camel_to_snake(item[1])
        model_name = f"xsell_propensity_{item_camel}"
        pipeline += create_pipeline_train_model(
            model_name=model_name,
           filter_pretargets_node=filter_pretargets_xsell,
            build_target_node=build_target_xsell,
            additional_tags=["xsell", model_name, f"train_{model_name}", item_camel],
        )

    return pipeline

where xsell_group is a .yml list of lists from the context parameters

WolVez

08/20/2021, 3:40 PM

The callout is that it only runs into the issue during the pipeline build. Settings look like they are initialized during the first step of

configure_project

while pipelines is the last step.

Lorena

08/20/2021, 3:41 PM

Also just to clarify, are you running the pipeline with

kedro run

or in package mode, i.e. you've packaged the project as a Python artifact and running it as

my_package run

WolVez

08/20/2021, 3:44 PM

So we are pip installing the package, then running the a python script, which passes CLI options (I.e. --pipeline, --project_path) to sys.argv. then running main()

Copy code

from phoenix_max.__main__ import main

argvs = {k:v for k, v in
    {
        '--pipeline': os.getenv("PIPELINES"),
        '--project_path': os.getenv("PROJECT_PATH"),
        '--env': os.getenv("RUN_ENV"),
        '--tags': os.getenv("TAGS")
    }.items() if v is not None
}

sys.argv += list(sum(argvs.items(), tuple()))
main()

Lorena

08/20/2021, 4:03 PM

I see that makes sense. I presume the

xsell_group

is a parameter that's used somewhere else in the project's pipelines too, not just for the dynamic creation here, right? Otherwise, if it's just here, you could just move the list to a constant in the Python file. I find it a bit surprising that

settings.CONF_ROOT

is that slow in package mode but I'll have a play with this myself next week to see if I can reproduce it / figure out what's happening. Meanwhile just importing from local package or using a literal directly should be good enough. Can you let me know what version of Kedro you're using?

WolVez

08/20/2021, 5:00 PM

@User Yes, we use xsell_group in a lot of places. We could create a python file for it, though that seems against the inherent build design of Kedro. We are using Kedro=0.17.2

datajoely

08/20/2021, 5:01 PM

@User - Thanks for getting back to us with the version. We'll have to pick this up on Monday, so have a good weekend and speak then!

WolVez

08/20/2021, 5:01 PM

Thanks @User

WolVez

08/20/2021, 5:02 PM

@User that this thread archives in 24 hours, ( I dont have permissions to make it last longer). I am not sure you want to extend that or not.

datajoely

08/20/2021, 5:02 PM

I'm still trying to work out what that actually means

datajoely

08/20/2021, 5:02 PM

I've been able to reopen threads older than that

datajoely

08/20/2021, 5:02 PM

if we have to open a new one on Monday we can

datajoely

08/20/2021, 5:02 PM

maybe even move this to advanced :p

WolVez

08/20/2021, 5:05 PM

Additional thoughts @User, would it make since to build an optional ConfigLoader into

configure_project

? I am wondering if the get_param is slowing things down by consistently having to remake new ConfigLoaders (though that should only be happening 3 or 4 times in our code base to generate pipelines)?

WolVez

08/23/2021, 2:33 PM

@User & @User, I did some additional testing and realized its less linear than I thought. It does look like session.run is being hit. The lag time is between Sesssion.run() and the first pipeline being hit. This should be a 3 minute pipeline, but this lag time is making it over an hour. Note, it is not the data loading, as we have Logging happening on the datasets being hit after the 1 hour delay.

Lorena

08/24/2021, 4:22 PM

@User would you be able to upgrade your kedro version, to latest 0.17.4 or even 0.17.3? From 0.17.3 we've made the pipelines to be loaded lazily, which might solve your problem. It should be a simple move as it's non-breaking realease.

WolVez

08/24/2021, 7:52 PM

@User , I cannot update to 17.4 or 17.3 right now. There seem to be a number of breaks, specifically with our custom Datasets who inherit from AbstractDataSet

WolVez

08/24/2021, 8:22 PM

@User , rrr I might have lied and just introduced another breaking change prior to the edit.

WolVez

08/27/2021, 4:35 PM

@User and @User, alright. Update. I have added an extensive level of logging to find this. I updated kedro to 0.17.4. It looks like during

register_catalog

DataCatalog.from_config

is utilized to instantiate the catalog. We are using custom Datasets which inherent from abstractdataset and more basic connects we setup which automatically handle authentication themselves. As a result, the Datasets are able to connect to exterior databases without the need for credentials to be stored in kedro. We have probably 100+ items in our catalog. The extensive lag time seems to be the entirety of all 100+ items instantiating a new instance of the dataset and then, as a result, going through authentication. We only have in total about 10 different Datasets types. That being said, during each run, we are calling a specific pipeline to run which only has probably a max of 5 or 6 Catalog items used. My questions are thus: 1. Do I need to utilize the

_filtered_pipelines

to instantiate the correct catalogs in our own

register_catalog

? 2. If i use kedro credentials, will it only create a single class of these (I am not seeing where in the code this would happen, but I remain hopefull I am just missing it)? 3. The actual instantiation of the base classes takes about 0.2 seconds. However, there is a minute occurring between each instantiation of the base classes while converting to Kedro Datasets? Ideas what would be causing this? 4. We are using Databricks to run this. We see that the expected runtime when utilizing notebooks, but when running jobs we hit this significant lag time. Given that the code is fundamentally the same (and the environment setup), do you have any ideas as to why this would change the runtime speed and the instantiation of the catalog?

datajoely

08/28/2021, 9:50 AM

@WolVez Lorena is now on holiday for two weeks and so I’ll have to pick this up next week with another team member. This is very helpful for us so please keep the questions coming I’m keen to find a solution/ improve this because it feels like something which shouldn’t be taking this long

WolVez

09/01/2021, 2:10 AM

@User , here are some logs of the runtime shown during the for loop of the

get_config

function inside of KedroContext during the Session.Run() prior to hitting the actual runners. There is a 1 minute lag time in the creation of each Abstract instantiation. The lag time is not specific to any one dataset type.

WolVez

09/01/2021, 2:11 AM

Given we have close to 100 (despite only using 7 in the pipeline) the run time for any process is close to an hour and half. The actual implementation of the pipeline is very speedy.

WolVez

09/01/2021, 4:05 AM

if we go deeper we see that the one minute between runs is is coming from the creation of the class_object from the passed dictionary within the abstractdataset.

datajoely

09/01/2021, 9:54 AM

Hi @User thank you for the detailed analysis

datajoely

09/01/2021, 9:54 AM

let me consult with the team

idanov

09/01/2021, 10:24 AM

@User the challenge seems to be that in order to get the config loader, your code loads the whole session and instantiates the full DataCatalog. I would avoid using

get_current_session()

since it seems to instantiate too many things for your needs. What you can do instead is in your

_get_config()

method, you can use https://kedro.readthedocs.io/en/stable/kedro.framework.startup.bootstrap_project.html?highlight=bootstrap_project to make sure the project is setup and then simply instantiate a ConfigLoader instance yourself as in the code block after

except

in your

_get_config()

function.

idanov

09/01/2021, 10:30 AM

For the slowness of the

DataCatalog

instantiation, this happens due to the eagerness of the default

DataCatalog

which will eagerly instantiate all catalog entries and as you mentioned the lag may come due to the authentication for each of the connections. So each of the 100+ datasets will setup the connection on instantiation. One way to improve that is to make sure that your custom DataSet class has a class property for the connection which will be set only once when the first dataset will be instantiated and then all other DataSet instances will reuse the same connection. This way Kedro will instantiate only 1 connection instead of 100+ and the slowness will disappear. In that case though you need to ensure that the access to the connection is thread-safe (if you are using ThreadRunner).

idanov

09/01/2021, 10:40 AM

The 1 hour slowness makes sense, I would expect Databricks to have some kind of brute-force attack prevention for authenticating and it seems that your code does too many authentincations at once, thus they throttle down the authentication requests. This could explain the 1 minute slowness between each authentication because of the timeout they apply between two consequtive authentications for security. The one shared connection between all datasets should solve this problem. Let us know if it fixes it for you. Here is an example I found after a quick search for a thread-safe singleton, hopefully it will help you with making a thread-safe class instance for your connection object: https://blog.hbis.fr/2019/03/23/python-singleton/

WolVez

09/02/2021, 9:12 PM

@User, @User Thanks for the response. I ended up adding singletons and a slew of other enhancements to reduce requests. However, this actually didn't end up speeding up the system very much. I went as deep as I could go this time around and actually identified the issue as

ConfigLoader.get()

. While

_lookup_config_filepaths

is the primary offender inside

ConfigLoader.get()

_load_configs

also took a large amount of time. We created a get_param function inside of our Connector Repo to help with grabbing various parameters held inside the conf. Below are the functions:

WolVez

09/02/2021, 9:12 PM

Copy code

def _get_config() -> ConfigLoader:
    """Get the kedro configuration context
    Returns:
        The kedro configuration context
    """
    try:
        value = get_current_session().load_context().config_loader
        logging.info("_get_config - GET_CURRENT_SESSION_METHOD USED")
        return value
    except Exception:
        value = ConfigLoader(["./conf/base", f"./conf/{get_env()}"])
        logging.info("_get_config - NEW CONFIGLOADER CREATED!!!")
        return value


def get_param(key: str, default_value: Any = None) -> str:
    """Get a parameter value from the parameters.yml
    Args:
        key: The id from parameter .yml files
    Returns:
        The parameter value
    """
    logging.info(f"GETTING PARAM from kedro_connect: key - {key}")
    start = perf_counter()
    config_start = perf_counter()
    config = _get_config()
    config_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE CONFIG FOR {key} from PARAM - CONFIG TIME: {config_end - config_start}s.")
    get_1_start = perf_counter()
    params = config.get("parameters*", "parameters*/**", "**/parameters*")
    get_1_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE GET1 PARAM FOR {key} from PARAM - CONFIG TIME: {get_1_end - get_1_start}s.")
    get_2_start = perf_counter()
    value = params.get(key, default_value)
    get_2_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE GET2 PARAM FOR {key} from PARAM - CONFIG TIME: {get_2_end - get_2_start}s.")
    end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE {key} from PARAM - TOTAL TIME: {end-start}s.")
    logging.info(f"CONFIG: {config}")
    return value

WolVez

09/02/2021, 9:12 PM

Speeds from the notebook:

WolVez

09/02/2021, 9:12 PM

Speeds from the job.

WolVez

09/02/2021, 9:15 PM

While significantly slower in the job, GET1 is still pretty slow in the notebook. I will keep going down the rabbit hole, but I have confirmed that this is the primary source of the minute-ish long gaps. We utilize these get_param functions extensively within our custom kedro datasets (commonly calling 2-4 per

__init__

WolVez

09/02/2021, 9:51 PM

Do you recommend a better solution than the above

get_param

function?

idanov

09/03/2021, 9:04 AM

@User if you call

get_param

often, that means that every single time you call this function, you will load all the parameters from scratch since

config.get("parameters*", "parameters*/**", "**/parameters*")

reads the files and parses them. Could you somehow cache the results from the first time this is called? As for whether it is a good practice to have something like

get_params

, I think in general it'd be best to keep the standard way Kedro passes parameters around and not use this kind of functions, but I understand that this is not always possible. If refactoring your code at the moment is out of the question, don't worry too much about it and keep what you have, and look for opportunities to refactor that in the future to get the parameters the usual way.

WolVez

09/03/2021, 2:28 PM

@User, another singleton! @User & @User , One of the issues is that we have soo many catalog items, but also so many of them needing similar attributes. For example, we probably have 75+ catalog items for snowflake. Each environment also needs to push to a different database (dev, pre, prd). So, we could include the required database as part of each snowflake catalog item, but then we would be having to specify the same information over and over again in the .yml. It seems easier to create one set of parameters in the environment conf, then utilize something like get_params to manage that process instead as within the dataset. I am not super familiar with how far the .yml inheritance process goes. I suppose we could use that as a work around if it can go down multiple levels? For example:

Copy code

- conf
    - base
    - dev
         - catalog
               - pipeline1
                    - snowflake.yml
                    - sql.yml
               - pipeline2
                    - catalog.yml
               - snowflake.yml
               - sql.yml

Where the

snowflake.yml

in the catalog could contain something like

Copy code

#snowflake table
_snowflake_table: &create_snowflake_table
   type: kedro_connector.datasets.SnowflakeTable
   db: connection.string.stuff
   other_creds: creds stuff

Then in the

pipeline1 snowflake.yml

file and/or the

pipeline2 catalog.yml

we could have something like:

Copy code

pipeline1_table_output:
    <<: *create_snowflake_table
    table: db.schema.table_name

Would the load order matter in this situation though? Could the CatalogLoader handle inheritance across multiple files like that?

datajoely

09/03/2021, 2:29 PM

^ So we actually have internal users who have abstracted their pipeline definition into YAML much like this

datajoely

09/03/2021, 2:30 PM

We've researched how it's gone and have landed on the idea that his is not a good idea for a couple of reasons

datajoely

09/03/2021, 2:30 PM

You end up writing an ungodly amount of YAML, with no help from your IDE, tests etc. Things become very hard to work with, handover and debug.

datajoely

09/03/2021, 2:31 PM

I'm not sure if you were involved in our most recent UX research piece where we tested our possible solutions to config hell

datajoely

09/03/2021, 2:31 PM

did you join any of those calls?

WolVez

09/09/2021, 3:29 PM

@User, unfortunately not. What was your final conclusion on the best way to handle it? Also note, the Singleton solved the speed problem!!!! Thanks for all your help all!

datajoely

09/09/2021, 3:31 PM

Re the config research - @User should be posting it on GitHub shortly

4 Views

Previous Next