https://kedro.org/ logo
Join the conversationJoin Discord
Channels
advanced-need-help
announcements
beginners-need-help
introductions
job-posting
plugins-integrations
random
resources
welcome
Powered by Linen
beginners-need-help
  • d

    datajoely

    08/10/2021, 10:19 PM
    Good luck - Let us know how you get on and I’ll pick up any messages first thing
  • w

    WolVerz

    08/10/2021, 10:19 PM
    thanks @User
  • d

    datajoely

    08/11/2021, 9:00 AM
    @User did you get it to work?
  • b

    Bertozzo

    08/11/2021, 5:14 PM
    Greetings ! I'm facing a very similar issue to this one that happened in an older version https://github.com/quantumblacklabs/kedro/issues/291 Basically I can't save my dataset due an encoding error
    d
    • 2
    • 101
  • b

    Bertozzo

    08/11/2021, 5:15 PM
    kedro.io.core.DataSetError: Failed while saving data to data set CSVDataSet(filepath=C:/Users.........csv, load_args={'encoding': utf-8}, protocol=file, save_args={'index': False}). 'charmap' codec can't encode character '\x91' in position 9: character maps to
  • b

    Bertozzo

    08/11/2021, 5:16 PM
    I'm on kedro 0.17.4 python 3.7.9 windows 10 pro
  • w

    WolVez

    08/11/2021, 8:17 PM
    @User , I was never able to figure it out. @User, I think you might be miss-understanding what I was asking. The video above and documentation seem to only reference creating a single custom environment. I am talking about inheritance from the various environments. For example, right now, if I run my dev environment, I also inherent from base if dev does not include the save parameter/catalog. I am wanting to take it a step further, where there are three levels of inheritance. Specifically such that temp inherits from dev which inherits from base (base -> dev -> temp).
    d
    • 2
    • 6
  • i

    Ignacio

    08/12/2021, 7:11 AM
    Hi @User! I have defined similar conf inheritance patterns in the past by modifying the
    register_config_loader
    hook under
    src/<package_name>/hooks.py
    . __**Example**__
    from pathlib import Path
    class ProjectHooks:
        """Project hooks."""
    
        @hook_impl
        def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
            # Force local as ultimate overriding params, regardless of env chosen.
            conf_paths.append((Path(CONF_ROOT) / "local"))
            return TemplatedConfigLoader(conf_paths, globals_pattern="*globals.yml")
    In this case, the inheritance will be
    base
    ->
    <custom_env>
    ->
    local
    . You can append other envs to
    conf_paths
    to customize this behavior.
  • w

    WolVez

    08/17/2021, 2:38 PM
    @User, I just found your project find-kedro. Cool stuff. Quick question: why do you utilize the full path for the pipeline name (for example src.default_kedro_159.pipelines.datas_science.pipeline) rather than just grabbing the directory name (in this case data_science), such that when running the project you just need "Kedro run --pipeline=data_science"?
  • w

    waylonwalker

    08/17/2021, 4:02 PM
    @User , It was mostly so that it would be most broadly useful. I cannot guarantee how others structure your project. The roots of find-kedro comes from my experience on
    0.14.x
    project that had a very deeply nested set of pipelines. You can easily change up the keys in your register pipelines module. You can make up rules that make most sense for your team. The core of what
    find-kedro
    does to create pipelines based on name (quite similar to how pytests picks up tests) will still be there, you will just make a slight tweak to make it cleaner for your project.
  • w

    waylonwalker

    08/17/2021, 4:13 PM
    @User Please let me know if you decide to use it and have any thoughts on it, I am not sure if anyone else is using it, but its the first thing I put on most of my projects.
  • w

    WolVez

    08/17/2021, 4:16 PM
    @User, I am planning on using it on a new project, then editing the output such that the key just requires the pipeline directory name aka "data_science". We did something similar to your solution on another project, but your's is one line of code to implement rather than the 10 we needed before.
  • w

    WolVez

    08/17/2021, 4:17 PM
    + ours was only for pipelines, not nodes and everything else.
  • w

    waylonwalker

    08/18/2021, 1:24 PM
    My pattern is to just put everything in a list of nodes, we really struggled onboarding early on. The docs were more sparse, and we had a very deeply nested setup. I spent way too many hours helping wtih packaging details just to get nodes to show in pipelines.
  • j

    JacobJeppesen

    08/19/2021, 7:44 AM
    Hi all :). I'm currently testing out Kedro, and have made a small example with a pipeline training a deep learning model on MNIST. The pipeline is composed of three smaller pipelines, with the first of these consisting of a seed_everything() node, which seeds all random generators. However, when I run the entire pipeline, it does not run the three smaller pipelines sequentially, even though I'm using a SequentialRunner. It seems like the order is based on the data dependencies, rather than the order defined in pipeline_registry(). Is there a way to ensure that a sub-pipeline finishes before starting the next when defining a pipeline?
    d
    • 2
    • 3
  • d

    datajoely

    08/19/2021, 8:03 AM
    Execution order
  • a

    anhoang

    08/19/2021, 5:40 PM
    Do people usually create/edit data catalog YAML file using python in their pipelines? I have a pipeline with known number of output datasets with consistent meaning and naming (
    file_A
    ,
    file_B
    ,
    file_C
    ). I want the folder that this pipeline runs in have its own dynamically generated data catalog so other people can go in and inspect the results from the pipeline easily by just taking the example from https://kedro.readthedocs.io/en/latest/05_data/01_data_catalog.html#configuring-a-data-catalog , is it possible to do this:
    python
    io = DataCatalog(
        {
            "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
            "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
            "cars_table": SQLTableDataSet(
                table_name="cars", credentials=dict(con="sqlite:///kedro.db")
            ),
            "scooters_query": SQLQueryDataSet(
                sql="select * from cars where gear=4",
                credentials=dict(con="sqlite:///kedro.db"),
            ),
            "ranked": ParquetDataSet(filepath="ranked.parquet"),
        }
    )
    and then do
    io.to_config()
    ? we have
    io.from_config()
    but not
    io.to_config()
    to generate YAML file from the data catalog object
    d
    • 2
    • 52
  • w

    waylonwalker

    08/19/2021, 8:20 PM
    @User , what do you mean by dynamic catalog? You have a known set of outputs, why does it need to be dynamic? I have definitely done some hacky python scripts to generate yaml.
  • w

    waylonwalker

    08/19/2021, 8:22 PM
    I have had a few that literally just print the config, and I run
    python gen_catalog.py > catalog.yml
    , End of the day yaml is still all that goes into the project and it's just a quick shortcut for me to generate a bunch of entries quick.
  • a

    anhoang

    08/19/2021, 8:31 PM
    sorry for the confusion, I meant catalog that is generated using python code
  • a

    anhoang

    08/19/2021, 8:32 PM
    @User do you have script that can be used to turn a
    DataCatalog
    like the example above into yaml? Would greatly appreciate it!!!
  • w

    waylonwalker

    08/19/2021, 8:38 PM
    ah, I see. I do not have one no.
  • w

    waylonwalker

    08/19/2021, 8:39 PM
    It defeinitely seems possible
  • a

    anhoang

    08/19/2021, 8:39 PM
    Also see [here](https://discord.com/channels/778216384475693066/877970253450199160/877977652198264962). "Dynamic data catalog" could mean using python to export a
    DataCatalog
    object yaml, and parameterize the data catalog generating script to generate different number of datasets (potentially with no intersection between the two sets) in different environments
  • a

    anhoang

    08/19/2021, 8:41 PM
    Yep! I just thought that someone must have done it. Now I'm confident that everyone is mainly just manually editing the yaml files when they want to add datasets (or maybe use
    kedro create_catalog
    as a starting point then manually edit) 🙂
  • w

    waylonwalker

    08/19/2021, 8:58 PM
    Actually I have never seen anyone make a catalog like you pulled from the docs using the python api (well except Lim). Personally I like it, but it just doesn't seem well supported. I raised an issue a while back that the datasets should document the yaml api rather than python api. I think it had good response, its just not a priority.
  • a

    anhoang

    08/19/2021, 9:13 PM
    I'm only using kedro catalog to combine with Prefect and not using Kedro pipelines, that's why I'd need this 🙂 . However, I still think things like this should be more supported for when people start with the
    mini-kedro
    minimal starter and have not needed pipelines yet
  • a

    anhoang

    08/19/2021, 9:16 PM
    @User I just whipped up the miinimal code snippet to go from the Dataset Class object to yaml string representation. Will need to modify for custom Datasets, but it's a start! 🙂
    python
    import inspect
    
    from kedro.extras.datasets.pandas import CSVDataSet
    
    full_class_name = inspect.getclasstree([CSVDataSet])[-1][0][0] #kedro.extras.datasets.pandas.csv_dataset.CSVDataSet
    
    full_class_name_str = str(full_class_name) #<class 'kedro.extras.datasets.pandas.csv_dataset.CSVDataSet'>
    
    yaml_class_str = full_class_name_str.partition("kedro.extras.datasets.")[-1].strip("'>") 
    
    print(yaml_class_str) #pandas.csv_dataset.CSVDataSet
  • w

    WolVez

    08/19/2021, 9:21 PM
    In a similar line to the dynamic catalog, has anyone done dynamically generated nodes inside of a pipeline based on params in the catalog? I am utilizing a for loop based on a list in the yaml to generate nodes. However, to do this I created a basic function which acquired the params from the context. However, when pip-installing a project and running a pipeline from the pip-install, this method seems to fail due to the
    __main__.py
    file attempting to configure the pipelines prior to the session setting the Context with
    configure_project(Path(__file__).parent.name)
    . Because the data files are not necessarily saved within cwd it is causing the pipeline registration to fail. I tried to create a ConfigLoader with the correct location if a session wasn't present, but this just seems to make the entire pipeline hang. Any idea how to get around
    configure_project
    ?
  • l

    Lorena

    08/20/2021, 9:11 AM
    So I'm afraid
    configure_project
    can't/shouldn't be bypassed as that's where settings and pipelines are (lazily) configured, in order to a) be able to import them anywhere in a project, and b) use them in the framework code. If you really really need the parameters, I suggest recreating the configloader logic of fetching the parameters in a helper function that you can call in the node. But generally dynamically generated pipelines are to be avoided if you can, I'm curious what your use case is maybe there's an alternative?
    w
    d
    i
    • 4
    • 49
Powered by Linen
Title
l

Lorena

08/20/2021, 9:11 AM
So I'm afraid
configure_project
can't/shouldn't be bypassed as that's where settings and pipelines are (lazily) configured, in order to a) be able to import them anywhere in a project, and b) use them in the framework code. If you really really need the parameters, I suggest recreating the configloader logic of fetching the parameters in a helper function that you can call in the node. But generally dynamically generated pipelines are to be avoided if you can, I'm curious what your use case is maybe there's an alternative?
w

WolVez

08/20/2021, 2:16 PM
@User , Thanks for your reply. We have created models for about 50 our of product catagories. We don't have the time or man power to stand-up / monitor each of these. Thus, we created several different nodes to manage training, analysis, and production of these 50 models. Thus, we really only create the list in catalog/the params to manage all 50 pipelines. I created a ConfigLoader in a very similar manner to the KedroContext method (see below). However, for some reason, when utilizing this method, it is extreeeemely slow if the
get_current_session()
fails. Specifically, getting to the
run
function in
__main__
take forever. In my mind, this implies something is up with
configure_project
.
def _get_config() -> ConfigLoader:
    """Get the kedro configuration context
    Returns:
        The kedro configuration context
    """
    try:
        return get_current_session().load_context().config_loader
    except Exception:
        env = os.getenv("RUN_ENV")
        if env:
            project_path = Path(os.getenv("PROJECT_PATH") or Path.cwd()).resolve()
            conf_root = settings.CONF_ROOT
            conf_paths = [
                str(project_path / conf_root / "base"),
                str(project_path / conf_root / env),
            ]
            return ConfigLoader(conf_paths)
        else:
            return ConfigLoader(["./conf/base", f"./conf/{get_env()}"])

def get_param(key: str, default_value: Any = None) -> str:
    """Get a parameter value from the parameters.yml
    Args:
        key: The id from parameter .yml files
    Returns:
        The parameter value
    """
    return _get_config().get("parameters*", "parameters*/**", "**/parameters*").get(key, default_value)
l

Lorena

08/20/2021, 3:32 PM
Where do settings come from in the snippet above? Is it
kedro.framework.project
or your own project package
my_package.settings
?
w

WolVez

08/20/2021, 3:34 PM
from kedro.framework.project import settings
l

Lorena

08/20/2021, 3:35 PM
If I step back a bit though, are you creating one pipeline per product category? And if so, are they structurally different, or is it just the data (catalog entries, parameters) that is different? If it's a yes, are you familiar with modular pipelines? https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html Could be what you're looking for.
That's probably why, if you try importing from your actual package, or even better just using "conf" straightaway is that working as expected?
w

WolVez

08/20/2021, 3:38 PM
xsell_group = get_param("xsell_group")

def create_pipeline_train():
    pipeline = Pipeline([])

    for item in xsell_group:
        item_camel = camel_to_snake(item[1])
        model_name = f"xsell_propensity_{item_camel}"
        pipeline += create_pipeline_train_model(
            model_name=model_name,
           filter_pretargets_node=filter_pretargets_xsell,
            build_target_node=build_target_xsell,
            additional_tags=["xsell", model_name, f"train_{model_name}", item_camel],
        )

    return pipeline
where xsell_group is a .yml list of lists from the context parameters
The callout is that it only runs into the issue during the pipeline build. Settings look like they are initialized during the first step of
configure_project
while pipelines is the last step.
l

Lorena

08/20/2021, 3:41 PM
Also just to clarify, are you running the pipeline with
kedro run
or in package mode, i.e. you've packaged the project as a Python artifact and running it as
my_package run
?
w

WolVez

08/20/2021, 3:44 PM
So we are pip installing the package, then running the a python script, which passes CLI options (I.e. --pipeline, --project_path) to sys.argv. then running main()
from phoenix_max.__main__ import main

argvs = {k:v for k, v in
    {
        '--pipeline': os.getenv("PIPELINES"),
        '--project_path': os.getenv("PROJECT_PATH"),
        '--env': os.getenv("RUN_ENV"),
        '--tags': os.getenv("TAGS")
    }.items() if v is not None
}

sys.argv += list(sum(argvs.items(), tuple()))
main()
l

Lorena

08/20/2021, 4:03 PM
I see that makes sense. I presume the
xsell_group
is a parameter that's used somewhere else in the project's pipelines too, not just for the dynamic creation here, right? Otherwise, if it's just here, you could just move the list to a constant in the Python file. I find it a bit surprising that
settings.CONF_ROOT
is that slow in package mode but I'll have a play with this myself next week to see if I can reproduce it / figure out what's happening. Meanwhile just importing from local package or using a literal directly should be good enough. Can you let me know what version of Kedro you're using?
w

WolVez

08/20/2021, 5:00 PM
@User Yes, we use xsell_group in a lot of places. We could create a python file for it, though that seems against the inherent build design of Kedro. We are using Kedro=0.17.2
d

datajoely

08/20/2021, 5:01 PM
@User - Thanks for getting back to us with the version. We'll have to pick this up on Monday, so have a good weekend and speak then!
w

WolVez

08/20/2021, 5:01 PM
Thanks @User
@User that this thread archives in 24 hours, ( I dont have permissions to make it last longer). I am not sure you want to extend that or not.
d

datajoely

08/20/2021, 5:02 PM
I'm still trying to work out what that actually means
I've been able to reopen threads older than that
if we have to open a new one on Monday we can
maybe even move this to advanced :p
w

WolVez

08/20/2021, 5:05 PM
Additional thoughts @User, would it make since to build an optional ConfigLoader into
configure_project
? I am wondering if the get_param is slowing things down by consistently having to remake new ConfigLoaders (though that should only be happening 3 or 4 times in our code base to generate pipelines)?
@User & @User, I did some additional testing and realized its less linear than I thought. It does look like session.run is being hit. The lag time is between Sesssion.run() and the first pipeline being hit. This should be a 3 minute pipeline, but this lag time is making it over an hour. Note, it is not the data loading, as we have Logging happening on the datasets being hit after the 1 hour delay.
l

Lorena

08/24/2021, 4:22 PM
@User would you be able to upgrade your kedro version, to latest 0.17.4 or even 0.17.3? From 0.17.3 we've made the pipelines to be loaded lazily, which might solve your problem. It should be a simple move as it's non-breaking realease.
w

WolVez

08/24/2021, 7:52 PM
@User , I cannot update to 17.4 or 17.3 right now. There seem to be a number of breaks, specifically with our custom Datasets who inherit from AbstractDataSet
@User , rrr I might have lied and just introduced another breaking change prior to the edit.
@User and @User, alright. Update. I have added an extensive level of logging to find this. I updated kedro to 0.17.4. It looks like during
register_catalog
,
DataCatalog.from_config
is utilized to instantiate the catalog. We are using custom Datasets which inherent from abstractdataset and more basic connects we setup which automatically handle authentication themselves. As a result, the Datasets are able to connect to exterior databases without the need for credentials to be stored in kedro. We have probably 100+ items in our catalog. The extensive lag time seems to be the entirety of all 100+ items instantiating a new instance of the dataset and then, as a result, going through authentication. We only have in total about 10 different Datasets types. That being said, during each run, we are calling a specific pipeline to run which only has probably a max of 5 or 6 Catalog items used. My questions are thus: 1. Do I need to utilize the
_filtered_pipelines
to instantiate the correct catalogs in our own
register_catalog
? 2. If i use kedro credentials, will it only create a single class of these (I am not seeing where in the code this would happen, but I remain hopefull I am just missing it)? 3. The actual instantiation of the base classes takes about 0.2 seconds. However, there is a minute occurring between each instantiation of the base classes while converting to Kedro Datasets? Ideas what would be causing this? 4. We are using Databricks to run this. We see that the expected runtime when utilizing notebooks, but when running jobs we hit this significant lag time. Given that the code is fundamentally the same (and the environment setup), do you have any ideas as to why this would change the runtime speed and the instantiation of the catalog?
d

datajoely

08/28/2021, 9:50 AM
@WolVez Lorena is now on holiday for two weeks and so I’ll have to pick this up next week with another team member. This is very helpful for us so please keep the questions coming I’m keen to find a solution/ improve this because it feels like something which shouldn’t be taking this long
w

WolVez

09/01/2021, 2:10 AM
@User , here are some logs of the runtime shown during the for loop of the
get_config
function inside of KedroContext during the Session.Run() prior to hitting the actual runners. There is a 1 minute lag time in the creation of each Abstract instantiation. The lag time is not specific to any one dataset type.
Given we have close to 100 (despite only using 7 in the pipeline) the run time for any process is close to an hour and half. The actual implementation of the pipeline is very speedy.
if we go deeper we see that the one minute between runs is is coming from the creation of the class_object from the passed dictionary within the abstractdataset.
d

datajoely

09/01/2021, 9:54 AM
Hi @User thank you for the detailed analysis
let me consult with the team
i

idanov

09/01/2021, 10:24 AM
@User the challenge seems to be that in order to get the config loader, your code loads the whole session and instantiates the full DataCatalog. I would avoid using
get_current_session()
since it seems to instantiate too many things for your needs. What you can do instead is in your
_get_config()
method, you can use https://kedro.readthedocs.io/en/stable/kedro.framework.startup.bootstrap_project.html?highlight=bootstrap_project to make sure the project is setup and then simply instantiate a ConfigLoader instance yourself as in the code block after
except
in your
_get_config()
function.
For the slowness of the
DataCatalog
instantiation, this happens due to the eagerness of the default
DataCatalog
which will eagerly instantiate all catalog entries and as you mentioned the lag may come due to the authentication for each of the connections. So each of the 100+ datasets will setup the connection on instantiation. One way to improve that is to make sure that your custom DataSet class has a class property for the connection which will be set only once when the first dataset will be instantiated and then all other DataSet instances will reuse the same connection. This way Kedro will instantiate only 1 connection instead of 100+ and the slowness will disappear. In that case though you need to ensure that the access to the connection is thread-safe (if you are using ThreadRunner).
The 1 hour slowness makes sense, I would expect Databricks to have some kind of brute-force attack prevention for authenticating and it seems that your code does too many authentincations at once, thus they throttle down the authentication requests. This could explain the 1 minute slowness between each authentication because of the timeout they apply between two consequtive authentications for security. The one shared connection between all datasets should solve this problem. Let us know if it fixes it for you. Here is an example I found after a quick search for a thread-safe singleton, hopefully it will help you with making a thread-safe class instance for your connection object: https://blog.hbis.fr/2019/03/23/python-singleton/
w

WolVez

09/02/2021, 9:12 PM
@User, @User Thanks for the response. I ended up adding singletons and a slew of other enhancements to reduce requests. However, this actually didn't end up speeding up the system very much. I went as deep as I could go this time around and actually identified the issue as
ConfigLoader.get()
. While
_lookup_config_filepaths
is the primary offender inside
ConfigLoader.get()
,
_load_configs
also took a large amount of time. We created a get_param function inside of our Connector Repo to help with grabbing various parameters held inside the conf. Below are the functions:
def _get_config() -> ConfigLoader:
    """Get the kedro configuration context
    Returns:
        The kedro configuration context
    """
    try:
        value = get_current_session().load_context().config_loader
        logging.info("_get_config - GET_CURRENT_SESSION_METHOD USED")
        return value
    except Exception:
        value = ConfigLoader(["./conf/base", f"./conf/{get_env()}"])
        logging.info("_get_config - NEW CONFIGLOADER CREATED!!!")
        return value


def get_param(key: str, default_value: Any = None) -> str:
    """Get a parameter value from the parameters.yml
    Args:
        key: The id from parameter .yml files
    Returns:
        The parameter value
    """
    logging.info(f"GETTING PARAM from kedro_connect: key - {key}")
    start = perf_counter()
    config_start = perf_counter()
    config = _get_config()
    config_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE CONFIG FOR {key} from PARAM - CONFIG TIME: {config_end - config_start}s.")
    get_1_start = perf_counter()
    params = config.get("parameters*", "parameters*/**", "**/parameters*")
    get_1_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE GET1 PARAM FOR {key} from PARAM - CONFIG TIME: {get_1_end - get_1_start}s.")
    get_2_start = perf_counter()
    value = params.get(key, default_value)
    get_2_end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE GET2 PARAM FOR {key} from PARAM - CONFIG TIME: {get_2_end - get_2_start}s.")
    end = perf_counter()
    logging.info(f"TOTAL TIME TO RETRIEVE {key} from PARAM - TOTAL TIME: {end-start}s.")
    logging.info(f"CONFIG: {config}")
    return value
Speeds from the notebook:
Speeds from the job.
message has been deleted
While significantly slower in the job, GET1 is still pretty slow in the notebook. I will keep going down the rabbit hole, but I have confirmed that this is the primary source of the minute-ish long gaps. We utilize these get_param functions extensively within our custom kedro datasets (commonly calling 2-4 per
__init__
.
Do you recommend a better solution than the above
get_param
function?
i

idanov

09/03/2021, 9:04 AM
@User if you call
get_param
often, that means that every single time you call this function, you will load all the parameters from scratch since
config.get("parameters*", "parameters*/**", "**/parameters*")
reads the files and parses them. Could you somehow cache the results from the first time this is called? As for whether it is a good practice to have something like
get_params
, I think in general it'd be best to keep the standard way Kedro passes parameters around and not use this kind of functions, but I understand that this is not always possible. If refactoring your code at the moment is out of the question, don't worry too much about it and keep what you have, and look for opportunities to refactor that in the future to get the parameters the usual way.
w

WolVez

09/03/2021, 2:28 PM
@User, another singleton! @User & @User , One of the issues is that we have soo many catalog items, but also so many of them needing similar attributes. For example, we probably have 75+ catalog items for snowflake. Each environment also needs to push to a different database (dev, pre, prd). So, we could include the required database as part of each snowflake catalog item, but then we would be having to specify the same information over and over again in the .yml. It seems easier to create one set of parameters in the environment conf, then utilize something like get_params to manage that process instead as within the dataset. I am not super familiar with how far the .yml inheritance process goes. I suppose we could use that as a work around if it can go down multiple levels? For example:
- conf
    - base
    - dev
         - catalog
               - pipeline1
                    - snowflake.yml
                    - sql.yml
               - pipeline2
                    - catalog.yml
               - snowflake.yml
               - sql.yml
Where the
snowflake.yml
in the catalog could contain something like
#snowflake table
_snowflake_table: &create_snowflake_table
   type: kedro_connector.datasets.SnowflakeTable
   db: connection.string.stuff
   other_creds: creds stuff
Then in the
pipeline1 snowflake.yml
file and/or the
pipeline2 catalog.yml
we could have something like:
pipeline1_table_output:
    <<: *create_snowflake_table
    table: db.schema.table_name
Would the load order matter in this situation though? Could the CatalogLoader handle inheritance across multiple files like that?
d

datajoely

09/03/2021, 2:29 PM
^ So we actually have internal users who have abstracted their pipeline definition into YAML much like this
We've researched how it's gone and have landed on the idea that his is not a good idea for a couple of reasons
You end up writing an ungodly amount of YAML, with no help from your IDE, tests etc. Things become very hard to work with, handover and debug.
I'm not sure if you were involved in our most recent UX research piece where we tested our possible solutions to config hell
did you join any of those calls?
w

WolVez

09/09/2021, 3:29 PM
@User, unfortunately not. What was your final conclusion on the best way to handle it? Also note, the Singleton solved the speed problem!!!! Thanks for all your help all!
d

datajoely

09/09/2021, 3:31 PM
Re the config research - @User should be posting it on GitHub shortly
View count: 1