Thank you for the reply I want to use the envs yes but I wan Kedro #beginners-need-help

Thank you for the reply. I want to use the envs, y...

Zemeio

11/06/2021, 1:25 PM

Thank you for the reply. I want to use the envs, yes, but I want to have pipelines that will sample my prod data to the test data with pipelines (or nodes). So I have to have a pipeline that goes from one env to the other. The way I thought to achieve this is by having a setting that always points to the test data (test) and one that can either point to the test data or to the prod data (base). In that case, the environment would make the base point to the test, so I can run stuff on a smaller dataset, and the base would point to the prod on the cloud, to use the huge datasets.

datajoely

11/08/2021, 9:39 AM

Hi @User I'm not sure if I'm following entirely, so let's start a thread to work through this. I think the answer we're going to end up will end up with more duplication than we'd like.

Zemeio

11/08/2021, 9:40 AM

Wow, thanks!

datajoely

11/08/2021, 9:41 AM

As I understand it we just need to have mirror catalogs in the folder structure. You can also do this trick to inject environment variables into your TemplatedConfigLoader scope:

Copy code

python
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
    return...        globals_pattern="*globals.yml",
        globals_dict={
            k: v for k, v in os.environ 
            if k.startswith("XXXXX")
        },
    )

Zemeio

11/08/2021, 9:42 AM

Basically, I want the ability to do (1)

kedro run --pipeline create-test-data

which is going to subset my prod data to my test data (which is sized in a way I can use in local) While still having the ability to run (2)

kedro run --pipeline my-usual-pipeline --env test

What complicates this is that the (1) needs to run from the prod data to the test data

datajoely

11/08/2021, 9:47 AM

So I think that's possible, but you need duplicate catalog entires

datajoely

11/08/2021, 9:48 AM

The other thing you could do, which I don't entirely endorse is in your

create_pipeline()

functions you could start injecting some logic that changes the pipeline inputs

Zemeio

11/08/2021, 9:54 AM

I planned on having duplicate catalog entries (thought they would be needed). In which case I wanted to make the test always available, you can always use it. The other one would be a "base", which can point to the test or to prod, depending on which --env you pass

datajoely

11/08/2021, 9:55 AM

Yeah I think it's probably the best way of doing it

datajoely

11/08/2021, 9:55 AM

but I also don't like it! we should do a better job in the future

datajoely

11/08/2021, 9:56 AM

We have an enormous long running piece of user research on how to simplify / overhaul config in general in this issue https://github.com/quantumblacklabs/kedro/issues/891 If you have any thoughts it would very much be welcome

Zemeio

11/08/2021, 10:42 AM

I'll take a look at it (tomorrow, can't get my hands on it rn)

datajoely

11/08/2021, 10:42 AM

Yeah it's chunky, but any feedback can help us steer the future of what I feel is our biggest area of overhead

datajoely

11/08/2021, 10:42 AM

no rush!

Previous Next