Do people usually create edit data catalog YAML file using p Kedro #beginners-need-help

Do people usually create/edit data catalog YAML fi...

anhoang

08/19/2021, 5:40 PM

Do people usually create/edit data catalog YAML file using python in their pipelines? I have a pipeline with known number of output datasets with consistent meaning and naming (

file_A

file_B

file_C

). I want the folder that this pipeline runs in have its own dynamically generated data catalog so other people can go in and inspect the results from the pipeline easily by just taking the example from https://kedro.readthedocs.io/en/latest/05_data/01_data_catalog.html#configuring-a-data-catalog , is it possible to do this:

Copy code

python
io = DataCatalog(
    {
        "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
        "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
        "cars_table": SQLTableDataSet(
            table_name="cars", credentials=dict(con="sqlite:///kedro.db")
        ),
        "scooters_query": SQLQueryDataSet(
            sql="select * from cars where gear=4",
            credentials=dict(con="sqlite:///kedro.db"),
        ),
        "ranked": ParquetDataSet(filepath="ranked.parquet"),
    }
)

and then do

io.to_config()

? we have

io.from_config()

but not

io.to_config()

to generate YAML file from the data catalog object

datajoely

08/19/2021, 5:41 PM

Hi @User - good question

datajoely

08/19/2021, 5:44 PM

So I would argue dynamic catalogs are an advanced topic

datajoely

08/19/2021, 5:45 PM

We do support Jinja2 in your YAML so you can do loops and things in there if YOU really want to

datajoely

08/19/2021, 5:45 PM

We are also currently working on something that would better support this sort of thing - so you may want to talk to our UX researcher @User

datajoely

08/19/2021, 5:46 PM

However - in general we feel that explicit is usually better than explicit so err towards catalogs that are readable at rest

datajoely

08/19/2021, 5:47 PM

You can do what you've shown here via the Python API if you want - but it's been designed more for our purposes than users if that makes sense

anhoang

08/19/2021, 5:49 PM

from my understanding, the jinja 2 support is for looping of subsets of datasets, I don't usually do it since I believe the deterministic, written-out datasets are easier to maintain and understand

anhoang

08/19/2021, 5:51 PM

I want to export the data catalog of datasets

bikes

cars

cars_table

scooters_query

and

rank

to YAML from python

datajoely

08/19/2021, 5:51 PM

I don't think we have a mechanism for that - but do you want to do it so that you then re-read that YAML in via Kedro?

datajoely

08/19/2021, 5:52 PM

because we can use the

DataCatalog

object you've created directly

anhoang

08/19/2021, 5:57 PM

what I'm trying to do is, for each set of parameters of the pipeline, create a folder

param1_X_param2_Y

with files

[bikes.csv, cars.csv, etc]

and a data catalog that documents these datasets. Another folder

param1_A_param2_B

with the same set of files but content of the files are different This way, another person can go into a folder and explore these datasets for each parameter combinations and do subsequent analyses without worrying the filepaths

anhoang

08/19/2021, 5:59 PM

They would just use

kedro jupyter

or initialize a

DataCatalog

object that points to the folder

param1_X_param2_Y

and load one set of files

datajoely

08/19/2021, 5:59 PM

So is this to have a pre-configured project for people to investigate?

datajoely

08/19/2021, 5:59 PM

less about paramaterising a run

anhoang

08/19/2021, 6:00 PM

initializing

DataCatalog

pointing to

param1_A_param2_B

will load same dataset names but different files

anhoang

08/19/2021, 6:00 PM

yep! sorry for not being clear

datajoely

08/19/2021, 6:00 PM

anhoang

08/19/2021, 6:00 PM

it's about making the datasets output from a pipeline easier to navigate

datajoely

08/19/2021, 6:00 PM

we do have something that make make this easier

datajoely

08/19/2021, 6:00 PM

https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#local-and-base-configuration-environments

datajoely

08/19/2021, 6:00 PM

we have something called run environments

datajoely

08/19/2021, 6:01 PM

the canonical example is that you have the same names of datasets, but different paths for staging, production, testing etc

datajoely

08/19/2021, 6:01 PM

and you can dynamically change it by setting an env variable

export KEDRO_ENV=test

datajoely

08/19/2021, 6:01 PM

it's not exactly the same

datajoely

08/19/2021, 6:02 PM

but it's the closest thing we have to a native version of this

datajoely

08/19/2021, 6:02 PM

so instead of

base

and

local

datajoely

08/19/2021, 6:02 PM

you can add your own

anhoang

08/19/2021, 6:04 PM

what if the number of datasets change between

base

and

local

, for example when you work in cloud you need to output additional files? I thought that the number of datasets and what they are have to be exactly the same in every environments

datajoely

08/19/2021, 6:04 PM

no they can be diff

anhoang

08/19/2021, 6:04 PM

also is there a way to export the

DataCatalog

in the example above into YAML?

datajoely

08/19/2021, 6:05 PM

datajoely

08/19/2021, 6:05 PM

all of the information is in the object

datajoely

08/19/2021, 6:05 PM

so if you wanted to write a little script to generate it you could

datajoely

08/19/2021, 6:05 PM

but it doesn't out of the box have a

to_yaml

mechanism

datajoely

08/19/2021, 6:05 PM

I don't think it would be too hard to patch in if you wanted to do it

datajoely

08/19/2021, 6:05 PM

but it's not something that exists

anhoang

08/19/2021, 6:06 PM

gotcha! thanks so much for the help! Is that something that you guys would accept a PR for?

datajoely

08/19/2021, 6:06 PM

I think so

datajoely

08/19/2021, 6:07 PM

but full disclosure because the catalog is so central to Kedro we would have quite a high bar in terms of documentation, tests etc

datajoely

08/19/2021, 6:09 PM

actually

datajoely

08/19/2021, 6:09 PM

there is something we have for a very diff purpose

anhoang

08/19/2021, 6:09 PM

I think I get what you mean. I can manually add another dataset to the YAML in "local" and not in base Is there an example of a pipeline that outputs datasets

A, B, C

when run in environment A but outputs dataset

A, B, D, E

when run in environment B?

datajoely

08/19/2021, 6:09 PM

but it's buried in this part of the CLI https://github.com/quantumblacklabs/kedro/blob/17fbecc9c60c37c2078f7a190ddb0cf5101b65fb/kedro/framework/cli/catalog.py#L139\

datajoely

08/19/2021, 6:10 PM

Yeah you would have to have the same number of outputs in both situations

datajoely

08/19/2021, 6:10 PM

I would maybe structure that as diff pipelines

anhoang

08/19/2021, 6:10 PM

Ah! I think that only creates

MemoryDataset

for every missing dataset not in the catalog right?

datajoely

08/19/2021, 6:10 PM

yes

datajoely

08/19/2021, 6:10 PM

but it's a jumping off point

anhoang

08/19/2021, 6:11 PM

gotcha, different pipelines with subset of shared nodes.

anhoang

08/19/2021, 6:13 PM

Thank you so much for your rapid help @User ! Idk if you want/can move this into advanced-need-help or not but I thought there was a method

DataCatalog.to_yaml

somewhere so thought it was a beginner question lol 😆

datajoely

08/19/2021, 6:14 PM

It's fine here! now discord has these fancy threads I'm not worried. In general dynamic pipelining is an advanced topic, but this is before we even run the thing!

datajoely

08/19/2021, 6:14 PM

Good luck!

5 Views

Previous Next