Title
#beginners-need-help
a

anhoang

08/19/2021, 5:40 PM
Do people usually create/edit data catalog YAML file using python in their pipelines? I have a pipeline with known number of output datasets with consistent meaning and naming (
file_A
,
file_B
,
file_C
). I want the folder that this pipeline runs in have its own dynamically generated data catalog so other people can go in and inspect the results from the pipeline easily by just taking the example from https://kedro.readthedocs.io/en/latest/05_data/01_data_catalog.html#configuring-a-data-catalog , is it possible to do this:
python
io = DataCatalog(
    {
        "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
        "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
        "cars_table": SQLTableDataSet(
            table_name="cars", credentials=dict(con="sqlite:///kedro.db")
        ),
        "scooters_query": SQLQueryDataSet(
            sql="select * from cars where gear=4",
            credentials=dict(con="sqlite:///kedro.db"),
        ),
        "ranked": ParquetDataSet(filepath="ranked.parquet"),
    }
)
and then do
io.to_config()
? we have
io.from_config()
but not
io.to_config()
to generate YAML file from the data catalog object
datajoely

datajoely

08/19/2021, 5:41 PM
Hi @User - good question
5:44 PM
So I would argue dynamic catalogs are an advanced topic
5:45 PM
We do support Jinja2 in your YAML so you can do loops and things in there if YOU really want to
5:45 PM
We are also currently working on something that would better support this sort of thing - so you may want to talk to our UX researcher @User
5:46 PM
However - in general we feel that explicit is usually better than explicit so err towards catalogs that are readable at rest
5:47 PM
You can do what you've shown here via the Python API if you want - but it's been designed more for our purposes than users if that makes sense
a

anhoang

08/19/2021, 5:49 PM
from my understanding, the jinja 2 support is for looping of subsets of datasets, I don't usually do it since I believe the deterministic, written-out datasets are easier to maintain and understand
5:51 PM
I want to export the data catalog of datasets
bikes
,
cars
,
cars_table
,
scooters_query
and
rank
to YAML from python
datajoely

datajoely

08/19/2021, 5:51 PM
I don't think we have a mechanism for that - but do you want to do it so that you then re-read that YAML in via Kedro?
5:52 PM
because we can use the
DataCatalog
object you've created directly
a

anhoang

08/19/2021, 5:57 PM
what I'm trying to do is, for each set of parameters of the pipeline, create a folder
param1_X_param2_Y
with files
[bikes.csv, cars.csv, etc]
and a data catalog that documents these datasets. Another folder
param1_A_param2_B
with the same set of files but content of the files are different This way, another person can go into a folder and explore these datasets for each parameter combinations and do subsequent analyses without worrying the filepaths
5:59 PM
They would just use
kedro jupyter
or initialize a
DataCatalog
object that points to the folder
param1_X_param2_Y
and load one set of files
datajoely

datajoely

08/19/2021, 5:59 PM
So is this to have a pre-configured project for people to investigate?
5:59 PM
less about paramaterising a run
a

anhoang

08/19/2021, 6:00 PM
initializing
DataCatalog
pointing to
param1_A_param2_B
will load same dataset names but different files
6:00 PM
yep! sorry for not being clear
datajoely

datajoely

08/19/2021, 6:00 PM
so
a

anhoang

08/19/2021, 6:00 PM
it's about making the datasets output from a pipeline easier to navigate
datajoely

datajoely

08/19/2021, 6:00 PM
we do have something that make make this easier
6:00 PM
we have something called run environments
6:01 PM
the canonical example is that you have the same names of datasets, but different paths for staging, production, testing etc
6:01 PM
and you can dynamically change it by setting an env variable
export KEDRO_ENV=test
6:01 PM
it's not exactly the same
6:02 PM
but it's the closest thing we have to a native version of this
6:02 PM
so instead of
base
and
local
6:02 PM
you can add your own
a

anhoang

08/19/2021, 6:04 PM
what if the number of datasets change between
base
and
local
, for example when you work in cloud you need to output additional files? I thought that the number of datasets and what they are have to be exactly the same in every environments
datajoely

datajoely

08/19/2021, 6:04 PM
no they can be diff
a

anhoang

08/19/2021, 6:04 PM
also is there a way to export the
DataCatalog
in the example above into YAML?
datajoely

datajoely

08/19/2021, 6:05 PM
no
6:05 PM
all of the information is in the object
6:05 PM
so if you wanted to write a little script to generate it you could
6:05 PM
but it doesn't out of the box have a
to_yaml
mechanism
6:05 PM
I don't think it would be too hard to patch in if you wanted to do it
6:05 PM
but it's not something that exists
a

anhoang

08/19/2021, 6:06 PM
gotcha! thanks so much for the help! Is that something that you guys would accept a PR for?
datajoely

datajoely

08/19/2021, 6:06 PM
I think so
6:07 PM
but full disclosure because the catalog is so central to Kedro we would have quite a high bar in terms of documentation, tests etc
6:09 PM
actually
6:09 PM
there is something we have for a very diff purpose
a

anhoang

08/19/2021, 6:09 PM
I think I get what you mean. I can manually add another dataset to the YAML in "local" and not in base Is there an example of a pipeline that outputs datasets
A, B, C
when run in environment A but outputs dataset
A, B, D, E
when run in environment B?
datajoely

datajoely

08/19/2021, 6:09 PM
6:10 PM
Yeah you would have to have the same number of outputs in both situations
6:10 PM
I would maybe structure that as diff pipelines
a

anhoang

08/19/2021, 6:10 PM
Ah! I think that only creates
MemoryDataset
for every missing dataset not in the catalog right?
datajoely

datajoely

08/19/2021, 6:10 PM
yes
6:10 PM
but it's a jumping off point
a

anhoang

08/19/2021, 6:11 PM
gotcha, different pipelines with subset of shared nodes.
6:13 PM
Thank you so much for your rapid help @User ! Idk if you want/can move this into advanced-need-help or not but I thought there was a method
DataCatalog.to_yaml
somewhere so thought it was a beginner question lol 😆
datajoely

datajoely

08/19/2021, 6:14 PM
It's fine here! now discord has these fancy threads I'm not worried. In general dynamic pipelining is an advanced topic, but this is before we even run the thing!
6:14 PM
Good luck!