Do people usually create/edit data catalog YAML fi...
# beginners-need-help
a
Do people usually create/edit data catalog YAML file using python in their pipelines? I have a pipeline with known number of output datasets with consistent meaning and naming (
file_A
,
file_B
,
file_C
). I want the folder that this pipeline runs in have its own dynamically generated data catalog so other people can go in and inspect the results from the pipeline easily by just taking the example from https://kedro.readthedocs.io/en/latest/05_data/01_data_catalog.html#configuring-a-data-catalog , is it possible to do this:
Copy code
python
io = DataCatalog(
    {
        "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
        "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
        "cars_table": SQLTableDataSet(
            table_name="cars", credentials=dict(con="sqlite:///kedro.db")
        ),
        "scooters_query": SQLQueryDataSet(
            sql="select * from cars where gear=4",
            credentials=dict(con="sqlite:///kedro.db"),
        ),
        "ranked": ParquetDataSet(filepath="ranked.parquet"),
    }
)
and then do
io.to_config()
? we have
io.from_config()
but not
io.to_config()
to generate YAML file from the data catalog object
d
Hi @User - good question
So I would argue dynamic catalogs are an advanced topic
We do support Jinja2 in your YAML so you can do loops and things in there if YOU really want to
We are also currently working on something that would better support this sort of thing - so you may want to talk to our UX researcher @User
However - in general we feel that explicit is usually better than explicit so err towards catalogs that are readable at rest
You can do what you've shown here via the Python API if you want - but it's been designed more for our purposes than users if that makes sense
a
from my understanding, the jinja 2 support is for looping of subsets of datasets, I don't usually do it since I believe the deterministic, written-out datasets are easier to maintain and understand
I want to export the data catalog of datasets
bikes
,
cars
,
cars_table
,
scooters_query
and
rank
to YAML from python
d
I don't think we have a mechanism for that - but do you want to do it so that you then re-read that YAML in via Kedro?
because we can use the
DataCatalog
object you've created directly
a
what I'm trying to do is, for each set of parameters of the pipeline, create a folder
param1_X_param2_Y
with files
[bikes.csv, cars.csv, etc]
and a data catalog that documents these datasets. Another folder
param1_A_param2_B
with the same set of files but content of the files are different This way, another person can go into a folder and explore these datasets for each parameter combinations and do subsequent analyses without worrying the filepaths
They would just use
kedro jupyter
or initialize a
DataCatalog
object that points to the folder
param1_X_param2_Y
and load one set of files
d
So is this to have a pre-configured project for people to investigate?
less about paramaterising a run
a
initializing
DataCatalog
pointing to
param1_A_param2_B
will load same dataset names but different files
yep! sorry for not being clear
d
so
a
it's about making the datasets output from a pipeline easier to navigate
d
we do have something that make make this easier
we have something called run environments
the canonical example is that you have the same names of datasets, but different paths for staging, production, testing etc
and you can dynamically change it by setting an env variable
export KEDRO_ENV=test
it's not exactly the same
but it's the closest thing we have to a native version of this
so instead of
base
and
local
you can add your own
a
what if the number of datasets change between
base
and
local
, for example when you work in cloud you need to output additional files? I thought that the number of datasets and what they are have to be exactly the same in every environments
d
no they can be diff
a
also is there a way to export the
DataCatalog
in the example above into YAML?
d
no
all of the information is in the object
so if you wanted to write a little script to generate it you could
but it doesn't out of the box have a
to_yaml
mechanism
I don't think it would be too hard to patch in if you wanted to do it
but it's not something that exists
a
gotcha! thanks so much for the help! Is that something that you guys would accept a PR for?
d
I think so
but full disclosure because the catalog is so central to Kedro we would have quite a high bar in terms of documentation, tests etc
actually
there is something we have for a very diff purpose
a
I think I get what you mean. I can manually add another dataset to the YAML in "local" and not in base Is there an example of a pipeline that outputs datasets
A, B, C
when run in environment A but outputs dataset
A, B, D, E
when run in environment B?
d
Yeah you would have to have the same number of outputs in both situations
I would maybe structure that as diff pipelines
a
Ah! I think that only creates
MemoryDataset
for every missing dataset not in the catalog right?
d
yes
but it's a jumping off point
a
gotcha, different pipelines with subset of shared nodes.
Thank you so much for your rapid help @User ! Idk if you want/can move this into advanced-need-help or not but I thought there was a method
DataCatalog.to_yaml
somewhere so thought it was a beginner question lol 😆
d
It's fine here! now discord has these fancy threads I'm not worried. In general dynamic pipelining is an advanced topic, but this is before we even run the thing!
Good luck!
4 Views