anhoang
08/19/2021, 5:40 PMfile_A
, file_B
, file_C
). I want the folder that this pipeline runs in have its own dynamically generated data catalog so other people can go in and inspect the results from the pipeline easily by just
taking the example from https://kedro.readthedocs.io/en/latest/05_data/01_data_catalog.html#configuring-a-data-catalog , is it possible to do this:
python
io = DataCatalog(
{
"bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
"cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
"cars_table": SQLTableDataSet(
table_name="cars", credentials=dict(con="sqlite:///kedro.db")
),
"scooters_query": SQLQueryDataSet(
sql="select * from cars where gear=4",
credentials=dict(con="sqlite:///kedro.db"),
),
"ranked": ParquetDataSet(filepath="ranked.parquet"),
}
)
and then do io.to_config()
? we have io.from_config()
but not io.to_config()
to generate YAML file from the data catalog objectdatajoely
08/19/2021, 5:41 PManhoang
08/19/2021, 5:49 PMbikes
, cars
, cars_table
, scooters_query
and rank
to YAML from pythondatajoely
08/19/2021, 5:51 PMDataCatalog
object you've created directlyanhoang
08/19/2021, 5:57 PMparam1_X_param2_Y
with files [bikes.csv, cars.csv, etc]
and a data catalog that documents these datasets. Another folder param1_A_param2_B
with the same set of files but content of the files are different
This way, another person can go into a folder and explore these datasets for each parameter combinations and do subsequent analyses without worrying the filepathskedro jupyter
or initialize a DataCatalog
object that points to the folder param1_X_param2_Y
and load one set of filesdatajoely
08/19/2021, 5:59 PManhoang
08/19/2021, 6:00 PMDataCatalog
pointing to param1_A_param2_B
will load same dataset names but different filesdatajoely
08/19/2021, 6:00 PManhoang
08/19/2021, 6:00 PMdatajoely
08/19/2021, 6:00 PMexport KEDRO_ENV=test
base
and local
anhoang
08/19/2021, 6:04 PMbase
and local
, for example when you work in cloud you need to output additional files? I thought that the number of datasets and what they are have to be exactly the same in every environmentsdatajoely
08/19/2021, 6:04 PManhoang
08/19/2021, 6:04 PMDataCatalog
in the example above into YAML?datajoely
08/19/2021, 6:05 PMto_yaml
mechanismanhoang
08/19/2021, 6:06 PMdatajoely
08/19/2021, 6:06 PManhoang
08/19/2021, 6:09 PMA, B, C
when run in environment A but outputs dataset A, B, D, E
when run in environment B?datajoely
08/19/2021, 6:09 PManhoang
08/19/2021, 6:10 PMMemoryDataset
for every missing dataset not in the catalog right?datajoely
08/19/2021, 6:10 PManhoang
08/19/2021, 6:11 PMDataCatalog.to_yaml
somewhere so thought it was a beginner question lol 😆datajoely
08/19/2021, 6:14 PM