Hi, I am very new to this. I might be missing some...
# beginners-need-help
u
Hi, I am very new to this. I might be missing something but it seems we can only input as raw data file by file. In the catalog, each entry seems to be only one file. However, my raw data is an entire directory from which I need to load data individually. I was wondering if there was a way to pass into a pipeline a directory as inputs instead of specific catalog entry that are each related to a file? Can we have in the data catalog a directory instead of a file? Sorry if this seems completely obvious...
d
Hi @User - If understand correctly you have many files which need to be unioned together?
Our spark datasets do this by default, you can use a
*
character if you want
are all the files of a predictable naming convention?
u
Yes that is correct. There are some informations in the the filename (ex: date of the data collection) and the files are all of a predictable naming convention. I am currently using the file name to chose which files I want to load (for example between two given dates). I want to try to use kedro for my analysis and be able to specify such things.
d
Okay so there are two ways to do this
1. we support jinja2 in YAML so that if you want to autogenerate the catalog entries you can use a loop and essentially replicate the catalog definition for every dataset
2. You could define a custom dataset that uses something like glob to find all files in a directory and combine them together
u
I'll look into this thank you!
u
Another question, I can't seem to load my custom dataset. I am following this https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html and when I load in the python console with context.catalog.load i get "name 'context' is not defined". What do you use to load a custom dataset in the python console ?
d
is this after running
kedro ipython
?
u
Oh that's it sorry I was just running ipython on its own
u
Thanks!
d
👌
2 Views