Hi I am very new to this I might be missing something but it Kedro #beginners-need-help

Hi, I am very new to this. I might be missing some...

user

08/22/2021, 11:06 PM

Hi, I am very new to this. I might be missing something but it seems we can only input as raw data file by file. In the catalog, each entry seems to be only one file. However, my raw data is an entire directory from which I need to load data individually. I was wondering if there was a way to pass into a pipeline a directory as inputs instead of specific catalog entry that are each related to a file? Can we have in the data catalog a directory instead of a file? Sorry if this seems completely obvious...

datajoely

08/23/2021, 8:22 AM

Hi @User - If understand correctly you have many files which need to be unioned together?

datajoely

08/23/2021, 8:23 AM

Our spark datasets do this by default, you can use a

character if you want

datajoely

08/23/2021, 8:23 AM

are all the files of a predictable naming convention?

user

08/23/2021, 8:28 AM

Yes that is correct. There are some informations in the the filename (ex: date of the data collection) and the files are all of a predictable naming convention. I am currently using the file name to chose which files I want to load (for example between two given dates). I want to try to use kedro for my analysis and be able to specify such things.

datajoely

08/23/2021, 8:29 AM

Okay so there are two ways to do this

datajoely

08/23/2021, 8:30 AM

1. we support jinja2 in YAML so that if you want to autogenerate the catalog entries you can use a loop and essentially replicate the catalog definition for every dataset

datajoely

08/23/2021, 8:31 AM

2. You could define a custom dataset that uses something like glob to find all files in a directory and combine them together

user

08/23/2021, 8:34 AM

I'll look into this thank you!

user

08/23/2021, 5:11 PM

Another question, I can't seem to load my custom dataset. I am following this https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html and when I load in the python console with context.catalog.load i get "name 'context' is not defined". What do you use to load a custom dataset in the python console ?

datajoely

08/23/2021, 5:22 PM

is this after running

kedro ipython

datajoely

08/23/2021, 5:22 PM

https://kedro.readthedocs.io/en/stable/11_tools_integration/02_ipython.html

user

08/23/2021, 5:51 PM

Oh that's it sorry I was just running ipython on its own

user

08/23/2021, 5:51 PM

Thanks!

datajoely

08/23/2021, 5:56 PM

👌

2 Views

Previous Next