Hello, I'm trying to implement a few pipelines in ...
# advanced-need-help
f
Hello, I'm trying to implement a few pipelines in Kedro 17.7 that have a lot of inputs of moderate complexity. It can be summarised roughly in reading a few dozen sheets in a few hundred excel spreadsheets. To do so, I'm using
PartitionedDataSets
with
pandas.ExcelDataSet
and specifying
load_args
as
sheet_name
,
names
and
dtype
. It works like a charm but I'm worrying about the size of the
catalog/ingest.yml
. I've been searching for a way to split that catalog yml into a few files, maybe on business oriented segments, but I have had no luck with it. Is there a intended way to do such a thing? If not intended way implemented, I've been thinking (not really tried though) to mess up with the
register_catalog
on the
ProjectHooks
class. Am I making any sense? Thanks!
d
Hi @FelicioV this is very doable
I'm not at my computer at the moment but will come back to this later if that's okay!
f
That's perfect. Thanks for your time.
d
Hi @User - so you are able to split the catalog into many files
if you look at this sample project that's what I've done here https://github.com/datajoely/modular-spaceflights/tree/main/conf/base
we actually look for the glob path
catalog*
and
catalog*/**
so we will pick up any files that are prefixed with
catalog
or live (recursively) within a folder with that prefix
so you can split into as many files as you want and they're treated as one at runtime
there are also a couple of methods we use to simplify complexity - but they do have tradeoffs in terms of readability
(1) You can use YAML anchors to re-use fragments within the same file
(2) you can use TemplatedConfigLoader to stop repeating yourself
(3) You can use Jinja to use loops in YAML (I'm not a fan of this, because I think it damages readability and makes things hard to maintain, but it's there)
f
Awesome! I'll dig into it. I've tried to nest it once on a folder, got the idea from this snippet on the 01_data_catalog page on the docs
Copy code
# <conf_root>/<env>/catalog/<pipeline_name>.yml
rockets:
  type: MemoryDataSet
scooters:
  type: MemoryDataSet
d
I tend to split things into phase-specific areas
but also keep in mind that sometimes that's a premature optimisation
during development I tend to persist everything at every stage so I can pick up where I left off
but once I'm done I delete all the catalog entries in the middle
so you often only need the very beginning inputs and the very end
f
Most of my catalog can, and will, be configured like MemoryDataSets. I'll just persist some cloud steps. Internet connection isn't as reliable as I wish in my area.
d
so you don' need to declare memory datasets
if they are outputs to a node the names are addressable in your pipeline logic
and can be referenced at runtime, but don't exist after
f
Got this tip from you on the linkedin live last week
d
💪
f
I understand. I'll do it for now to have more examples and later on the implementation I'll comment out most of it. I hope it will be an instructive moment for the team. Again, thanks for your time.
d
No worries - happy to review any code snippets if helpful
2 Views