Hello I m trying to implement a few pipelines in Kedro 17 7 Kedro #advanced-need-help

Hello, I'm trying to implement a few pipelines in ...

FelicioV

03/02/2022, 5:42 PM

Hello, I'm trying to implement a few pipelines in Kedro 17.7 that have a lot of inputs of moderate complexity. It can be summarised roughly in reading a few dozen sheets in a few hundred excel spreadsheets. To do so, I'm using

PartitionedDataSets

with

pandas.ExcelDataSet

and specifying

load_args

sheet_name

names

and

dtype

. It works like a charm but I'm worrying about the size of the

catalog/ingest.yml

. I've been searching for a way to split that catalog yml into a few files, maybe on business oriented segments, but I have had no luck with it. Is there a intended way to do such a thing? If not intended way implemented, I've been thinking (not really tried though) to mess up with the

register_catalog

on the

ProjectHooks

class. Am I making any sense? Thanks!

datajoely

03/02/2022, 6:08 PM

Hi @FelicioV this is very doable

datajoely

03/02/2022, 6:09 PM

I'm not at my computer at the moment but will come back to this later if that's okay!

FelicioV

03/02/2022, 6:09 PM

That's perfect. Thanks for your time.

datajoely

03/02/2022, 8:03 PM

Hi @User - so you are able to split the catalog into many files

datajoely

03/02/2022, 8:03 PM

if you look at this sample project that's what I've done here https://github.com/datajoely/modular-spaceflights/tree/main/conf/base

datajoely

03/02/2022, 8:04 PM

we actually look for the glob path

catalog*

and

catalog*/**

so we will pick up any files that are prefixed with

catalog

or live (recursively) within a folder with that prefix

datajoely

03/02/2022, 8:04 PM

so you can split into as many files as you want and they're treated as one at runtime

datajoely

03/02/2022, 8:05 PM

there are also a couple of methods we use to simplify complexity - but they do have tradeoffs in terms of readability

datajoely

03/02/2022, 8:06 PM

(1) You can use YAML anchors to re-use fragments within the same file

datajoely

03/02/2022, 8:06 PM

(2) you can use TemplatedConfigLoader to stop repeating yourself

datajoely

03/02/2022, 8:07 PM

(3) You can use Jinja to use loops in YAML (I'm not a fan of this, because I think it damages readability and makes things hard to maintain, but it's there)

FelicioV

03/02/2022, 8:08 PM

Awesome! I'll dig into it. I've tried to nest it once on a folder, got the idea from this snippet on the 01_data_catalog page on the docs

FelicioV

03/02/2022, 8:09 PM

Copy code

# <conf_root>/<env>/catalog/<pipeline_name>.yml
rockets:
  type: MemoryDataSet
scooters:
  type: MemoryDataSet

datajoely

03/02/2022, 8:10 PM

I tend to split things into phase-specific areas

datajoely

03/02/2022, 8:10 PM

but also keep in mind that sometimes that's a premature optimisation

datajoely

03/02/2022, 8:10 PM

during development I tend to persist everything at every stage so I can pick up where I left off

datajoely

03/02/2022, 8:11 PM

but once I'm done I delete all the catalog entries in the middle

datajoely

03/02/2022, 8:11 PM

so you often only need the very beginning inputs and the very end

FelicioV

03/02/2022, 8:12 PM

Most of my catalog can, and will, be configured like MemoryDataSets. I'll just persist some cloud steps. Internet connection isn't as reliable as I wish in my area.

datajoely

03/02/2022, 8:13 PM

so you don' need to declare memory datasets

datajoely

03/02/2022, 8:13 PM

if they are outputs to a node the names are addressable in your pipeline logic

datajoely

03/02/2022, 8:13 PM

and can be referenced at runtime, but don't exist after

FelicioV

03/02/2022, 8:13 PM

Got this tip from you on the linkedin live last week

datajoely

03/02/2022, 8:14 PM

💪

FelicioV

03/02/2022, 8:15 PM

I understand. I'll do it for now to have more examples and later on the implementation I'll comment out most of it. I hope it will be an instructive moment for the team. Again, thanks for your time.

datajoely

03/02/2022, 8:15 PM

No worries - happy to review any code snippets if helpful

3 Views

Previous Next