Hi < martinlarsalbert> I ll answer all of you questions in t Kedro #beginners-need-help

Join Discord

Hi <@!910093050850725928> I'll answer all of you ...

# beginners-need-help

datajoely

12/01/2021, 1:19 PM

Hi @User I'll answer all of you questions in this thread

datajoely

12/01/2021, 1:20 PM

So regarding your second two questions - we don't encourage users to construct runners outside of the template like this? What are you trying to achieve

martinlarsalbert

12/01/2021, 1:21 PM

I want to run the same pipeline with a number of datasets, like putting a pipeline in a "loop"

datajoely

12/01/2021, 1:22 PM

Okay so you can achieve that without constructing a runner outside of the main template

datajoely

12/01/2021, 1:22 PM

So we have a pattern called modular pipelines that allow you instantiate versions of pipelines but override the inputs/outputs/parameters

datajoely

12/01/2021, 1:23 PM

Your normal

from kedro.pipeline import Pipeline

object can be overridden with

from kedro.pipeline.modular_pipeline import pipeline

martinlarsalbert

12/01/2021, 1:23 PM

Yea I've tried that one also, for instance creating 5 modular pipelines. But then I will need to use namespaces to separate them if I understand correctly?

datajoely

12/01/2021, 1:23 PM

yes - but you can namespace as part of the loop

datajoely

12/01/2021, 1:24 PM

and that way we can be sure things are isolated

datajoely

12/01/2021, 1:24 PM

If you look at this viz

datajoely

12/01/2021, 1:24 PM

https://kedro-viz-live-demo.hfa4c8ufrmn4u.eu-west-2.cs.amazonlightsail.com/

datajoely

12/01/2021, 1:24 PM

and expand the modelling pipeline

datajoely

12/01/2021, 1:24 PM

you can see we have two instances of the same pipeline

datajoely

12/01/2021, 1:25 PM

the feature pipeline too

martinlarsalbert

12/01/2021, 1:25 PM

Thanks, so far I've been using the yaml files to specify my catalogs and then it seem like I will need to duplicate these for my namespaces?

datajoely

12/01/2021, 1:25 PM

you can look at the code below

datajoely

12/01/2021, 1:25 PM

https://github.com/datajoely/modular-spaceflights/

datajoely

12/01/2021, 1:25 PM

that's how we build it

datajoely

12/01/2021, 1:25 PM

it's going to be a tutorial, but it's currently work in progress

martinlarsalbert

12/01/2021, 1:26 PM

OK, I will have a look and thanks a lot this help is fantastic!😆

datajoely

12/01/2021, 1:33 PM

💪 we built Modular Pipelines for this problem specifically - static pipeline definition, dynamic inputs

datajoely

12/01/2021, 1:34 PM

we're hoping to overhaul the tutorials in the next few weeks as we do a terrible job about showing off how cool it is

datajoely

12/01/2021, 1:34 PM

Plus the latest version of viz has the collapsible nodes that now show the power of namespacing

datajoely

12/01/2021, 1:34 PM

and also make it much easier to develop against with

kedro viz --auto-reload

martinlarsalbert

12/01/2021, 2:36 PM

Nice solutions! It will take me some time to fully understand them. But to take it to the extreme: Let's say that I have 100 datasets that should independently go through the same pipeline and I want to save all the outputs from the nodes for each dataset. With the modular pipeline solution I guess that you will have to create 100 namespaced copies of the catalog.yml definitions to get that working?

datajoely

12/01/2021, 2:45 PM

so you can save yourself writing 100 catalog entries by doing a Jinja2 loop

datajoely

12/01/2021, 2:45 PM

https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#jinja2-support

datajoely

12/01/2021, 2:46 PM

and doing the same on the python side if you need to

datajoely

12/01/2021, 2:46 PM

I'd have a go with proving the concept for say 3

datajoely

12/01/2021, 2:46 PM

and then scale

datajoely

12/01/2021, 2:46 PM

because I think you'll get the hand of namespacing quicker that way

martinlarsalbert

12/01/2021, 3:45 PM

I got that to working with jinja2 very nice indeed! Thank you so much! I will try to pay back with some contributions to this project when I get more into it

datajoely

12/01/2021, 3:45 PM

That's wonderful to hear!

datajoely

12/01/2021, 3:46 PM

Shout if you need any help 🙂 your feedback on how difficult it was to read about modular pipelines is super useful and we're keen to make this easier for future people in your position

j c h a r l e s

12/01/2021, 7:32 PM

These kinds of discussions are super helpful for other beginners such as myself. Please keep asking and answering questions in public. Thank you both

datajoely

12/01/2021, 7:32 PM

Messages like this make it worth it!

j c h a r l e s

12/01/2021, 9:34 PM

What do you mean by "expand the modeling pipeline"?

datajoely

12/01/2021, 9:34 PM

On the left hand side you can expand the drop down

j c h a r l e s

12/01/2021, 9:34 PM

Which tab?

datajoely

12/01/2021, 9:35 PM

The little chevrons correspond to the namespaces

datajoely

12/01/2021, 9:35 PM

Which can be nested with the dot syntax

j c h a r l e s

12/01/2021, 9:36 PM

Wow

datajoely

12/01/2021, 9:36 PM

Cool right?! Super excited about this feature it was only released a couple weeks back on the viz side

datajoely

12/01/2021, 9:36 PM

And any modular pipeline can be packaged and shared with other projects

datajoely

12/01/2021, 9:37 PM

Lots of fun stuff in this space

j c h a r l e s

12/01/2021, 9:38 PM

Where do I submit feature requests?

datajoely

12/01/2021, 9:38 PM

GitHub issues please!

j c h a r l e s

12/01/2021, 9:38 PM

There's one thing I built 5 years ago that I would like to add in

j c h a r l e s

12/01/2021, 9:38 PM

ok awesome

datajoely

12/01/2021, 9:39 PM

If you’re feeling brave we accept PRs too 👀😂

j c h a r l e s

12/01/2021, 9:39 PM

For these runs do you ever visualize the cardinality of the datasets that have been processed so far?

datajoely

12/01/2021, 9:39 PM

Good question

j c h a r l e s

12/01/2021, 9:39 PM

Like N=10000 for the X then N=9000 for X_train, N=1000 for X_test

datajoely

12/01/2021, 9:39 PM

There is an argument you could use a tracking.MetricsDataSet to do that

datajoely

12/01/2021, 9:40 PM

Stay tuned for more on that

j c h a r l e s

12/01/2021, 9:40 PM

Yeah one of the main values I find for pipeline visualizations like these is to know what actually ran and what has the right amount of data

j c h a r l e s

12/01/2021, 9:40 PM

I literally built this exact pipeline a bunch of times for ML

j c h a r l e s

12/01/2021, 9:40 PM

and used graphviz to show the total rows output by each step

j c h a r l e s

12/01/2021, 9:41 PM

and it kept me sane

j c h a r l e s

12/01/2021, 9:41 PM

Because I ran this generic pipeline for dozens of use cases every day

datajoely

12/01/2021, 9:41 PM

So in the demo if you click the 🧪 icon you can see the first cut of our experiment tracking features

datajoely

12/01/2021, 9:44 PM

And this is actively being worked on

datajoely

12/01/2021, 9:44 PM

So expect more features to arrive in quick succession

j c h a r l e s

12/01/2021, 9:44 PM

https://www.figma.com/file/oX0GJemJ0J3kGs3qAnDtaH/Untitled

datajoely

12/01/2021, 9:45 PM

Interesting so you’d like to annotate the flowchart with custom attributes

j c h a r l e s

12/01/2021, 9:45 PM

yeah

datajoely

12/01/2021, 9:45 PM

Would those be attributes of the data or task nodes

j c h a r l e s

12/01/2021, 9:45 PM

And if something hasnt run

j c h a r l e s

12/01/2021, 9:46 PM

Then it would be red with a 0

j c h a r l e s

12/01/2021, 9:46 PM

or something like that

datajoely

12/01/2021, 9:46 PM

Interesting

j c h a r l e s

12/01/2021, 9:46 PM

yeah

j c h a r l e s

12/01/2021, 9:46 PM

this was insanely useful

j c h a r l e s

12/01/2021, 9:46 PM

Attributes of the data

datajoely

12/01/2021, 9:46 PM

Please raise as a GitHub issue and we can get a sense from the community if it would be worth prioritising

j c h a r l e s

12/01/2021, 9:46 PM

j c h a r l e s

12/01/2021, 9:46 PM

Any guidelines for issues?

datajoely

12/01/2021, 9:47 PM

I think the short term answer is that our experiment tracking features will let you do something close to it

datajoely

12/01/2021, 9:47 PM

There is an issue template when you select new issue

j c h a r l e s

12/01/2021, 9:47 PM

kk excellent

j c h a r l e s

12/01/2021, 9:47 PM

Could you explain how the experiment tracking lets you do this?

j c h a r l e s

12/01/2021, 9:47 PM

Does it provide a table with stats for the intermediate datasets?

datajoely

12/01/2021, 9:48 PM

🙏

datajoely

12/01/2021, 9:48 PM

You can track what you want

datajoely

12/01/2021, 9:49 PM

And it will show up in the second tab today

datajoely

12/01/2021, 9:49 PM

But also on the flow chart soon

datajoely

12/01/2021, 9:55 PM

I can show you some designs tomorrow when I’m back at my computer

j c h a r l e s

12/01/2021, 9:59 PM

https://github.com/quantumblacklabs/kedro/issues/1084

6 Views

Previous Next