https://kedro.org/ logo
#beginners-need-help
Title
# beginners-need-help
d

datajoely

12/01/2021, 1:19 PM
Hi @User I'll answer all of you questions in this thread
So regarding your second two questions - we don't encourage users to construct runners outside of the template like this? What are you trying to achieve
m

martinlarsalbert

12/01/2021, 1:21 PM
I want to run the same pipeline with a number of datasets, like putting a pipeline in a "loop"
d

datajoely

12/01/2021, 1:22 PM
Okay so you can achieve that without constructing a runner outside of the main template
So we have a pattern called modular pipelines that allow you instantiate versions of pipelines but override the inputs/outputs/parameters
Your normal
from kedro.pipeline import Pipeline
object can be overridden with
from kedro.pipeline.modular_pipeline import pipeline
m

martinlarsalbert

12/01/2021, 1:23 PM
Yea I've tried that one also, for instance creating 5 modular pipelines. But then I will need to use namespaces to separate them if I understand correctly?
d

datajoely

12/01/2021, 1:23 PM
yes - but you can namespace as part of the loop
and that way we can be sure things are isolated
If you look at this viz
and expand the modelling pipeline
you can see we have two instances of the same pipeline
the feature pipeline too
m

martinlarsalbert

12/01/2021, 1:25 PM
Thanks, so far I've been using the yaml files to specify my catalogs and then it seem like I will need to duplicate these for my namespaces?
d

datajoely

12/01/2021, 1:25 PM
you can look at the code below
that's how we build it
it's going to be a tutorial, but it's currently work in progress
m

martinlarsalbert

12/01/2021, 1:26 PM
OK, I will have a look and thanks a lot this help is fantastic!πŸ˜†
d

datajoely

12/01/2021, 1:33 PM
πŸ’ͺ we built Modular Pipelines for this problem specifically - static pipeline definition, dynamic inputs
we're hoping to overhaul the tutorials in the next few weeks as we do a terrible job about showing off how cool it is
Plus the latest version of viz has the collapsible nodes that now show the power of namespacing
and also make it much easier to develop against with
kedro viz --auto-reload
m

martinlarsalbert

12/01/2021, 2:36 PM
Nice solutions! It will take me some time to fully understand them. But to take it to the extreme: Let's say that I have 100 datasets that should independently go through the same pipeline and I want to save all the outputs from the nodes for each dataset. With the modular pipeline solution I guess that you will have to create 100 namespaced copies of the catalog.yml definitions to get that working?
d

datajoely

12/01/2021, 2:45 PM
so you can save yourself writing 100 catalog entries by doing a Jinja2 loop
and doing the same on the python side if you need to
I'd have a go with proving the concept for say 3
and then scale
because I think you'll get the hand of namespacing quicker that way
m

martinlarsalbert

12/01/2021, 3:45 PM
I got that to working with jinja2 very nice indeed! Thank you so much! I will try to pay back with some contributions to this project when I get more into it
d

datajoely

12/01/2021, 3:45 PM
That's wonderful to hear!
Shout if you need any help πŸ™‚ your feedback on how difficult it was to read about modular pipelines is super useful and we're keen to make this easier for future people in your position
j

j c h a r l e s

12/01/2021, 7:32 PM
These kinds of discussions are super helpful for other beginners such as myself. Please keep asking and answering questions in public. Thank you both
d

datajoely

12/01/2021, 7:32 PM
Messages like this make it worth it!
j

j c h a r l e s

12/01/2021, 9:34 PM
What do you mean by "expand the modeling pipeline"?
d

datajoely

12/01/2021, 9:34 PM
On the left hand side you can expand the drop down
j

j c h a r l e s

12/01/2021, 9:34 PM
Which tab?
d

datajoely

12/01/2021, 9:35 PM
The little chevrons correspond to the namespaces
Which can be nested with the dot syntax
j

j c h a r l e s

12/01/2021, 9:36 PM
Wow
d

datajoely

12/01/2021, 9:36 PM
Cool right?! Super excited about this feature it was only released a couple weeks back on the viz side
And any modular pipeline can be packaged and shared with other projects
Lots of fun stuff in this space
j

j c h a r l e s

12/01/2021, 9:38 PM
Where do I submit feature requests?
d

datajoely

12/01/2021, 9:38 PM
GitHub issues please!
j

j c h a r l e s

12/01/2021, 9:38 PM
There's one thing I built 5 years ago that I would like to add in
ok awesome
d

datajoely

12/01/2021, 9:39 PM
If you’re feeling brave we accept PRs too πŸ‘€πŸ˜‚
j

j c h a r l e s

12/01/2021, 9:39 PM
For these runs do you ever visualize the cardinality of the datasets that have been processed so far?
d

datajoely

12/01/2021, 9:39 PM
Good question
j

j c h a r l e s

12/01/2021, 9:39 PM
Like N=10000 for the X then N=9000 for X_train, N=1000 for X_test
d

datajoely

12/01/2021, 9:39 PM
There is an argument you could use a tracking.MetricsDataSet to do that
Stay tuned for more on that
j

j c h a r l e s

12/01/2021, 9:40 PM
Yeah one of the main values I find for pipeline visualizations like these is to know what actually ran and what has the right amount of data
I literally built this exact pipeline a bunch of times for ML
and used graphviz to show the total rows output by each step
and it kept me sane
Because I ran this generic pipeline for dozens of use cases every day
d

datajoely

12/01/2021, 9:41 PM
So in the demo if you click the πŸ§ͺ icon you can see the first cut of our experiment tracking features
And this is actively being worked on
So expect more features to arrive in quick succession
d

datajoely

12/01/2021, 9:45 PM
Interesting so you’d like to annotate the flowchart with custom attributes
j

j c h a r l e s

12/01/2021, 9:45 PM
yeah
d

datajoely

12/01/2021, 9:45 PM
Would those be attributes of the data or task nodes
j

j c h a r l e s

12/01/2021, 9:45 PM
And if something hasnt run
Then it would be red with a 0
or something like that
d

datajoely

12/01/2021, 9:46 PM
Interesting
j

j c h a r l e s

12/01/2021, 9:46 PM
yeah
this was insanely useful
Attributes of the data
d

datajoely

12/01/2021, 9:46 PM
Please raise as a GitHub issue and we can get a sense from the community if it would be worth prioritising
j

j c h a r l e s

12/01/2021, 9:46 PM
kk
Any guidelines for issues?
d

datajoely

12/01/2021, 9:47 PM
I think the short term answer is that our experiment tracking features will let you do something close to it
There is an issue template when you select new issue
j

j c h a r l e s

12/01/2021, 9:47 PM
kk excellent
Could you explain how the experiment tracking lets you do this?
Does it provide a table with stats for the intermediate datasets?
d

datajoely

12/01/2021, 9:48 PM
πŸ™
You can track what you want
And it will show up in the second tab today
But also on the flow chart soon
I can show you some designs tomorrow when I’m back at my computer
5 Views