Hi <@!910093050850725928> I'll answer all of you ...
# beginners-need-help
d
Hi @User I'll answer all of you questions in this thread
So regarding your second two questions - we don't encourage users to construct runners outside of the template like this? What are you trying to achieve
m
I want to run the same pipeline with a number of datasets, like putting a pipeline in a "loop"
d
Okay so you can achieve that without constructing a runner outside of the main template
So we have a pattern called modular pipelines that allow you instantiate versions of pipelines but override the inputs/outputs/parameters
Your normal
from kedro.pipeline import Pipeline
object can be overridden with
from kedro.pipeline.modular_pipeline import pipeline
m
Yea I've tried that one also, for instance creating 5 modular pipelines. But then I will need to use namespaces to separate them if I understand correctly?
d
yes - but you can namespace as part of the loop
and that way we can be sure things are isolated
If you look at this viz
and expand the modelling pipeline
you can see we have two instances of the same pipeline
the feature pipeline too
m
Thanks, so far I've been using the yaml files to specify my catalogs and then it seem like I will need to duplicate these for my namespaces?
d
you can look at the code below
that's how we build it
it's going to be a tutorial, but it's currently work in progress
m
OK, I will have a look and thanks a lot this help is fantastic!πŸ˜†
d
πŸ’ͺ we built Modular Pipelines for this problem specifically - static pipeline definition, dynamic inputs
we're hoping to overhaul the tutorials in the next few weeks as we do a terrible job about showing off how cool it is
Plus the latest version of viz has the collapsible nodes that now show the power of namespacing
and also make it much easier to develop against with
kedro viz --auto-reload
m
Nice solutions! It will take me some time to fully understand them. But to take it to the extreme: Let's say that I have 100 datasets that should independently go through the same pipeline and I want to save all the outputs from the nodes for each dataset. With the modular pipeline solution I guess that you will have to create 100 namespaced copies of the catalog.yml definitions to get that working?
d
so you can save yourself writing 100 catalog entries by doing a Jinja2 loop
and doing the same on the python side if you need to
I'd have a go with proving the concept for say 3
and then scale
because I think you'll get the hand of namespacing quicker that way
m
I got that to working with jinja2 very nice indeed! Thank you so much! I will try to pay back with some contributions to this project when I get more into it
d
That's wonderful to hear!
Shout if you need any help πŸ™‚ your feedback on how difficult it was to read about modular pipelines is super useful and we're keen to make this easier for future people in your position
j
These kinds of discussions are super helpful for other beginners such as myself. Please keep asking and answering questions in public. Thank you both
d
Messages like this make it worth it!
j
What do you mean by "expand the modeling pipeline"?
d
On the left hand side you can expand the drop down
j
Which tab?
d
The little chevrons correspond to the namespaces
Which can be nested with the dot syntax
j
Wow
d
Cool right?! Super excited about this feature it was only released a couple weeks back on the viz side
And any modular pipeline can be packaged and shared with other projects
Lots of fun stuff in this space
j
Where do I submit feature requests?
d
GitHub issues please!
j
There's one thing I built 5 years ago that I would like to add in
ok awesome
d
If you’re feeling brave we accept PRs too πŸ‘€πŸ˜‚
j
For these runs do you ever visualize the cardinality of the datasets that have been processed so far?
d
Good question
j
Like N=10000 for the X then N=9000 for X_train, N=1000 for X_test
d
There is an argument you could use a tracking.MetricsDataSet to do that
Stay tuned for more on that
j
Yeah one of the main values I find for pipeline visualizations like these is to know what actually ran and what has the right amount of data
I literally built this exact pipeline a bunch of times for ML
and used graphviz to show the total rows output by each step
and it kept me sane
Because I ran this generic pipeline for dozens of use cases every day
d
So in the demo if you click the πŸ§ͺ icon you can see the first cut of our experiment tracking features
And this is actively being worked on
So expect more features to arrive in quick succession
d
Interesting so you’d like to annotate the flowchart with custom attributes
j
yeah
d
Would those be attributes of the data or task nodes
j
And if something hasnt run
Then it would be red with a 0
or something like that
d
Interesting
j
yeah
this was insanely useful
Attributes of the data
d
Please raise as a GitHub issue and we can get a sense from the community if it would be worth prioritising
j
kk
Any guidelines for issues?
d
I think the short term answer is that our experiment tracking features will let you do something close to it
There is an issue template when you select new issue
j
kk excellent
Could you explain how the experiment tracking lets you do this?
Does it provide a table with stats for the intermediate datasets?
d
πŸ™
You can track what you want
And it will show up in the second tab today
But also on the flow chart soon
I can show you some designs tomorrow when I’m back at my computer
6 Views