https://kedro.org/ logo
Title
m

metalmind

01/05/2022, 5:21 PM
As far as I know, this is a tool for manual labeling. What I'm looking for is a script with a function used labeling and another for feature generation.
d

datajoely

01/05/2022, 5:25 PM
Hello created a thread to isolate the conversation
copying questions here
> Data can be loaded via configuration, right? I want to use Kedro during the experiement/R&D phase. So I need to be able to mix and match raw data + feature generator + label generator + parameters. Say I have 10 raw data files. 10 feature generation functions (source files), ...etc. I want to try Raw Data 1 + Featurizer 2 + Labeler 5 + Parameter Set 6 (1/2/5/6). Then 3/4/1/7, ...etc. All while letter Mlflow record the results. > So not only raw data can be configured and loaded, also scripts (functions) used to generate the features and labels.
So this is an interesting topic
as you'll need to set up your pipelines to work in quite a dynamic fashion which we're not necessarily fans as we feel it makes things quite hard to read/maintain/debug
but it is posisble
So I think you're going to get creating with your run args
so you can inject using the technique above
> --config may be used as a starting point, but how to change the function in a pipeline node from it. I have an idea already but wanted first to check if there's a way to do it out of the box.
m

metalmind

01/05/2022, 5:29 PM
Thank you for the thread.
d

datajoely

01/05/2022, 5:30 PM
If you look at this sample project
you'll see
we load a different Sklearn regressor object dynamically based on paramters
and then parametrise the class path
I think the same approach can be applied
You can actually follow along that sample project as it then instantiates the modelling twice
It's an advanced use of Kedro but I think it's similar to what you're trying to achieve
m

metalmind

01/05/2022, 5:35 PM
You used this for creating the model, right? I though of doing the same or create the model in a notebook and exporting it as YAML then let a node load it as YamlDataset then create the model from it. So I'd avoid having extra scripts.
d

datajoely

01/05/2022, 5:35 PM
Yes - it's a trivial model so I appreciate it may not be super representative
m

metalmind

01/05/2022, 5:39 PM
Let me know your idea about handling the scripts differently. I'd create a new dataset called ScriptDataSet or PythonDataSet or FunctionDataset. Define its params in the data catalog. And create 2 new folders 09_featurizers and 10_modelers where they scripts are stored. Then give the FunctionDataset a Python source file name to load it form here and optional method, or it loads the first method from it. The use that dataset in one of the pipeline nodes.
Unless there's something like that ready in Kedro.
d

datajoely

01/05/2022, 5:39 PM
:S
I'm not sure - my first reaction is that could get a bit messy and feels like a security risk
we have plans to let the user enable 'unsafe' yaml in the future, but it's not available yet
this would allow you to reference a Python module with the special syntax at the bottom
m

metalmind

01/05/2022, 5:41 PM
Could get messy yes but what's the alternative. I could have like 10 different feature generators and similar label generators and need to be able to mix and match during experiementation.
d

datajoely

01/05/2022, 5:42 PM
Is the desired output just the combinatorial of all of them?
Itertools in the standard library may be helpful https://docs.python.org/3/library/itertools.html#itertools.combinations
m

metalmind

01/05/2022, 5:43 PM
Yes, every time single raw data + single fetureizer + single... etc
oh not automated or grid-search-like, this is manual experimentation based on the results.
d

datajoely

01/05/2022, 5:44 PM
ok
so each pipeline run is - fixed input data - parameterised featurizer - parameterised labeler
m

metalmind

01/05/2022, 5:46 PM
Input raw data would be parameterized too. I though of each iteration/run would be a YAML file pointing to raw data+ featurizer + labelers + YAML parameters.
Passed as
kedro run --config...
or something.
d

datajoely

01/05/2022, 5:46 PM
Unfortunately I think you may be breaking too many of Kedro's assumptions
let me think about this more
but I think there is a chance it may not be the right tool for this approach
m

metalmind

01/05/2022, 5:49 PM
You meant Kedro as a whole?
d

datajoely

01/05/2022, 5:49 PM
potentially
I'm protyping somrthing currently
regarding the dynanamic input data - would it be defined in the catalog?
m

metalmind

01/05/2022, 5:50 PM
I think Kedro is suited more to come after a developed model. Not to be used during experiemtation.
Yes as the standard method. And I'm thinking of using different catalogs or find a way to replace the data with each run.
d

datajoely

01/05/2022, 5:51 PM
Okay I may have something
give me 10 mins to make a dummy project
m

metalmind

01/05/2022, 5:53 PM
My main target now is to find a way to run R&D on a matrix of the 4 mentioned elements as mix and match while recording the result in Mlflow to find the best model then use the same feature generation function in inference.
Thank you.
Actually I've started to feel that Kedro is an overkill for my needs. I have only a fixed 4-step pipeline. The input data, functions, and parameters are what change so if I can put all in a repository and feed them to my pipeline via YAML configs, and use Mlflow for experiement tracking and DVC for versioning the 4 elements with each run, that would be sufficient. I'm torn between using Kedro or writing my small application.
d

datajoely

01/05/2022, 6:20 PM
alright
I think I'm done with my sample project- but I'm torn if this makes sense or not
so running
kedro run --config trial_1.yml
will take data from
conf/trail_1/catalog.yml
and the nodes will take their parametrisation from the configured YAML file
m

metalmind

01/05/2022, 6:35 PM
Thank you very much for your effort. I appreciate it. The folder per catalog approach would be too much when running a lot of experiments. Also if the 4 elements could be added via the command line as an option, the folders won't be required.
But generally, what do you think about my comment above about writing a small app for that. I'm in a proof of concept phase and wrote already part of it. Either that or another project more suited for R&D you know of?
I prefer for the params and features to be Python functions in their own script files instead of conigs in YAML. I'd have more control like that and can integrate them to any other app.
d

datajoely

01/05/2022, 6:42 PM
Yeah it's fair to say this approach doesn't utilise the benefits of authoring your pipelines via Kedro
so I think you may be better off with any orchestrator that doesn't take responsibility for IO
so something like Prefect, Airflow or Dagster may be more appropriate
m

metalmind

01/05/2022, 6:46 PM
Does the part
sklearn.linear_model
below:
feature_params:
      module: 'sklearn.linear_model'
work for any module or pip-installed ones only?
As my functions should be loaded form the same repository where the other data resides.
d

datajoely

01/05/2022, 6:55 PM
Hello if you look at the implementation of the nodes.py
It just uses importlib
So will read any module visible by the library locally or in site packages
m

metalmind

01/05/2022, 7:03 PM
Thank you again for your effort. I'm checking your implementation to decide how should I move next.