Title
#beginners-need-help
metalmind

metalmind

01/05/2022, 5:21 PM
As far as I know, this is a tool for manual labeling. What I'm looking for is a script with a function used labeling and another for feature generation.
datajoely

datajoely

01/05/2022, 5:25 PM
Hello created a thread to isolate the conversation
5:25 PM
copying questions here
5:25 PM
Data can be loaded via configuration, right? I want to use Kedro during the experiement/R&D phase. So I need to be able to mix and match raw data + feature generator + label generator + parameters. Say I have 10 raw data files. 10 feature generation functions (source files), ...etc. I want to try Raw Data 1 + Featurizer 2 + Labeler 5 + Parameter Set 6 (1/2/5/6). Then 3/4/1/7, ...etc. All while letter Mlflow record the results. So not only raw data can be configured and loaded, also scripts (functions) used to generate the features and labels.
5:26 PM
So this is an interesting topic
5:26 PM
as you'll need to set up your pipelines to work in quite a dynamic fashion which we're not necessarily fans as we feel it makes things quite hard to read/maintain/debug
5:26 PM
but it is posisble
5:27 PM
So I think you're going to get creating with your run args
5:27 PM
so you can inject using the technique above
5:28 PM
--config may be used as a starting point, but how to change the function in a pipeline node from it. I have an idea already but wanted first to check if there's a way to do it out of the box.
metalmind

metalmind

01/05/2022, 5:29 PM
Thank you for the thread.
datajoely

datajoely

01/05/2022, 5:30 PM
If you look at this sample project
5:30 PM
you'll see
5:30 PM
we load a different Sklearn regressor object dynamically based on paramters
5:31 PM
and then parametrise the class path
5:31 PM
I think the same approach can be applied
5:32 PM
You can actually follow along that sample project as it then instantiates the modelling twice
5:32 PM
It's an advanced use of Kedro but I think it's similar to what you're trying to achieve
metalmind

metalmind

01/05/2022, 5:35 PM
You used this for creating the model, right? I though of doing the same or create the model in a notebook and exporting it as YAML then let a node load it as YamlDataset then create the model from it. So I'd avoid having extra scripts.
datajoely

datajoely

01/05/2022, 5:35 PM
Yes - it's a trivial model so I appreciate it may not be super representative
metalmind

metalmind

01/05/2022, 5:39 PM
Let me know your idea about handling the scripts differently. I'd create a new dataset called ScriptDataSet or PythonDataSet or FunctionDataset. Define its params in the data catalog. And create 2 new folders 09_featurizers and 10_modelers where they scripts are stored. Then give the FunctionDataset a Python source file name to load it form here and optional method, or it loads the first method from it. The use that dataset in one of the pipeline nodes.
5:39 PM
Unless there's something like that ready in Kedro.
datajoely

datajoely

01/05/2022, 5:39 PM
:S
5:40 PM
I'm not sure - my first reaction is that could get a bit messy and feels like a security risk
5:40 PM
we have plans to let the user enable 'unsafe' yaml in the future, but it's not available yet
5:41 PM
message has been deleted
5:41 PM
this would allow you to reference a Python module with the special syntax at the bottom
metalmind

metalmind

01/05/2022, 5:41 PM
Could get messy yes but what's the alternative. I could have like 10 different feature generators and similar label generators and need to be able to mix and match during experiementation.
datajoely

datajoely

01/05/2022, 5:42 PM
Is the desired output just the combinatorial of all of them?
5:43 PM
Itertools in the standard library may be helpfulhttps://docs.python.org/3/library/itertools.html#itertools.combinations
metalmind

metalmind

01/05/2022, 5:43 PM
Yes, every time single raw data + single fetureizer + single... etc
5:44 PM
oh not automated or grid-search-like, this is manual experimentation based on the results.
datajoely

datajoely

01/05/2022, 5:44 PM
ok
5:45 PM
so each pipeline run is - fixed input data - parameterised featurizer - parameterised labeler
metalmind

metalmind

01/05/2022, 5:46 PM
Input raw data would be parameterized too. I though of each iteration/run would be a YAML file pointing to raw data+ featurizer + labelers + YAML parameters.
5:46 PM
Passed as
kedro run --config...
or something.
datajoely

datajoely

01/05/2022, 5:46 PM
Unfortunately I think you may be breaking too many of Kedro's assumptions
5:47 PM
let me think about this more
5:47 PM
but I think there is a chance it may not be the right tool for this approach
metalmind

metalmind

01/05/2022, 5:49 PM
You meant Kedro as a whole?
datajoely

datajoely

01/05/2022, 5:49 PM
potentially
5:49 PM
I'm protyping somrthing currently
5:50 PM
regarding the dynanamic input data - would it be defined in the catalog?
metalmind

metalmind

01/05/2022, 5:50 PM
I think Kedro is suited more to come after a developed model. Not to be used during experiemtation.
5:51 PM
Yes as the standard method. And I'm thinking of using different catalogs or find a way to replace the data with each run.
datajoely

datajoely

01/05/2022, 5:51 PM
Okay I may have something
5:52 PM
give me 10 mins to make a dummy project
metalmind

metalmind

01/05/2022, 5:53 PM
My main target now is to find a way to run R&D on a matrix of the 4 mentioned elements as mix and match while recording the result in Mlflow to find the best model then use the same feature generation function in inference.
5:53 PM
Thank you.
6:18 PM
Actually I've started to feel that Kedro is an overkill for my needs. I have only a fixed 4-step pipeline. The input data, functions, and parameters are what change so if I can put all in a repository and feed them to my pipeline via YAML configs, and use Mlflow for experiement tracking and DVC for versioning the 4 elements with each run, that would be sufficient. I'm torn between using Kedro or writing my small application.
datajoely

datajoely

01/05/2022, 6:20 PM
alright
6:20 PM
I think I'm done with my sample project- but I'm torn if this makes sense or not
6:24 PM
so running
kedro run --config trial_1.yml
will take data from
conf/trail_1/catalog.yml
and the nodes will take their parametrisation from the configured YAML file
metalmind

metalmind

01/05/2022, 6:35 PM
Thank you very much for your effort. I appreciate it. The folder per catalog approach would be too much when running a lot of experiments. Also if the 4 elements could be added via the command line as an option, the folders won't be required.
6:37 PM
But generally, what do you think about my comment above about writing a small app for that. I'm in a proof of concept phase and wrote already part of it. Either that or another project more suited for R&D you know of?
6:40 PM
I prefer for the params and features to be Python functions in their own script files instead of conigs in YAML. I'd have more control like that and can integrate them to any other app.
datajoely

datajoely

01/05/2022, 6:42 PM
Yeah it's fair to say this approach doesn't utilise the benefits of authoring your pipelines via Kedro
6:42 PM
so I think you may be better off with any orchestrator that doesn't take responsibility for IO
6:42 PM
so something like Prefect, Airflow or Dagster may be more appropriate
metalmind

metalmind

01/05/2022, 6:46 PM
Does the part
sklearn.linear_model
below:
feature_params:
      module: 'sklearn.linear_model'
work for any module or pip-installed ones only?
6:51 PM
As my functions should be loaded form the same repository where the other data resides.
datajoely

datajoely

01/05/2022, 6:55 PM
Hello if you look at the implementation of the nodes.py
6:55 PM
It just uses importlib
6:56 PM
So will read any module visible by the library locally or in site packages
metalmind

metalmind

01/05/2022, 7:03 PM
Thank you again for your effort. I'm checking your implementation to decide how should I move next.