As far as I know this is a tool for manual labeling What I m Kedro #beginners-need-help

As far as I know, this is a tool for manual labeli...

metalmind

01/05/2022, 5:21 PM

As far as I know, this is a tool for manual labeling. What I'm looking for is a script with a function used labeling and another for feature generation.

datajoely

01/05/2022, 5:25 PM

Hello created a thread to isolate the conversation

datajoely

01/05/2022, 5:25 PM

copying questions here

datajoely

01/05/2022, 5:25 PM

> Data can be loaded via configuration, right? I want to use Kedro during the experiement/R&D phase. So I need to be able to mix and match raw data + feature generator + label generator + parameters. Say I have 10 raw data files. 10 feature generation functions (source files), ...etc. I want to try Raw Data 1 + Featurizer 2 + Labeler 5 + Parameter Set 6 (1/2/5/6). Then 3/4/1/7, ...etc. All while letter Mlflow record the results. > So not only raw data can be configured and loaded, also scripts (functions) used to generate the features and labels.

datajoely

01/05/2022, 5:26 PM

So this is an interesting topic

datajoely

01/05/2022, 5:26 PM

as you'll need to set up your pipelines to work in quite a dynamic fashion which we're not necessarily fans as we feel it makes things quite hard to read/maintain/debug

datajoely

01/05/2022, 5:26 PM

but it is posisble

datajoely

01/05/2022, 5:27 PM

So I think you're going to get creating with your run args

datajoely

01/05/2022, 5:27 PM

https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#specify-parameters-at-runtime

datajoely

01/05/2022, 5:27 PM

so you can inject using the technique above

datajoely

01/05/2022, 5:28 PM

> --config may be used as a starting point, but how to change the function in a pipeline node from it. I have an idea already but wanted first to check if there's a way to do it out of the box.

metalmind

01/05/2022, 5:29 PM

Thank you for the thread.

datajoely

01/05/2022, 5:30 PM

If you look at this sample project

datajoely

01/05/2022, 5:30 PM

https://github.com/datajoely/modular-spaceflights/blob/260b209c24c7440342b41fb02f218d39c9115220/src/modular_spaceflights/pipelines/modelling/nodes.py#L42

datajoely

01/05/2022, 5:30 PM

you'll see

datajoely

01/05/2022, 5:30 PM

we load a different Sklearn regressor object dynamically based on paramters

datajoely

01/05/2022, 5:31 PM

and then parametrise the class path

datajoely

01/05/2022, 5:31 PM

https://github.com/datajoely/modular-spaceflights/blob/260b209c24c7440342b41fb02f218d39c9115220/conf/base/parameters/modelling.yml#L6

datajoely

01/05/2022, 5:31 PM

I think the same approach can be applied

datajoely

01/05/2022, 5:32 PM

You can actually follow along that sample project as it then instantiates the modelling twice

datajoely

01/05/2022, 5:32 PM

It's an advanced use of Kedro but I think it's similar to what you're trying to achieve

metalmind

01/05/2022, 5:35 PM

You used this for creating the model, right? I though of doing the same or create the model in a notebook and exporting it as YAML then let a node load it as YamlDataset then create the model from it. So I'd avoid having extra scripts.

datajoely

01/05/2022, 5:35 PM

Yes - it's a trivial model so I appreciate it may not be super representative

metalmind

01/05/2022, 5:39 PM

Let me know your idea about handling the scripts differently. I'd create a new dataset called ScriptDataSet or PythonDataSet or FunctionDataset. Define its params in the data catalog. And create 2 new folders 09_featurizers and 10_modelers where they scripts are stored. Then give the FunctionDataset a Python source file name to load it form here and optional method, or it loads the first method from it. The use that dataset in one of the pipeline nodes.

metalmind

01/05/2022, 5:39 PM

Unless there's something like that ready in Kedro.

datajoely

01/05/2022, 5:39 PM

datajoely

01/05/2022, 5:40 PM

I'm not sure - my first reaction is that could get a bit messy and feels like a security risk

datajoely

01/05/2022, 5:40 PM

we have plans to let the user enable 'unsafe' yaml in the future, but it's not available yet

datajoely

01/05/2022, 5:41 PM

this would allow you to reference a Python module with the special syntax at the bottom

metalmind

01/05/2022, 5:41 PM

Could get messy yes but what's the alternative. I could have like 10 different feature generators and similar label generators and need to be able to mix and match during experiementation.

datajoely

01/05/2022, 5:42 PM

Is the desired output just the combinatorial of all of them?

datajoely

01/05/2022, 5:43 PM

Itertools in the standard library may be helpful https://docs.python.org/3/library/itertools.html#itertools.combinations

metalmind

01/05/2022, 5:43 PM

Yes, every time single raw data + single fetureizer + single... etc

metalmind

01/05/2022, 5:44 PM

oh not automated or grid-search-like, this is manual experimentation based on the results.

datajoely

01/05/2022, 5:44 PM

datajoely

01/05/2022, 5:45 PM

so each pipeline run is - fixed input data - parameterised featurizer - parameterised labeler

metalmind

01/05/2022, 5:46 PM

Input raw data would be parameterized too. I though of each iteration/run would be a YAML file pointing to raw data+ featurizer + labelers + YAML parameters.

metalmind

01/05/2022, 5:46 PM

Passed as

kedro run --config...

or something.

datajoely

01/05/2022, 5:46 PM

Unfortunately I think you may be breaking too many of Kedro's assumptions

datajoely

01/05/2022, 5:47 PM

let me think about this more

datajoely

01/05/2022, 5:47 PM

but I think there is a chance it may not be the right tool for this approach

metalmind

01/05/2022, 5:49 PM

You meant Kedro as a whole?

datajoely

01/05/2022, 5:49 PM

potentially

datajoely

01/05/2022, 5:49 PM

I'm protyping somrthing currently

datajoely

01/05/2022, 5:50 PM

regarding the dynanamic input data - would it be defined in the catalog?

metalmind

01/05/2022, 5:50 PM

I think Kedro is suited more to come after a developed model. Not to be used during experiemtation.

metalmind

01/05/2022, 5:51 PM

Yes as the standard method. And I'm thinking of using different catalogs or find a way to replace the data with each run.

datajoely

01/05/2022, 5:51 PM

Okay I may have something

datajoely

01/05/2022, 5:52 PM

give me 10 mins to make a dummy project

metalmind

01/05/2022, 5:53 PM

My main target now is to find a way to run R&D on a matrix of the 4 mentioned elements as mix and match while recording the result in Mlflow to find the best model then use the same feature generation function in inference.

metalmind

01/05/2022, 5:53 PM

Thank you.

metalmind

01/05/2022, 6:18 PM

Actually I've started to feel that Kedro is an overkill for my needs. I have only a fixed 4-step pipeline. The input data, functions, and parameters are what change so if I can put all in a repository and feed them to my pipeline via YAML configs, and use Mlflow for experiement tracking and DVC for versioning the 4 elements with each run, that would be sufficient. I'm torn between using Kedro or writing my small application.

datajoely

01/05/2022, 6:20 PM

alright

datajoely

01/05/2022, 6:20 PM

I think I'm done with my sample project- but I'm torn if this makes sense or not

datajoely

01/05/2022, 6:23 PM

https://github.com/datajoely/kedro-matrix-example/blob/main/trial_1.yml

datajoely

01/05/2022, 6:24 PM

so running

kedro run --config trial_1.yml

will take data from

conf/trail_1/catalog.yml

and the nodes will take their parametrisation from the configured YAML file

metalmind

01/05/2022, 6:35 PM

Thank you very much for your effort. I appreciate it. The folder per catalog approach would be too much when running a lot of experiments. Also if the 4 elements could be added via the command line as an option, the folders won't be required.

metalmind

01/05/2022, 6:37 PM

But generally, what do you think about my comment above about writing a small app for that. I'm in a proof of concept phase and wrote already part of it. Either that or another project more suited for R&D you know of?

metalmind

01/05/2022, 6:40 PM

I prefer for the params and features to be Python functions in their own script files instead of conigs in YAML. I'd have more control like that and can integrate them to any other app.

datajoely

01/05/2022, 6:42 PM

Yeah it's fair to say this approach doesn't utilise the benefits of authoring your pipelines via Kedro

datajoely

01/05/2022, 6:42 PM

so I think you may be better off with any orchestrator that doesn't take responsibility for IO

datajoely

01/05/2022, 6:42 PM

so something like Prefect, Airflow or Dagster may be more appropriate

metalmind

01/05/2022, 6:46 PM

Does the part

sklearn.linear_model

below:

Copy code

feature_params:
      module: 'sklearn.linear_model'

work for any module or pip-installed ones only?

metalmind

01/05/2022, 6:51 PM

As my functions should be loaded form the same repository where the other data resides.

datajoely

01/05/2022, 6:55 PM

Hello if you look at the implementation of the nodes.py

datajoely

01/05/2022, 6:55 PM

It just uses importlib

datajoely

01/05/2022, 6:56 PM

So will read any module visible by the library locally or in site packages

metalmind

01/05/2022, 7:03 PM

Thank you again for your effort. I'm checking your implementation to decide how should I move next.

8 Views

Previous Next