As far as I know, this is a tool for manual labeli...
# beginners-need-help
m
As far as I know, this is a tool for manual labeling. What I'm looking for is a script with a function used labeling and another for feature generation.
d
Hello created a thread to isolate the conversation
copying questions here
> Data can be loaded via configuration, right? I want to use Kedro during the experiement/R&D phase. So I need to be able to mix and match raw data + feature generator + label generator + parameters. Say I have 10 raw data files. 10 feature generation functions (source files), ...etc. I want to try Raw Data 1 + Featurizer 2 + Labeler 5 + Parameter Set 6 (1/2/5/6). Then 3/4/1/7, ...etc. All while letter Mlflow record the results. > So not only raw data can be configured and loaded, also scripts (functions) used to generate the features and labels.
So this is an interesting topic
as you'll need to set up your pipelines to work in quite a dynamic fashion which we're not necessarily fans as we feel it makes things quite hard to read/maintain/debug
but it is posisble
So I think you're going to get creating with your run args
so you can inject using the technique above
> --config may be used as a starting point, but how to change the function in a pipeline node from it. I have an idea already but wanted first to check if there's a way to do it out of the box.
m
Thank you for the thread.
d
If you look at this sample project
you'll see
we load a different Sklearn regressor object dynamically based on paramters
and then parametrise the class path
I think the same approach can be applied
You can actually follow along that sample project as it then instantiates the modelling twice
It's an advanced use of Kedro but I think it's similar to what you're trying to achieve
m
You used this for creating the model, right? I though of doing the same or create the model in a notebook and exporting it as YAML then let a node load it as YamlDataset then create the model from it. So I'd avoid having extra scripts.
d
Yes - it's a trivial model so I appreciate it may not be super representative
m
Let me know your idea about handling the scripts differently. I'd create a new dataset called ScriptDataSet or PythonDataSet or FunctionDataset. Define its params in the data catalog. And create 2 new folders 09_featurizers and 10_modelers where they scripts are stored. Then give the FunctionDataset a Python source file name to load it form here and optional method, or it loads the first method from it. The use that dataset in one of the pipeline nodes.
Unless there's something like that ready in Kedro.
d
:S
I'm not sure - my first reaction is that could get a bit messy and feels like a security risk
we have plans to let the user enable 'unsafe' yaml in the future, but it's not available yet
this would allow you to reference a Python module with the special syntax at the bottom
m
Could get messy yes but what's the alternative. I could have like 10 different feature generators and similar label generators and need to be able to mix and match during experiementation.
d
Is the desired output just the combinatorial of all of them?
Itertools in the standard library may be helpful https://docs.python.org/3/library/itertools.html#itertools.combinations
m
Yes, every time single raw data + single fetureizer + single... etc
oh not automated or grid-search-like, this is manual experimentation based on the results.
d
ok
so each pipeline run is - fixed input data - parameterised featurizer - parameterised labeler
m
Input raw data would be parameterized too. I though of each iteration/run would be a YAML file pointing to raw data+ featurizer + labelers + YAML parameters.
Passed as
kedro run --config...
or something.
d
Unfortunately I think you may be breaking too many of Kedro's assumptions
let me think about this more
but I think there is a chance it may not be the right tool for this approach
m
You meant Kedro as a whole?
d
potentially
I'm protyping somrthing currently
regarding the dynanamic input data - would it be defined in the catalog?
m
I think Kedro is suited more to come after a developed model. Not to be used during experiemtation.
Yes as the standard method. And I'm thinking of using different catalogs or find a way to replace the data with each run.
d
Okay I may have something
give me 10 mins to make a dummy project
m
My main target now is to find a way to run R&D on a matrix of the 4 mentioned elements as mix and match while recording the result in Mlflow to find the best model then use the same feature generation function in inference.
Thank you.
Actually I've started to feel that Kedro is an overkill for my needs. I have only a fixed 4-step pipeline. The input data, functions, and parameters are what change so if I can put all in a repository and feed them to my pipeline via YAML configs, and use Mlflow for experiement tracking and DVC for versioning the 4 elements with each run, that would be sufficient. I'm torn between using Kedro or writing my small application.
d
alright
I think I'm done with my sample project- but I'm torn if this makes sense or not
so running
kedro run --config trial_1.yml
will take data from
conf/trail_1/catalog.yml
and the nodes will take their parametrisation from the configured YAML file
m
Thank you very much for your effort. I appreciate it. The folder per catalog approach would be too much when running a lot of experiments. Also if the 4 elements could be added via the command line as an option, the folders won't be required.
But generally, what do you think about my comment above about writing a small app for that. I'm in a proof of concept phase and wrote already part of it. Either that or another project more suited for R&D you know of?
I prefer for the params and features to be Python functions in their own script files instead of conigs in YAML. I'd have more control like that and can integrate them to any other app.
d
Yeah it's fair to say this approach doesn't utilise the benefits of authoring your pipelines via Kedro
so I think you may be better off with any orchestrator that doesn't take responsibility for IO
so something like Prefect, Airflow or Dagster may be more appropriate
m
Does the part
sklearn.linear_model
below:
Copy code
feature_params:
      module: 'sklearn.linear_model'
work for any module or pip-installed ones only?
As my functions should be loaded form the same repository where the other data resides.
d
Hello if you look at the implementation of the nodes.py
It just uses importlib
So will read any module visible by the library locally or in site packages
m
Thank you again for your effort. I'm checking your implementation to decide how should I move next.
6 Views