Questions about experiment tracking plans -- CC <@...
# advanced-need-help
s
Questions about experiment tracking plans -- CC @User @User . I read @User s post about experiment tracking plans. We are building kedro-dvc which integrates in with DVC experiment tracking (see https://github.com/FactFiber/kedro-dvc/discussions/6 for kedro-dvc discussion with links to DVC). It would seem that DVC and Kedro plans are largely orthogonal -- and could be used profitably together. For instance, DVC supports tracking data and parameter dependencies, and only partially rerunning pipelines. It supports forking experiments at checkpoints in the middle of pipelines, comparing metrics between experiments and forks. Also supports publishing experiments to git branches or pushing them "as experiments" to other repo users. (Underneath it uses the git-stash mechanism together with internal files to cache metrics.) [To this list we plan to track code dependencies as well as data dependencies in Kedro-DVC, allowing partial reruns to depend on code changes, even if not noted explicitly.] Kedro, on the other hand, seems to focus on cross experiment visualization, adding to kedro-viz. (DVC provides this via DVC-studio but that is on the other side of the freemium barrier.) To integrate, it would seem the key piece is the "session store". I wonder: a) Could the session store be a plugin with a defined API, rather than a piece of kedro-viz? (Then I could switch out the default.) b) Or will the session store have a defined API? c) How does your planned session mechanism deal with different versions of data?
a
Just to make you've been looking at the right thing, this is the best comment to date on the topic I think: https://github.com/kedro-org/kedro/issues/1070#issuecomment-979130536. It was originally written by @User who probably still knows most about all this 🙂 a)
SessionStore
is definitely meant to be a configurable piece with a defined API. The way you would customise this is through the
SESSION_STORE_CLASS
and
SESSION_STORE_ARGS
in your project's settings.py file: https://github.com/kedro-org/kedro/blob/develop/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/settings.py. The fact that
SQLiteStore
is defined on kedro-viz is a temporary convenience while the feature is developed - at some point in the future, it should become part of core kedro. b) That would be https://github.com/kedro-org/kedro/blob/326450b78e676fea440bde645c32637136d1d4cd/kedro/framework/session/store.py#L11, although as you can see from
SQLiteSessionStore
, that's not a restrictive API that stops you from adding other pieces: https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/integrations/kedro/sqlite_store.py c) Basically the version timestamp is used to uniquely identify a`kedro run` . This is recorded in the session store and is the same as the version used for versioning datasets. If a user has specified to load different versions using
load_version
then this should also be available in the session store because it's part of the run command arguments.
s
Nice ... hmm. 1.
class BaseSessionStore(UserDict):
-- will this work with duck typing, or will I need to inherit from this. (Any chance of using a typing.Protocol here?) 2.
read -> Dict[str, Any]
is a little opaque as to what the expectations are... 🙂 3. Apropos IDs, could I use the content-based hashes from DVC instead of a timestamp? (Are you using the timestamp as something other than an id? Could you make that configurable if so, perhaps? -- I can also add a timestamp but to coordinate would be nice if the two were optionally not the same thing.)
NB .. if someone is really using this to fire up a big fleet of spot instances to do distributed hyper-parameter tuning, then in theory a timestamp might not be unique. A timestamp + a short hash would work. I guess this case probably isn't pressing to support....
Apropos "challenges": > * Currently, there is no easy way to get the list of previous run_ids and related run data from the store location, as non-run sessions are stored in the same location. DVC can help with that. 🙂
BTW -- is
SessionRepository
also going to be configurable?
a
These are great questions but there might not be very good answers because (a) we're still trying to crystallise exactly what belongs in a session and how it works; and (b)
SQLiteSessionStore
is really the first "proper" session store we've ever had, and it was developed specifically for experiment tracking and is pretty new. So basically some of the behaviour here hasn't necessarily been fully figured out or has been left deliberately vague and open-ended, to be determined by future requirements and user feedback. Outside experiment tracking I'm sure you're the first person who has considered writing a custom session store. So you’re sort of hitting the limits of what kedro has well-defined here, and any thoughts you have are very much welcomed! We actually have a ticket in this sprint to try and figure out whether we should have session_id == run_id == dataset save version (the timestamp), as is currently the case - you should definitely take a look here and leave a comment 🙂 e.g. if it would be useful to be able to set a custom run_id or control these properties independently. https://github.com/kedro-org/kedro/issues/1273 1. Currently this works with duck typing, but it's maybe not obvious what methods you need to provide so probably best to just inherit from
BaseSessionStore
. Here's the only file where the session store is used I think: https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py. It seems that the requirements for a valid session store are that it can be initialised with certain arguments (see
def _init_store
) and it exposes certain methods, some of which are only in
UserDict
and not explicitly in
BaseSessionStore
or any classes below that (like
update
). Note that
SQLiteSessionStore
doesn't actually define a
read
method (not sure why actually). Defining the requirements properly using typing.Protocol sounds like a good idea, but I guess we'd have to figure out exactly what those requirements are first...
2. Indeed. This was left deliberately vague in the interests of minimising breaking changes because we didn’t know what it might contain in future. 3. This is a great question and very relevant to the GH issue I linked to above. Currently it seems the answer is no, you can’t make a custom session_id but if there’s demand for it then it might happen. Another problem would be there’s currently a strong assumption that even with a custom session_id, they are ordered lexicographically. This is because things like loading the most recent dataset relies on the ordering - see https://github.com/kedro-org/kedro/blob/main/kedro/io/core.py#L529 4. The session repository (now seems to be
RunsRepository
) is defined here: https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/repositories/runs.py . I don’t think this is likely to become configurable any time soon since it’s part of the data access layer of Kedro-Viz, which doesn’t have a system for injecting custom components like Kedro’s settings.py.
s
Thanks, Antony -- I did comment on the issue. BTW what is the possible difference between a run and a session? Is a "run" the "current state" and a session is "the state made persistent" (a checkpoint?) in some way? Are you aiming to support forking etc like DVC? Does the
RunsRepository
really need to live in kedro viz? Don't know what other things kedro viz is keeping track of persistently, but could at least this be made into a separate service? Kedro could provide a default, but
kedro-dvc
could override it, and kedro-viz wouldn't have to care which it was using.
d
I think the honest answer is we're still working out how deep into this world to go
l
Hello, nice to meet you @User For a bit of context: If you are integrating Kedro & DvC experiment tracking, it's probably best to ignore Kedro-Viz altogether. All of the abstractions, like you mention the
RunsRepository
etc. are all quite specific to Kedro-Viz's needs at the moment. What we are hoping to achieve is if the SQLite-based session store proves to be useful, we will iterate and stabilise the interface in Kedro-Viz and backport into Kedro in the future as first-class citizen. You are welcome to use it, but expect it to change.
Regarding the run vs session business: * In theory, a session can have multiple runs, e.g. users can have multiple kedro runs within the same jupyter notebook session. As far as I can tell, this doesn't happen in practice so session ~= run * I'm not too sure how these maps into DvC primitives. Will have to read up further on DvC and get back to you
s
@User -- Thanks -- nice to meet you, as well! ... so a session is tied to Jupyter? In DVC there are two sorts of runs: (1) via
dvc repro
- which updates the state in the working tree, without "marking", but storing results in the "run cache" (https://dvc.org/doc/user-guide/experiment-management#run-cache-automatic-log-of-stage-runs) and (2) via
dvc exp run
- (https://dvc.org/doc/user-guide/experiment-management/running-experiments#running-the-pipelines) associates results with a git reference (utilizing
git stash
machinery). @User -- I'm hoping you don't have to go very deep at all, but will be able to rely on DVC! 🙂 You have concentrated on visualization, which isn't part of DVC (being in their premium offering, DVC Studio). If you can establish appropriate hooks, we can use Kedro-DVC to store experiments, and also to fork, publish, share, collect in one repo, etc. -- while still using Kedro to visualize them. But I would presume that, in order to visualize, Kedro-vis needs a mechanism to access stage metrics and plots across experiments. I'd rather not have to hack in a duplicate history, but would prefer an (pluggable) API in order to keep things clean and DRY. For you, this might have the benefit of separation of concerns, regardless of DVC, keeping Kedro-vis from getting bloated.
l
@User no, a session "orchestrates" runs: it parses the data catalog, pipeline definition, start a runner, etc. Then people can start session from different entrypoints. By far the most common entrypoint is the command line:
kedro run
-- and in this entrypoint it's 1 session 1 run. However, people can also start session in other entrypoints such as jupyter notebook and in theory can do many runs per session. Thanks for the link. Let me check it out tonight and will circle back
s
@User -- > If you are integrating Kedro & DvC experiment tracking, it's probably best to ignore Kedro-Viz altogether. I presume this advice applies to
SESSION_STORE_CLASS
as well? Do you have any plans wrt to the rest of Kedro on how it uses what it consumes from this API? I would guess that, if I override to present a view of DVC experiments I'll end up breaking Kedro-viz at the moment. (?)
l
Re `SESSION_STORE_CLASS`: The rest of Kedro at the moment doesn't rely or consume anything from it. It does what it says on the tin: store data from a session run, so it only has a
read
and
save
interface. Re compatibility with `Kedro-Viz`: yes, if you provide a custom implementation of the session store and ask your users to use it, the experimentatoin tracking tab on Kedro Viz won't work but other features should still be fine. But i think this will be the case at a product level right? Why would people use Kedro-Viz experimentation tracking if they choose to go with DvC?
There is a good write up from Neptune on how they integrate with kedro: https://docs.neptune.ai/integrations-and-supported-tools/automation-pipelines/kedro
it's a pure hook-based solution iirc, so might be of interest to you
s
@User -- thanks! So -- > Why would people use Kedro-Viz experimentation tracking if they choose to go with DvC? Because Kedro-viz is awesome? 🙂 ... What I'm arguing for is that the actual tracking of experiments can and should be separate from the visualization of tracked experiments.
l
Cheers yea that's a good question. I think wrt experiments, there are 2 kinds of data: * Experiment artefacts, e.g. metrics, output models, etc. --> these are tracked through datasets (1) * The surrounding "context" of an experiment, i.e. data related to a specific run (run params, timestamp, etc.) (2) (1) is currently Kedro-native since they are datasets. The end game for (2) I think is what you said: once we are happy with the data model, we can backport the session store from Kedro-Viz into kedro core. For now it's iterated in Kedro-Viz because we are still figuring out what exactly we are building.
s
Sounds right. wrt to: 1. experiment artifacts: we plan to maintain DvC state based on kedro setup. 2. experiment context: here however, DvC relies on git to map experiments to git "time" dimension, which I regard as an excellent solution. It can leverage git for all sorts of capabilities not yet contemplated in straight Kedro experiment tracking (I presume) -- forking, sharing, etc... Here I would rather present some view of this state to Kedro for it to consume. As a practical matter I wouldn't expect you to deviate from your plans on the drop of a hat -- I'm putting a shout out for some abstraction between Kedro-vis and the session store. (We are targeting the end of March for a first functional version, and then have work to integrate code dependencies discovered with
trace
with DvC data dependencies.... So actually writing a Kedro-compatible session store for us probably is an issue for May.) I'm hoping that thinking about it in this manner will also help you in your design. [Just took a brief look at Neptune integration; seems to be pre-experiment-tracking -- not surprising 🙂 -- which offers comparisons between different pipelines. I also hope for a purely hook-based implementation... which is why I'm asking for hooks.]
i
@User we're still designing the
KedroSession
and the store, so it's hard for us to give you many useful details. We mainly mean to use the store as a way to save the details from each run, so we can visualise it in Viz or simply for investigative reasons. You pairing it with DVC sounds about right, we'd love to keep in touch with what you do with it. As at the moment it's fairly free form, probably just go with what seems most reasonable for you and then share it back with us. We'll consider your usecase when we are advancing the design and hopefully not make too many breaking changes on the way. At the moment Kedro Viz is making some assumptions, which may not hold in the future. We did that just to progress the Experiment Tracking work, but if changes arise, we'll update Viz to work with the new format.
Our philosophy is that versioning tracking and that will make it easy to map everything on the same timeline, including git commits (probably exactly what DVC are doing). It's not out of the question us eventually leveraging what is already there in DVC, who knows - maybe that's the most sane approach.
s
Excellent! ... I hope that we can get a concrete mapping between Kedro and DvC implemented for you to study in the next 3-4 weeks. The proposal I raised in the issue -- allowing use of commit hash for session id, and having time (and/or order) optionally represented separately in session metadata is a small example of (one possible way of) smoothing out eventual integration. I have yet to look at
RunsRepository
but I presume that that may be a bit more of a challenge to figure out how to abstract. If you are open to it would be happy to brainstorm at some point on what the requirements and interface should be.
l
@User the runs repository is just a thin abstraction over sqlalchemy to pull data out of the sqlite session store. I won't even reuse it because it's just a bunch of sql queries anyway. Kedro-Viz backend uses repositories pattern to manage data access, but you can query your session store however you like 🙂
s
Yup -- I guess under the hood things will look quite different as I'll be using a
dvc.repo.Repo
instance to query a git repo and some internal DvC stuff. One question will be whether the interface will be returning -- e.g. -- paths to files with metrics, or the metrics themselves. (I guess the former?)
2 Views