Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert formats to yaml with ids #76

Merged
merged 14 commits into from
Dec 9, 2024
Merged

Convert formats to yaml with ids #76

merged 14 commits into from
Dec 9, 2024

Conversation

sneakers-the-rat
Copy link
Collaborator

@sneakers-the-rat sneakers-the-rat commented Nov 12, 2024

Fix: #55

OK Here we replace the awkward format model system with one where we just use yaml files for all static config like that.

This is something that is intended to work alongside #72 to make a more coherent configuration experience.

Now we have the starts of a uniform system where we extended the generic YAMLMixin class to a ConfigYAMLMixin class which can be used with any pydantic model.

The way it works is

  • every config model gains an additional set of fields:
    • id: an identifier that is expected to be locally unique - i.e. in a given deployment, between the builtin configs and the user provided configs, an id is unique. not globally unique across all deployments.
    • mio_model: the fully-qualified python module that the config corresponds to, like package.subpackage.ClassName
    • mio_version: the last known version of miniscope-io that this config was known to work with this config. more on versioning below.
  • The mixin model provides classmethods to find matching configs by ID, either from
  • On loading those files, we make sure that the header keys that identify the config are at the top of the file. rewriting it if so (with a backup). This is to prep us for my own work where i want to be able to rely on header keys at the top to avoid parsing the document, and i've tried to make it nondestructive and conservative until we can be sure that it's not annoying to rewrite your document (doing it in a way that roundtrips, e.g. preserving comments and whitespace and formatting and etc. is a real pain in the ass, but would be less intrusive).
  • we also gain an additional from_any method that accepts the possible input types, yaml files and ids, to provide a "fuck it find my thing" interface. we get to avoid unstructured or EAFP-style string parsing by defining id to exclude ., so we can unambiguously decide whether something is a path or an id.
  • i ported over a yaml_peek function to be able to scan for ids quickly because parsing yaml is surprisingly expensive, especially if we imagine that someone might eventually use this tool to make a relatively complex configuration like we see in the existing miniscope-qt-daq configs. it should be at least as fast as pyyaml always, because the underlying file reading and regex methods in python are well optimized. the reason this is so perf-sensitive is that while it shouldn't happen repeatedly during runtime, we should expect someone to accumulate a billion and a half config files as they use it, and this is easier and likely about as reliable as building a caching system for that.

So basically how this should work in normal usage is like this:

where given some config file in a config directory like

id: wireless-200px
mio_model: miniscope_io.models.stream.StreamDevConfig
mio_version: "v5.0.0"

# ... rest of file
from miniscope_io.stream_daq import StreamDaq

cam = StreamDaq('wireless-200px')

where the user doesn't need to handle paths but can instead refer to their config by ID. this also helps with publishing code with papers, but that's more of a 'later' thing.

Where the actual API of the thing is like this

from miniscope_io import Config
from miniscope_io.models.mixins import ConfigYAMLMixin
from miniscope_io.models import MiniscopeConfig

class MyConfig(MiniscopeConfig, ConfigYAMLMixin):
    """sup use me to configure something"""
    key_1: str
    key_2: int
    
my_config = MyConfig(id="custom-config", key_1 = "hey", key_2 = 500)
my_config.to_yaml(Config().config_dir / "different_filename_idk.yaml")

# loading...
re_loaded = MyConfig.from_id("custom-config")

# using in a miniscope class
# these are equivalent:
cam = StreamDaq("custom-config")
cam = StreamDaq(re_loaded)
cam = StreamDaq(Config().config_dir / "different_filename_idk.yaml")

and so on.

This design sets us up for the future where we will want to have several different kinds of configs for every object, some builtin and fixed, and others provided by the user. For example the current stream_daq config blends together params that are intrinsic to the device with params that the user might want to configure, both dynamically (i.e. in a gui) and for a given set of experiments. we will want to be able to have a bunch of different types of configs that we can refer to with a shorthand and have a standard way of locating them rather than having them splayed out everywhere :). This also makes it possible to have a uniform interface over builtin configs as well as user-provided ones, because the user should not have to know anything about the directory structure of the package (e.g. knowing that wireless miniscope configs are in data/configs/wireless) because they usually can't see it, and they also shouldn't have to juggle paths in derivative code.

Additional notes on the design here:

  • the format of an id has been constrained to one where we might be able to support configs with PIDs in the future: a config is allowed to use / and # symbols to indicate hierarchy and fragments, respectively.
  • the purpose of the mio_version parameter is to be able to identify when old configs might not work with the current version of the model it corresponds to, and to be able to migrate between those version updates. I opted not to give every model its own independent version, which we would then have to write tests to validate that we write a migration whenever the model is updated and so on. Instead we just use the miniscope-io version for everything, and then when we change a model we write a migration that corresponds to that version, and on load we check "are there any migrations for this model? if so, are the versions they are tagged to greater than the version in this config, if so migrate." That's all tbd and i'm not sure if there is tooling for this already in pydantic universe, but if not we'll write it as a separate package. Downstream users are welcome to add an additional version field to their model if they'd like to keep track of things that way as well. we're not quite at plugin phase yet and we'll probably need to mildly tweak this in the future when miniscope-io version is not the package that contains the migrations, but we'll get there later.
  • currently we aren't validating that the mio_model value matches the loading class. that's because it would be annoying while we are pre-alpha and names and locations are shifting to keep having to update these strings. once we get a decent migration system then we can do that. for now it's like "you can name something the wrong name at your own peril lol"

📚 Documentation preview 📚: https://miniscope-io--76.org.readthedocs.build/en/76/

@sneakers-the-rat
Copy link
Collaborator Author

riddle me this - why would the tests fail at 2719511
if

miniscope_io.formats.stream.StreamBufferHeader == miniscope_io.models.stream.StreamBufferHeaderFormat.from_id("stream-buffer-header")

Copy link

github-actions bot commented Nov 12, 2024

Pull Request Test Coverage Report for Build 12243401376

Details

  • 171 of 202 (84.65%) changed or added relevant lines in 10 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.7%) to 77.893%

Changes Missing Coverage Covered Lines Changed/Added Lines %
miniscope_io/cli/stream.py 0 1 0.0%
miniscope_io/types.py 26 27 96.3%
miniscope_io/stream_daq.py 5 8 62.5%
miniscope_io/cli/common.py 0 10 0.0%
miniscope_io/models/mixins.py 127 143 88.81%
Files with Coverage Reduction New Missed Lines %
miniscope_io/cli/stream.py 1 0.0%
Totals Coverage Status
Change from base Build 12208310258: 0.7%
Covered Lines: 1205
Relevant Lines: 1547

💛 - Coveralls

@sneakers-the-rat sneakers-the-rat marked this pull request as ready for review November 16, 2024 05:26
@sneakers-the-rat
Copy link
Collaborator Author

ok @t-sasatani updated the root comment with a description of this PR :) it looks big but a lot of it is from converting the formats to yaml and also how we don't have linting on for the tests and my IDE reformatted some of the test files, oop.

Copy link
Collaborator

@t-sasatani t-sasatani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like being able to call IDs instead of directly specifying file paths and making the YAML configs explicitly coupled with models.

Please correct me if I'm wrong, but other parts of this PR felt like it's leaning towards making the YAML config files primarily code-edited instead of manual-edited, which I think raises the bar for people who don't code much. Structure-wise, it makes sense if this is an intermediate step for making miniscope-io fully GUI-based, but I'm unsure if we should merge this part at this point. This is mostly about developer interface, and we've looked at this too much, so it could be good to ask opinions from the other folks trying to integrate this io.

Overall, my current opinions are:

  • id: I prefer using the directory/stemname as the ID.
  • mio_model: let's do this.
  • mio_version: I prefer not having this in the config file itself because it'll make it a mix of code/hand-edited entities.

while we're mostly CLI.

@@ -0,0 +1,41 @@
id: wirefree-sd-layout-battery
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not fully understand the benefit of having a dedicated id field when we could also use the file path directly. (I assume it'll be one configuration per one YAML file here.)

  • Can't we use the stem path under config as a unique ID to call configs? Like wirefree/sd-layout-battery for this example or maybe user/some-config if it's a user directory.
  • I feel there could be some confusion with duplicate IDs, and ambiguous IDs, especially because they will be hidden from the human perspective. We could add tests to prevent duplicates, but I think using file paths as unique identifiers is more intuitive.
  • There will be multiple configs for one device. I think we'll have higher risks of ambiguous configs IDs getting pushed (like ScopeX-1, ScopeX-2, ScopeX-old, ScopeX-latest, and ScopeY for a device called ScopeX), compared to when we just use path/filename as IDs.
  • Being able to have nice versions in IDs feels nice, but I think it'll be pretty confusing if the filename doesn't sync with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ope responded to this as a global comment before i read the individual comments. Responded to most of these there, but for continuity's sake: currently we get both, and there are reasons to have both, and they don't really trade off.

I feel there could be some confusion with duplicate IDs, and ambiguous IDs,

this would be an interface question. so we would want something like

$ mio config --show
id                    path
sd-layout-battery     miniscope_io/data/configs/sd-layout-battery.yaml
my-config             {user_dir}/configs/my-config.yaml
my-config             {user_dir}/configs/my-config-2.yaml

but again you can always specify something as a path if you want to.

There will be multiple configs for one device. I think we'll have higher risks of ambiguous configs IDs getting pushed (like ScopeX-1, ScopeX-2, ScopeX-old, ScopeX-latest, and ScopeY for a device called ScopeX), compared to when we just use path/filename as IDs.

the goal is that we don't want to have scope-v1, scope-v2, etc. The idea behind having the id is that we want to have a way of identifying unique configurations semantically, so then different versions could each be id: scope-default with a different mio_version field to identify that they are related configs that are for different versions of the scope. Then the id only contains the minimal semantic information necessary to identify it: like one would usually just use wireless to run their wireless miniscope, but if they had some custom configuration they could use wireless-mycustomconfig everywhere and not have to worry about keeping the rest of their code up to date with the current location of that file, or the different file names as the version changes.

Being able to have nice versions in IDs feels nice, but I think it'll be pretty confusing if the filename doesn't sync with it.

Filenames are unique identifiers to the system, but we want to provide an interface that potentially allows people to have multiple versions of the same configuration, and exposing them in the filesystem makes for nicer ux than just having them in a database somewhere. so that's the compromise - have an id field and allow it to be decoupled from the filename.

if someone wants to refer to something by the filename, then they just go path/filename.yaml. if someone wants to refer to something by the id, they go myconfig. one can make the filename match the if they want to, but they have the ability to make it orthogonal to filename in cases where that's useful.

There will be multiple configs for one device. I think we'll have higher risks of ambiguous configs IDs getting pushed (like ScopeX-1, ScopeX-2, ScopeX-old, ScopeX-latest, and ScopeY for a device called ScopeX), compared to when we just use path/filename as IDs

Comment on lines +123 to +133
for config_file in chain(*globs):
try:
file_id = yaml_peek("id", config_file)
if file_id == id:
init_logger("config").debug(
"Model for %s found at %s", cls._model_name(), config_file
)
return cls.from_yaml(config_file)
except KeyError:
continue
raise KeyError(f"No config with id {id} found in {Config().config_dir}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to return errors when there are duplicates (maybe when generating)? Though I feel it's better to use filepath as IDs as in the above comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure sure, we could check and warn for duplicates here, that would be np

Comment on lines -127 to +128
target-version = "py311"
target-version = "py39"
Copy link
Collaborator

@t-sasatani t-sasatani Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly out of curiosity, but do we want to do this, or is it just a leftover from something? Might be better to have some notes if 3.11 isn't good because I believe 3.9's end of support is soon

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry ya, this is because ruff uses this as the minimum target, so it was constantly suggesting changes to the code that were incompatible with our lowest version. (specifically Union[X,Y] to X | Y which 3.9 doesn't support). 3.9 should be EOL next october, so basically this value should always stay pegged to our lowest version we support and bumped then (rather than the max version, which was 3.11 at the time i set this iirc)

Comment on lines +78 to +80
* `id` - unique identifier for this config
* `mio_model` - fully-qualified module path to model class
* `mio_version` - version of miniscope-io when this model was created
Copy link
Collaborator

@t-sasatani t-sasatani Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the model here, and even if this is manually filled, having validation for this could be useful.

However, I'm unsure about the rest because I assume these YAML config files will usually be created by copy-paste and manually filling them out. Having this mio_version seems to make these primarily made via a script or CLI command, which I feel will raise the bar without much benefit and cause confusion when hand-edited (which I assume will be the norm).

I feel it'll be more intuitive if we assume the configs will be manually edited. And maybe write a CI flow that documents the mio_version that the config YAML was last changed from the commit ID if we need that.

I've written about the id in other parts, so I will skip it here.

Copy link
Collaborator

@t-sasatani t-sasatani Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe generating something like this with CI?
https://miniscope-io--84.org.readthedocs.build/en/84/history/config.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste and manually filling them out.

I think for most deployments we should assume that people won't be looking at the source code - we assume pip install miniscope-io is the default way to get the package. So then they won't have anything to copy/paste, and we need to provide a cli or programmatic interface for dealing with configs. that's sorta the whole idea with the design of having an id field and the way it was implemented as being something that gets autofilled when loaded and then put at the top of the yaml file so we could use a cheap yaml_peek method to look through many files quickly.

So i think this is a pretty easy way to start using the program

$ pip install miniscope-io
$ mio capture --config wireless

and then when i want to make my own config

$ mio config create wireless --to wireless-custom
# or 
$ mio config create wireless --to my_config_dir/wireless_custom.yaml

and then manually edit it.

CI flow that documents the mio_version that the config YAML was last changed from the commit ID if we need that.

we're looking to support user-custom configs here, so we assume they aren't committing anything to the repository (and don't have to fork it to use it) - versions need to be totally independent from the git flow

@sneakers-the-rat
Copy link
Collaborator Author

other parts of this PR felt like it's leaning towards making the YAML config files primarily code-edited instead of manual-edited, which I think raises the bar for people who don't code much
mio_version: I prefer not having this in the config file itself because it'll make it a mix of code/hand-edited entities.

Fair point! The idea for the design here is that it can be both: you can totally hand-code your configs if you want to, and then when they are loaded the proper metadata is stamped into the hand-coded config files if it's not already there.

So the anticipated workflow would be someone doing something like this

# kwargs version
mio config create --from wireless-base --to my-wireless-base
# positional args version
mio config create wireless-base my-wireless-base

to copy the config file with id: wireless-base from the miniscope_io/data/configs directory to {USER_DIR}/configs/my-wireless-base.yaml.

Then they could edit it however they want by hand, and use it with the id my-wireless-base anywhere in the program. The automatic editing part would only happen if someone hand-wrote something from scratch (which we don't expect to happen, because the margin for error is high and providing a cli interface to create a new config from a base config should be the easiest way to do it), it would add the metadata fields, put them at the top, but otherwise leave the config unchanged.

this should actually make it easier to hand-maintain custom configs over time, specifically the mio_version field, because if someone has a custom config but the model changes in the code s.t. the old config is invalid, the intention is that we would write some migration that allows someone's old config to be updated to the new version - "this_field was renamed to that_field - prompt user to ask if they want to upgrade their config, if yes, make changes and stamp new version" kinda stuff

I prefer using the directory/stemname as the ID.

the idea for having a single id slug is to make it agnostic to location: we want to be able to refer to something uniquely from anywhere. so say when someone publishes their work, they might repackage their data and their config and etc. into a custom directory structure that fits for that journal's expectation, or some format, etc.

As-written though, we get both: you can either provide an id or a relative path. so if you provide something like directory/stemname.yaml then we try and locate it first relative to the cwd, and then if nothing is found there, relative to the config_dir. So the id field is strictly additive and benefits both cli as well as programmatic usage: in the future, we and others may want to make GUIs that just provide, e.g. a dropdown list of available configurations. So by providing a standard way of declaring and looking up those configs, we accomplish that while not losing the ability to specify configs by path, if that's more familiar to people.

@t-sasatani
Copy link
Collaborator

t-sasatani commented Dec 7, 2024

I agree that it will be the same for people used to coding and will be better for maintaining.
So, the points I mostly disagree with are these:

  • So the id field is strictly additive and benefits both cli as well as programmatic usage
  • providing a cli interface to create a new config from a base config should be the easiest way to do it

Function-wise, it is strictly additive, as you say. However, UI-wise, I think it's pretty negative to do this, requiring the user to read through the CLI reference to run a command they only run once (the alternative is just to copy and paste the YAML file), providing many ways to do the same thing, mixing up manually edited fields and generated fields in the same config file, etc.

I see a benefit and agree that limiting the flow's entrance to a CLI is good if the objective is to keep the IDs unique and the fields organized for future updates, but I see a pretty critical trade-off here. I think at this point it's more important lowering the bar/simplifying as much as possible.

@sneakers-the-rat
Copy link
Collaborator Author

requiring the user to read through the CLI reference to run a command they only run once (the alternative is just to copy and paste the YAML file) providing many ways to do the same thing, mixing up manually edited fields and generated fields in the same config file, etc.

  • the user will not usually have access to the straight up yaml files (i.e. pip install miniscope-io is the default access path, as well as any downstream packaging that is done)
  • the cli should be the primary entrypoint to the package, reading through the docs to know which files to copy paste is actually more of a burden than mio config create --help
  • the manually edited fields and autogenerated fields are actually clearly defined - the values are only autogenerated when they are absent, entirely independently of user input, where all explicitly defined fields are honored.

I see a benefit and agree that limiting the flow's entrance to a CLI is good if the objective is to keep the IDs unique and the fields organized for future updates, but I see a pretty critical trade-off here. I think at this point it's more important lowering the bar/simplifying as much as possible.

yep! and what i'm saying is that mio config create wireless is simpler than "go to the git repo website, navigate to this folder, copy and paste it to a file in this specific directory, and then refer to it by its relative path name"

@sneakers-the-rat
Copy link
Collaborator Author

sneakers-the-rat commented Dec 7, 2024

if you wanted to just copy and paste the files and pass relative paths, you can - this change is fully backwards compatible

@t-sasatani
Copy link
Collaborator

t-sasatani commented Dec 9, 2024

Ok, there's no reason to actively oppose this, as it is true that this is fully backward-compatible. Also, I guess my view of the user is mixed up; I considered tool developers as the users because they will decide whether to use this, and the end users will have no business with the device configs for a while. I thought it'd be nice if it looked a bit easier for developers to join the coding, but if they will likely treat this io as a well-defined black box downloaded via pip, as you say, there's no issue.

Related note: I think it is perfect timing to pull in other Miniscope developers (i.e., our lab people to start) in the dev to make our lives easier, but I'm a bit concerned that people seem to be a bit overwhelmed, which I think makes total sense. I don't know a good way to approach this, but it's rather non-ideal that it's mostly the two of us for nearly 1.5 years, though it should be a bridging project. These nesting, assuring multiple pathways, rich versioning, etc., will benefit future scalability, but it could also somewhat raise the bar for joining (at least, it is raising my mental bar to add stuff a bit, even though I think I've read most of the code). I totally agree these changes are worth it and get this project closer to what it ultimately should look like, but it could make sense to slow down a bit on these future scalability aspects.

@sneakers-the-rat
Copy link
Collaborator Author

it's rather non-ideal that it's mostly the two of us for nearly 1.5 years, though it should be a bridging project.

totally agree. in trying to get hemal up to speed, the config system being split across the format models and the yaml files was a big barrier, so hopefully this simplifies things down a bit, and there is still a decent big of work to do to simplify implementation since the sd-card classes and the stream daq are basically two independent implementations.

@sneakers-the-rat sneakers-the-rat merged commit f611a59 into main Dec 9, 2024
16 checks passed
@sneakers-the-rat sneakers-the-rat deleted the formats-to-yaml branch December 13, 2024 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

embed config metadata, id, type, and version in all config files by default
2 participants