-
Notifications
You must be signed in to change notification settings - Fork 509
[For Maintainers] Notes on `dev` ‐ `main` merges
All project maintainers are trusted to unilaterally do dev -> main merges as they see fit. The recommended process is through the git
CLI (typically preferred over a PR since the latter introduces an extra commit. The procedure is pretty simple:
-
Make sure both
dev
andmain
are up-to-date. -
With the
dev
branch checked out, make sure you are able to run at a minimum: a) Run a 2B IT model and produce a correct generation for some standard inputs "hi how are you?", "tell me about places to visit in [geographic location]", "write code to do [simple task]" and b) I usually try at least one random new interaction that I haven't tested recently (e.g. a different prompt, different commands.
Beyond that, any additional testing e.g. with 7B IT is of course beneficial, but the main goal of this final manual approval gate is to sanity check end-to-end behavior before it impacts everyday users checking out from main
.
-
With the
main
branch checked out, performgit merge dev
-
git push
to update the github repo.
The purpose of using automation for PR->dev
merges and a manual release gate for dev
->main
is meant to balance two forms of availability:
- Don't block - Minimal blocking / friction for PR, take advantage of CI / automation for aggressively merging PRs at a relatively high velocity / low overhead, minimize blocking on the availability of maintainers.
- Don't break - Most end users probably don't care about any individual PRs and would be more negatively affected if functionality either broke or changed in unexpected ways. We also want to avoid quasi-pager emergencies of fixing breaking changes or rolling back subtly breaking commits over the weekend/after hours.
This dual track between fast iteration on dev
and manual gating on main
, gets us most benefits of . For power users who want to track (eg bindings maintainers) the latest PR, there's little cost to this as they can continuously track dev
. Non-power users probably only notice if something is broken, main
provides a buffer between fast iteration and their UX.
Can we automate more or all of this?
I think the boundaries of automation could be pushed further. The main blocker is probably that compared to other CI testing, the model artifacts are fairly large and the compute overhead is fairly high. These are not insurmountable though and it would be beneficial to be able to do a "hi how are you?"-type test as part of CI.
However, the interaction surface of LLM apps are not 1:1 transferrable with standard applications. The surface area of interaction is a lot more amorphous and there are corner cases that, if-tested have negative externalities such as inordinately increasing the CI time. Two small examples: 1) code/model artifact syncing - if we update the model artifact (eg the recent MHA -> MQA) including in the CI, we could easily miss that this impacts users with either new version of the code + old version of the artifact or an old version of the code + new version of the artifact. Running into this in the manual check leads us to update docs + error messages to minimize the impact. 2) While CI is probably limited to short generations to avoid increasing overhead, some bugs may only be visible after longer generations (generation quality degredation, fence-post errors).
Although these two issues could benefit from at least be partial mitigation, LLM testing has unique concerns that make comprehensive automated testing a challenge and at the same time gating dev/main has relatively few costs.
We've cut releases relatively informally here https://github.com/google/gemma.cpp/releases With an OSS project like this releases are primarily symbolic, but can be useful to cut:
- To demarcate substantial updates/change to functionality, features and/or performance
- To engage the community + provide public recognition for contributions
Github has some nice automation around release notes - note the What's Changed
and New Contributors
sections in 0.1.1 are automated suggestions.
We're not targeting a particular cadence, with the current velocity of the project it seems like ~ once every 1-2 months feels like enough of a change to cut a minor version. This could slow down (eg quarterly/bi-annually if development is at a steady state) or speed up if there's a particular major feature release planned (eg a new model variation supported or major refactoring).