Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents #73

jakedoublev · 2023-10-10T16:37:19Z

Is your feature request related to a problem? Please describe.
As it stands, there is currently no universal standard in the spec for liveness/readiness of an Agent running in the context of a system. Until I receive an vendor-Agent/model/custom-Agent bad response to a task-related request, I don't know there are problems with my Agent (within the context of the spec).

At a high level, I believe an open protocol capably applied across all Agent implementations should treat an Agent as any other service within a stack and consider things like integrating observability, events/metrics, shutdown, etc, but this proposal is limited in scope to a binary health/unhealth discovery implementation.

Describe the solution you'd like
Kubernetes actually has a great solution in the form of two health check probes, liveness and readiness. In such context, the liveness healthcheck returns either a 200 or unhealthy HTTP status like 400/500 and indicates that the service is alive, and the readiness healthcheck does the same but ensures all dependencies are also alive (such as connection to a database).

An Agent that depends on connection to a single model is arguably not dependent upon any external resources as it could be considered unalive/not healthy if there is no connection to its single LLM, but I think we could easily see a future where a single orchestrator Agent is facilitating Agent interactions between multiple models, and a truly universally applied spec needs to consider such circumstances.

An endpoint /ap/v1/agent/health_check would be a good place to capture health-related inquiries and I'd love to hear more discussion from there about:

Whether liveness_and_ readiness are both requirements
Whether it needs to be a concern of the spec beyond providing a dedicated endpoint to query health and can be left as an Agent-implementation concern from there

I think an implementation that lacks the ability to decipher if a non-200 response is due to a failure of my Agent to start or a bug in my implementation is a poor developer experience, so Agent-implementations will solve for this on their own and will undoubtedly differ in their implementations without a protocol-driven spec.

Describe alternatives you've considered
The alternative is mostly just not providing a way to query an Agent's healthy state, which is the current status of the Agent Protocol. Agent health is the responsibility of the Agent-specific implementation and not of the protocol, which leads to a lack of consistency and will promote vendor lock-in should Agents evolve to 3rd party SaaS tooling.

Additional context
It could be worthwhile discussion to extend this conversation to things like Agent deployment versioning /ap/v1/agent/version and other deployment state/service context discovery as well, but again, this proposal is limited in scope to solely a health check.

The text was updated successfully, but these errors were encountered:

jzanecook · 2023-10-11T19:40:31Z

I'm curious what your thoughts are on the info endpoint, I posted a potential schema for it here in Issue #39.

It didn't go over a health check for readiness and liveliness, but I think that it could be added there technically. Or some other type of standardized status message.

The schema we were considering does include a version for both the Agent Protocol and the Agent itself, which would be useful for clients.

jakedoublev · 2023-10-11T23:50:18Z

That's an interesin

I'm curious what your thoughts are on the info endpoint, I posted a potential schema for it here in Issue #39.

It didn't go over a health check for readiness and liveliness, but I think that it could be added there technically. Or some other type of standardized status message.

The schema we were considering does include a version for both the Agent Protocol and the Agent itself, which would be useful for clients.

That's an interesting proposal @jzanecook. I have some thoughts about that info proposal that are distinctly different form this one, but I believe they are and should be separate concerns. I think it's worthwhile deciding earlier rather than later how prescriptive the agent protocol needs to be. For what it's worth, I prefer open/extensible protocols rather than restrictive ones, and I also believe a mechanism in the protocol for Agent metadata is valuable but is a separate concern from liveness or readiness.

I would find a lot of value in being able to GET liveness with a 200 response as good enough to detect liveness of an Agent. I think the readiness check is distinctly different because it indicates that dependent resources are also alive and responsive. Agent metadata is also valuable but a GET for info/metadata should be different than a GET for a factor of health. My preference in RESTful architectures (since this protocol appears to assume HTTP so far) is always to keep separate endpoints for separate resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents #73

Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents #73

jakedoublev commented Oct 10, 2023

jzanecook commented Oct 11, 2023

jakedoublev commented Oct 11, 2023

Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents #73

Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents #73

Comments

jakedoublev commented Oct 10, 2023

jzanecook commented Oct 11, 2023

jakedoublev commented Oct 11, 2023