feat(metrics): implement Prometheus observability#45
Open
MatteoMori wants to merge 4 commits intokagent-dev:mainfrom
Open
feat(metrics): implement Prometheus observability#45MatteoMori wants to merge 4 commits intokagent-dev:mainfrom
MatteoMori wants to merge 4 commits intokagent-dev:mainfrom
Conversation
Replace generateRuntimeMetrics() with prometheus/client_golang and add flexible metrics server architecture supporting same-port or dedicated port deployment. Changes: - Add internal/metrics package with custom Prometheus registry - Configurable metrics port via --metrics-port flag (default: 8084) - Two-server architecture with proper WaitGroup coordination - Graceful shutdown for both main and metrics servers - Export kagent_tools_mcp_server_info (version metadata) - Export kagent_tools_mcp_registered_tools (tool providers) - Include Go runtime metrics (goroutines, memory, GC stats) - Include process metrics (CPU, memory, file descriptors) Architecture improvement: Move http.Server instantiation outside goroutines to prevent race condition between assignment and shutdown. Test coverage: 5 unit tests validating registry, collectors, and metrics. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>
Use MCPServer.ListTools() to automatically detect which tools each provider registers, eliminating the need to modify individual tool packages. The approach snapshots the tool list before and after each provider's RegisterTools() call, then records the newly added tools in Prometheus with the correct tool_provider label. This means: - Zero changes required in any pkg/ file - Future tools are automatically tracked - No risk of forgetting to add a metric for a new tool Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>
Add kagent_tools_mcp_invocations_total and kagent_tools_mcp_invocations_failure_total counters using the wrapper/middleware pattern. All handlers are centrally instrumented in wrapToolHandlersWithMetrics with zero changes to pkg/ files. Update README with Observability section and CLI flags reference. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>
Add comprehensive Prometheus Operator integration via Helm chart: - ServiceMonitor resource for automatic target discovery - Dedicated metrics service (kagent-tools-metrics) - Deployment args for --metrics-port configuration - Configurable scrape interval, timeout, and labels Include Grafana dashboard with 8 panels visualizing: - Server version and health metrics - Tool invocation rates by provider - Success/failure rates and trends - Top invoked tools table with heat mapping Add CLAUDE.md with architecture documentation covering: - Tool provider pattern and MCP server lifecycle - Observability architecture (metrics wrapper pattern) - Development commands and key implementation patterns - Helm chart structure and troubleshooting guide Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com>
02aaa2c to
569d744
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While working with the kAgent Tool MCP Server, I noticed the absence of basic Prometheus metrics. Personally, I would find very useful to know things like: what tools am I exposing?, are invocation failing more than usual?, etc so that I could start to build operational best practices around the tool.
So I spent a little bit of time adding some basic metrics in this project.
What does this PR add?
[x] Prometheus server: it supports to run on the MCP port, or a custom one
[x] 4 initial metrics:
-
kagent_tools_mcp_server_info- Server metadata (version, commit, build date)-
kagent_tools_mcp_registered_tools- Gauge per tool (tool_name, tool_provider)-
kagent_tools_mcp_invocations_total- Counter of all invocations ( DISCLAIMER: OPUS helped a lot here )-
kagent_tools_mcp_invocations_failure_total- Counter of failures ( DISCLAIMER: OPUS helped a lot here )[x] updated the Helm chart
[x] added a basic Grafana dashboard