Live Dashboards
Real metrics from the running GKE cluster. Grafana dashboards powered by Google Managed Prometheus.
Agent Platform Overview
Agent runs, failure rates, token usage, and gateway health — the four numbers that matter.
Errors & Infrastructure
Error rates, web endpoint health, and gateway readiness — the operational view.
Key Metrics Explained
Every agent execution — broken down by model (gemini-2.5-pro, gemma-3-27b-it, etc.), status (success/failed), and instance name. The primary throughput indicator.
Output tokens generated by model. Directly correlates with Vertex AI cost. The dashboard shows tokens/minute by model tier so you can see where spend is going.
End-to-end agent execution time in milliseconds. The dashboard shows P50, P95, and P99 to catch tail latency issues — especially important for interactive channels.
Every tool/skill sidecar call, by tool name and status. Shows which tools agents use most and which are failing — essential for debugging complex multi-step agent runs.
Web endpoint gateway readiness. When this drops to 0, the OpenAI-compatible API is down. Alerts fire after 5 minutes of not-ready state.
Unhandled errors across all components. Spikes here correlate with agent failures and should be investigated alongside the failure rate panel.
Alert Rules
Built-in alerting via GMP Rules CRD. These fire automatically when thresholds are breached.
| Alert | Condition | Severity | Why It Matters |
|---|---|---|---|
AgentRunFailureRateHigh | >25% failure rate over 5m | warning | Agents are failing — check Vertex AI quota, model availability, or prompt issues |
AgentRunFailureRateCritical | >50% failure rate over 5m | critical | Majority of runs failing — likely a platform-level issue, not individual agents |
TokenBudgetCritical | >2M output tokens/hr | critical | Significant Vertex AI spend — a runaway agent or unexpected traffic spike |
AgentRunLatencyHigh | P95 >2min over 10m | warning | Interactive users are waiting too long — check model load or prompt complexity |
ToolErrorRateHigh | >30% tool errors over 5m | warning | A specific tool sidecar is broken — check RBAC, image pull, or API connectivity |