Agent Platform Overview

Agent runs, failure rates, token usage, and gateway health — the four numbers that matter.

Dashboard loading... If this doesn't appear, ensure Grafana is configured for anonymous access.

Errors & Infrastructure

Error rates, web endpoint health, and gateway readiness — the operational view.

Dashboard loading... If this doesn't appear, ensure Grafana is configured for anonymous access.

Key Metrics Explained

sympozium_agent_runs_total
Counter

Every agent execution — broken down by model (gemini-2.5-pro, gemma-3-27b-it, etc.), status (success/failed), and instance name. The primary throughput indicator.

gen_ai.usage.output_tokens_total
Counter

Output tokens generated by model. Directly correlates with Vertex AI cost. The dashboard shows tokens/minute by model tier so you can see where spend is going.

sympozium_agent_run_duration
Histogram

End-to-end agent execution time in milliseconds. The dashboard shows P50, P95, and P99 to catch tail latency issues — especially important for interactive channels.

sympozium_tool_invocations_total
Counter

Every tool/skill sidecar call, by tool name and status. Shows which tools agents use most and which are failing — essential for debugging complex multi-step agent runs.

sympozium_gateway_ready
Gauge

Web endpoint gateway readiness. When this drops to 0, the OpenAI-compatible API is down. Alerts fire after 5 minutes of not-ready state.

sympozium_errors_total
Counter

Unhandled errors across all components. Spikes here correlate with agent failures and should be investigated alongside the failure rate panel.

Alert Rules

Built-in alerting via GMP Rules CRD. These fire automatically when thresholds are breached.

Alert Condition Severity Why It Matters
AgentRunFailureRateHigh >25% failure rate over 5m warning Agents are failing — check Vertex AI quota, model availability, or prompt issues
AgentRunFailureRateCritical >50% failure rate over 5m critical Majority of runs failing — likely a platform-level issue, not individual agents
TokenBudgetCritical >2M output tokens/hr critical Significant Vertex AI spend — a runaway agent or unexpected traffic spike
AgentRunLatencyHigh P95 >2min over 10m warning Interactive users are waiting too long — check model load or prompt complexity
ToolErrorRateHigh >30% tool errors over 5m warning A specific tool sidecar is broken — check RBAC, image pull, or API connectivity