Live Dashboards — Sympozium GCP

Agent Platform Overview

Agent runs, failure rates, token usage, and gateway health — the four numbers that matter.

Dashboard loading... If this doesn't appear, ensure Grafana is configured for anonymous access.

Errors & Infrastructure

Error rates, web endpoint health, and gateway readiness — the operational view.

Dashboard loading... If this doesn't appear, ensure Grafana is configured for anonymous access.

Key Metrics Explained

sympozium_agent_runs_total

Counter

Every agent execution — broken down by model (gemini-2.5-pro, gemma-3-27b-it, etc.), status (success/failed), and instance name. The primary throughput indicator.

gen_ai.usage.output_tokens_total

Counter

Output tokens generated by model. Directly correlates with Vertex AI cost. The dashboard shows tokens/minute by model tier so you can see where spend is going.

sympozium_agent_run_duration

Histogram

End-to-end agent execution time in milliseconds. The dashboard shows P50, P95, and P99 to catch tail latency issues — especially important for interactive channels.

sympozium_tool_invocations_total

Counter

Every tool/skill sidecar call, by tool name and status. Shows which tools agents use most and which are failing — essential for debugging complex multi-step agent runs.

sympozium_gateway_ready

Gauge

Web endpoint gateway readiness. When this drops to 0, the OpenAI-compatible API is down. Alerts fire after 5 minutes of not-ready state.

sympozium_errors_total

Counter

Unhandled errors across all components. Spikes here correlate with agent failures and should be investigated alongside the failure rate panel.

Alert Rules

Built-in alerting via GMP Rules CRD. These fire automatically when thresholds are breached.

Alert	Condition	Severity	Why It Matters
`AgentRunFailureRateHigh`	>25% failure rate over 5m	warning	Agents are failing — check Vertex AI quota, model availability, or prompt issues
`AgentRunFailureRateCritical`	>50% failure rate over 5m	critical	Majority of runs failing — likely a platform-level issue, not individual agents
`TokenBudgetCritical`	>2M output tokens/hr	critical	Significant Vertex AI spend — a runaway agent or unexpected traffic spike
`AgentRunLatencyHigh`	P95 >2min over 10m	warning	Interactive users are waiting too long — check model load or prompt complexity
`ToolErrorRateHigh`	>30% tool errors over 5m	warning	A specific tool sidecar is broken — check RBAC, image pull, or API connectivity