Most Grafana dashboards are bad. Too many panels, unclear queries, inconsistent color schemes, no clear purpose. Here are the principles I apply now.
Rule 1: Every dashboard has one question
Start by writing down: “What question does this dashboard answer?”
Good:
- “Is the order service healthy right now?”
- “How is the nightly ETL job progressing?”
- “What is the cost trend for our compute in the last 30 days?”
Bad:
- “Production metrics”
- “Database overview”
If you can’t state the question in one sentence, you don’t know what the dashboard is for.
Rule 2: The top row tells the story
Top panels should answer the main question at a glance. Error rate, latency p99, queue depth, whatever matters. Use stat panels with color thresholds.
Secondary details (breakdowns by endpoint, instance, etc.) go below.
Rule 3: Use consistent color semantics
- Green = healthy
- Yellow = warning
- Red = broken
- Blue = neutral / informational
Don’t use red for “request rate” just because it looks nice. Colors communicate meaning.
Rule 4: Label everything
Every panel needs:
- Clear title (not “Requests” — “HTTP 5xx per second, all regions”)
- Axis units (%, B/s, req/s)
- Legend with meaningful names (template variables help)
Without labels, dashboards are useless in 6 months when you return to them.
Rule 5: Link to runbooks
Each critical panel should have a link to a runbook:
[panel description](https://runbooks.example.com/service-x/5xx)
When oncall gets paged at 3am and opens Grafana, they should be one click away from “what to do”.
Anti-patterns I see constantly
Graph walls: 30 panels on one page. Nobody reads all of them. Split into focused dashboards.
Last-data-point queries: Using
last_over_time()everywhere makes dashboards show stale data without warning. Use rate() and sum() with short intervals.Hardcoded service names: Use template variables so one dashboard serves all your services.
Bad time ranges: Default to “last 1 hour” for operational dashboards. “Last 24h” for trend dashboards. Never default to fixed dates.
Percentiles from counter queries: If you’re computing p99 from rate of requests, you’re wrong. Use histogram_quantile on native histograms.
Grafana-as-code
Store dashboards in git as JSON. Use Jsonnet/libsonnet or Terraform for generation. Manual dashboard edits drift and never get reviewed.
resource "grafana_dashboard" "order_service" {
config_json = file("${path.module}/dashboards/order-service.json")
}
The 80/20 tool
For most teams, the Method dashboard pattern is a solid default:
- USE (Utilization/Saturation/Errors) for infrastructure
- RED (Rate/Errors/Duration) for services
- FourGoldenSignals: add Latency
Pick one framework and apply consistently. Custom dashboards should have clear reason to exist.