Grafana Dashboards That Don't Suck: Principles and Anti-Patterns

Most Grafana dashboards are bad. Too many panels, unclear queries, inconsistent color schemes, no clear purpose. Here are the principles I apply now.

Rule 1: Every dashboard has one question

Start by writing down: “What question does this dashboard answer?”

Good:

“Is the order service healthy right now?”
“How is the nightly ETL job progressing?”
“What is the cost trend for our compute in the last 30 days?”

Bad:

“Production metrics”
“Database overview”

If you can’t state the question in one sentence, you don’t know what the dashboard is for.

Rule 2: The top row tells the story

Top panels should answer the main question at a glance. Error rate, latency p99, queue depth, whatever matters. Use stat panels with color thresholds.

Secondary details (breakdowns by endpoint, instance, etc.) go below.

Rule 3: Use consistent color semantics

Green = healthy
Yellow = warning
Red = broken
Blue = neutral / informational

Don’t use red for “request rate” just because it looks nice. Colors communicate meaning.

Rule 4: Label everything

Every panel needs:

Clear title (not “Requests” — “HTTP 5xx per second, all regions”)
Axis units (%, B/s, req/s)
Legend with meaningful names (template variables help)

Without labels, dashboards are useless in 6 months when you return to them.

Rule 5: Link to runbooks

Each critical panel should have a link to a runbook:

[panel description](https://runbooks.example.com/service-x/5xx)

When oncall gets paged at 3am and opens Grafana, they should be one click away from “what to do”.

Anti-patterns I see constantly

Graph walls: 30 panels on one page. Nobody reads all of them. Split into focused dashboards.
Last-data-point queries: Using last_over_time() everywhere makes dashboards show stale data without warning. Use rate() and sum() with short intervals.
Hardcoded service names: Use template variables so one dashboard serves all your services.
Bad time ranges: Default to “last 1 hour” for operational dashboards. “Last 24h” for trend dashboards. Never default to fixed dates.
Percentiles from counter queries: If you’re computing p99 from rate of requests, you’re wrong. Use histogram_quantile on native histograms.

Grafana-as-code

Store dashboards in git as JSON. Use Jsonnet/libsonnet or Terraform for generation. Manual dashboard edits drift and never get reviewed.

resource "grafana_dashboard" "order_service" {
  config_json = file("${path.module}/dashboards/order-service.json")
}

The 80/20 tool

For most teams, the Method dashboard pattern is a solid default:

USE (Utilization/Saturation/Errors) for infrastructure
RED (Rate/Errors/Duration) for services
FourGoldenSignals: add Latency

Pick one framework and apply consistently. Custom dashboards should have clear reason to exist.

Rule 1: Every dashboard has one question#

Rule 2: The top row tells the story#

Rule 3: Use consistent color semantics#

Rule 4: Label everything#

Rule 5: Link to runbooks#

Anti-patterns I see constantly#

Grafana-as-code#

The 80/20 tool#