Most Grafana dashboards are bad. Too many panels, unclear queries, inconsistent color schemes, no clear purpose. Here are the principles I apply now.

Rule 1: Every dashboard has one question

Start by writing down: “What question does this dashboard answer?”

Good:

  • “Is the order service healthy right now?”
  • “How is the nightly ETL job progressing?”
  • “What is the cost trend for our compute in the last 30 days?”

Bad:

  • “Production metrics”
  • “Database overview”

If you can’t state the question in one sentence, you don’t know what the dashboard is for.

Rule 2: The top row tells the story

Top panels should answer the main question at a glance. Error rate, latency p99, queue depth, whatever matters. Use stat panels with color thresholds.

Secondary details (breakdowns by endpoint, instance, etc.) go below.

Rule 3: Use consistent color semantics

  • Green = healthy
  • Yellow = warning
  • Red = broken
  • Blue = neutral / informational

Don’t use red for “request rate” just because it looks nice. Colors communicate meaning.

Rule 4: Label everything

Every panel needs:

  • Clear title (not “Requests” — “HTTP 5xx per second, all regions”)
  • Axis units (%, B/s, req/s)
  • Legend with meaningful names (template variables help)

Without labels, dashboards are useless in 6 months when you return to them.

Each critical panel should have a link to a runbook:

[panel description](https://runbooks.example.com/service-x/5xx)

When oncall gets paged at 3am and opens Grafana, they should be one click away from “what to do”.

Anti-patterns I see constantly

  1. Graph walls: 30 panels on one page. Nobody reads all of them. Split into focused dashboards.

  2. Last-data-point queries: Using last_over_time() everywhere makes dashboards show stale data without warning. Use rate() and sum() with short intervals.

  3. Hardcoded service names: Use template variables so one dashboard serves all your services.

  4. Bad time ranges: Default to “last 1 hour” for operational dashboards. “Last 24h” for trend dashboards. Never default to fixed dates.

  5. Percentiles from counter queries: If you’re computing p99 from rate of requests, you’re wrong. Use histogram_quantile on native histograms.

Grafana-as-code

Store dashboards in git as JSON. Use Jsonnet/libsonnet or Terraform for generation. Manual dashboard edits drift and never get reviewed.

resource "grafana_dashboard" "order_service" {
  config_json = file("${path.module}/dashboards/order-service.json")
}

The 80/20 tool

For most teams, the Method dashboard pattern is a solid default:

  • USE (Utilization/Saturation/Errors) for infrastructure
  • RED (Rate/Errors/Duration) for services
  • FourGoldenSignals: add Latency

Pick one framework and apply consistently. Custom dashboards should have clear reason to exist.