Skip to content

SOLR-18147 Make a new Grafana dashboard for Solr 10.x#4210

Draft
janhoy wants to merge 8 commits intoapache:mainfrom
janhoy:001-grafana-dashboard-solr10
Draft

SOLR-18147 Make a new Grafana dashboard for Solr 10.x#4210
janhoy wants to merge 8 commits intoapache:mainfrom
janhoy:001-grafana-dashboard-solr10

Conversation

@janhoy
Copy link
Contributor

@janhoy janhoy commented Mar 12, 2026

https://issues.apache.org/jira/browse/SOLR-18147

  • Brand new dashboard, built from mixin source that can re-generate both dashboard and alerts.
  • Bringing back monitoring-with-prometheus-and-grafana refguide page, but written from scratch, with a new diagram scraping each solr node.
  • A solr/monitoring/dev folder with a docker-compose file that starts two solr, prometheus, grafana, alertmanager and a tarffic ingester container, to easily test metric/grafana changes locally

Want to review?

This is a first draft, the things most ready for review are the mixin build logic and the dev/ compose setup for local testing.

I'd not recommend starting a details-focused review of each dashboard panel, presentation etc. The dashboard and panels themselves I'd categorize as first LLM draft. I have not done more than fixing them so they display data and react to variable dropdowns. Thus, everything related to choice of dashboard ROWs, selection and presentation of what metrics to make panels for, and the design of those panels are up for discussion, so the most useful review feedback on the dashboard at this stage is high-level on what rows and panels we need, and what style.

I give every committer permission to commit fixes and improvements to this branch, after first announcing what you intend to do in a review comment or ordinary comment. I am not strongly attached to the current row+panel selection.

Current dashboard layout (Draft)

The rows are:

  • Node Overview (open by default) — query/index request rates, latency, cores, disk
  • JVM (open by default) — heap, GC, threads, CPU
  • SolrCloud (collapsed) — Overseer queues, ZK ops, shard leaders
  • Index Health (collapsed) — segments, index size, merge rates, MMap efficiency
  • Cache Efficiency (collapsed) — filter/query/document cache hit rates and evictions

Here are some screenshots:
node-overview
jvm-panel
solrcloud
index-health
cache-panel

Disclaimer: All of this is built by Claude Code.

@janhoy janhoy marked this pull request as draft March 12, 2026 15:52
@github-actions github-actions bot added documentation Improvements or additions to documentation scripts labels Mar 12, 2026
@janhoy janhoy marked this pull request as ready for review March 12, 2026 19:51
@janhoy janhoy requested a review from Copilot March 12, 2026 19:51

This comment was marked as outdated.

@janhoy janhoy requested a review from mlbiscoc March 12, 2026 22:38
support running your own solr
@janhoy
Copy link
Contributor Author

janhoy commented Mar 13, 2026

So the foundation is laid I believe. Technically it is working and I generally like the "rows" and panels chosen by AI.

But there are probably useful changes to do. Here are some I can think of

  • Add a panel for system memory (dependent on SOLR-18159 Add metrics for system memory #4209), perhaps a stacked area with heap-max in it
  • Distinguish between "collection QPS" and "per-core" QPS. I think the metrics include a label for whether they are "local" or not?
  • Add panel for number of zookeepers "up"
  • Add panel for number of solr nodes "up"
  • Other panels for cluster-level things like number of collections, shard leadership over time
  • Gather more user feedback for what they lack
  • Add OTEL collector to the docker-compose and have it push metrics to the same prometheus, but with a different "cluster" or "environment" label, to test those dropdowns.

@janhoy janhoy marked this pull request as draft March 13, 2026 08:58
@gus-asf
Copy link
Contributor

gus-asf commented Mar 13, 2026

Latency graphs should always show the max, p50 is basically useless... https://www.youtube.com/watch?v=lJ8ydIuPFeU

Also update latency is only rarely interesting... throughput is what most folks care about for indexing, that and stuck/failed documents.

@mlbiscoc
Copy link
Contributor

Thanks Jan this looks like a great start. I'll find some time to take a look. I really love the docker compose setup making it easy to test. Something we should add is also a way to turn on tracing module with this so we can also see exemplars that Solr supports now as well with these dashboards. Maybe a second iteration since that is definitely way out of scope.

@janhoy
Copy link
Contributor Author

janhoy commented Mar 13, 2026

Latency graphs should always show the max, p50 is basically useless... https://www.youtube.com/watch?v=lJ8ydIuPFeU

Good feedback, adding in a max graph in the search latency panel. Let's do that.

Also update latency is only rarely interesting... throughput is what most folks care about for indexing, that and stuck/failed documents.

Yea, cause /update is non-blocking, right, so it won't tell much other than how large the payload was and perhaps how busy the server was. Let's use that real estate for something better.

@janhoy
Copy link
Contributor Author

janhoy commented Mar 13, 2026

Something we should add is also a way to turn on tracing module with this so we can also see exemplars that Solr supports now as well with these dashboards.

Thought of it but wanted to keep scope somewhat low, so I think this PR should focus on a GA dashboard. Then follow up work could add OTEL collector and Jaeger to the dev/ setup. I also discovered Microsofts Aspire Dashboard project, and I think I'll add it to compose. It shows you real-time what OTLP packets (metric, trace, logs) are received, and you can inspect the content of all. it has a simple traces viewer.

@Jesssullivan
Copy link

Looking good! +1 on lacing up a OTEL collector next 👀

Thought of it but wanted to keep scope somewhat low, so I think this PR should focus on a GA dashboard. Then follow up work could add OTEL collector and Jaeger to the dev/ setup. I

@janhoy
Copy link
Contributor Author

janhoy commented Mar 19, 2026

Are you ok with the location in the monorepo solr/monitoring ? In some way it more belongs on the top level, but I guess I try to avoid adding stuff to top level. Considered separate git repo but that breaks with our monorepo style, and it is useful to keep dashboard in sync with evolution of the app.

@mlbiscoc
Copy link
Contributor

mlbiscoc commented Mar 19, 2026

I like solr/monitoring location over it being at the root and not putting in a separate repo. In a separate repo, if we add metrics or change, it'd be hard to see it without switching between 2 repos. I'd vote how it is.

# ./stack.sh --help # All options
#
# Services (full stack):
# solr1 http://localhost:8983 (SolrCloud node 1, embedded ZooKeeper)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ❤️ this!

Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress.. There is a lot here that I don't quite grok... Is trafficgen coming out of other perf related effrots, or just "hey, we need some load" ;-)

@janhoy
Copy link
Contributor Author

janhoy commented Mar 21, 2026

Good progress.. There is a lot here that I don't quite grok... Is trafficgen coming out of other perf related effrots, or just "hey, we need some load" ;-)

Trafficgen is just something I wrote earlier, not written for perf at all, just to have something happening in a cluster, as it is boring to view a dashboard or traces with nothing going on. This dev/ hack is just convenience tooling to assist when developing / changing dashboards, metrics, modifyint OTEL Collector configuration etc.

Do you feel it is too much to add? Should the entire dev/ folder move to /dev-tools/monitoring instead, and trafficgen to /dev-tools/trafficgen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation scripts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants