HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and exit #9400

sreejasahithi · 2025-12-01T07:06:56Z

What changes were proposed in this pull request?

This patch introduces a new dashboard "SCM Safemode" in Grafana which contains a chart for each safemode rule displaying its target and actual value. It also displays if the SCM is in safemode or not by a binary value i.e 1 if SCM is in safemode, and 0 if it is out of safemode.

What is the link to the Apache JIRA

HDDS-14039

How was this patch tested?

Green CI : https://github.com/sreejasahithi/ozone/actions/runs/19803545209

Tested over docker cluster:

… exit

jojochuang · 2025-12-01T17:23:52Z

@rnblough

jojochuang · 2025-12-02T06:18:56Z

IMO it would be even better if you can display "In Safe Mode" and "Exited Safe Mode" instead of the numerical 0 and 1.

sreejasahithi · 2025-12-02T09:49:31Z

@errose28 could you please review this PR.

rnblough · 2025-12-03T21:03:02Z

While clear to me, I expect a little confusion from the graph being labelled Binary but going up to 2. Can the Binary axis be limited to 1?

sumitagrawl · 2025-12-05T12:56:41Z

@sreejasahithi These metrics are applicable only during startup for safemode exit info. May be we do not need a dashboard for this. While debug, we can do JMX query to know the status.
cc: @errose28

errose28 · 2025-12-05T16:41:35Z

This is essential for Ozone cluster admins/operators to monitor how long it takes their SCM to come out of safemode, especially when doing a rolling restart. Raw jmx queries are poor for usability and do not show trends over time. The dashboard is in its own file and can be ignored without harm when it is not needed.

Tejaskriya

Thanks for working on this @sreejasahithi, left a few suggestions below.

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SafeModeMetrics.java

hadoop-ozone/dist/src/main/compose/common/grafana/dashboards/Ozone - SCM Safemode.json

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java

Tejaskriya · 2025-12-09T15:54:22Z

Also, could you check why the CI seems to be failing?

sreejasahithi · 2025-12-10T09:02:16Z

Also, could you check why the CI seems to be failing?

I think the failure is not related to my changes.
Could you please help me re-trigger the test

Tejaskriya · 2025-12-11T10:47:04Z

It seems to be failing at the "Download ozone binary tar" stage. I tried re-triggered it a couple of times, still fails. I'll trigger the full run again and lets see if it helps. Can you merge the master too when you push the next set of commits?

Tejaskriya

Thanks for the update @sreejasahithi , just a suggestion for the tests.

Tejaskriya · 2025-12-16T09:30:36Z

...dds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/safemode/TestSCMSafeModeManager.java

+    GenericTestUtils.waitFor(() -> !scmSafeModeManager.getInSafeMode() &&
+            scmSafeModeManager.getSafeModeMetrics().getScmInSafeMode().value() == 0,


I would suggest to separate out the metrics check from the waitFor block in all the occurances. That way if there was to be a failure in the metrics capturing logic and not in the actual status of scm, debugging is easier.

errose28 · 2025-12-18T00:10:20Z

Thanks for adding this. I pulled up the Grafana chart in docker to look around.

IMO it would be even better if you can display "In Safe Mode" and "Exited Safe Mode" instead of the numerical 0 and 1.

+1 to Wei-Chiu's comment here. We can have text labels and see enter/exit safemode trends over time with Grafana's state timeline. Can we switch the binary plot to use this instead? A red block would indicate when an SCM was in safemode, and a green block would indicate that it is out.

For the threshold to exit safemode on each rule, the two solid lines on top of each other are difficult to read. We can either use a dashed line for the target value, or use a gradient fill where the area at/above the threshold is green and the area below is red. Also, the thresholds are expected to be the same for all SCMs with HA. I think it would be easier to read if we just take the max of thresholds returned by each SCM as a way to reduce this to a single number, and plot that as the exit criteria without a corresponding hostname label.

Can you share screenshots of what the updated dashboars look like in an SCM HA cluster?

sumitagrawl · 2025-12-18T08:15:31Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SafeModeMetrics.java

+  private MutableGaugeInt scmInSafeMode;
+
+  @Metric private MutableGaugeLong numRequiredDatanodesThreshold;
+  @Metric private MutableCounterLong currentRegisteredDatanodesCount;


org.apache.hadoop.hdds.scm.node.SCMNodeMetrics#numNodeReportProcessed
This is already existing metrics, we do not need this metric as duplicate

numNodeReportProcessed metric increments on every periodic node report received, while currentRegisteredDatanodesCount metric increments once per unique datanode registration so it is not exactly a duplicate.

sumitagrawl · 2025-12-18T08:21:20Z

@errose28 Do we need add metrics for config such as threshold value? just for metrics which is there for certain duration during startup ... I think this is not the correct approach for this grafana dashbord.

errose28 · 2025-12-18T15:18:56Z

@sumitagrawl we need to be able to see the rule counters progress towards their target value. The target value is a configuration known by the SCM process. SCM communicates this to Prometheus and Grafana through metrics. If you have a different way to achieve this same goal we can look into it, but we cannot drop the target from the dashboard otherwise the rule lines provide little value.

I think this is not the correct approach for this grafana dashbord.

Why do you say this? This is a pretty standard dashboard that tracks system progress towards a desired goal/threshold over time.

sreejasahithi · 2025-12-30T04:53:10Z

Can you share screenshots of what the updated dashboars look like in an SCM HA cluster?

This is a screenshot of the updated dashboard in SCM HA

sreejasahithi · 2025-12-30T08:56:30Z

The test TestSCMSafeModeManager looks flaky, I will fix it.

jojochuang · 2025-12-30T18:55:18Z

The safe mode chart looks good!

Sreeja Chintalapati added 2 commits December 1, 2025 00:38

HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and…

fa2d983

… exit

Fixed metric name

f15eb49

jojochuang self-requested a review December 1, 2025 17:22

Updated grafana dashboard

1f1fd6f

Tejaskriya reviewed Dec 9, 2025

View reviewed changes

Merge branch 'master' of github.com:apache/ozone into HDDS-14039

4e3f753

errose28 self-requested a review December 11, 2025 18:18

Added test coverage for metrics and minor update to dashboard

04a7b86

sreejasahithi requested a review from Tejaskriya December 12, 2025 10:03

Tejaskriya reviewed Dec 16, 2025

View reviewed changes

Sreeja Chintalapati added 2 commits December 18, 2025 11:36

Updated test

3f3c876

Merge branch 'master' of github.com:apache/ozone into HDDS-14039

8d679db

sumitagrawl reviewed Dec 18, 2025

View reviewed changes

Updated grafana dashboard

476656d

sreejasahithi requested a review from Tejaskriya December 30, 2025 05:21

		GenericTestUtils.waitFor(() -> !scmSafeModeManager.getInSafeMode() &&
		scmSafeModeManager.getSafeModeMetrics().getScmInSafeMode().value() == 0,

HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and exit #9400

Are you sure you want to change the base?

HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and exit #9400

Conversation

sreejasahithi commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

jojochuang commented Dec 1, 2025

Uh oh!

jojochuang commented Dec 2, 2025

Uh oh!

sreejasahithi commented Dec 2, 2025

Uh oh!

rnblough commented Dec 3, 2025

Uh oh!

sumitagrawl commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

errose28 commented Dec 5, 2025

Uh oh!

Tejaskriya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tejaskriya commented Dec 9, 2025

Uh oh!

sreejasahithi commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tejaskriya commented Dec 11, 2025

Uh oh!

Tejaskriya left a comment

Choose a reason for hiding this comment

Uh oh!

Tejaskriya Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

errose28 commented Dec 18, 2025

Uh oh!

sumitagrawl Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

sreejasahithi Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

sumitagrawl commented Dec 18, 2025

Uh oh!

errose28 commented Dec 18, 2025

Uh oh!

sreejasahithi commented Dec 30, 2025

Uh oh!

sreejasahithi commented Dec 30, 2025

Uh oh!

jojochuang commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sreejasahithi commented Dec 1, 2025 •

edited

Loading

sumitagrawl commented Dec 5, 2025 •

edited

Loading

sreejasahithi commented Dec 10, 2025 •

edited

Loading