-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and exit #9400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
IMO it would be even better if you can display "In Safe Mode" and "Exited Safe Mode" instead of the numerical 0 and 1. |
|
@errose28 could you please review this PR. |
|
While clear to me, I expect a little confusion from the graph being labelled Binary but going up to 2. Can the Binary axis be limited to 1? |
|
@sreejasahithi These metrics are applicable only during startup for safemode exit info. May be we do not need a dashboard for this. While debug, we can do JMX query to know the status. |
|
This is essential for Ozone cluster admins/operators to monitor how long it takes their SCM to come out of safemode, especially when doing a rolling restart. Raw jmx queries are poor for usability and do not show trends over time. The dashboard is in its own file and can be ignored without harm when it is not needed. |
Tejaskriya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @sreejasahithi, left a few suggestions below.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SafeModeMetrics.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/dist/src/main/compose/common/grafana/dashboards/Ozone - SCM Safemode.json
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Show resolved
Hide resolved
...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java
Show resolved
Hide resolved
|
Also, could you check why the CI seems to be failing? |
I think the failure is not related to my changes. |
|
It seems to be failing at the "Download ozone binary tar" stage. I tried re-triggered it a couple of times, still fails. I'll trigger the full run again and lets see if it helps. Can you merge the master too when you push the next set of commits? |
Tejaskriya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @sreejasahithi , just a suggestion for the tests.
| GenericTestUtils.waitFor(() -> !scmSafeModeManager.getInSafeMode() && | ||
| scmSafeModeManager.getSafeModeMetrics().getScmInSafeMode().value() == 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to separate out the metrics check from the waitFor block in all the occurances. That way if there was to be a failure in the metrics capturing logic and not in the actual status of scm, debugging is easier.
|
Thanks for adding this. I pulled up the Grafana chart in docker to look around.
+1 to Wei-Chiu's comment here. We can have text labels and see enter/exit safemode trends over time with Grafana's state timeline. Can we switch the binary plot to use this instead? A red block would indicate when an SCM was in safemode, and a green block would indicate that it is out. For the threshold to exit safemode on each rule, the two solid lines on top of each other are difficult to read. We can either use a dashed line for the target value, or use a gradient fill where the area at/above the threshold is green and the area below is red. Also, the thresholds are expected to be the same for all SCMs with HA. I think it would be easier to read if we just take the max of thresholds returned by each SCM as a way to reduce this to a single number, and plot that as the exit criteria without a corresponding hostname label. Can you share screenshots of what the updated dashboars look like in an SCM HA cluster? |
| private MutableGaugeInt scmInSafeMode; | ||
|
|
||
| @Metric private MutableGaugeLong numRequiredDatanodesThreshold; | ||
| @Metric private MutableCounterLong currentRegisteredDatanodesCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.hadoop.hdds.scm.node.SCMNodeMetrics#numNodeReportProcessed
This is already existing metrics, we do not need this metric as duplicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numNodeReportProcessed metric increments on every periodic node report received, while currentRegisteredDatanodesCount metric increments once per unique datanode registration so it is not exactly a duplicate.
|
@errose28 Do we need add metrics for config such as threshold value? just for metrics which is there for certain duration during startup ... I think this is not the correct approach for this grafana dashbord. |
|
@sumitagrawl we need to be able to see the rule counters progress towards their target value. The target value is a configuration known by the SCM process. SCM communicates this to Prometheus and Grafana through metrics. If you have a different way to achieve this same goal we can look into it, but we cannot drop the target from the dashboard otherwise the rule lines provide little value.
Why do you say this? This is a pretty standard dashboard that tracks system progress towards a desired goal/threshold over time. |
|
The test |
|
The safe mode chart looks good! |

What changes were proposed in this pull request?
This patch introduces a new dashboard "SCM Safemode" in Grafana which contains a chart for each safemode rule displaying its target and actual value. It also displays if the SCM is in safemode or not by a binary value i.e 1 if SCM is in safemode, and 0 if it is out of safemode.
What is the link to the Apache JIRA
HDDS-14039
How was this patch tested?
Green CI : https://github.com/sreejasahithi/ozone/actions/runs/19803545209
Tested over docker cluster:
