-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sreejasahithi for working on this.
The PR title has = instead of - in the JIRA ID.
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
sarvekshayr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've tested the changes, attach the logs so we can verify the behaviour.
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
|
Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful. |
yes , will add the log examples, I just have a couple of changes to make after which I will add the examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sreejasahithi IMO, we do not need print safe mode status, its logged in below condition based on event,
- DN registered
- pipeline report
- open pipeline
- container Ratis/EC registeration
So it process the event from DN on HB and validate. If satisfied, exit safemode.
We do not need again at regular interval, but CLI is present to have safemode rule info on need basis from leader. For HB, already we have audit log at SCM, that can be referred for problem analysis.
May be we need have support query safemode status from CLI as requirement from follower node also.
cc: @errose28
The information logged here is not a duplicate of the event triggered rules in
CLI works when you have direct access to the cluster but not for offline analysis where we need to triage an issue from logs.
Yes we should also circle back to HDDS-13832 and get that implemented as well. This is needed for rolling restart scenarios where we want to wait for the restarted follower to exit safemode before restarting another node. |
|
Below is a sample of SCM log messages before the changes made in this patch : Below is a sample of SCM log messages after the changes made in this patch (the SCM logs will still contain above log messages): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. The comparison of the log messages definitely helps show the use case for the improvement since I think the bottom one is much easier to follow. Can we add a test using log capturer to check that each safemode rules's getStatusText is being printed periodically while in safemode?
| LOG.info("Started periodic Safe Mode logging with interval {} ms", safeModeLogIntervalMs); | ||
| } | ||
|
|
||
| private void logSafeModeStatus() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this whole method synchronized instead of trying to isolate the specific parts of it that need to be coordinated with other method calls? This method will run quickly and not block other safemode checks. It also makes it easier to reason about. Right now there could be some strange cases where, for example, we read safeModeStatus as "in safemode", but it has left safemode by the time we get to the logging section, in which case the rule states won't match the status.
| + ", " + rule.getStatusText() + ")"; | ||
| }) | ||
| .collect(Collectors.joining(", ")); | ||
| LOG.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of something like this for the log output. This has one summary line and each rule on its own line. It can be logged as one single log message using \n to separate the lines so it is not interrupted. The common prefix still helps when searching the logs with grep. Note that some of the safemode rules' messages have a semicolon at the end for some reason which we can probably remove. Lines are usually parsed with awk so using a pseudo-json layout with parentheses and brackets doesn't provide much benefit.
SCM SafeMode Status | state=INITIAL preCheckComplete=false validatedPreCheckRules=0/1 validatedRules=2/5
SCM SafeMode Status | DataNodeSafeModeRule (waiting) registered datanodes (=0) >= required datanodes (=5)
SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)
SCM SafeMode Status | OneReplicaPipelineSafeModeRule (waiting) reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)
SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this update the periodic log message looks as follows:
2025-12-14 15:48:47 2025-12-14 10:18:47,920 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager:
2025-12-14 15:48:47 SCM SafeMode Status | state=PRE_CHECKS_PASSED preCheckComplete=true validatedPreCheckRules=1/1 validatedRules=4/5
2025-12-14 15:48:47 SCM SafeMode Status | DataNodeSafeModeRule (validated) registered datanodes (=5) >= required datanodes (=5)
2025-12-14 15:48:47 SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)
2025-12-14 15:48:47 SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)
2025-12-14 15:48:47 SCM SafeMode Status | OneReplicaPipelineSafeModeRule (validated) reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
2025-12-14 15:48:47 SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)
| HDDS_SCM_SAFEMODE_ONE_NODE_REPORTED_PIPELINE_PCT_DEFAULT = 0.90; | ||
|
|
||
| public static final String HDDS_SCM_SAFEMODE_LOG_INTERVAL = | ||
| "hdds.scm.safemode.log.interval"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this dynamically reconfigurable? I'm thinking of a scenario where one safemode rule is having trouble validating and we want to adjust this rather than stop the SCM and restart the whole safemode process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay , will make this change to have HDDS_SCM_SAFEMODE_LOG_INTERVAL dynamically reconfigurable.
What changes were proposed in this pull request?
SCM logs rule statuses at arbitrary time intervals. Sometimes there is one log line per minute, sometimes it will go 5+ minutes without logging anything and then log one line showing a large jump in progress. This is not due to log flushing, the timestamps on the log lines exhibit these gaps too. We need a timer in the safemode manager that gives all safemode information at a configurable interval, probably once a minute by default.
What is the link to the Apache JIRA
HDDS-14012
How was this patch tested?
https://github.com/sreejasahithi/ozone/actions/runs/19693260454