HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376

sreejasahithi · 2025-11-26T08:58:02Z

What changes were proposed in this pull request?

SCM logs rule statuses at arbitrary time intervals. Sometimes there is one log line per minute, sometimes it will go 5+ minutes without logging anything and then log one line showing a large jump in progress. This is not due to log flushing, the timestamps on the log lines exhibit these gaps too. We need a timer in the safemode manager that gives all safemode information at a configurable interval, probably once a minute by default.

What is the link to the Apache JIRA

HDDS-14012

How was this patch tested?

https://github.com/sreejasahithi/ozone/actions/runs/19693260454

sarvekshayr

Thanks @sreejasahithi for working on this.

The PR title has = instead of - in the JIRA ID.

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

sarvekshayr

If you've tested the changes, attach the logs so we can verify the behaviour.

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

errose28 · 2025-12-05T01:10:04Z

Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful.

sreejasahithi · 2025-12-05T05:15:41Z

Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful.

yes , will add the log examples, I just have a couple of changes to make after which I will add the examples.

sumitagrawl

@sreejasahithi IMO, we do not need print safe mode status, its logged in below condition based on event,

DN registered
pipeline report
open pipeline
container Ratis/EC registeration
So it process the event from DN on HB and validate. If satisfied, exit safemode.

We do not need again at regular interval, but CLI is present to have safemode rule info on need basis from leader. For HB, already we have audit log at SCM, that can be referred for problem analysis.

May be we need have support query safemode status from CLI as requirement from follower node also.

cc: @errose28

errose28 · 2025-12-05T17:00:49Z

We do not need again at regular interval, its logged in below condition based on event

The information logged here is not a duplicate of the event triggered rules in safemodeExitRule#process. Those display information about only their individual rule. Here we propose adding a summary message of all safemode rule statuses. This way you can tail a log file with watch + grep for the summary keyword to get a periodic update. This workflow is not currently possible which makes tailing logs for safemode exit difficult.

CLI is present to have safemode rule info on need basis from leader

CLI works when you have direct access to the cluster but not for offline analysis where we need to triage an issue from logs.

May be we need have support query safemode status from CLI as requirement from follower node also.

Yes we should also circle back to HDDS-13832 and get that implemented as well. This is needed for rolling restart scenarios where we want to wait for the restarted follower to exit safemode before restarting another node.

sreejasahithi · 2025-12-05T18:52:34Z

Below is a sample of SCM log messages before the changes made in this patch :

2025-12-05 10:59:29 2025-12-05 05:29:29,934 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 1 DataNodes registered, 5 required.
2025-12-05 10:59:29 2025-12-05 05:29:29,934 [EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule] INFO safemode.SCMSafeModeManager: RatisContainerSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,935 [EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: OneReplicaPipelineSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,935 [EventQueue-ContainerRegistrationReportForECContainerSafeModeRule] INFO safemode.SCMSafeModeManager: ECContainerSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,967 [aaffa793-f02c-4e5b-b861-dfe90ff67c94@group-EF33F58B837E-StateMachineUpdater] INFO safemode.SCMSafeModeManager: RatisContainerSafeModeRule rule is success
2025-12-05 10:59:29 2025-12-05 05:29:29,999 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 2 DataNodes registered, 5 required.
2025-12-05 10:59:30 2025-12-05 05:29:30,018 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 3 DataNodes registered, 5 required.
2025-12-05 11:00:05 2025-12-05 05:30:05,491 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 4 DataNodes registered, 5 required.
2025-12-05 11:01:36 2025-12-05 05:31:36,079 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 5 DataNodes registered, 5 required.
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: All SCM safe mode pre check rules have passed
2025-12-05 11:01:36 2025-12-05 05:31:36,088 [aaffa793-f02c-4e5b-b861-dfe90ff67c94@group-EF33F58B837E-StateMachineUpdater] INFO safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,383 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: SCM exiting safe mode.

Below is a sample of SCM log messages after the changes made in this patch (the SCM logs will still contain above log messages):

2025-12-05 10:59:22 2025-12-05 05:29:22,903 [main] INFO safemode.SCMSafeModeManager: Started periodic Safe Mode logging with interval 60000 ms
2025-12-05 10:59:22 2025-12-05 05:29:22,904 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=0/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=0) >= required datanodes (=5)), RatisContainerSafeModeRule(status=waiting, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=waiting, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=waiting, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:00:22 2025-12-05 05:30:22,905 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=3/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=4) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:22 2025-12-05 05:31:22,905 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=3/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=4) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=true}, preCheckComplete=true, validatedRules=4/5, preCheckValidated=1/1, rules=[DataNodeSafeModeRule(status=validated, registered datanodes (=5) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=true}, preCheckComplete=true, validatedRules=5/5, preCheckValidated=1/1, rules=[DataNodeSafeModeRule(status=validated, registered datanodes (=5) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=validated, healthy Ratis/THREE pipelines (=1) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: Stopped periodic Safe Mode logging

errose28

Thanks for working on this. The comparison of the log messages definitely helps show the use case for the improvement since I think the bottom one is much easier to follow. Can we add a test using log capturer to check that each safemode rules's getStatusText is being printed periodically while in safemode?

errose28 · 2025-12-12T17:47:06Z

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

+    LOG.info("Started periodic Safe Mode logging with interval {} ms", safeModeLogIntervalMs);
+  }
+
+  private void logSafeModeStatus() {


Can we make this whole method synchronized instead of trying to isolate the specific parts of it that need to be coordinated with other method calls? This method will run quickly and not block other safemode checks. It also makes it easier to reason about. Right now there could be some strange cases where, for example, we read safeModeStatus as "in safemode", but it has left safemode by the time we get to the logging section, in which case the rule states won't match the status.

errose28 · 2025-12-12T22:16:12Z

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java

+              + ", " + rule.getStatusText() + ")";
+        })
+        .collect(Collectors.joining(", "));
+    LOG.info(


I'm thinking of something like this for the log output. This has one summary line and each rule on its own line. It can be logged as one single log message using \n to separate the lines so it is not interrupted. The common prefix still helps when searching the logs with grep. Note that some of the safemode rules' messages have a semicolon at the end for some reason which we can probably remove. Lines are usually parsed with awk so using a pseudo-json layout with parentheses and brackets doesn't provide much benefit.

SCM SafeMode Status | state=INITIAL preCheckComplete=false validatedPreCheckRules=0/1 validatedRules=2/5 SCM SafeMode Status | DataNodeSafeModeRule (waiting) registered datanodes (=0) >= required datanodes (=5) SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1) SCM SafeMode Status | OneReplicaPipelineSafeModeRule (waiting) reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0) SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99) SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)

with this update the periodic log message looks as follows:

2025-12-14 15:48:47 2025-12-14 10:18:47,920 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: 2025-12-14 15:48:47 SCM SafeMode Status | state=PRE_CHECKS_PASSED preCheckComplete=true validatedPreCheckRules=1/1 validatedRules=4/5 2025-12-14 15:48:47 SCM SafeMode Status | DataNodeSafeModeRule (validated) registered datanodes (=5) >= required datanodes (=5) 2025-12-14 15:48:47 SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99) 2025-12-14 15:48:47 SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1) 2025-12-14 15:48:47 SCM SafeMode Status | OneReplicaPipelineSafeModeRule (validated) reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0) 2025-12-14 15:48:47 SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)

errose28 · 2025-12-12T22:18:10Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsConfigKeys.java

      HDDS_SCM_SAFEMODE_ONE_NODE_REPORTED_PIPELINE_PCT_DEFAULT = 0.90;

+  public static final String HDDS_SCM_SAFEMODE_LOG_INTERVAL =
+      "hdds.scm.safemode.log.interval";


Can we make this dynamically reconfigurable? I'm thinking of a scenario where one safemode rule is having trouble validating and we want to adjust this rather than stop the SCM and restart the whole safemode process.

Okay , will make this change to have HDDS_SCM_SAFEMODE_LOG_INTERVAL dynamically reconfigurable.

… format

HDDS=14012. SCM needs to log safemode exit rules at regular intervals

4809504

sarvekshayr reviewed Nov 26, 2025

View reviewed changes

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java Outdated Show resolved Hide resolved

sreejasahithi changed the title ~~HDDS=14012. SCM needs to log safemode exit rules at regular intervals~~ HDDS-14012. SCM needs to log safemode exit rules at regular intervals Nov 26, 2025

Minor fix

24674f7

sreejasahithi requested a review from sarvekshayr November 26, 2025 10:04

sarvekshayr reviewed Nov 26, 2025

View reviewed changes

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java Outdated Show resolved Hide resolved

sreejasahithi marked this pull request as draft November 26, 2025 10:23

jojochuang requested a review from sumitagrawl December 1, 2025 17:42

sumitagrawl reviewed Dec 5, 2025

View reviewed changes

Updated safemode status logging

0cd6226

errose28 reviewed Dec 12, 2025

View reviewed changes

Sreeja Chintalapati added 3 commits December 16, 2025 14:53

Added testcase to verify periodic logging and updated the log message…

1f4075e

… format

Fixed finbugs issue

1586d71

Made SCM safemode log interval dynamically reconfigurable

dda8b43

sreejasahithi requested a review from errose28 December 30, 2025 04:29

Fixed reconfigurableProperties test failure

d85a0a1

HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376

Are you sure you want to change the base?

HDDS-14012. SCM needs to log safemode exit rules at regular intervals #9376

Conversation

sreejasahithi commented Nov 26, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sarvekshayr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarvekshayr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

errose28 commented Dec 5, 2025

Uh oh!

sreejasahithi commented Dec 5, 2025

Uh oh!

sumitagrawl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

errose28 commented Dec 5, 2025

Uh oh!

sreejasahithi commented Dec 5, 2025

Uh oh!

errose28 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

errose28 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

errose28 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

sreejasahithi Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

errose28 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

sreejasahithi Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sarvekshayr left a comment •

edited

Loading

sumitagrawl left a comment •

edited

Loading

errose28 left a comment •

edited

Loading