Skip to content

Conversation

@sreejasahithi
Copy link
Contributor

What changes were proposed in this pull request?

SCM logs rule statuses at arbitrary time intervals. Sometimes there is one log line per minute, sometimes it will go 5+ minutes without logging anything and then log one line showing a large jump in progress. This is not due to log flushing, the timestamps on the log lines exhibit these gaps too. We need a timer in the safemode manager that gives all safemode information at a configurable interval, probably once a minute by default.

What is the link to the Apache JIRA

HDDS-14012

How was this patch tested?

https://github.com/sreejasahithi/ozone/actions/runs/19693260454

Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sreejasahithi for working on this.

The PR title has = instead of - in the JIRA ID.

@sreejasahithi sreejasahithi changed the title HDDS=14012. SCM needs to log safemode exit rules at regular intervals HDDS-14012. SCM needs to log safemode exit rules at regular intervals Nov 26, 2025
Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've tested the changes, attach the logs so we can verify the behaviour.

@sreejasahithi sreejasahithi marked this pull request as draft November 26, 2025 10:23
@errose28
Copy link
Contributor

errose28 commented Dec 5, 2025

Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful.

@sreejasahithi
Copy link
Contributor Author

Thanks for working on this. Like Sarveksha said, if you could attach before and after log examples that would be helpful.

yes , will add the log examples, I just have a couple of changes to make after which I will add the examples.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sreejasahithi IMO, we do not need print safe mode status, its logged in below condition based on event,

  1. DN registered
  2. pipeline report
  3. open pipeline
  4. container Ratis/EC registeration
    So it process the event from DN on HB and validate. If satisfied, exit safemode.

We do not need again at regular interval, but CLI is present to have safemode rule info on need basis from leader. For HB, already we have audit log at SCM, that can be referred for problem analysis.

May be we need have support query safemode status from CLI as requirement from follower node also.

cc: @errose28

@errose28
Copy link
Contributor

errose28 commented Dec 5, 2025

We do not need again at regular interval, its logged in below condition based on event

The information logged here is not a duplicate of the event triggered rules in safemodeExitRule#process. Those display information about only their individual rule. Here we propose adding a summary message of all safemode rule statuses. This way you can tail a log file with watch + grep for the summary keyword to get a periodic update. This workflow is not currently possible which makes tailing logs for safemode exit difficult.

CLI is present to have safemode rule info on need basis from leader

CLI works when you have direct access to the cluster but not for offline analysis where we need to triage an issue from logs.

May be we need have support query safemode status from CLI as requirement from follower node also.

Yes we should also circle back to HDDS-13832 and get that implemented as well. This is needed for rolling restart scenarios where we want to wait for the restarted follower to exit safemode before restarting another node.

@sreejasahithi
Copy link
Contributor Author

Below is a sample of SCM log messages before the changes made in this patch :

2025-12-05 10:59:29 2025-12-05 05:29:29,934 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 1 DataNodes registered, 5 required.
2025-12-05 10:59:29 2025-12-05 05:29:29,934 [EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule] INFO safemode.SCMSafeModeManager: RatisContainerSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,935 [EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: OneReplicaPipelineSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,935 [EventQueue-ContainerRegistrationReportForECContainerSafeModeRule] INFO safemode.SCMSafeModeManager: ECContainerSafeModeRule rule is successfully validated
2025-12-05 10:59:29 2025-12-05 05:29:29,967 [aaffa793-f02c-4e5b-b861-dfe90ff67c94@group-EF33F58B837E-StateMachineUpdater] INFO safemode.SCMSafeModeManager: RatisContainerSafeModeRule rule is success
2025-12-05 10:59:29 2025-12-05 05:29:29,999 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 2 DataNodes registered, 5 required.
2025-12-05 10:59:30 2025-12-05 05:29:30,018 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 3 DataNodes registered, 5 required.
2025-12-05 11:00:05 2025-12-05 05:30:05,491 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 4 DataNodes registered, 5 required.
2025-12-05 11:01:36 2025-12-05 05:31:36,079 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM in safe mode. 5 DataNodes registered, 5 required.
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: All SCM safe mode pre check rules have passed
2025-12-05 11:01:36 2025-12-05 05:31:36,088 [aaffa793-f02c-4e5b-b861-dfe90ff67c94@group-EF33F58B837E-StateMachineUpdater] INFO safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,383 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: SCM exiting safe mode.

Below is a sample of SCM log messages after the changes made in this patch (the SCM logs will still contain above log messages):

2025-12-05 10:59:22 2025-12-05 05:29:22,903 [main] INFO safemode.SCMSafeModeManager: Started periodic Safe Mode logging with interval 60000 ms
2025-12-05 10:59:22 2025-12-05 05:29:22,904 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=0/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=0) >= required datanodes (=5)), RatisContainerSafeModeRule(status=waiting, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=waiting, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=waiting, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:00:22 2025-12-05 05:30:22,905 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=3/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=4) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:22 2025-12-05 05:31:22,905 [SCM-SafeMode-Log-0] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=false}, preCheckComplete=false, validatedRules=3/5, preCheckValidated=0/1, rules=[DataNodeSafeModeRule(status=waiting, registered datanodes (=4) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:36 2025-12-05 05:31:36,080 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=true}, preCheckComplete=true, validatedRules=4/5, preCheckValidated=1/1, rules=[DataNodeSafeModeRule(status=validated, registered datanodes (=5) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=waiting, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: SCM SafeMode periodic status: state=SafeModeStatus{safeModeStatus=true, preCheckPassed=true}, preCheckComplete=true, validatedRules=5/5, preCheckValidated=1/1, rules=[DataNodeSafeModeRule(status=validated, registered datanodes (=5) >= required datanodes (=5)), RatisContainerSafeModeRule(status=validated, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);), HealthyPipelineSafeModeRule(status=validated, healthy Ratis/THREE pipelines (=1) >= healthyPipelineThresholdCount (=1)), OneReplicaPipelineSafeModeRule(status=validated, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)), ECContainerSafeModeRule(status=validated, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);)]
2025-12-05 11:01:44 2025-12-05 05:31:44,384 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO safemode.SCMSafeModeManager: Stopped periodic Safe Mode logging

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. The comparison of the log messages definitely helps show the use case for the improvement since I think the bottom one is much easier to follow. Can we add a test using log capturer to check that each safemode rules's getStatusText is being printed periodically while in safemode?

LOG.info("Started periodic Safe Mode logging with interval {} ms", safeModeLogIntervalMs);
}

private void logSafeModeStatus() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this whole method synchronized instead of trying to isolate the specific parts of it that need to be coordinated with other method calls? This method will run quickly and not block other safemode checks. It also makes it easier to reason about. Right now there could be some strange cases where, for example, we read safeModeStatus as "in safemode", but it has left safemode by the time we get to the logging section, in which case the rule states won't match the status.

+ ", " + rule.getStatusText() + ")";
})
.collect(Collectors.joining(", "));
LOG.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of something like this for the log output. This has one summary line and each rule on its own line. It can be logged as one single log message using \n to separate the lines so it is not interrupted. The common prefix still helps when searching the logs with grep. Note that some of the safemode rules' messages have a semicolon at the end for some reason which we can probably remove. Lines are usually parsed with awk so using a pseudo-json layout with parentheses and brackets doesn't provide much benefit.

SCM SafeMode Status | state=INITIAL preCheckComplete=false validatedPreCheckRules=0/1 validatedRules=2/5
SCM SafeMode Status | DataNodeSafeModeRule (waiting) registered datanodes (=0) >= required datanodes (=5)
SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)
SCM SafeMode Status | OneReplicaPipelineSafeModeRule (waiting) reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)
SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this update the periodic log message looks as follows:

2025-12-14 15:48:47 2025-12-14 10:18:47,920 [EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO safemode.SCMSafeModeManager: 
2025-12-14 15:48:47 SCM SafeMode Status | state=PRE_CHECKS_PASSED preCheckComplete=true validatedPreCheckRules=1/1 validatedRules=4/5
2025-12-14 15:48:47 SCM SafeMode Status | DataNodeSafeModeRule (validated) registered datanodes (=5) >= required datanodes (=5)
2025-12-14 15:48:47 SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)
2025-12-14 15:48:47 SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)
2025-12-14 15:48:47 SCM SafeMode Status | OneReplicaPipelineSafeModeRule (validated) reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
2025-12-14 15:48:47 SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99)

HDDS_SCM_SAFEMODE_ONE_NODE_REPORTED_PIPELINE_PCT_DEFAULT = 0.90;

public static final String HDDS_SCM_SAFEMODE_LOG_INTERVAL =
"hdds.scm.safemode.log.interval";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this dynamically reconfigurable? I'm thinking of a scenario where one safemode rule is having trouble validating and we want to adjust this rather than stop the SCM and restart the whole safemode process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay , will make this change to have HDDS_SCM_SAFEMODE_LOG_INTERVAL dynamically reconfigurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants