HDDS-14110. [DiskBalancer] Show EstimatedBytesToMove only during active balancing and improve threshold check message #9465

Gargi-jais11 · 2025-12-09T08:36:40Z

What changes were proposed in this pull request?

EstimatedBytesToMoved and EstimatedTimeLeft should not be shown up if no container movement happens.
It's not a bug if there is no container to move while EstimatedBytesToMove is not 0, if the configured threshold is very small and none of container's size of DN is less than this value.
For this case, we are adding comments in the output of status CLI.
Improve threshold validation error message. When running the DiskBalancer update command with a threshold value of 100.0, the operation fails on all datanodes with the following error:

bash> ozone admin datanode diskbalancer update -t 100.0 --in-service-datanodes
Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100.

A threshold of 0 means any deviation from ideal usage (even 0.01%) triggers
container movement

This leads to excessive and continuous balancing operations and results in unnecessary I/O overhead and resource consumption
A Threshold value can never be 100.0% as it would mean allow moving 100% of a disk's contents, effectively emptying one disk.
Suggested improvement:
Rather the error message should clarify that 0 and 100 is excluded. The validation is being updated to exclude 0, requiring threshold to be in
the range (0, 100) exclusive.
new error msg:

Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14110

How was this patch tested?

Added check for estimatedBytes and DiskBalancerConfiguration in unit test TestDiskBalancerService.
Tested manually:
before patch:

bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      StopAfterDiskEven    SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)    
ozone-datanode-5.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               638                2                   
ozone-datanode-3.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               1                  1                   
ozone-datanode-4.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               1                  1                   
ozone-datanode-2.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               698                2                   
ozone-datanode-1.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               3                  1                   

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.

After code chnages output fixed:

bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      StopAfterDiskEven    SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)    
ozone-datanode-1.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-3.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-5.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-2.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-4.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   

Note:
  - EstBytesToMove is calculated based on the target disk even state with the configured threshold.
  - EstTimeLeft is calculated based on EstimatedBytesToMove and configured disk bandwidth.
  - Both EstimatedBytes and EstTimeLeft could be non-zero while no containers can be moved, especially when the configured threshold or disk capacity is too small.

Threshold error output:

bash-5.1$ ozone admin datanode diskbalancer start -t 0 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 100 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 0.001 --in-service-datanodes
Started DiskBalancer on all IN_SERVICE nodes.

…eshold error msg improve

Copilot

Pull request overview

This PR addresses two issues in the DiskBalancer service: (1) preventing the display of EstimatedBytesToMove and EstimatedTimeLeft when no container movement is occurring, and (2) improving threshold validation to exclude boundary values 0 and 100 with a clearer error message.

Key Changes:

Updated threshold validation to exclude 0 and 100 (changed from < 0d to <= 0d), preventing edge cases that would cause excessive or meaningless balancing
Modified getDiskBalancerInfo() to only calculate and report bytesToMove when containers are actively being balanced (RUNNING state AND non-empty inProgressContainers)
Enhanced error message to clarify that the threshold range is exclusive: "Threshold must be a percentage(double) in the range 0 to 100 both exclusive."

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerConfiguration.java`	Updated threshold validation to exclude 0 and 100, and improved error message clarity
`hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java`	Added check for non-empty `inProgressContainers` before calculating `bytesToMove` and updated comments
`hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java`	Added test coverage for the new `inProgressContainers` check in `getDiskBalancerInfo()`

Comments suppressed due to low confidence (1)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java:724

getInProgressContainers exposes the internal representation stored in field inProgressContainers. The value may be modified after this call to getInProgressContainers.

  public Set<ContainerID> getInProgressContainers() {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.../src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerConfiguration.java

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Gargi-jais11 · 2025-12-10T05:30:18Z

@ChenSammi Please have a look on this patch. I have resolved the review comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java:724

getInProgressContainers exposes the internal representation stored in field inProgressContainers. The value may be modified after this call to getInProgressContainers.

  public Set<ContainerID> getInProgressContainers() {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.../src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerConfiguration.java

...ce/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java

ChenSammi · 2025-12-30T07:59:14Z

...dmin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/DiskBalancerStatusSubcommand.java

-    formatBuilder.append("%nNote: Estimated time left is calculated" +
-        " based on the estimated bytes to move and the configured disk bandwidth.");
+    formatBuilder.append("%nNote:%n");
+    formatBuilder.append("  - Estimated time left is calculated based on the estimated bytes" +


EstimatedBytesToMove is calculated based on the target disk even state with the configured threshold.
EstTimeLeft is calculated based on EstimatedBytesToMove and configured disk bandwidth.
Both EstimatedBytes and EstTimeLeft could be non-zero while no containers can be moved, especially when the configured threshold or disk capacity is too small.

BTW, @Gargi-jais11 , how about change the volume.density.threshold to volume.density.threshold.percent, and change CLI option --threshold to --threshold--percentage?

DiskBalancerVolumeChoosingPolicy.java and ContainerChoosingPolicy.java have [0, 100] in java doc.

BTW, @Gargi-jais11 , how about change the volume.density.threshold to volume.density.threshold.percent, and change CLI option --threshold to --threshold--percentage?

I agree we should change to --threshold--percentage as it is more clearer indicating it as percentage value.

corrected to show estBytesToMove only during active balancing and thr…

4a2e668

…eshold error msg improve

Gargi-jais11 marked this pull request as ready for review December 9, 2025 08:37

ChenSammi requested a review from Copilot December 9, 2025 09:03

Copilot started reviewing on behalf of ChenSammi December 9, 2025 09:03 View session

Copilot AI reviewed Dec 9, 2025

View reviewed changes

.../src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerConfiguration.java Show resolved Hide resolved

Copilot AI reviewed Dec 9, 2025

View reviewed changes

added test case for diskbalancer configuration invalid values

75e7930

ChenSammi requested a review from Copilot December 15, 2025 08:00

Copilot started reviewing on behalf of ChenSammi December 15, 2025 08:00 View session

Copilot AI reviewed Dec 15, 2025

View reviewed changes

.../src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerConfiguration.java Outdated Show resolved Hide resolved

...ce/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java Outdated Show resolved Hide resolved

ChenSammi reviewed Dec 15, 2025

View reviewed changes

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java Outdated Show resolved Hide resolved

Gargi Jaiswal added 3 commits December 16, 2025 12:11

removed inProgressContainer check and add CLI comment

55ba64a

minor fix

8487d46

Merge branch 'HDDS-5713' of github.com:apache/ozone into HDDS-14110

c45be82

Gargi-jais11 requested a review from ChenSammi December 16, 2025 07:12

ChenSammi reviewed Dec 30, 2025

View reviewed changes

Gargi Jaiswal added 2 commits December 30, 2025 14:41

Merge branch 'HDDS-5713' of github.com:apache/ozone into HDDS-14110

86dad91

changed threshold to threshold-percentage and status note refactoring

3e392c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-14110. [DiskBalancer] Show EstimatedBytesToMove only during active balancing and improve threshold check message #9465

HDDS-14110. [DiskBalancer] Show EstimatedBytesToMove only during active balancing and improve threshold check message #9465

Gargi-jais11 commented Dec 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Gargi-jais11 commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChenSammi Dec 30, 2025 •

edited

Loading

Uh oh!

ChenSammi Dec 30, 2025

Uh oh!

ChenSammi Dec 30, 2025

Uh oh!

Gargi-jais11 Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-14110. [DiskBalancer] Show EstimatedBytesToMove only during active balancing and improve threshold check message #9465

Are you sure you want to change the base?

HDDS-14110. [DiskBalancer] Show EstimatedBytesToMove only during active balancing and improve threshold check message #9465

Conversation

Gargi-jais11 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Gargi-jais11 commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChenSammi Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenSammi Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

ChenSammi Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Gargi-jais11 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gargi-jais11 commented Dec 9, 2025 •

edited

Loading

ChenSammi Dec 30, 2025 •

edited

Loading