Skip to content

Conversation

@Gargi-jais11
Copy link
Contributor

@Gargi-jais11 Gargi-jais11 commented Dec 9, 2025

What changes were proposed in this pull request?

  • EstimatedBytesToMoved and EstimatedTimeLeft should not be shown up if no container movement happens.
    It's not a bug if there is no container to move while EstimatedBytesToMove is not 0, if the configured threshold is very small and none of container's size of DN is less than this value.
    For this case, we are adding comments in the output of status CLI.

  • Improve threshold validation error message. When running the DiskBalancer update command with a threshold value of 100.0, the operation fails on all datanodes with the following error:

bash> ozone admin datanode diskbalancer update -t 100.0 --in-service-datanodes
Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100.

A threshold of 0 means any deviation from ideal usage (even 0.01%) triggers
container movement

This leads to excessive and continuous balancing operations and results in unnecessary I/O overhead and resource consumption
A Threshold value can never be 100.0% as it would mean allow moving 100% of a disk's contents, effectively emptying one disk.
Suggested improvement:
Rather the error message should clarify that 0 and 100 is excluded. The validation is being updated to exclude 0, requiring threshold to be in
the range (0, 100) exclusive.
new error msg:

Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14110

How was this patch tested?

Added check for estimatedBytes and DiskBalancerConfiguration in unit test TestDiskBalancerService.
Tested manually:
before patch:

bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      StopAfterDiskEven    SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)    
ozone-datanode-5.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               638                2                   
ozone-datanode-3.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               1                  1                   
ozone-datanode-4.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               1                  1                   
ozone-datanode-2.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               698                2                   
ozone-datanode-1.ozone_default      RUNNING         0.0001          10              5            false                0            0            0               3                  1                   

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.

After code chnages output fixed:

bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      StopAfterDiskEven    SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)    
ozone-datanode-1.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-3.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-5.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-2.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   
ozone-datanode-4.ozone_default      STOPPED         10.0000         10              5            true                 0            0            0               0                  0                   

Note:
  - EstBytesToMove is calculated based on the target disk even state with the configured threshold.
  - EstTimeLeft is calculated based on EstimatedBytesToMove and configured disk bandwidth.
  - Both EstimatedBytes and EstTimeLeft could be non-zero while no containers can be moved, especially when the configured threshold or disk capacity is too small.

Threshold error output:

bash-5.1$ ozone admin datanode diskbalancer start -t 0 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 100 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 0.001 --in-service-datanodes
Started DiskBalancer on all IN_SERVICE nodes.

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review December 9, 2025 08:37
@ChenSammi ChenSammi requested a review from Copilot December 9, 2025 09:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two issues in the DiskBalancer service: (1) preventing the display of EstimatedBytesToMove and EstimatedTimeLeft when no container movement is occurring, and (2) improving threshold validation to exclude boundary values 0 and 100 with a clearer error message.

Key Changes:

  • Updated threshold validation to exclude 0 and 100 (changed from < 0d to <= 0d), preventing edge cases that would cause excessive or meaningless balancing
  • Modified getDiskBalancerInfo() to only calculate and report bytesToMove when containers are actively being balanced (RUNNING state AND non-empty inProgressContainers)
  • Enhanced error message to clarify that the threshold range is exclusive: "Threshold must be a percentage(double) in the range 0 to 100 both exclusive."

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerConfiguration.java Updated threshold validation to exclude 0 and 100, and improved error message clarity
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java Added check for non-empty inProgressContainers before calculating bytesToMove and updated comments
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java Added test coverage for the new inProgressContainers check in getDiskBalancerInfo()
Comments suppressed due to low confidence (1)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java:724

  public Set<ContainerID> getInProgressContainers() {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Gargi-jais11
Copy link
Contributor Author

@ChenSammi Please have a look on this patch. I have resolved the review comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java:724

  public Set<ContainerID> getInProgressContainers() {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

formatBuilder.append("%nNote: Estimated time left is calculated" +
" based on the estimated bytes to move and the configured disk bandwidth.");
formatBuilder.append("%nNote:%n");
formatBuilder.append(" - Estimated time left is calculated based on the estimated bytes" +
Copy link
Contributor

@ChenSammi ChenSammi Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EstimatedBytesToMove is calculated based on the target disk even state with the configured threshold.
EstTimeLeft is calculated based on EstimatedBytesToMove and configured disk bandwidth.
Both EstimatedBytes and EstTimeLeft could be non-zero while no containers can be moved, especially when the configured threshold or disk capacity is too small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, @Gargi-jais11 , how about change the volume.density.threshold to volume.density.threshold.percent, and change CLI option --threshold to --threshold--percentage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiskBalancerVolumeChoosingPolicy.java and ContainerChoosingPolicy.java have [0, 100] in java doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, @Gargi-jais11 , how about change the volume.density.threshold to volume.density.threshold.percent, and change CLI option --threshold to --threshold--percentage?

I agree we should change to --threshold--percentage as it is more clearer indicating it as percentage value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants