Skip to content

Conversation

@CyCle1024
Copy link
Collaborator

Motivation

In current CI rollout unittest, OSError port already in use occured occasionally. After several attempts, we found the main cause of this error is port tested in find_master_addr_and_port may not be freed after the funciton call. We can use socket.SO_REUSEADDR option to make the port available.

Key Change:

  1. Refactor find_master_addr_and_port with less try-catch structure
  2. set sock option socket.SO_REUSEADDR to socket

1. refactor find_master_addr_and_port, set sock option socket.SO_REUSEADDR to socket
@CyCle1024 CyCle1024 requested a review from YanhuiDua January 15, 2026 11:55
s.close()
else:
assert isinstance(start_port, int)
assert isinstance(end_port, int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有没有一种情况是:start_port is not None, end_port is None 呢?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果不允许start_port is not None, end_port is None的话,这里也要检查下(end_port - start_port) > nums

if start_port is None:
for _ in range(nums):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# socket.SO_REUSEADDR can help avoid TIME_WAIT state issues
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

socket.SO_REUSEADDR can help to bind the port in TIME_WAIT state immediately

1. Add more detailed comment on socket.SO_REUSEADDR
2. Add assert message in the case of start_port is not None
@CyCle1024 CyCle1024 requested a review from YanhuiDua January 16, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants