Skip to content

Conversation

@mathieucarbou
Copy link
Member

@mathieucarbou mathieucarbou commented Jan 22, 2026

Adds some null checks to avoid any usage of _client pointer some loop is in progress calling send() and a disconnect event arrives at the same time (disconnect event frees the _client pointer)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds defensive null checks and improves code quality in the AsyncEventSource module. The changes focus on preventing potential null pointer dereferences and improving the handling of disconnected clients.

Changes:

  • Added null checks for _client pointer in _queueMessage and _runQueue methods
  • Refactored avgPacketsWaiting() to use a ternary operator for cleaner division-by-zero prevention
  • Added connected() checks in send() and _adjust_inflight_window() to skip disconnected clients
  • Improved _adjust_inflight_window() to use connected client count instead of total client count

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@willmmiles willmmiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re this one and #370: there's a recurring pattern/problem in this codebase regarding the ownership and life cycle of 'SomeClient' type objects when they're also attached to a "SomeServer" type object that wants to operate on a list of them. In the original AsyncServer->AsyncWebRequest->AsyncWebResponse architecture, the Request and Response objects are "self-owned" and managed by their AsyncClient's life cycle. But with AsyncWebSocketClient and AsyncEventClient, we must have a second reference to these objects in a list in their associated Source type, making cleanup of these objects somewhat more complex: they cannot simply be deleted when their AsyncClient terminates, as they might be in use by another task; and that other task has to still interlock with the fact that the AsyncClient has been invalidated. (AsyncTCP handles this gracefully, as it's now interlocked internally, but I haven't personally reviewed the older ESP8266 ESPAsyncTCP; perhaps I should.)

I think we probably want to consider building a common data structure to manage this pattern across the different use cases in this library -- something that allows us to safely do for(client& c: clients) c->write() without getting in to trouble. (Also: the major feature I added the WLED AWS fork is to recycle that pattern for AsyncWebRequest as well, so as to support a request queue to manage heap load.) I spent some time on looking at this but I don't yet have a coherent API suggestion. I think we can do some magic with std::shared_ptr/std::weak_ptr and making a temporary stack copy of the list contents during iteration to minimize a lot the friction.

++hits;
} else {
++miss;
if (c->connected()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need a client-pointer-is-not-null check here? (and all the other c->connected() pattern usages?)

Copy link
Member Author

@mathieucarbou mathieucarbou Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR at least because c is the unique ptr of a AsyncEventSourceClient object wrapping the _client pointer. c->connected() si checking for a null _client ptr behind. c is not supposed to be null.

This PR is just some code cleanup and null checks.

I would rather discuss how to correctly protect the iteration over the AsyncEventSourceClient list in #370 which was opened for that goal.

@mathieucarbou
Copy link
Member Author

Re this one and #370: there's a recurring pattern/problem in this codebase regarding the ownership and life cycle of 'SomeClient' type objects when they're also attached to a "SomeServer" type object that wants to operate on a list of them. In the original AsyncServer->AsyncWebRequest->AsyncWebResponse architecture, the Request and Response objects are "self-owned" and managed by their AsyncClient's life cycle. But with AsyncWebSocketClient and AsyncEventClient, we must have a second reference to these objects in a list in their associated Source type, making cleanup of these objects somewhat more complex: they cannot simply be deleted when their AsyncClient terminates, as they might be in use by another task; and that other task has to still interlock with the fact that the AsyncClient has been invalidated. (AsyncTCP handles this gracefully, as it's now interlocked internally, but I haven't personally reviewed the older ESP8266 ESPAsyncTCP; perhaps I should.)

I think we probably want to consider building a common data structure to manage this pattern across the different use cases in this library -- something that allows us to safely do for(client& c: clients) c->write() without getting in to trouble. (Also: the major feature I added the WLED AWS fork is to recycle that pattern for AsyncWebRequest as well, so as to support a request queue to manage heap load.) I spent some time on looking at this but I don't yet have a coherent API suggestion. I think we can do some magic with std::shared_ptr/std::weak_ptr and making a temporary stack copy of the list contents during iteration to minimize a lot the friction.

Yes, that's typically the kind of effort I wanted to initiate through #370, which is not to review yet: it was just an attempt to prove that the source of the issue @zekageri saw was the modification of the list during its iteration. Bu there were also some missing NPE checks.

Let's focus on this cleanup PR first and then we will see how to correctly solve this problem.

@zekageri
Copy link

I have removed the cleanup call entirely and tested the whole day but so far no crash

@mathieucarbou
Copy link
Member Author

I have removed the cleanup call entirely and tested the whole day but so far no crash

Thanks a lot for being proactive!

That's what I was going to ask you... to test this branch. I saw some missing null checks so i was hoping they would solve the issue you saw.

So that's perfect then!

@mathieucarbou
Copy link
Member Author

I have removed the cleanup call entirely and tested the whole day but so far no crash

Ah wait!

I think I misunderstood...

Can you please test this PR branch ? It only has some null checks but still has the list erase doing a mutation in the disconnect handler.

If this branch fixes your issue then is was a missing null check.

If this branch still has the bug, then this is the list mutation that is problematic.

Thanks 🙏

@zekageri
Copy link

Oh okay, will test it tomorrow

@bdraco
Copy link

bdraco commented Jan 22, 2026

Once there is a final version, I'm happy to retest on 8266 as well. Feel free to ping me here or reach out on discord (same handle)

@mathieucarbou
Copy link
Member Author

mathieucarbou commented Jan 22, 2026

Once there is a final version, I'm happy to retest on 8266 as well. Feel free to ping me here or reach out on discord (same handle)

@bdraco : this PR is final and rebased on top of main which also includes your previous PR. So it would be nice of you to test this branch also to have your go for the merge 👍

Note: we can merge this PR even if @zekageri still see the issue on ESP32 (regarding the list mutation), because IF there is still an issue regarding list mutation on ESP32, we will fix in another PR.

@bdraco
Copy link

bdraco commented Jan 22, 2026

Great, I'll test it after we finish with the ESPHome patch release

@mathieucarbou mathieucarbou linked an issue Jan 22, 2026 that may be closed by this pull request
5 tasks
@bdraco
Copy link

bdraco commented Jan 22, 2026

testing passes on the reproducer. trying ESPHome now

@bdraco
Copy link

bdraco commented Jan 22, 2026

can not longer repro crash on ESPHome with heavy reloads (tested with this pr + other fixes since main isn't merge in here)

diff --git a/esphome/components/web_server_base/__init__.py b/esphome/components/web_server_base/__init__.py
index d5d75b395d..4be653c362 100644
--- a/esphome/components/web_server_base/__init__.py
+++ b/esphome/components/web_server_base/__init__.py
@@ -48,4 +48,5 @@ async def to_code(config):
         if CORE.is_libretiny:
             CORE.add_platformio_option("lib_ignore", ["ESPAsyncTCP", "RPAsyncTCP"])
         # https://github.com/ESP32Async/ESPAsyncWebServer/blob/main/library.json
-        cg.add_library("ESP32Async/ESPAsyncWebServer", "3.7.10")
+        # Testing PR #370 for ESP8266 SSE crash fix
+        cg.add_library("https://github.com/bdraco/ESPAsyncWebServer.git#pr-370", None)

Pages of

[12:28:23.506]E async_ws 233: Event message queue overflow: discard message
[12:28:23.519]E async_ws 233: Event message queue overflow: discard message
[12:28:23.524]E async_ws 233: Event message queue overflow: discard message
[12:28:23.531]E async_ws 233: Event message queue overflow: discard message
[12:28:23.537]E async_ws 233: Event message queue overflow: discard message
[12:28:23.543]E async_ws 233: Event message queue overflow: discard message
[12:28:23.549]E async_ws 233: Event message queue overflow: discard message
[12:28:23.554]E async_ws 233: Event message queue overflow: discard message
[12:28:23.561]E async_ws 233: Event message queue overflow: discard message
[12:28:23.567]E async_ws 233: Event message queue overflow: discard message
[12:28:23.577]E async_ws 233: Event message queue overflow: discard message
[12:28:23.585]E async_ws 233: Event message queue overflow: discard message
[12:28:23.591]E async_ws 233: Event message queue overflow: discard message
[12:28:23.598]E async_ws 233: Event message queue overflow: discard message
[12:28:23.605]E async_ws 233: Event message queue overflow: discard message
[12:28:23.612]E async_ws 233: Event message queue overflow: discard message
[12:28:23.644]E async_ws 233: Event message queue overflow: discard message
[12:28:23.651]E async_ws 233: Event message queue overflow: discard message
[12:28:23.657]E async_ws 233: Event message queue overflow: discard message
[12:28:23.663]E async_ws 233: Event message queue overflow: discard message
[12:28:23.669]E async_ws 233: Event message queue overflow: discard message
[12:28:23.675]E async_ws 233: Event message queue overflow: discard message
[12:28:23.682]E async_ws 233: Event message queue overflow: discard message
[12:28:23.688]E async_ws 233: Event message queue overflow: discard message
[12:28:23.695]E async_ws 233: Event message queue overflow: discard message
[12:28:25.448]E async_ws 233: Event message queue overflow: discard message
[12:28:26.985]E async_ws 233: Event message queue overflow: discard message
[12:28:27.004]E async_ws 233: Event message queue overflow: discard message

on the console though. but thats expected.

Might be nice to track drop messages and only log every so often than x messages were discarded to reduce the serial blocking, but a nice to have for sure

@bdraco
Copy link

bdraco commented Jan 22, 2026

tested rtlxxx with libretiny as well. All good

@bdraco
Copy link

bdraco commented Jan 22, 2026

ESPHome PR memory impact analysis esphome/esphome#13467 (comment)

@mathieucarbou
Copy link
Member Author

Might be nice to track drop messages and only log every so often than x messages were discarded to reduce the serial blocking, but a nice to have for sure

Yes we know. Logging is something we would like to improved and have a way to collect stats instead of logging like that.

FYI @me-no-dev - you came with this idea last time also

@mathieucarbou
Copy link
Member Author

Thanks a lot for your testing @bdraco !

I will ask the team for approval / review.

Do you need a release just after for your upgrade ?

We can tackle the esp32 issue in a next PR / release.

@bdraco
Copy link

bdraco commented Jan 23, 2026

A release in the next week would be great timing as it gives time to bake in 2026.2.x-dev for 3-4 weeks and for the ESP8266 users that test dev to report issues

@mathieucarbou
Copy link
Member Author

@me-no-dev @vortigont @willmmiles : if you can have a look and approve : merge. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash on ESP8266 with rapid reloads

5 participants