-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The celery-low-prio worker, which is responsible for the majority of longer-running tasks, sometimes goes offline inexplicably. This happens when the worker misses a heartbeat, prompting the worker to believe that it is unable to connect to the broker (RabbitMQ) and making it show up as "offline" on Flower.
It is unclear what has caused this issue, but it did start appearing after the addition of the PDF OCR, with the task video_2.pdf_to_pages often being the task where things start to break down. Almost every single time, an inspection of tasks on Flower reveals that video_2.pdf_to_pages is in the 'STARTED' state, with several instances of its child, video_2.extract_multi_image_text also in the same state (or 'RECEIVED'). While video_2.pdf_to_pages is supposed to spawn 8 instances of the child task, there are always fewer than 8 of the child task. These tasks never reach 'SUCCESS' and hang forever.
This issue is intended as a discussion and diagnosis of the problem. There are two possibilities:
- The issue is caused directly by pdf to image conversion.
- This is simply a side effect of heavy, CPU-dominant use of the low-prio worker for the RAG pipelines.
The issue of Celery workers going offline after a while is well-documented, although a definitive fix has not been created yet. A theory is that high CPU usage can slow down the heartbeat-checking mechanism with RabbitMQ, causing it to overshoot the allowed delay and thus count as a missed heartbeat. A proposed fix involves increasing the wait period to avoid accidentally missing heartbeats. If this theory is correct, then the issue has nothing to do specifically with PDF to image conversion.
It could also be that the issue is actually caused by PDF to image conversion. Originally, this part of the task chain used pdf2image, which uses Poppler, a command line tool. Poppler was initially suspected to be the culprit and pdf2image was therefore replaced with pymupdf, but this only reduced the number of crashes, rather than eliminating them completely. Given that this issue popped up right around the time PDF processing was added, it's not unlikely. A problem that appeared elsewhere, in image fingerprinting, was that certain PDF pages came up against the upper limit of the PIL.Image pixel count. If this sort of thing is the issue, then the solution would be to either find an alternative to PIL-based utilities, or to fundamentally raise the relevant limits.
Feel free to discuss suspected causes and potential fixes.