Celery worker inexplicably going offline

The `celery-low-prio` worker, which is responsible for the majority of longer-running tasks, sometimes goes offline inexplicably. This happens when the worker misses a heartbeat, prompting the worker to believe that it is unable to connect to the broker (RabbitMQ) and making it show up as "offline" on Flower.

It is unclear what has caused this issue, but it did start appearing after the addition of the PDF OCR, with the task `video_2.pdf_to_pages` often being the task where things start to break down. Almost every single time, an inspection of tasks on Flower reveals that `video_2.pdf_to_pages` is in the 'STARTED' state, with several instances of its child, `video_2.extract_multi_image_text` also in the same state (or 'RECEIVED'). While `video_2.pdf_to_pages` is supposed to spawn 8 instances of the child task, there are always fewer than 8 of the child task. These tasks never reach 'SUCCESS' and hang forever.

This issue is intended as a discussion and diagnosis of the problem. There are two possibilities: 
1. The issue is caused directly by pdf to image conversion.
2. This is simply a side effect of heavy, CPU-dominant use of the low-prio worker for the RAG pipelines.

The issue of Celery workers going offline after a while is [well-documented](https://github.com/celery/celery/issues/4758), although a definitive fix has not been created yet. A theory is that high CPU usage can slow down the heartbeat-checking mechanism with RabbitMQ, causing it to overshoot the allowed delay and thus count as a missed heartbeat. A [proposed fix](https://github.com/celery/celery/issues/4758#issuecomment-1971375457) involves increasing the wait period to avoid accidentally missing heartbeats. If this theory is correct, then the issue has nothing to do specifically with PDF to image conversion.

It could also be that the issue is actually caused by PDF to image conversion. Originally, this part of the task chain used `pdf2image`, which uses Poppler, a command line tool. Poppler was initially suspected to be the culprit and `pdf2image` was therefore replaced with `pymupdf`, but this only reduced the number of crashes, rather than eliminating them completely. Given that this issue popped up right around the time PDF processing was added, it's not unlikely. A problem that appeared elsewhere, in image fingerprinting, was that certain PDF pages came up against the upper limit of the `PIL.Image` pixel count. If this sort of thing is the issue, then the solution would be to either find an alternative to PIL-based utilities, or to fundamentally raise the relevant limits.

Feel free to discuss suspected causes and potential fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery worker inexplicably going offline #264

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Celery worker inexplicably going offline #264

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions