Skip to content

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Nov 20, 2025

Summary

Long running tasks continue to go missing periodically (the parent job status never changes, and no additional progress is made). @carlosgjs discovered a behavior specific to RabbitMQ that may be killing our long-running processing tasks prematurely and the setting that seems to alleviate it after initial testing: consumer_timeout. The variable must be set on the server where RabbitMQ is running, but in this PR we add it to the example production .env file for documentation purposes.

Here @carlosgjs notes

I started testing a "long" job (5k images) using the latest code in main, which uses rabbitmq and the new timeouts/limits. After 30 mins it failed with:
celeryworker-1 | amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more

Good news! after adding the consumer_timeout the 5000 image job proceeded smoothly. Celery memory stayed around 1.5GB. It took a little over 3hrs, which means that previously it would have run into the old CELERY_TASK_TIME_LIMIT of 2.8 hrs. So progress being made.

The change was initially added in this PR but I decided to make an explicit one.
d2711a7#diff-f9a9a1ff76dacf233aaeb2ff844646b76955e06f958a597b14ebff11b50297b3R24

Related Issues

Part of the fixes for #721 and overall robustness for the current synchronous processing service API

How to Test the Changes

Start a job that runs over 30 minutes (the previous default?)

Screenshots

If applicable, add screenshots to help explain this PR (ex. Before and after for UI changes).

Deployment Notes

Set consumer_timeout to 60480000 in rabbitmq.conf on the RabbitMQ server

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Summary by CodeRabbit

  • Chores
    • Increased message broker consumer timeout to improve long-running task stability.
    • Added production example settings for message queue and results backend (RabbitMQ/Celery) to simplify deployment and configuration.

✏️ Tip: You can customize this high-level summary in your review settings.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings November 20, 2025 02:13
@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit eb30cf1
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/691e7b56ca29bd000802fb1a

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 20, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds RabbitMQ/Celery environment variables: local env receives a new Erlang consumer timeout arg; production example gains a RabbitMQ block (broker URL, results backend set to rpc://, default user/pass placeholders, and the same Erlang consumer timeout).

Changes

Cohort / File(s) Summary
Local env change
\.envs/.local/.django
Added RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=-rabbit consumer_timeout 604800000 to increase RabbitMQ consumer timeout.
Production example env
\.envs/.production/.django-example
Appended RabbitMQ/Celery variables: CELERY_BROKER_URL=, CELERY_RESULT_BACKEND=rpc://, RABBITMQ_DEFAULT_USER=, RABBITMQ_DEFAULT_PASS=, and RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=-rabbit consumer_timeout 604800000.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify the Erlang consumer_timeout value (both files use 604800000) is intentional and correctly scaled/typed.
  • Confirm CELERY_RESULT_BACKEND=rpc:// is the intended backend and documented for deployers.
  • Check environment variable naming consistency and placement in the production example.

Possibly related PRs

Suggested reviewers

  • carlosgjs

Poem

🐰 Hops of config, tidy and bright,
Queues that wait through day and night,
Erlang whispers, time stretched long,
Celery hums a steady song,
I nibble carrots, celebrate this byte.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a RabbitMQ consumer_timeout setting to allow longer-running jobs.
Description check ✅ Passed The description includes most required sections (Summary, Related Issues, How to Test, Deployment Notes, Checklist) with clear context about the RabbitMQ consumer_timeout issue and its resolution.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/rabbitmq-consumer-timeout

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e54579d and eb30cf1.

📒 Files selected for processing (2)
  • .envs/.local/.django (1 hunks)
  • .envs/.production/.django-example (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .envs/.production/.django-example
  • .envs/.local/.django
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.envs/.production/.django-example (1)

27-32: Consider adding a comment to document the timeout value.

The RabbitMQ configuration block is well-structured, and the empty credential placeholders are appropriate for an example file. The timeout value matches the local environment configuration, ensuring consistency.

Consider adding a brief comment to explain the timeout value:

 # RabbitMQ
 CELERY_BROKER_URL=
 CELERY_RESULT_BACKEND=rpc:// # Use RabbitMQ for results backend
 RABBITMQ_DEFAULT_USER=
 RABBITMQ_DEFAULT_PASS=
+# Consumer timeout: 60480000 ms = 16.8 hours (for long-running tasks)
 RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=-rabbit consumer_timeout 60480000

This helps operators understand the timeout duration when configuring production environments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ef8c40 and e54579d.

📒 Files selected for processing (2)
  • .envs/.local/.django (1 hunks)
  • .envs/.production/.django-example (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Agent
  • GitHub Check: Redirect rules
  • GitHub Check: Header rules
  • GitHub Check: Pages changed
  • GitHub Check: test
🔇 Additional comments (1)
.envs/.local/.django (1)

24-24: All verification checks passed—no issues found.

The shell script output confirms that the RabbitMQ container in docker-compose loads environment variables from ./.envs/.local/.django via the env_file directive, ensuring that RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS is properly consumed. The web search validates that the Erlang argument syntax -rabbit consumer_timeout 60480000 is correct for RabbitMQ 3.x+ and is the recommended approach for Docker deployments. The configuration is appropriately implemented.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds RabbitMQ consumer_timeout configuration to address an issue where long-running Celery tasks were being terminated after 30 minutes (RabbitMQ's default timeout). The configuration is added to both local development and production example environment files to allow jobs to run longer without being prematurely killed by RabbitMQ.

Key changes:

  • Added RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS environment variable with consumer_timeout set to 60480000 ms (~16.8 hours)
  • Added comprehensive RabbitMQ configuration section to production example file including broker URL, result backend, and authentication variables

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
.envs/.production/.django-example Added new RabbitMQ configuration section with broker URL, result backend, credentials, and consumer timeout settings
.envs/.local/.django Added consumer timeout configuration to existing RabbitMQ settings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mihow
Copy link
Collaborator Author

mihow commented Nov 20, 2025

Important note: I realized that I had previously increased the consumer timeout above the default on the server, and then recently reduced the timeout back to the default. Here are my notes. We need to start committing this to the devops repo.

# TEMPORARY: These high limits are a temporary solution until message results 
# from raw ML results are handled differently
# Increase timeout to 12 hours (we have multi-day jobs)
# Updated 2025-11-15 - reduce timeouts again to see if disconnections get canceled more reliability (stop waiting)
# consumer_timeout = 43200000
# handshake_timeout = 43200000
# Increase max message size to 100MB (in bytes)
max_message_size = 104857600

# Memory and disk settings that might help with large messages
vm_memory_high_watermark.relative = 0.8
disk_free_limit.absolute = 5GB

# Updated 2025-11-19 - increase timeout again 
consumer_timeout = 60480000

@mihow mihow merged commit a6a3605 into main Nov 20, 2025
7 checks passed
@mihow mihow deleted the fix/rabbitmq-consumer-timeout branch November 20, 2025 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants