Skip to content

Conversation

@CyberDem0n
Copy link
Collaborator

Fast shutdown may hang indefinitely when synchronous_standby_names requirement cannot be satisfied due to an insufficient number of synchronous replicas. In this situation, pg_cron can block waiting for a synchronous replication acknowledgment.

Example:

postgres -D testdb --shared_preload_libraries=pg_cron --synchronous_standby_names=foobar
 \_ postgres: io worker 0
 \_ postgres: io worker 1
 \_ postgres: io worker 2
 \_ postgres: checkpointer
 \_ postgres: pg_cron launcher  waiting for 0/A2DDC88

gdb:

(gdb) bt
#0  0x00007f7b2a5b3e5a in epoll_wait (epfd=5, events=0x56096e95dc08, maxevents=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x0000560942c65aa6 in WaitEventSetWaitBlock (set=set@entry=0x56096e95dba0, cur_timeout=cur_timeout@entry=-1, occurred_events=occurred_events@entry=0x7fff16aa23d0, nevents=nevents@entry=1) at waiteventset.c:1191
#2  0x0000560942c664b5 in WaitEventSetWait (set=0x56096e95dba0, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fff16aa23d0, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217780) at waiteventset.c:1139
#3  0x0000560942c5884a in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=17, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217780) at latch.c:196
#4  0x0000560942c0f6c4 in SyncRepWaitForLSN (lsn=170777736, commit=commit@entry=true) at syncrep.c:388
#5  0x00005609428d87cd in RecordTransactionCommit () at xact.c:1557
#6  0x00005609428d88f2 in CommitTransaction () at xact.c:2365
#7  0x00005609428d9831 in CommitTransactionCommandInternal () at xact.c:3202
#8  0x00005609428d9bbb in CommitTransactionCommand () at xact.c:3163
#9  0x00007f7b2b4b3b19 in MarkPendingRunsAsFailed () at src/job_metadata.c:1456
#10 0x00007f7b2b4b66a4 in PgCronLauncherMain (arg=<optimized out>) at src/pg_cron.c:588
#11 0x0000560942bc1798 in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at bgworker.c:879
#12 0x0000560942bc3a4b in postmaster_child_launch (child_type=child_type@entry=B_BG_WORKER, child_slot=238, startup_data=startup_data@entry=0x56096e9f67b0, startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0) at launch_backend.c:290
#13 0x0000560942bc5bf2 in StartBackgroundWorker (rw=rw@entry=0x56096e9f67b0) at postmaster.c:4164
#14 0x0000560942bc5e43 in maybe_start_bgworkers () at postmaster.c:4330
#15 0x0000560942bc6be3 in LaunchMissingBackgroundProcesses () at postmaster.c:3404
#16 0x0000560942bc89f9 in ServerLoop () at postmaster.c:1717
#17 0x0000560942bc9e08 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x56096e95d2e0) at postmaster.c:1400
#18 0x0000560942acfc06 in main (argc=5, argv=0x56096e95d2e0) at main.c:227

This happens because pg_cron installs a custom SIGTERM handler that does not set ProcDiePending, causing SyncRepWaitForLSN() to never exit its wait loop.

Fix this by switching to the standard SIGTERM handler (die()). Additionally, remove the custom SIGHUP handler and rely on SignalHandlerForConfigReload() instead.

Fast shutdown may hang indefinitely when `synchronous_standby_names`
requirement cannot be satisfied due to an insufficient number of
synchronous replicas. In this situation, pg_cron can block waiting for a
synchronous replication acknowledgment.

Example:
```
postgres -D testdb --shared_preload_libraries=pg_cron --synchronous_standby_names=foobar
 \_ postgres: io worker 0
 \_ postgres: io worker 1
 \_ postgres: io worker 2
 \_ postgres: checkpointer
 \_ postgres: pg_cron launcher  waiting for 0/A2DDC88
```

gdb:
```
(gdb) bt
#0  0x00007f7b2a5b3e5a in epoll_wait (epfd=5, events=0x56096e95dc08, maxevents=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
citusdata#1  0x0000560942c65aa6 in WaitEventSetWaitBlock (set=set@entry=0x56096e95dba0, cur_timeout=cur_timeout@entry=-1, occurred_events=occurred_events@entry=0x7fff16aa23d0, nevents=nevents@entry=1) at waiteventset.c:1191
citusdata#2  0x0000560942c664b5 in WaitEventSetWait (set=0x56096e95dba0, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fff16aa23d0, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217780) at waiteventset.c:1139
citusdata#3  0x0000560942c5884a in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=17, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=134217780) at latch.c:196
citusdata#4  0x0000560942c0f6c4 in SyncRepWaitForLSN (lsn=170777736, commit=commit@entry=true) at syncrep.c:388
citusdata#5  0x00005609428d87cd in RecordTransactionCommit () at xact.c:1557
citusdata#6  0x00005609428d88f2 in CommitTransaction () at xact.c:2365
citusdata#7  0x00005609428d9831 in CommitTransactionCommandInternal () at xact.c:3202
citusdata#8  0x00005609428d9bbb in CommitTransactionCommand () at xact.c:3163
citusdata#9  0x00007f7b2b4b3b19 in MarkPendingRunsAsFailed () at src/job_metadata.c:1456
citusdata#10 0x00007f7b2b4b66a4 in PgCronLauncherMain (arg=<optimized out>) at src/pg_cron.c:588
citusdata#11 0x0000560942bc1798 in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at bgworker.c:879
citusdata#12 0x0000560942bc3a4b in postmaster_child_launch (child_type=child_type@entry=B_BG_WORKER, child_slot=238, startup_data=startup_data@entry=0x56096e9f67b0, startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0) at launch_backend.c:290
citusdata#13 0x0000560942bc5bf2 in StartBackgroundWorker (rw=rw@entry=0x56096e9f67b0) at postmaster.c:4164
citusdata#14 0x0000560942bc5e43 in maybe_start_bgworkers () at postmaster.c:4330
citusdata#15 0x0000560942bc6be3 in LaunchMissingBackgroundProcesses () at postmaster.c:3404
citusdata#16 0x0000560942bc89f9 in ServerLoop () at postmaster.c:1717
citusdata#17 0x0000560942bc9e08 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x56096e95d2e0) at postmaster.c:1400
citusdata#18 0x0000560942acfc06 in main (argc=5, argv=0x56096e95d2e0) at main.c:227
```

This happens because pg_cron installs a custom `SIGTERM` handler that
does not set `ProcDiePending`, causing `SyncRepWaitForLSN()` to never
exit its wait loop.

Fix this by switching to the standard `SIGTERM` handler (`die()`).
Additionally, remove the custom `SIGHUP` handler and rely on
`SignalHandlerForConfigReload()` instead.
Copy link
Collaborator

@sfc-gh-mslot sfc-gh-mslot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants