Skip to content

[Bug] - RCU Possible timer handling issue on cpu #1037

@p24-max

Description

@p24-max

Describe the bug
I run a data copy&conversion task in python using a queue which spawns up to 50 new worker processes. Its task is to copy from a remote MSSQL-Server (reachable through a SSH-tunnel) to a RDS Aurora (ServerlessV2) within the AWS-VPC (same AZ).
At the beginning it runs fast, but after a while it gets stuck and slow (regular AVG per job: 0.2sec / when stuck it takes around 4-10 minutes until job completes.

In the DMESG this gets visible:

[  176.408977] sched: DL replenish lagged too much
[  306.042180] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  306.042942] rcu:     0-...0: (3 GPs behind) idle=2c04/1/0x4000000000000002 softirq=4058/4058 fqs=2677
[  306.044351] rcu:     (detected by 11, t=5902 jiffies, g=7193, q=10320 ncpus=16)
[  306.045510] Sending NMI from CPU 11 to CPUs 0:
[  316.046168] rcu: rcu_sched kthread starved for 998 jiffies! g7193 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=8
[  316.047585] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  316.049002] rcu: RCU grace-period kthread stack dump:
[  316.049900] task:rcu_sched       state:I stack:0     pid:17    tgid:17    ppid:2      flags:0x00000008
[  316.051373] Call trace:
[  316.052012]  __switch_to+0xf0/0x160 (T)
[  316.052794]  __schedule+0x290/0x668
[  316.053530]  schedule+0x30/0xc8
[  316.054220]  schedule_timeout+0x90/0x168
[  316.054981]  rcu_gp_fqs_loop+0x110/0x450
[  316.055754]  rcu_gp_kthread+0x148/0x190
[  316.056525]  kthread+0xe0/0xf0
[  316.057211]  ret_from_fork+0x10/0x20
[  316.057963] rcu: Stack dump where RCU GP kthread last ran:
[  316.058891] Sending NMI from CPU 11 to CPUs 8:
[  316.059717] NMI backtrace for cpu 8 skipped: idling at default_idle_call+0x3c/0xf0
[  548.619604] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  548.620556] rcu:     0-...0: (1 GPs behind) idle=720c/1/0x4000000000000000 softirq=5291/5292 fqs=2462
[  548.621965] rcu:     (detected by 11, t=5902 jiffies, g=12381, q=2751 ncpus=16)
[  548.623141] Sending NMI from CPU 11 to CPUs 0:
[  558.623792] rcu: rcu_sched kthread timer wakeup didn't happen for 1001 jiffies! g12381 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  558.625207] rcu:     Possible timer handling issue on cpu=11 timer-softirq=5710
[  558.626359] rcu: rcu_sched kthread starved for 1002 jiffies! g12381 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=11
[  558.627958] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  558.629375] rcu: RCU grace-period kthread stack dump:
[  558.630264] task:rcu_sched       state:I stack:0     pid:17    tgid:17    ppid:2      flags:0x00000008
[  558.631738] Call trace:
[  558.632369]  __switch_to+0xf0/0x160 (T)
[  558.633143]  __schedule+0x290/0x668
[  558.633874]  schedule+0x30/0xc8
[  558.634566]  schedule_timeout+0x90/0x168
[  558.635363]  rcu_gp_fqs_loop+0x110/0x450
[  558.636146]  rcu_gp_kthread+0x148/0x190
[  558.636908]  kthread+0xe0/0xf0
[  558.637607]  ret_from_fork+0x10/0x20
[  558.638373] rcu: Stack dump where RCU GP kthread last ran:
[  558.639312] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[  558.640703] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[  558.641770] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  558.642912] pc : default_idle_call+0x3c/0xf0
[  558.643743] lr : default_idle_call+0x28/0xf0
[  558.644553] sp : ffff800083353de0
[  558.645264] x29: ffff800083353de0 x28: 0000000000000000 x27: 0000000000000000
[  558.646432] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[  558.647601] x23: 00000000000000c0 x22: ffff800081d7fa60 x21: ffff0003c0b5c900
[  558.648773] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
[  558.649939] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  558.651105] x14: ffff800081131880 x13: ffff800081d7fa68 x12: 000000000000000b
[  558.652274] x11: 00000000000000c0 x10: 2666fefd0b604580 x9 : ffff8000801c61a4
[  558.653449] x8 : ffff0003c0b5ed38 x7 : ffff00226ffe7760 x6 : 0000000067aeac73
[  558.654631] x5 : 4000000000000000 x4 : 4000000000000000 x3 : ffff8021ee7f6000
[  558.655803] x2 : 000000000000000b x1 : ffff800083353da0 x0 : ffff0003c0b5c900
[  558.656974] Call trace:
[  558.657613]  default_idle_call+0x3c/0xf0 (P)
[  558.658423]  cpuidle_idle_call+0x168/0x198
[  558.659238]  do_idle+0xa8/0x120
[  558.659942]  cpu_startup_entry+0x40/0x50
[  558.660714]  secondary_start_kernel+0xe0/0x120
[  558.661549]  __secondary_switched+0xc0/0xc8
[  559.285178] NMI backtrace for cpu 0
[  559.285181] CPU: 0 UID: 992 PID: 11412 Comm: python3 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[  559.285183] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[  559.285184] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  559.285186] pc : handle_softirqs+0x9c/0x310
[  559.285192] lr : __do_softirq+0x1c/0x28
[  559.285193] sp : ffff80008319bf50
[  559.285194] x29: ffff80008319bf50 x28: ffff000408902480 x27: 0000000000000000
[  559.285196] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[  559.285197] x23: 0000000003400005 x22: 0000000000000282 x21: ffff8000a4f2be40
[  559.285198] x20: ffff8000800100a8 x19: ffff8000a4f2bcf0 x18: 0000000000000000
[  559.285200] x17: ffff8021ee680000 x16: ffff800083198000 x15: 0000000000000000
[  559.285201] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[  559.285202] x11: ffff8021ee680000 x10: 0000008237fe80e5 x9 : ffff80008001013c
[  559.285203] x8 : 00000081742eb800 x7 : 00000081742eb800 x6 : 0000000067aeac73
[  559.285205] x5 : 0000000000800000 x4 : 0000000000400040 x3 : ffff800081975e40
[  559.285206] x2 : ffff8021ee680000 x1 : ffff800081975e40 x0 : ffff8021ee680000
[  559.285207] Call trace:
[  559.285208]  handle_softirqs+0x9c/0x310 (P)
[  559.285210]  __do_softirq+0x1c/0x28
[  559.285211]  ____do_softirq+0x18/0x30
[  559.285213]  call_on_irq_stack+0x30/0x48
[  559.285214]  do_softirq_own_stack+0x24/0x50
[  559.285216]  __irq_exit_rcu+0xd8/0x100
[  559.285217]  irq_exit_rcu+0x18/0x30
[  559.285219]  el1_interrupt+0x48/0x98
[  559.285221]  el1h_64_irq_handler+0x18/0x28
[  559.285222]  el1h_64_irq+0x84/0x88
[  559.285223]  do_notify_resume+0x40/0x1b8 (P)
[  559.285225]  el0_interrupt+0xe0/0x140
[  559.285226]  __el0_irq_handler_common+0x18/0x28
[  559.285227]  el0t_64_irq_handler+0x10/0x20
[  559.285228]  el0t_64_irq+0x1b0/0x1b8
[  745.155635] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  745.156445] rcu:     0-...0: (1 GPs behind) idle=d634/1/0x4000000000000002 softirq=7032/7032 fqs=2024
[  745.157849] rcu:     (detected by 1, t=5902 jiffies, g=16541, q=4391 ncpus=16)
[  745.159013] Sending NMI from CPU 1 to CPUs 0:
[  755.159658] rcu: rcu_sched kthread starved for 999 jiffies! g16541 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=10
[  755.161090] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  755.162491] rcu: RCU grace-period kthread stack dump:
[  755.163403] task:rcu_sched       state:I stack:0     pid:17    tgid:17    ppid:2      flags:0x00000008
[  755.164853] Call trace:
[  755.165476]  __switch_to+0xf0/0x160 (T)
[  755.166267]  __schedule+0x290/0x668
[  755.167013]  schedule+0x30/0xc8
[  755.167726]  schedule_timeout+0x90/0x168
[  755.168504]  rcu_gp_fqs_loop+0x110/0x450
[  755.169294]  rcu_gp_kthread+0x148/0x190
[  755.170056]  kthread+0xe0/0xf0
[  755.170761]  ret_from_fork+0x10/0x20
[  755.171498] rcu: Stack dump where RCU GP kthread last ran:
[  755.172452] Sending NMI from CPU 1 to CPUs 10:
[  755.173292] NMI backtrace for cpu 10 skipped: idling at default_idle_call+0x3c/0xf0
[ 1016.903975] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1016.904835] rcu:     0-...0: (2 GPs behind) idle=d5a4/1/0x4000000000000002 softirq=9113/9113 fqs=2488
[ 1016.906235] rcu:     (detected by 4, t=5902 jiffies, g=22521, q=14370 ncpus=16)
[ 1016.907405] Sending NMI from CPU 4 to CPUs 0:
[ 1026.908136] rcu: rcu_sched kthread timer wakeup didn't happen for 1000 jiffies! g22521 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 1026.909788] rcu:     Possible timer handling issue on cpu=4 timer-softirq=11871
[ 1026.910942] rcu: rcu_sched kthread starved for 1001 jiffies! g22521 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=4
[ 1026.912534] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1026.913957] rcu: RCU grace-period kthread stack dump:
[ 1026.914843] task:rcu_sched       state:I stack:0     pid:17    tgid:17    ppid:2      flags:0x00000008
[ 1026.916283] Call trace:
[ 1026.916923]  __switch_to+0xf0/0x160 (T)
[ 1026.917676]  __schedule+0x290/0x668
[ 1026.918425]  schedule+0x30/0xc8
[ 1026.919135]  schedule_timeout+0x90/0x168
[ 1026.919940]  rcu_gp_fqs_loop+0x110/0x450
[ 1026.920727]  rcu_gp_kthread+0x148/0x190
[ 1026.921507]  kthread+0xe0/0xf0
[ 1026.922219]  ret_from_fork+0x10/0x20
[ 1026.922958] rcu: Stack dump where RCU GP kthread last ran:
[ 1026.923891] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 1026.925262] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 1026.926358] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1026.927509] pc : default_idle_call+0x3c/0xf0
[ 1026.928343] lr : default_idle_call+0x28/0xf0
[ 1026.929169] sp : ffff80008331bde0
[ 1026.929899] x29: ffff80008331bde0 x28: 0000000000000000 x27: 0000000000000000
[ 1026.931075] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 1026.932235] x23: 00000000000000c0 x22: ffff800081d7fa60 x21: ffff0003c0b38000
[ 1026.933425] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
[ 1026.934607] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 1026.935766] x14: ffff800081131880 x13: ffff800081d7fa68 x12: 0000000000000004
[ 1026.936939] x11: 00000000000000c0 x10: 6c754932078c6bf6 x9 : ffff8000801c61a4
[ 1026.938093] x8 : ffff0003c0b3a438 x7 : ffff8000ac0ab880 x6 : 0000000067aeac73
[ 1026.939274] x5 : 4000000000000000 x4 : 4000000000000000 x3 : ffff8021ee708000
[ 1026.940460] x2 : 0000000000000004 x1 : ffff80008331bda0 x0 : ffff0003c0b38000
[ 1026.941636] Call trace:
[ 1026.942261]  default_idle_call+0x3c/0xf0 (P)
[ 1026.943089]  cpuidle_idle_call+0x168/0x198
[ 1026.943886]  do_idle+0xa8/0x120
[ 1026.944597]  cpu_startup_entry+0x40/0x50
[ 1026.945398]  secondary_start_kernel+0xe0/0x120
[ 1026.946237]  __secondary_switched+0xc0/0xc8
[ 1028.144967] NMI backtrace for cpu 0 skipped: idling at default_idle_call+0x3c/0xf0
[ 1509.729088] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1509.729911] rcu:     0-...0: (4 GPs behind) idle=b6ec/1/0x4000000000000002 softirq=13924/13925 fqs=2867
[ 1509.731355] rcu:     (detected by 1, t=5902 jiffies, g=39757, q=8441 ncpus=16)
[ 1509.732505] Sending NMI from CPU 1 to CPUs 0:
[ 1514.457704] NMI backtrace for cpu 0 skipped: idling at default_idle_call+0x3c/0xf0
[ 1514.457811] rcu: rcu_sched kthread starved for 471 jiffies! g39757 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=7
[ 1514.460489] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1514.461896] rcu: RCU grace-period kthread stack dump:
[ 1514.462768] task:rcu_sched       state:I stack:0     pid:17    tgid:17    ppid:2      flags:0x00000008
[ 1514.464216] Call trace:
[ 1514.464759]  __switch_to+0xf0/0x160 (T)
[ 1514.465459]  __schedule+0x290/0x668
[ 1514.466123]  schedule+0x30/0xc8
[ 1514.466749]  schedule_timeout+0x90/0x168
[ 1514.467463]  rcu_gp_fqs_loop+0x110/0x450
[ 1514.468191]  rcu_gp_kthread+0x148/0x190
[ 1514.468902]  kthread+0xe0/0xf0
[ 1514.469522]  ret_from_fork+0x10/0x20
[ 1514.470201] rcu: Stack dump where RCU GP kthread last ran:
[ 1514.471142] Sending NMI from CPU 1 to CPUs 7:
[ 1514.471930] NMI backtrace for cpu 7 skipped: idling at default_idle_call+0x3c/0xf0
[ 1761.536806] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1761.537710] rcu:     0-...0: (1 GPs behind) idle=75bc/1/0x4000000000000000 softirq=15304/15305 fqs=2359
[ 1761.539154] rcu:     (detected by 2, t=5902 jiffies, g=45361, q=1759 ncpus=16)
[ 1761.540308] Sending NMI from CPU 2 to CPUs 0:
[ 1765.982160] NMI backtrace for cpu 0
[ 1765.982163] CPU: 0 UID: 992 PID: 24017 Comm: python3 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 1765.982166] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 1765.982167] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1765.982168] pc : handle_softirqs+0x9c/0x310
[ 1765.982177] lr : __do_softirq+0x1c/0x28
[ 1765.982179] sp : ffff80008319bf50
[ 1765.982179] x29: ffff80008319bf50 x28: ffff0003c55ea480 x27: 0000000000000000
[ 1765.982182] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 1765.982183] x23: 0000000003400005 x22: 0000000000000282 x21: ffff8000a13c3e40
[ 1765.982185] x20: ffff8000800100a8 x19: ffff8000a13c3cf0 x18: 0000000000000000
[ 1765.982186] x17: ffff8021ee680000 x16: ffff800083198000 x15: 0000000000000000
[ 1765.982187] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 1765.982189] x11: ffff8021ee680000 x10: 0000019b2c9939dc x9 : ffff80008001013c
[ 1765.982191] x8 : ffff0003c0fd8040 x7 : 000000000000000a x6 : 0000000000000000
[ 1765.982192] x5 : 0000000000800000 x4 : 0000000000400040 x3 : ffff800081975e40
[ 1765.982193] x2 : ffff8021ee680000 x1 : ffff800081975e40 x0 : ffff8021ee680000
[ 1765.982195] Call trace:
[ 1765.982196]  handle_softirqs+0x9c/0x310 (P)
[ 1765.982198]  __do_softirq+0x1c/0x28
[ 1765.982199]  ____do_softirq+0x18/0x30
[ 1765.982201]  call_on_irq_stack+0x30/0x48
[ 1765.982203]  do_softirq_own_stack+0x24/0x50
[ 1765.982204]  __irq_exit_rcu+0xd8/0x100
[ 1765.982206]  irq_exit_rcu+0x18/0x30
[ 1765.982207]  el1_interrupt+0x48/0x98
[ 1765.982210]  el1h_64_irq_handler+0x18/0x28
[ 1765.982212]  el1h_64_irq+0x84/0x88
[ 1765.982213]  do_notify_resume+0x40/0x1b8 (P)
[ 1765.982215]  el0_interrupt+0xe0/0x140
[ 1765.982216]  __el0_irq_handler_common+0x18/0x28
[ 1765.982217]  el0t_64_irq_handler+0x10/0x20
[ 1765.982219]  el0t_64_irq+0x1b0/0x1b8
[ 1765.982602] rcu: rcu_sched kthread starved for 442 jiffies! g45361 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=10
[ 1766.013214] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1766.014615] rcu: RCU grace-period kthread stack dump:
[ 1766.015499] task:rcu_sched       state:I stack:0     pid:17    tgid:17    ppid:2      flags:0x00000008
[ 1766.016942] Call trace:
[ 1766.017465]  __switch_to+0xf0/0x160 (T)
[ 1766.018181]  __schedule+0x290/0x668
[ 1766.018851]  schedule+0x30/0xc8
[ 1766.019479]  schedule_timeout+0x90/0x168
[ 1766.020207]  rcu_gp_fqs_loop+0x110/0x450
[ 1766.020935]  rcu_gp_kthread+0x148/0x190
[ 1766.021635]  kthread+0xe0/0xf0
[ 1766.022246]  ret_from_fork+0x10/0x20
[ 1766.022928] rcu: Stack dump where RCU GP kthread last ran:
[ 1766.023859] Sending NMI from CPU 2 to CPUs 10:
[ 1766.024667] NMI backtrace for cpu 10 skipped: idling at default_idle_call+0x3c/0xf0
[ 2199.150223] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2199.151075] rcu:     0-...0: (1 GPs behind) idle=205c/1/0x4000000000000000 softirq=19121/19122 fqs=1391
[ 2199.152518] rcu:     (detected by 7, t=5906 jiffies, g=58149, q=20399 ncpus=16)
[ 2199.153685] Sending NMI from CPU 7 to CPUs 0:
[ 2207.009221] NMI backtrace for cpu 0
[ 2207.009224] CPU: 0 UID: 992 PID: 28020 Comm: python3 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 2207.009227] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 2207.009228] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 2207.009229] pc : handle_softirqs+0x9c/0x310
[ 2207.009236] lr : __do_softirq+0x1c/0x28
[ 2207.009238] sp : ffff80008319bf50
[ 2207.009239] x29: ffff80008319bf50 x28: ffff0004165c2480 x27: 0000000000000000
[ 2207.009241] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 2207.009243] x23: 0000000083400805 x22: 0000000000000282 x21: ffff8000abe53e10
[ 2207.009244] x20: ffff8000800100a8 x19: ffff8000abe53cc0 x18: 0000000000000000
[ 2207.009245] x17: ffff8021ee680000 x16: ffff800083198000 x15: 0000000000000000
[ 2207.009247] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 2207.009248] x11: ffff8021ee680000 x10: 00000201dc000e8e x9 : ffff80008001013c
[ 2207.009249] x8 : ffff0003c0fd8040 x7 : 0000000000000003 x6 : 0000000000000000
[ 2207.009250] x5 : 0000000000800000 x4 : 0000000000400040 x3 : ffff800081975e40
[ 2207.009252] x2 : ffff8021ee680000 x1 : ffff800081975e40 x0 : ffff8021ee680000
[ 2207.009253] Call trace:
[ 2207.009254]  handle_softirqs+0x9c/0x310 (P)
[ 2207.009256]  __do_softirq+0x1c/0x28
[ 2207.009257]  ____do_softirq+0x18/0x30
[ 2207.009259]  call_on_irq_stack+0x30/0x48
[ 2207.009261]  do_softirq_own_stack+0x24/0x50
[ 2207.009263]  __irq_exit_rcu+0xd8/0x100
[ 2207.009264]  irq_exit_rcu+0x18/0x30
[ 2207.009266]  el1_interrupt+0x48/0x98
[ 2207.009269]  el1h_64_irq_handler+0x18/0x28
[ 2207.009270]  el1h_64_irq+0x84/0x88
[ 2207.009271]  __arm64_sys_newfstatat+0x0/0x38 (P)
[ 2207.009273]  el0_svc_common.constprop.0+0x48/0xf8
[ 2207.009275]  do_el0_svc+0x28/0x40
[ 2207.009276]  el0_svc+0x3c/0x180
[ 2207.009277]  el0t_64_sync_handler+0x13c/0x158
[ 2207.009278]  el0t_64_sync+0x1b0/0x1b8
[ 2207.010202] rcu: rcu_sched kthread starved for 780 jiffies! g58149 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=3
[ 2207.041539] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2207.042943] rcu: RCU grace-period kthread stack dump:
[ 2207.043817] task:rcu_sched       state:I stack:0     pid:17    tgid:17    ppid:2      flags:0x00000008
[ 2207.045272] Call trace:
[ 2207.045800]  __switch_to+0xf0/0x160 (T)
[ 2207.046509]  __schedule+0x290/0x668
[ 2207.047173]  schedule+0x30/0xc8
[ 2207.047794]  schedule_timeout+0x90/0x168
[ 2207.048513]  rcu_gp_fqs_loop+0x110/0x450
[ 2207.049256]  rcu_gp_kthread+0x148/0x190
[ 2207.049953]  kthread+0xe0/0xf0
[ 2207.050566]  ret_from_fork+0x10/0x20
[ 2207.051248] rcu: Stack dump where RCU GP kthread last ran:
[ 2207.052187] Sending NMI from CPU 7 to CPUs 3:
[ 2207.052973] NMI backtrace for cpu 3 skipped: idling at default_idle_call+0x3c/0xf0

While running the dataload, the server is neither CPU bound (AVG 14%) nor disc-bound. Most of the time taken is network I/O (awaiting DB to provide data or awaiting the DB-commit to confirm).
Both databases are running on separate servers.

Additional context

  • r8g.4xlarge
  • EBS GP3 500GB (3000 IOPS)
  • Amazon Linux 2023.9.20251117
  • Python 3.9.24
  • modinfo ena
    version: 2.15.0g
  • ifconfig
    ens34: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
    inet 10.0.0.75 netmask 255.255.255.0 broadcast 10.0.0.255
    inet6 fe80::b9:9bff:fe4a:3187 prefixlen 64 scopeid 0x20
    ether 02:b9:9b:4a:31:87 txqueuelen 1000 (Ethernet)
    RX packets 7937817 bytes 11890830800 (11.0 GiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 7524972 bytes 5313936205 (4.9 GiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions