generated from amazon-archives/__template_DevGuide
-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
Describe the bug
I run a data copy&conversion task in python using a queue which spawns up to 50 new worker processes. Its task is to copy from a remote MSSQL-Server (reachable through a SSH-tunnel) to a RDS Aurora (ServerlessV2) within the AWS-VPC (same AZ).
At the beginning it runs fast, but after a while it gets stuck and slow (regular AVG per job: 0.2sec / when stuck it takes around 4-10 minutes until job completes.
In the DMESG this gets visible:
[ 176.408977] sched: DL replenish lagged too much
[ 306.042180] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 306.042942] rcu: 0-...0: (3 GPs behind) idle=2c04/1/0x4000000000000002 softirq=4058/4058 fqs=2677
[ 306.044351] rcu: (detected by 11, t=5902 jiffies, g=7193, q=10320 ncpus=16)
[ 306.045510] Sending NMI from CPU 11 to CPUs 0:
[ 316.046168] rcu: rcu_sched kthread starved for 998 jiffies! g7193 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=8
[ 316.047585] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 316.049002] rcu: RCU grace-period kthread stack dump:
[ 316.049900] task:rcu_sched state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 316.051373] Call trace:
[ 316.052012] __switch_to+0xf0/0x160 (T)
[ 316.052794] __schedule+0x290/0x668
[ 316.053530] schedule+0x30/0xc8
[ 316.054220] schedule_timeout+0x90/0x168
[ 316.054981] rcu_gp_fqs_loop+0x110/0x450
[ 316.055754] rcu_gp_kthread+0x148/0x190
[ 316.056525] kthread+0xe0/0xf0
[ 316.057211] ret_from_fork+0x10/0x20
[ 316.057963] rcu: Stack dump where RCU GP kthread last ran:
[ 316.058891] Sending NMI from CPU 11 to CPUs 8:
[ 316.059717] NMI backtrace for cpu 8 skipped: idling at default_idle_call+0x3c/0xf0
[ 548.619604] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 548.620556] rcu: 0-...0: (1 GPs behind) idle=720c/1/0x4000000000000000 softirq=5291/5292 fqs=2462
[ 548.621965] rcu: (detected by 11, t=5902 jiffies, g=12381, q=2751 ncpus=16)
[ 548.623141] Sending NMI from CPU 11 to CPUs 0:
[ 558.623792] rcu: rcu_sched kthread timer wakeup didn't happen for 1001 jiffies! g12381 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 558.625207] rcu: Possible timer handling issue on cpu=11 timer-softirq=5710
[ 558.626359] rcu: rcu_sched kthread starved for 1002 jiffies! g12381 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=11
[ 558.627958] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 558.629375] rcu: RCU grace-period kthread stack dump:
[ 558.630264] task:rcu_sched state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 558.631738] Call trace:
[ 558.632369] __switch_to+0xf0/0x160 (T)
[ 558.633143] __schedule+0x290/0x668
[ 558.633874] schedule+0x30/0xc8
[ 558.634566] schedule_timeout+0x90/0x168
[ 558.635363] rcu_gp_fqs_loop+0x110/0x450
[ 558.636146] rcu_gp_kthread+0x148/0x190
[ 558.636908] kthread+0xe0/0xf0
[ 558.637607] ret_from_fork+0x10/0x20
[ 558.638373] rcu: Stack dump where RCU GP kthread last ran:
[ 558.639312] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 558.640703] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 558.641770] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 558.642912] pc : default_idle_call+0x3c/0xf0
[ 558.643743] lr : default_idle_call+0x28/0xf0
[ 558.644553] sp : ffff800083353de0
[ 558.645264] x29: ffff800083353de0 x28: 0000000000000000 x27: 0000000000000000
[ 558.646432] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 558.647601] x23: 00000000000000c0 x22: ffff800081d7fa60 x21: ffff0003c0b5c900
[ 558.648773] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
[ 558.649939] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 558.651105] x14: ffff800081131880 x13: ffff800081d7fa68 x12: 000000000000000b
[ 558.652274] x11: 00000000000000c0 x10: 2666fefd0b604580 x9 : ffff8000801c61a4
[ 558.653449] x8 : ffff0003c0b5ed38 x7 : ffff00226ffe7760 x6 : 0000000067aeac73
[ 558.654631] x5 : 4000000000000000 x4 : 4000000000000000 x3 : ffff8021ee7f6000
[ 558.655803] x2 : 000000000000000b x1 : ffff800083353da0 x0 : ffff0003c0b5c900
[ 558.656974] Call trace:
[ 558.657613] default_idle_call+0x3c/0xf0 (P)
[ 558.658423] cpuidle_idle_call+0x168/0x198
[ 558.659238] do_idle+0xa8/0x120
[ 558.659942] cpu_startup_entry+0x40/0x50
[ 558.660714] secondary_start_kernel+0xe0/0x120
[ 558.661549] __secondary_switched+0xc0/0xc8
[ 559.285178] NMI backtrace for cpu 0
[ 559.285181] CPU: 0 UID: 992 PID: 11412 Comm: python3 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 559.285183] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 559.285184] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 559.285186] pc : handle_softirqs+0x9c/0x310
[ 559.285192] lr : __do_softirq+0x1c/0x28
[ 559.285193] sp : ffff80008319bf50
[ 559.285194] x29: ffff80008319bf50 x28: ffff000408902480 x27: 0000000000000000
[ 559.285196] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 559.285197] x23: 0000000003400005 x22: 0000000000000282 x21: ffff8000a4f2be40
[ 559.285198] x20: ffff8000800100a8 x19: ffff8000a4f2bcf0 x18: 0000000000000000
[ 559.285200] x17: ffff8021ee680000 x16: ffff800083198000 x15: 0000000000000000
[ 559.285201] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 559.285202] x11: ffff8021ee680000 x10: 0000008237fe80e5 x9 : ffff80008001013c
[ 559.285203] x8 : 00000081742eb800 x7 : 00000081742eb800 x6 : 0000000067aeac73
[ 559.285205] x5 : 0000000000800000 x4 : 0000000000400040 x3 : ffff800081975e40
[ 559.285206] x2 : ffff8021ee680000 x1 : ffff800081975e40 x0 : ffff8021ee680000
[ 559.285207] Call trace:
[ 559.285208] handle_softirqs+0x9c/0x310 (P)
[ 559.285210] __do_softirq+0x1c/0x28
[ 559.285211] ____do_softirq+0x18/0x30
[ 559.285213] call_on_irq_stack+0x30/0x48
[ 559.285214] do_softirq_own_stack+0x24/0x50
[ 559.285216] __irq_exit_rcu+0xd8/0x100
[ 559.285217] irq_exit_rcu+0x18/0x30
[ 559.285219] el1_interrupt+0x48/0x98
[ 559.285221] el1h_64_irq_handler+0x18/0x28
[ 559.285222] el1h_64_irq+0x84/0x88
[ 559.285223] do_notify_resume+0x40/0x1b8 (P)
[ 559.285225] el0_interrupt+0xe0/0x140
[ 559.285226] __el0_irq_handler_common+0x18/0x28
[ 559.285227] el0t_64_irq_handler+0x10/0x20
[ 559.285228] el0t_64_irq+0x1b0/0x1b8
[ 745.155635] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 745.156445] rcu: 0-...0: (1 GPs behind) idle=d634/1/0x4000000000000002 softirq=7032/7032 fqs=2024
[ 745.157849] rcu: (detected by 1, t=5902 jiffies, g=16541, q=4391 ncpus=16)
[ 745.159013] Sending NMI from CPU 1 to CPUs 0:
[ 755.159658] rcu: rcu_sched kthread starved for 999 jiffies! g16541 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=10
[ 755.161090] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 755.162491] rcu: RCU grace-period kthread stack dump:
[ 755.163403] task:rcu_sched state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 755.164853] Call trace:
[ 755.165476] __switch_to+0xf0/0x160 (T)
[ 755.166267] __schedule+0x290/0x668
[ 755.167013] schedule+0x30/0xc8
[ 755.167726] schedule_timeout+0x90/0x168
[ 755.168504] rcu_gp_fqs_loop+0x110/0x450
[ 755.169294] rcu_gp_kthread+0x148/0x190
[ 755.170056] kthread+0xe0/0xf0
[ 755.170761] ret_from_fork+0x10/0x20
[ 755.171498] rcu: Stack dump where RCU GP kthread last ran:
[ 755.172452] Sending NMI from CPU 1 to CPUs 10:
[ 755.173292] NMI backtrace for cpu 10 skipped: idling at default_idle_call+0x3c/0xf0
[ 1016.903975] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1016.904835] rcu: 0-...0: (2 GPs behind) idle=d5a4/1/0x4000000000000002 softirq=9113/9113 fqs=2488
[ 1016.906235] rcu: (detected by 4, t=5902 jiffies, g=22521, q=14370 ncpus=16)
[ 1016.907405] Sending NMI from CPU 4 to CPUs 0:
[ 1026.908136] rcu: rcu_sched kthread timer wakeup didn't happen for 1000 jiffies! g22521 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 1026.909788] rcu: Possible timer handling issue on cpu=4 timer-softirq=11871
[ 1026.910942] rcu: rcu_sched kthread starved for 1001 jiffies! g22521 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=4
[ 1026.912534] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1026.913957] rcu: RCU grace-period kthread stack dump:
[ 1026.914843] task:rcu_sched state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 1026.916283] Call trace:
[ 1026.916923] __switch_to+0xf0/0x160 (T)
[ 1026.917676] __schedule+0x290/0x668
[ 1026.918425] schedule+0x30/0xc8
[ 1026.919135] schedule_timeout+0x90/0x168
[ 1026.919940] rcu_gp_fqs_loop+0x110/0x450
[ 1026.920727] rcu_gp_kthread+0x148/0x190
[ 1026.921507] kthread+0xe0/0xf0
[ 1026.922219] ret_from_fork+0x10/0x20
[ 1026.922958] rcu: Stack dump where RCU GP kthread last ran:
[ 1026.923891] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 1026.925262] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 1026.926358] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1026.927509] pc : default_idle_call+0x3c/0xf0
[ 1026.928343] lr : default_idle_call+0x28/0xf0
[ 1026.929169] sp : ffff80008331bde0
[ 1026.929899] x29: ffff80008331bde0 x28: 0000000000000000 x27: 0000000000000000
[ 1026.931075] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 1026.932235] x23: 00000000000000c0 x22: ffff800081d7fa60 x21: ffff0003c0b38000
[ 1026.933425] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
[ 1026.934607] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 1026.935766] x14: ffff800081131880 x13: ffff800081d7fa68 x12: 0000000000000004
[ 1026.936939] x11: 00000000000000c0 x10: 6c754932078c6bf6 x9 : ffff8000801c61a4
[ 1026.938093] x8 : ffff0003c0b3a438 x7 : ffff8000ac0ab880 x6 : 0000000067aeac73
[ 1026.939274] x5 : 4000000000000000 x4 : 4000000000000000 x3 : ffff8021ee708000
[ 1026.940460] x2 : 0000000000000004 x1 : ffff80008331bda0 x0 : ffff0003c0b38000
[ 1026.941636] Call trace:
[ 1026.942261] default_idle_call+0x3c/0xf0 (P)
[ 1026.943089] cpuidle_idle_call+0x168/0x198
[ 1026.943886] do_idle+0xa8/0x120
[ 1026.944597] cpu_startup_entry+0x40/0x50
[ 1026.945398] secondary_start_kernel+0xe0/0x120
[ 1026.946237] __secondary_switched+0xc0/0xc8
[ 1028.144967] NMI backtrace for cpu 0 skipped: idling at default_idle_call+0x3c/0xf0
[ 1509.729088] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1509.729911] rcu: 0-...0: (4 GPs behind) idle=b6ec/1/0x4000000000000002 softirq=13924/13925 fqs=2867
[ 1509.731355] rcu: (detected by 1, t=5902 jiffies, g=39757, q=8441 ncpus=16)
[ 1509.732505] Sending NMI from CPU 1 to CPUs 0:
[ 1514.457704] NMI backtrace for cpu 0 skipped: idling at default_idle_call+0x3c/0xf0
[ 1514.457811] rcu: rcu_sched kthread starved for 471 jiffies! g39757 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=7
[ 1514.460489] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1514.461896] rcu: RCU grace-period kthread stack dump:
[ 1514.462768] task:rcu_sched state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 1514.464216] Call trace:
[ 1514.464759] __switch_to+0xf0/0x160 (T)
[ 1514.465459] __schedule+0x290/0x668
[ 1514.466123] schedule+0x30/0xc8
[ 1514.466749] schedule_timeout+0x90/0x168
[ 1514.467463] rcu_gp_fqs_loop+0x110/0x450
[ 1514.468191] rcu_gp_kthread+0x148/0x190
[ 1514.468902] kthread+0xe0/0xf0
[ 1514.469522] ret_from_fork+0x10/0x20
[ 1514.470201] rcu: Stack dump where RCU GP kthread last ran:
[ 1514.471142] Sending NMI from CPU 1 to CPUs 7:
[ 1514.471930] NMI backtrace for cpu 7 skipped: idling at default_idle_call+0x3c/0xf0
[ 1761.536806] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1761.537710] rcu: 0-...0: (1 GPs behind) idle=75bc/1/0x4000000000000000 softirq=15304/15305 fqs=2359
[ 1761.539154] rcu: (detected by 2, t=5902 jiffies, g=45361, q=1759 ncpus=16)
[ 1761.540308] Sending NMI from CPU 2 to CPUs 0:
[ 1765.982160] NMI backtrace for cpu 0
[ 1765.982163] CPU: 0 UID: 992 PID: 24017 Comm: python3 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 1765.982166] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 1765.982167] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1765.982168] pc : handle_softirqs+0x9c/0x310
[ 1765.982177] lr : __do_softirq+0x1c/0x28
[ 1765.982179] sp : ffff80008319bf50
[ 1765.982179] x29: ffff80008319bf50 x28: ffff0003c55ea480 x27: 0000000000000000
[ 1765.982182] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 1765.982183] x23: 0000000003400005 x22: 0000000000000282 x21: ffff8000a13c3e40
[ 1765.982185] x20: ffff8000800100a8 x19: ffff8000a13c3cf0 x18: 0000000000000000
[ 1765.982186] x17: ffff8021ee680000 x16: ffff800083198000 x15: 0000000000000000
[ 1765.982187] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 1765.982189] x11: ffff8021ee680000 x10: 0000019b2c9939dc x9 : ffff80008001013c
[ 1765.982191] x8 : ffff0003c0fd8040 x7 : 000000000000000a x6 : 0000000000000000
[ 1765.982192] x5 : 0000000000800000 x4 : 0000000000400040 x3 : ffff800081975e40
[ 1765.982193] x2 : ffff8021ee680000 x1 : ffff800081975e40 x0 : ffff8021ee680000
[ 1765.982195] Call trace:
[ 1765.982196] handle_softirqs+0x9c/0x310 (P)
[ 1765.982198] __do_softirq+0x1c/0x28
[ 1765.982199] ____do_softirq+0x18/0x30
[ 1765.982201] call_on_irq_stack+0x30/0x48
[ 1765.982203] do_softirq_own_stack+0x24/0x50
[ 1765.982204] __irq_exit_rcu+0xd8/0x100
[ 1765.982206] irq_exit_rcu+0x18/0x30
[ 1765.982207] el1_interrupt+0x48/0x98
[ 1765.982210] el1h_64_irq_handler+0x18/0x28
[ 1765.982212] el1h_64_irq+0x84/0x88
[ 1765.982213] do_notify_resume+0x40/0x1b8 (P)
[ 1765.982215] el0_interrupt+0xe0/0x140
[ 1765.982216] __el0_irq_handler_common+0x18/0x28
[ 1765.982217] el0t_64_irq_handler+0x10/0x20
[ 1765.982219] el0t_64_irq+0x1b0/0x1b8
[ 1765.982602] rcu: rcu_sched kthread starved for 442 jiffies! g45361 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=10
[ 1766.013214] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1766.014615] rcu: RCU grace-period kthread stack dump:
[ 1766.015499] task:rcu_sched state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 1766.016942] Call trace:
[ 1766.017465] __switch_to+0xf0/0x160 (T)
[ 1766.018181] __schedule+0x290/0x668
[ 1766.018851] schedule+0x30/0xc8
[ 1766.019479] schedule_timeout+0x90/0x168
[ 1766.020207] rcu_gp_fqs_loop+0x110/0x450
[ 1766.020935] rcu_gp_kthread+0x148/0x190
[ 1766.021635] kthread+0xe0/0xf0
[ 1766.022246] ret_from_fork+0x10/0x20
[ 1766.022928] rcu: Stack dump where RCU GP kthread last ran:
[ 1766.023859] Sending NMI from CPU 2 to CPUs 10:
[ 1766.024667] NMI backtrace for cpu 10 skipped: idling at default_idle_call+0x3c/0xf0
[ 2199.150223] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2199.151075] rcu: 0-...0: (1 GPs behind) idle=205c/1/0x4000000000000000 softirq=19121/19122 fqs=1391
[ 2199.152518] rcu: (detected by 7, t=5906 jiffies, g=58149, q=20399 ncpus=16)
[ 2199.153685] Sending NMI from CPU 7 to CPUs 0:
[ 2207.009221] NMI backtrace for cpu 0
[ 2207.009224] CPU: 0 UID: 992 PID: 28020 Comm: python3 Not tainted 6.12.55-74.119.amzn2023.aarch64 #1
[ 2207.009227] Hardware name: Amazon EC2 r8g.4xlarge/, BIOS 1.0 11/1/2018
[ 2207.009228] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 2207.009229] pc : handle_softirqs+0x9c/0x310
[ 2207.009236] lr : __do_softirq+0x1c/0x28
[ 2207.009238] sp : ffff80008319bf50
[ 2207.009239] x29: ffff80008319bf50 x28: ffff0004165c2480 x27: 0000000000000000
[ 2207.009241] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 2207.009243] x23: 0000000083400805 x22: 0000000000000282 x21: ffff8000abe53e10
[ 2207.009244] x20: ffff8000800100a8 x19: ffff8000abe53cc0 x18: 0000000000000000
[ 2207.009245] x17: ffff8021ee680000 x16: ffff800083198000 x15: 0000000000000000
[ 2207.009247] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 2207.009248] x11: ffff8021ee680000 x10: 00000201dc000e8e x9 : ffff80008001013c
[ 2207.009249] x8 : ffff0003c0fd8040 x7 : 0000000000000003 x6 : 0000000000000000
[ 2207.009250] x5 : 0000000000800000 x4 : 0000000000400040 x3 : ffff800081975e40
[ 2207.009252] x2 : ffff8021ee680000 x1 : ffff800081975e40 x0 : ffff8021ee680000
[ 2207.009253] Call trace:
[ 2207.009254] handle_softirqs+0x9c/0x310 (P)
[ 2207.009256] __do_softirq+0x1c/0x28
[ 2207.009257] ____do_softirq+0x18/0x30
[ 2207.009259] call_on_irq_stack+0x30/0x48
[ 2207.009261] do_softirq_own_stack+0x24/0x50
[ 2207.009263] __irq_exit_rcu+0xd8/0x100
[ 2207.009264] irq_exit_rcu+0x18/0x30
[ 2207.009266] el1_interrupt+0x48/0x98
[ 2207.009269] el1h_64_irq_handler+0x18/0x28
[ 2207.009270] el1h_64_irq+0x84/0x88
[ 2207.009271] __arm64_sys_newfstatat+0x0/0x38 (P)
[ 2207.009273] el0_svc_common.constprop.0+0x48/0xf8
[ 2207.009275] do_el0_svc+0x28/0x40
[ 2207.009276] el0_svc+0x3c/0x180
[ 2207.009277] el0t_64_sync_handler+0x13c/0x158
[ 2207.009278] el0t_64_sync+0x1b0/0x1b8
[ 2207.010202] rcu: rcu_sched kthread starved for 780 jiffies! g58149 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=3
[ 2207.041539] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2207.042943] rcu: RCU grace-period kthread stack dump:
[ 2207.043817] task:rcu_sched state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 2207.045272] Call trace:
[ 2207.045800] __switch_to+0xf0/0x160 (T)
[ 2207.046509] __schedule+0x290/0x668
[ 2207.047173] schedule+0x30/0xc8
[ 2207.047794] schedule_timeout+0x90/0x168
[ 2207.048513] rcu_gp_fqs_loop+0x110/0x450
[ 2207.049256] rcu_gp_kthread+0x148/0x190
[ 2207.049953] kthread+0xe0/0xf0
[ 2207.050566] ret_from_fork+0x10/0x20
[ 2207.051248] rcu: Stack dump where RCU GP kthread last ran:
[ 2207.052187] Sending NMI from CPU 7 to CPUs 3:
[ 2207.052973] NMI backtrace for cpu 3 skipped: idling at default_idle_call+0x3c/0xf0
While running the dataload, the server is neither CPU bound (AVG 14%) nor disc-bound. Most of the time taken is network I/O (awaiting DB to provide data or awaiting the DB-commit to confirm).
Both databases are running on separate servers.
Additional context
- r8g.4xlarge
- EBS GP3 500GB (3000 IOPS)
- Amazon Linux 2023.9.20251117
- Python 3.9.24
- modinfo ena
version: 2.15.0g - ifconfig
ens34: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.0.0.75 netmask 255.255.255.0 broadcast 10.0.0.255
inet6 fe80::b9:9bff:fe4a:3187 prefixlen 64 scopeid 0x20
ether 02:b9:9b:4a:31:87 txqueuelen 1000 (Ethernet)
RX packets 7937817 bytes 11890830800 (11.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 7524972 bytes 5313936205 (4.9 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Metadata
Metadata
Assignees
Labels
No labels