-
Notifications
You must be signed in to change notification settings - Fork 118
Description
Problem
I noticed some unusually long times spent in read_byte_range for a workload using datafusion partitioned 32 ways and doing 100% object reads.
Sometimes, these spans were as long as ~20s with the majority of time attributed to idle_ns. After noticing that the thread pool onto which these IO operations are dispatched only had a single thread
vortex/vortex-file/src/generic.rs
Lines 22 to 23 in f5473e5
| static TOKIO_DISPATCHER: std::sync::LazyLock<IoDispatcher> = | |
| std::sync::LazyLock::new(|| IoDispatcher::new_tokio(1)); |
num_cpus::get() to no effect.
Based on tokio console, there was indeed a very high scheduling overhead:
At this point I tried removing the dispatch pool and running IO tasks on the same runtime, and performance improved dramatically, confirming that there is high scheduling overhead using the separate IO dispatcher pool.
Profiles
I collected some profiling data while running the separate IO pool, and I think the majority of the overhead is coming from flume, which is the crate that provides the channel onto which IO tasks are scheduled and collected by the IO threads.
A CPU profile shows that the majority of CPU time is spent just managing threads waiting on the channel: https://pprof.me/b6613f12145582bc14829a2232574346
This is the code where the flume channel is read from for reference:
vortex/vortex-io/src/dispatcher/tokio.rs
Line 51 in f5473e5
| while let Ok(task) = rx.recv_async().await { |
Additionally, there seems to be a bug in flume which causes a memory leak which probably doesn't help. After running a couple of queries, the system was in a fully idle state but holding on to ~30GiB of memory all coming from a flume vector: https://pprof.me/7f60e6b4ad581c22be6dd3669ba70eb9
Solution
Long term, @gatesn is working on a redesign of the IO dispatcher pool (#4406) to also make it possible to have more than one thread running in the pool. The use of flume should probably be re-evaluated or fixed as part of this work. There also seems to be some movement towards using io_uring, which should remove the need for a separate dispatcher pool.
Short term, removing the IO dispatcher pool seems to work well enough (polarsignals@8b593ae), although requires running on a fork. The approach I took is to just use tokio::spawn and make things Send so tasks can be scheduled on any thread for a more even work distribution. Another option is to offer some dispatcher option to run on the current runtime but I'm unsure whether this also requires making things Send.