Skip to content

IO dispatcher pool overhead makes it unusable in production scenarios #4400

@asubiotto

Description

@asubiotto

Problem

I noticed some unusually long times spent in read_byte_range for a workload using datafusion partitioned 32 ways and doing 100% object reads.

Image

Sometimes, these spans were as long as ~20s with the majority of time attributed to idle_ns. After noticing that the thread pool onto which these IO operations are dispatched only had a single thread

static TOKIO_DISPATCHER: std::sync::LazyLock<IoDispatcher> =
std::sync::LazyLock::new(|| IoDispatcher::new_tokio(1));
, I tried bumping the thread pool size to num_cpus::get() to no effect.

Based on tokio console, there was indeed a very high scheduling overhead:

Image

At this point I tried removing the dispatch pool and running IO tasks on the same runtime, and performance improved dramatically, confirming that there is high scheduling overhead using the separate IO dispatcher pool.

Profiles

I collected some profiling data while running the separate IO pool, and I think the majority of the overhead is coming from flume, which is the crate that provides the channel onto which IO tasks are scheduled and collected by the IO threads.

A CPU profile shows that the majority of CPU time is spent just managing threads waiting on the channel: https://pprof.me/b6613f12145582bc14829a2232574346

This is the code where the flume channel is read from for reference:

while let Ok(task) = rx.recv_async().await {

Additionally, there seems to be a bug in flume which causes a memory leak which probably doesn't help. After running a couple of queries, the system was in a fully idle state but holding on to ~30GiB of memory all coming from a flume vector: https://pprof.me/7f60e6b4ad581c22be6dd3669ba70eb9

Solution

Long term, @gatesn is working on a redesign of the IO dispatcher pool (#4406) to also make it possible to have more than one thread running in the pool. The use of flume should probably be re-evaluated or fixed as part of this work. There also seems to be some movement towards using io_uring, which should remove the need for a separate dispatcher pool.

Short term, removing the IO dispatcher pool seems to work well enough (polarsignals@8b593ae), although requires running on a fork. The approach I took is to just use tokio::spawn and make things Send so tasks can be scheduled on any thread for a more even work distribution. Another option is to offer some dispatcher option to run on the current runtime but I'm unsure whether this also requires making things Send.

Metadata

Metadata

Assignees

Labels

bugA bug issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions