IO dispatcher pool overhead makes it unusable in production scenarios

## Problem

I noticed some unusually long times spent in [`read_byte_range`](https://github.com/vortex-data/vortex/blob/f5473e57f79b36c5a50e9e32a8cb6467b3c3ae38/vortex-io/src/object_store.rs#L45) for a workload using datafusion partitioned 32 ways and doing 100% object reads.

<img width="720" height="397" alt="Image" src="https://github.com/user-attachments/assets/bd5d4d72-6df1-48fd-a06d-ae965ee591ac" />

Sometimes, these spans were as long as ~20s with the majority of time attributed to `idle_ns`. After noticing that the thread pool onto which these IO operations are dispatched only had a single thread https://github.com/vortex-data/vortex/blob/f5473e57f79b36c5a50e9e32a8cb6467b3c3ae38/vortex-file/src/generic.rs#L22-L23, I tried bumping the thread pool size to `num_cpus::get()` to no effect.

Based on tokio console, there was indeed a very high scheduling overhead:

<img width="720" height="195" alt="Image" src="https://github.com/user-attachments/assets/a08a66aa-5c72-4115-86c7-02fab87aa87d" />

At this point I tried removing the dispatch pool and running IO tasks on the same runtime, and performance improved dramatically, confirming that there is high scheduling overhead using the separate IO dispatcher pool.

## Profiles

I collected some profiling data while running the separate IO pool, and I think the majority of the overhead is coming from `flume`, which is the crate that provides the channel onto which IO tasks are scheduled and collected by the IO threads.

A CPU profile shows that the majority of CPU time is spent just managing threads waiting on the channel: https://pprof.me/b6613f12145582bc14829a2232574346

This is the code where the flume channel is read from for reference:
https://github.com/vortex-data/vortex/blob/f5473e57f79b36c5a50e9e32a8cb6467b3c3ae38/vortex-io/src/dispatcher/tokio.rs#L51

Additionally, there seems to be a bug in flume which causes a memory leak which probably doesn't help. After running a couple of queries, the system was in a fully idle state but holding on to ~30GiB of memory all coming from a [flume vector](https://github.com/zesterer/flume/blob/02eda72e14d6805beb3b27559b8e066f8464f490/src/async.rs#L410): https://pprof.me/7f60e6b4ad581c22be6dd3669ba70eb9

## Solution

Long term, @gatesn is working on a redesign of the IO dispatcher pool (#4406) to also make it possible to have more than one thread running in the pool. The use of flume should probably be re-evaluated or fixed as part of this work. There also seems to be some movement towards using `io_uring`, which should remove the need for a separate dispatcher pool.

Short term, removing the IO dispatcher pool seems to work well enough (https://github.com/polarsignals/vortex/commit/8b593ae6fb8f6af9c4f9ff2b931c94d23b5fd878), although requires running on a fork. The approach I took is to just use `tokio::spawn` and make things `Send` so tasks can be scheduled on any thread for a more even work distribution. Another option is to offer some dispatcher option to run on the current runtime but I'm unsure whether this also requires making things `Send`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO dispatcher pool overhead makes it unusable in production scenarios #4400

Problem

Profiles

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	static TOKIO_DISPATCHER: std::sync::LazyLock<IoDispatcher> =
	std::sync::LazyLock::new(\|\| IoDispatcher::new_tokio(1));

IO dispatcher pool overhead makes it unusable in production scenarios #4400

Description

Problem

Profiles

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions