Linux io-wq UAF rediscovery
----------------------------------------------------------------------------
date     : 2026-05-18
vendor   : linux-kernel
target   : linux/kernel
severity : high
cve      : n/a
status   : Published
----------------------------------------------------------------------------

Independent rediscovery of an io-wq hash_tail UAF. Fixed upstream in d6a2d7b04b5a. KASAN repro in lab.


Use after free in `io-wq`, on the read of `wq->hash_tail[index]` during worker selection. Already fixed upstream in commit `d6a2d7b04b5a`. Stable kernels carrying the bug picked up the backport in their corresponding releases.

I rediscovered it independently. Writing it up here as a methodology note plus a working reproducer, not as a vulnerability announcement.

## what's broken

io-wq is the workqueue subsystem inside `io_uring`. It manages a pool of worker threads that pick up async work submitted through SQEs. Hashed work, the kind tagged with `IOSQE_ASYNC` and bound to a particular hash bucket, gets routed for cache locality: work items with the same hash land on the same worker so caches stay warm across submissions.

The routing relies on `wq->hash_tail[]`, an array on the workqueue, indexed by hash bucket. Each slot holds a pointer to the last work item enqueued for that bucket. On enqueue, io-wq reads `hash_tail[bucket]` to decide whether to chain the new work after an existing tail (locality preserved) or set up a fresh chain.

The lifetime of the pointer in `hash_tail[bucket]` is tied to the work item it points to. When that work item is freed, the slot is supposed to be cleared or replaced before any concurrent reader can see the dangling pointer.

There's a window between the free and the clear where a concurrent enqueue path reads the slot and gets back a pointer into freed slab memory. Slab allocators recycle memory quickly under load, and the poison the allocator writes into the freed region is what trips KASAN on the next read.

## the race

Two contexts collide.

The teardown path: `io_wq_exit_workers` runs when a ring is being shut down. It drains outstanding work, frees the `io_wq_work` objects, and is supposed to clear the corresponding `hash_tail` slots. The clear is not atomic with respect to the free.

The submit path: `io_wq_enqueue`, reached from `io_queue_iowq` from `io_req_task_submit`, reads `hash_tail[bucket]` and decides what to do based on whether it sees a tail. If the read returns a freed pointer (because teardown freed the work but hasn't cleared the slot yet), enqueue dereferences it on the chain decision and KASAN trips.

The window is short. Under normal load it almost never opens. To open it reliably you need rapid ring teardown bursts overlapping with hashed work submission on adjacent rings or task_work flushes. The reproducer below structures workload to maximize that overlap.

## what the patch does

Upstream fix `d6a2d7b04b5a` adds a check that the workqueue slot is live before reading `hash_tail`. The shape is a load with explicit memory ordering, paired with a release on the teardown side: teardown publishes the invalidation before the free becomes visible, and enqueue refuses to use a `hash_tail` value if the corresponding bucket has been marked dead.

The result is that the enqueue path either sees a valid pointer (it can deref) or sees the invalid sentinel (it bails to the slow path that does a fresh worker pick). Race window closed.

## reproduction

Build an affected kernel with KASAN on. Affected kernels are 6.x prior to the `d6a2d7b04b5a` backport landing in the relevant stable line.

```bash
make defconfig
./scripts/config --enable CONFIG_KASAN \
                 --enable CONFIG_KASAN_INLINE \
                 --enable CONFIG_IO_URING \
                 --enable CONFIG_DEBUG_KERNEL
make -j"$(nproc)"
```

`CONFIG_KASAN_INLINE` gives you inline KASAN checks (faster, larger image) which catch the read at the access site rather than through a function call. Either inline or outline works for repro; inline produces a cleaner backtrace.

The reproducer is a small userspace program. It opens N rings, submits hashed NOP work concurrently from each, then tears them all down in a tight loop to force the teardown path to overlap with the submit path on adjacent rings.

```c
#include <liburing.h>

#define RINGS 32
#define DEPTH 64

struct io_uring rings[RINGS];

for (int i = 0; i < RINGS; i++)
    io_uring_queue_init(DEPTH, &rings[i], 0);

for (int round = 0; round < 1000; round++) {
    for (int i = 0; i < RINGS; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(&rings[i]);
        io_uring_prep_nop(sqe);
        sqe->flags |= IOSQE_ASYNC;
        io_uring_submit(&rings[i]);
    }
    for (int i = 0; i < RINGS; i++)
        io_uring_queue_exit(&rings[i]);
    for (int i = 0; i < RINGS; i++)
        io_uring_queue_init(DEPTH, &rings[i], 0);
}
```

Pinning the process to a small CPU set (2 cores is plenty) makes the race more frequent by forcing teardown and submit work to interleave on the same kernel threads. Without pinning you may need several seconds before KASAN trips. With pinning the first round usually does it.

Under load, KASAN trips on the read of `wq->hash_tail[index]` in the io-wq worker selection path:

```text
BUG: KASAN: slab-use-after-free in io_wq_enqueue+0x...
Read of size 8 at addr ffff888100abc010 by task repro/...

Call Trace:
  io_wq_enqueue                  // reads wq->hash_tail[bucket]
  io_queue_iowq                  // io_uring submit path
  io_req_task_submit             // task_work flush
  task_work_run
  exit_to_user_mode_loop
  ...

Allocated by task ...:
  kmem_cache_alloc
  io_wq_alloc_work
  ...

Freed by task ...:
  kmem_cache_free
  io_wq_exit_workers             // teardown frees prior tail
  io_wq_put_and_exit
  io_ring_exit_work
  process_one_work
  ...
```

The annotated frames show the shape: the submit side reader is in `io_wq_enqueue`, the freer is `io_wq_exit_workers` running from a different ring's teardown on the worker pool that's about to be reaped.

## detection without KASAN

On a production kernel without KASAN compiled in, the bug shows up as a kernel oops or general protection fault on a wild pointer dereference inside the io-wq enqueue path. The crash signature varies because the freed slab can be either still zeroed, repurposed for another allocation, or carrying allocator debug poison.

Workloads that hit this in practice are containerized services with rapid io_uring ring lifecycles. Containers that fork, exec, and exit quickly. Services that cycle rings on configuration change. Worker pools that recreate the ring per request. Services that hold one ring open for the long term rarely trigger it.

If you operate kernels that aren't yet carrying the `d6a2d7b04b5a` backport, detection options are:

1. Crash signature matching on `io_wq_enqueue+` frames in syslog with a wild pointer in `RSI` or `RDI`.
2. eBPF kprobe on `io_wq_enqueue` that validates the returned `hash_tail` pointer against the slab cache for `io_wq_work`. Flag when the pointer is in a freed slab.
3. Audit metrics on ring teardown rate. The class of services that hit this is the class that tears down rings hundreds of times per second under stress. That's the signal worth alerting on independently of the bug.

## variants in adjacent code

The pattern is: an element of a pointer array read during enqueue, indexed by a key derived from the work, racing with a teardown path that frees the referenced object. I checked the obvious candidates in io-wq and a couple of places in `io_uring` outside io-wq after fingerprinting the primary site. Nothing else trips reliably under this harness, but the pattern is worth a grep across any worker pool code: arrays of `struct io_wq_work *` indexed by hash or bucket, similar arrays in `io_uring` for request queues and cancel tables.

The lifetime invariant that fails in all of these is the same. The array slot's lifetime is implicitly tied to the object it points to, and concurrent readers don't have a way to verify liveness before deref.

A more robust defense pattern for this class is to wrap the pointer in a tagged or versioned slot: each slot carries a generation counter that the reader can compare against the object's generation. Mismatch means stale, bail to slow path. That's heavier than the liveness check the upstream patch uses, but it's the better pattern if you have to do this in multiple places. The upstream fix took the lighter approach because it was localized to one site.

## what this calibrated

The pipeline surfaced the candidate site from a diff against the upstream io-wq patch series, with no prior knowledge of the fix or its commit message. The same primitive ended up as both the pipeline's candidate and the upstream's actual fix target. That match is the calibration data point worth recording.

Time from "this looks like a candidate" to KASAN repro was short. Most of the time went into the harness, not the analysis. Once the workload shape was right (small rings, fast teardown, pinned CPUs), repro was reliable enough to drop into CI.

For a solo operation hunting kernel bugs, the metric that matters is time from candidate to repro. If the pipeline produces candidates faster than they can be harnessed and reproduced, the bottleneck shifts to bench time, which scales with effort more directly than analysis time scales with intuition. This rediscovery suggests the pipeline is operating in the right regime for kernel work.

## patch

Upstream fix: [`d6a2d7b04b5a`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6a2d7b04b5a093021a7a0e2e69e9d5237dfa8cc).


----------------------------------------------------------------------------
https://dtrsecurity.com/research/linux-iowq-uaf/
----------------------------------------------------------------------------