.. SPDX-License-Identifier: GPL-2.0 ===================== io_uring zero copy Rx ===================== Introduction ============ io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on the network receive path, allowing packet data to be received directly into userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that there are no strict alignment requirements and no need to mmap()/munmap(). Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are processed by the kernel TCP stack as normal. NIC HW Requirements =================== Several NIC HW features are required for io_uring ZC Rx to work. For now the kernel API does not configure the NIC and it must be done by the user. Header/data split ----------------- Required to split packets at the L4 boundary into a header and a payload. Headers are received into kernel memory as normal and processed by the TCP stack as normal. Payloads are received into userspace memory directly. Flow steering ------------- Specific HW Rx queues are configured for this feature, but modern NICs typically distribute flows across all HW Rx queues. Flow steering is required to ensure that only desired flows are directed towards HW queues that are configured for io_uring ZC Rx. RSS --- In addition to flow steering above, RSS is required to steer all other non-zero copy flows away from queues that are configured for io_uring ZC Rx. Usage ===== Setup NIC --------- Must be done out of band for now. Ensure there are at least two queues:: ethtool -L eth0 combined 2 Enable header/data split:: ethtool -G eth0 tcp-data-split on Carve out half of the HW Rx queues for zero copy using RSS:: ethtool -X eth0 equal 1 Set up flow steering, bearing in mind that queues are 0-indexed:: ethtool -N eth0 flow-type tcp6 ... action 1 Setup io_uring -------------- This section describes the low level io_uring kernel API. Please refer to liburing documentation for how to use the higher level API. Create an io_uring instance with the following required setup flags:: IORING_SETUP_SINGLE_ISSUER IORING_SETUP_DEFER_TASKRUN IORING_SETUP_CQE32 Create memory area ------------------ Allocate userspace memory area for receiving zero copy data:: void *area_ptr = mmap(NULL, area_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); Create refill ring ------------------ Allocate memory for a shared ringbuf used for returning consumed buffers:: void *ring_ptr = mmap(NULL, ring_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); This refill ring consists of some space for the header, followed by an array of ``struct io_uring_zcrx_rqe``:: size_t rq_entries = 4096; size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE; /* align to page size */ ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1); Register ZC Rx -------------- Fill in registration structs:: struct io_uring_zcrx_area_reg area_reg = { .addr = (__u64)(unsigned long)area_ptr, .len = area_size, .flags = 0, }; struct io_uring_region_desc region_reg = { .user_addr = (__u64)(unsigned long)ring_ptr, .size = ring_size, .flags = IORING_MEM_REGION_TYPE_USER, }; struct io_uring_zcrx_ifq_reg reg = { .if_idx = if_nametoindex("eth0"), /* this is the HW queue with desired flow steered into it */ .if_rxq = 1, .rq_entries = rq_entries, .area_ptr = (__u64)(unsigned long)&area_reg, .region_ptr = (__u64)(unsigned long)®ion_reg, }; Register with kernel:: io_uring_register_ifq(ring, ®); Map refill ring --------------- The kernel fills in fields for the refill ring in the registration ``struct io_uring_zcrx_ifq_reg``. Map it into userspace:: struct io_uring_zcrx_rq refill_ring; refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head); refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail); refill_ring.rqes = (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes); refill_ring.rq_tail = 0; refill_ring.ring_ptr = ring_ptr; Receiving data -------------- Prepare a zero copy recv request:: struct io_uring_sqe *sqe; sqe = io_uring_get_sqe(ring); io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0); sqe->ioprio |= IORING_RECV_MULTISHOT; Now, submit and wait:: io_uring_submit_and_wait(ring, 1); Finally, process completions:: struct io_uring_cqe *cqe; unsigned int count = 0; unsigned int head; io_uring_for_each_cqe(ring, head, cqe) { struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1); unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1; unsigned char *data = area_ptr + (rcqe->off & mask); /* do something with the data */ count++; } io_uring_cq_advance(ring, count); Recycling buffers ----------------- Return buffers back to the kernel to be used again:: struct io_uring_zcrx_rqe *rqe; unsigned mask = refill_ring.ring_entries - 1; rqe = &refill_ring.rqes[refill_ring.rq_tail & mask]; unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK; rqe->off = area_offset | area_reg.rq_area_token; rqe->len = cqe->res; IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail); Testing ======= See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``