.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later ========================== EDAC Memory Repair Control ========================== Copyright (c) 2024-2025 HiSilicon Limited. :Author: Shiju Jose :License: The GNU Free Documentation License, Version 1.2 without Invariant Sections, Front-Cover Texts nor Back-Cover Texts. (dual licensed under the GPL v2) :Original Reviewers: - Written for: 6.15 Introduction ------------ Some memory devices support repair operations to address issues in their memory media. Post Package Repair (PPR) and memory sparing are examples of such features. Post Package Repair (PPR) ~~~~~~~~~~~~~~~~~~~~~~~~~ Post Package Repair is a maintenance operation which requests the memory device to perform repair operation on its media. It is a memory self-healing feature that fixes a failing memory location by replacing it with a spare row in a DRAM device. For example, a CXL memory device with DRAM components that support PPR features implements maintenance operations. DRAM components support those types of PPR functions: - hard PPR, for a permanent row repair, and - soft PPR, for a temporary row repair. Soft PPR is much faster than hard PPR, but the repair is lost after a power cycle. The data may not be retained and memory requests may not be correctly processed during a repair operation. In such case, the repair operation should not be executed at runtime. For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections 8.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR Maintenance Operation for more details. Memory Sparing ~~~~~~~~~~~~~~ Memory sparing is a repair function that replaces a portion of memory with a portion of functional memory at a particular granularity. Memory sparing has cacheline/row/bank/rank sparing granularities. For example, in rank memory-sparing mode, one memory rank serves as a spare for other ranks on the same channel in case they fail. The spare rank is held in reserve and not used as active memory until a failure is indicated, with reserved capacity subtracted from the total available memory in the system. After an error threshold is surpassed in a system protected by memory sparing, the content of a failing rank of DIMMs is copied to the spare rank. The failing rank is then taken offline and the spare rank placed online for use as active memory in place of the failed rank. For example, CXL memory devices can support various subclasses for sparing operation vary in terms of the scope of the sparing being performed. Cacheline sparing subclass refers to a sparing action that can replace a full cacheline. Row sparing is provided as an alternative to PPR sparing functions and its scope is that of a single DDR row. Bank sparing allows an entire bank to be replaced. Rank sparing is defined as an operation in which an entire DDR rank is replaced. See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details. .. [1] https://computeexpresslink.org/cxl-specification/ Use cases of generic memory repair features control ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. The soft PPR, hard PPR and memory-sparing features share similar control attributes. Therefore, there is a need for a standardized, generic sysfs repair control that is exposed to userspace and used by administrators, scripts and tools. 2. When a CXL device detects an error in a memory component, it informs the host of the need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event record specifies the device physical address (DPA) and attributes of the memory that requires repair. The kernel reports the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) initiate a repair maintenance operation in response to the device request using the sysfs repair control. 3. Userspace tools, such as rasdaemon, request a repair operation on a memory region when maintenance need flag set or an uncorrected memory error or excess of corrected memory errors above a threshold value is reported or an exceed corrected errors threshold flag set for that memory. 4. Multiple PPR/sparing instances may be present per memory device. 5. Drivers should enforce that live repair is safe. In systems where memory mapping functions can change between boots, one approach to this is to log memory errors seen on this boot against which to check live memory repair requests. The File System --------------- The control attributes of a registered memory repair instance could be accessed in the /sys/bus/edac/devices//mem_repairX/ sysfs ----- Sysfs files are documented in `Documentation/ABI/testing/sysfs-edac-memory-repair`.