Blame - Documentation/admin-guide/ras.rst - SHIFTPHONES/mainline/linux

blob: 7b481b2a368e7cbca173a665f339577dae1133f9 [file] [log] [blame]

Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1	.. include:: <isonum.txt>
				2
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	3	============================================
				4	Reliability, Availability and Serviceability
				5	============================================
				6
				7	RAS concepts
				8	************
				9
				10	Reliability, Availability and Serviceability (RAS) is a concept used on
Tamara Diaconita	9f02a48	2017-03-14 10:38:35 +0200	[diff] [blame]	11	servers meant to measure their robustness.
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	12
				13	Reliability
				14	is the probability that a system will produce correct outputs.
				15
				16	* Generally measured as Mean Time Between Failures (MTBF)
				17	* Enhanced by features that help to avoid, detect and repair hardware faults
				18
				19	Availability
				20	is the probability that a system is operational at a given time
				21
				22	* Generally measured as a percentage of downtime per a period of time
				23	* Often uses mechanisms to detect and correct hardware faults in
				24	runtime;
				25
				26	Serviceability (or maintainability)
				27	is the simplicity and speed with which a system can be repaired or
				28	maintained
				29
				30	* Generally measured on Mean Time Between Repair (MTBR)
				31
				32	Improving RAS
				33	-------------
				34
				35	In order to reduce systems downtime, a system should be capable of detecting
				36	hardware errors, and, when possible correcting them in runtime. It should
				37	also provide mechanisms to detect hardware degradation, in order to warn
				38	the system administrator to take the action of replacing a component before
				39	it causes data loss or system downtime.
				40
				41	Among the monitoring measures, the most usual ones include:
				42
				43	* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
				44	* Memory – add error correction logic (ECC) to detect and correct errors;
Tamara Diaconita	9f02a48	2017-03-14 10:38:35 +0200	[diff] [blame]	45	* I/O – add CRC checksums for transferred data;
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	46	* Storage – RAID, journal file systems, checksums,
				47	Self-Monitoring, Analysis and Reporting Technology (SMART).
				48
				49	By monitoring the number of occurrences of error detections, it is possible
				50	to identify if the probability of hardware errors is increasing, and, on such
Tamara Diaconita	9f02a48	2017-03-14 10:38:35 +0200	[diff] [blame]	51	case, do a preventive maintenance to replace a degraded component while
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	52	those errors are correctable.
				53
				54	Types of errors
				55	---------------
				56
Geert Uytterhoeven	9d436ed	2018-11-07 14:46:17 +0100	[diff] [blame]	57	Most mechanisms used on modern systems use technologies like Hamming
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	58	Codes that allow error correction when the number of errors on a bit packet
				59	is below a threshold. If the number of errors is above, those mechanisms
				60	can indicate with a high degree of confidence that an error happened, but
				61	they can't correct.
				62
				63	Also, sometimes an error occur on a component that it is not used. For
				64	example, a part of the memory that it is not currently allocated.
				65
				66	That defines some categories of errors:
				67
				68	* Correctable Error (CE) - the error detection mechanism detected and
				69	corrected the error. Such errors are usually not fatal, although some
				70	Kernel mechanisms allow the system administrator to consider them as fatal.
				71
				72	* Uncorrected Error (UE) - the amount of errors happened above the error
				73	correction threshold, and the system was unable to auto-correct.
				74
				75	* Fatal Error - when an UE error happens on a critical component of the
				76	system (for example, a piece of the Kernel got corrupted by an UE), the
				77	only reliable way to avoid data corruption is to hang or reboot the machine.
				78
				79	* Non-fatal Error - when an UE error happens on an unused component,
				80	like a CPU in power down state or an unused memory bank, the system may
				81	still run, eventually replacing the affected hardware by a hot spare,
				82	if available.
				83
Masahiro Yamada	9332ef9	2017-02-27 14:28:47 -0800	[diff] [blame]	84	Also, when an error happens on a userspace process, it is also possible to
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	85	kill such process and let userspace restart it.
				86
				87	The mechanism for handling non-fatal errors is usually complex and may
				88	require the help of some userspace application, in order to apply the
				89	policy desired by the system administrator.
				90
				91	Identifying a bad hardware component
				92	------------------------------------
				93
				94	Just detecting a hardware flaw is usually not enough, as the system needs
				95	to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
				96	to make the hardware reliable again.
				97
				98	So, it requires not only error logging facilities, but also mechanisms that
				99	will translate the error message to the silkscreen or component label for
				100	the MRU.
				101
				102	Typically, it is very complex for memory, as modern CPUs interlace memory
				103	from different memory modules, in order to provide a better performance. The
				104	DMI BIOS usually have a list of memory module labels, with can be obtained
				105	using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
				106
				107	Memory Device
				108	Total Width: 64 bits
				109	Data Width: 64 bits
				110	Size: 16384 MB
				111	Form Factor: SODIMM
				112	Set: None
				113	Locator: ChannelA-DIMM0
				114	Bank Locator: BANK 0
				115	Type: DDR4
				116	Type Detail: Synchronous
				117	Speed: 2133 MHz
				118	Rank: 2
				119	Configured Clock Speed: 2133 MHz
				120
				121	On the above example, a DDR4 SO-DIMM memory module is located at the
				122	system's memory labeled as "BANK 0", as given by the bank locator field.
				123	Please notice that, on such system, the total width is equal to the
Tamara Diaconita	9f02a48	2017-03-14 10:38:35 +0200	[diff] [blame]	124	data width. It means that such memory module doesn't have error
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	125	detection/correction mechanisms.
				126
				127	Unfortunately, not all systems use the same field to specify the memory
				128	bank. On this example, from an older server, ``dmidecode`` shows::
				129
				130	Memory Device
				131	Array Handle: 0x1000
				132	Error Information Handle: Not Provided
				133	Total Width: 72 bits
				134	Data Width: 64 bits
				135	Size: 8192 MB
				136	Form Factor: DIMM
				137	Set: 1
				138	Locator: DIMM_A1
				139	Bank Locator: Not Specified
				140	Type: DDR3
				141	Type Detail: Synchronous Registered (Buffered)
				142	Speed: 1600 MHz
				143	Rank: 2
				144	Configured Clock Speed: 1600 MHz
				145
				146	There, the DDR3 RDIMM memory module is located at the system's memory labeled
				147	as "DIMM_A1", as given by the locator field. Please notice that this
Tamara Diaconita	9f02a48	2017-03-14 10:38:35 +0200	[diff] [blame]	148	memory module has 64 bits of data width and 72 bits of total width. So,
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	149	it has 8 extra bits to be used by error detection and correction mechanisms.
				150	Such kind of memory is called Error-correcting code memory (ECC memory).
				151
				152	To make things even worse, it is not uncommon that systems with different
				153	labels on their system's board to use exactly the same BIOS, meaning that
				154	the labels provided by the BIOS won't match the real ones.
				155
				156	ECC memory
				157	----------
				158
Waiman Long	b17b24f	2020-05-06 12:22:17 -0400	[diff] [blame]	159	As mentioned in the previous section, ECC memory has extra bits to be
				160	used for error correction. In the above example, a memory module has
				161	64 bits of data width, and 72 bits of total width. The extra 8
				162	bits which are used for the error detection and correction mechanisms
				163	are referred to as the syndrome\ [#f1]_\ [#f2]_.
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	164
				165	So, when the cpu requests the memory controller to write a word with
				166	data width, the memory controller calculates the syndrome in real time,
				167	using Hamming code, or some other error correction code, like SECDED+,
				168	producing a code with total width size. Such code is then written
				169	on the memory modules.
				170
				171	At read, the total width bits code is converted back, using the same
				172	ECC code used on write, producing a word with data width and a syndrome.
				173	The word with data width is sent to the CPU, even when errors happen.
				174
				175	The memory controller also looks at the syndrome in order to check if
				176	there was an error, and if the ECC code was able to fix such error.
				177	If the error was corrected, a Corrected Error (CE) happened. If not, an
				178	Uncorrected Error (UE) happened.
				179
				180	The information about the CE/UE errors is stored on some special registers
				181	at the memory controller and can be accessed by reading such registers,
				182	either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
				183	bit CPUs, such errors can also be retrieved via the Machine Check
				184	Architecture (MCA)\ [#f3]_.
				185
				186	.. [#f1] Please notice that several memory controllers allow operation on a
				187	mode called "Lock-Step", where it groups two memory modules together,
				188	doing 128-bit reads/writes. That gives 16 bits for error correction, with
Tamara Diaconita	9f02a48	2017-03-14 10:38:35 +0200	[diff] [blame]	189	significantly improves the error correction mechanism, at the expense
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	190	that, when an error happens, there's no way to know what memory module is
				191	to blame. So, it has to blame both memory modules.
				192
				193	.. [#f2] Some memory controllers also allow using memory in mirror mode.
				194	On such mode, the same data is written to two memory modules. At read,
				195	the system checks both memory modules, in order to check if both provide
				196	identical data. On such configuration, when an error happens, there's no
				197	way to know what memory module is to blame. So, it has to blame both
				198	memory modules (or 4 memory modules, if the system is also on Lock-step
				199	mode).
				200
				201	.. [#f3] For more details about the Machine Check Architecture (MCA),
Mauro Carvalho Chehab	cb1aaeb	2019-06-07 15:54:32 -0300	[diff] [blame]	202	please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree.
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	203
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	204	EDAC - Error Detection And Correction
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	205	*************************************
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	206
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	207	.. note::
Borislav Petkov	e34217c	2015-11-26 14:12:56 +0100	[diff] [blame]	208
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	209	"bluesmoke" was the name for this device driver subsystem when it
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	210	was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
				211	That site is mostly archaic now and can be used only for historical
				212	purposes.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	213
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	214	When the subsystem was pushed upstream for the first time, on
Mauro Carvalho Chehab	00aff95	2020-04-14 18:48:42 +0200	[diff] [blame]	215	Kernel 2.6.16, it was renamed to ``EDAC``.
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	216
				217	Purpose
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	218	-------
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	219
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	220	The ``edac`` kernel module's goal is to detect and report hardware errors
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	221	that occur within the computer system running under linux.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	222
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	223	Memory
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	224	------
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	225
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	226	Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
				227	primary errors being harvested. These types of errors are harvested by
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	228	the ``edac_mc`` device.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	229
				230	Detecting CE events, then harvesting those events and reporting them,
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	231	can but must not necessarily be a predictor of future UE events. With
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	232	CE events only, the system can and will continue to operate as no data
				233	has been damaged yet.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	234
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	235	However, preventive maintenance and proactive part replacement of memory
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	236	modules exhibiting CEs can reduce the likelihood of the dreaded UE events
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	237	and system panics.
				238
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	239	Other hardware elements
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	240	-----------------------
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	241
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	242	A new feature for EDAC, the ``edac_device`` class of device, was added in
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	243	the 2.6.23 version of the kernel.
				244
				245	This new device type allows for non-memory type of ECC hardware detectors
				246	to have their states harvested and presented to userspace via the sysfs
				247	interface.
				248
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	249	Some architectures have ECC detectors for L1, L2 and L3 caches,
				250	along with DMA engines, fabric switches, main data path switches,
				251	interconnections, and various other hardware data paths. If the hardware
				252	reports it, then a edac_device device probably can be constructed to
				253	harvest and present that to userspace.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	254
				255
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	256	PCI bus scanning
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	257	----------------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	258
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	259	In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
				260	in order to determine if errors are occurring during data transfers.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	261
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	262	The presence of PCI Parity errors must be examined with a grain of salt.
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	263	There are several add-in adapters that do not follow the PCI specification
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	264	with regards to Parity generation and reporting. The specification says
				265	the vendor should tie the parity status bits to 0 if they do not intend
				266	to generate parity. Some vendors do not do this, and thus the parity bit
				267	can "float" giving false positives.
				268
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	269	There is a PCI device attribute located in sysfs that is checked by
				270	the EDAC PCI scanning code. If that attribute is set, PCI parity/error
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	271	scanning is skipped for that device. The attribute is::
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	272
				273	broken_parity_status
				274
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	275	and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	276	PCI devices.
				277
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	278
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	279	Versioning
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	280	----------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	281
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	282	EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	283	Controller (MC) driver modules. On a given system, the CORE is loaded
				284	and one MC driver will be loaded. Both the CORE and the MC driver (or
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	285	``edac_device`` driver) have individual versions that reflect current
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	286	release level of their respective modules.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	287
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	288	Thus, to "report" on what version a system is running, one must report
				289	both the CORE's and the MC driver's versions.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	290
				291
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	292	Loading
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	293	-------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	294
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	295	If ``edac`` was statically linked with the kernel then no loading
				296	is necessary. If ``edac`` was built as modules then simply modprobe
				297	the ``edac`` pieces that you need. You should be able to modprobe
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	298	hardware-specific modules and have the dependencies load the necessary
				299	core modules.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	300
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	301	Example::
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	302
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	303	$ modprobe amd76x_edac
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	304
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	305	loads both the ``amd76x_edac.ko`` memory controller module and the
				306	``edac_mc.ko`` core module.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	307
				308
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	309	Sysfs interface
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	310	---------------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	311
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	312	EDAC presents a ``sysfs`` interface for control and reporting purposes. It
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	313	lives in the /sys/devices/system/edac directory.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	314
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	315	Within this directory there currently reside 2 components:
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	316
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	317	======= ==============================
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	318	mc memory controller(s) system
Doug Thompson	49c0dab7	2006-07-10 04:45:19 -0700	[diff] [blame]	319	pci PCI control and status system
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	320	======= ==============================
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	321
				322
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	323
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	324	Memory Controller (mc) Model
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	325	----------------------------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	326
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	327	Each ``mc`` device controls a set of memory modules [#f4]_. These modules
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	328	are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	329	There can be multiple csrows and multiple channels.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	330
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	331	.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
				332	used to refer to a memory module, although there are other memory
Robert Richter	778f3a9	2019-11-06 09:33:30 +0000	[diff] [blame]	333	packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
				334	specification (Version 2.7) defines a memory module in the Common
				335	Platform Error Record (CPER) section to be an SMBIOS Memory Device
				336	(Type 17). Along this document, and inside the EDAC subsystem, the term
				337	"dimm" is used for all memory modules, even when they use a
				338	different kind of packaging.
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	339
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	340	Memory controllers allow for several csrows, with 8 csrows being a
				341	typical value. Yet, the actual number of csrows depends on the layout of
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	342	a given motherboard, memory controller and memory module characteristics.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	343
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	344	Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
				345	data transfers to/from the CPU from/to memory. Some newer chipsets allow
				346	for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
				347	controllers. The following example will assume 2 channels:
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	348
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	349	+------------+-----------------------+
Jonathan Corbet	82a1955	2017-06-18 17:30:18 -0600	[diff] [blame]	350	\| CS Rows \| Channels \|
				351	+------------+-----------+-----------+
				352	\| \| ``ch0`` \| ``ch1`` \|
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	353	+============+===========+===========+
Mauro Carvalho Chehab	cfa2049	2020-04-14 18:48:41 +0200	[diff] [blame]	354	\| \|DIMM_A0\|DIMM_B0\|
				355	+------------+-----------+-----------+
				356	\| ``csrow0`` \| rank0 \| rank0 \|
				357	+------------+-----------+-----------+
Robert Richter	778f3a9	2019-11-06 09:33:30 +0000	[diff] [blame]	358	\| ``csrow1`` \| rank1 \| rank1 \|
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	359	+------------+-----------+-----------+
Mauro Carvalho Chehab	cfa2049	2020-04-14 18:48:41 +0200	[diff] [blame]	360	\| \|DIMM_A1\|DIMM_B1\|
				361	+------------+-----------+-----------+
				362	\| ``csrow2`` \| rank0 \| rank0 \|
				363	+------------+-----------+-----------+
				364	\| ``csrow3`` \| rank1 \| rank1 \|
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	365	+------------+-----------+-----------+
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	366
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	367	In the above example, there are 4 physical slots on the motherboard
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	368	for memory DIMMs:
				369
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	370	+---------+---------+
				371	\| DIMM_A0 \| DIMM_B0 \|
				372	+---------+---------+
				373	\| DIMM_A1 \| DIMM_B1 \|
				374	+---------+---------+
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	375
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	376	Labels for these slots are usually silk-screened on the motherboard.
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	377	Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	378	channel 1. Notice that there are two csrows possible on a physical DIMM.
				379	These csrows are allocated their csrow assignment based on the slot into
				380	which the memory DIMM is placed. Thus, when 1 DIMM is placed in each
				381	Channel, the csrows cross both DIMMs.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	382
				383	Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
Robert Richter	778f3a9	2019-11-06 09:33:30 +0000	[diff] [blame]	384	In the example above 2 dual ranked DIMMs are similarly placed. Thus,
				385	both csrow0 and csrow1 are populated. On the other hand, when 2 single
				386	ranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will
				387	have just one csrow (csrow0) and csrow1 will be empty. The pattern
				388	repeats itself for csrow2 and csrow3. Also note that some memory
				389	controllers don't have any logic to identify the memory module, see
				390	``rankX`` directories below.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	391
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	392	The representation of the above is reflected in the directory
				393	tree in EDAC's sysfs interface. Starting in directory
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	394	``/sys/devices/system/edac/mc``, each memory controller will be
				395	represented by its own ``mcX`` directory, where ``X`` is the
				396	index of the MC::
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	397
				398	..../edac/mc/
				399	\|
				400	\|->mc0
				401	\|->mc1
				402	\|->mc2
				403	....
				404
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	405	Under each ``mcX`` directory each ``csrowX`` is again represented by a
				406	``csrowX``, where ``X`` is the csrow index::
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	407
				408	.../mc/mc0/
				409	\|
				410	\|->csrow0
				411	\|->csrow2
				412	\|->csrow3
				413	....
				414
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	415	Notice that there is no csrow1, which indicates that csrow0 is composed
				416	of a single ranked DIMMs. This should also apply in both Channels, in
				417	order to have dual-channel mode be operational. Since both csrow2 and
				418	csrow3 are populated, this indicates a dual ranked set of DIMMs for
				419	channels 0 and 1.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	420
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	421	Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	422	control and attribute files.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	423
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	424	``mcX`` directories
				425	-------------------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	426
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	427	In ``mcX`` directories are EDAC control and attribute files for
				428	this ``X`` instance of the memory controllers.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	429
Mauro Carvalho Chehab	8b6f04c	2012-04-17 08:53:34 -0300	[diff] [blame]	430	For a description of the sysfs API, please see:
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	431
Rami Rosen	3aae9ed	2015-06-19 09:18:34 +0300	[diff] [blame]	432	Documentation/ABI/testing/sysfs-devices-edac
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	433
				434
Mauro Carvalho Chehab	032d0ab	2016-10-27 10:00:46 -0200	[diff] [blame]	435	``dimmX`` or ``rankX`` directories
				436	----------------------------------
				437
				438	The recommended way to use the EDAC subsystem is to look at the information
				439	provided by the ``dimmX`` or ``rankX`` directories [#f5]_.
				440
				441	A typical EDAC system has the following structure under
				442	``/sys/devices/system/edac/``\ [#f6]_::
				443
				444	/sys/devices/system/edac/
				445	├── mc
				446	│ ├── mc0
				447	│ │ ├── ce_count
				448	│ │ ├── ce_noinfo_count
				449	│ │ ├── dimm0
Aaron Miller	4fb6fde	2016-11-03 15:01:53 -0700	[diff] [blame]	450	│ │ │ ├── dimm_ce_count
Mauro Carvalho Chehab	032d0ab	2016-10-27 10:00:46 -0200	[diff] [blame]	451	│ │ │ ├── dimm_dev_type
				452	│ │ │ ├── dimm_edac_mode
				453	│ │ │ ├── dimm_label
				454	│ │ │ ├── dimm_location
				455	│ │ │ ├── dimm_mem_type
Aaron Miller	4fb6fde	2016-11-03 15:01:53 -0700	[diff] [blame]	456	│ │ │ ├── dimm_ue_count
Mauro Carvalho Chehab	032d0ab	2016-10-27 10:00:46 -0200	[diff] [blame]	457	│ │ │ ├── size
				458	│ │ │ └── uevent
				459	│ │ ├── max_location
				460	│ │ ├── mc_name
				461	│ │ ├── reset_counters
				462	│ │ ├── seconds_since_reset
				463	│ │ ├── size_mb
				464	│ │ ├── ue_count
				465	│ │ ├── ue_noinfo_count
				466	│ │ └── uevent
				467	│ ├── mc1
				468	│ │ ├── ce_count
				469	│ │ ├── ce_noinfo_count
				470	│ │ ├── dimm0
Aaron Miller	4fb6fde	2016-11-03 15:01:53 -0700	[diff] [blame]	471	│ │ │ ├── dimm_ce_count
Mauro Carvalho Chehab	032d0ab	2016-10-27 10:00:46 -0200	[diff] [blame]	472	│ │ │ ├── dimm_dev_type
				473	│ │ │ ├── dimm_edac_mode
				474	│ │ │ ├── dimm_label
				475	│ │ │ ├── dimm_location
				476	│ │ │ ├── dimm_mem_type
Aaron Miller	4fb6fde	2016-11-03 15:01:53 -0700	[diff] [blame]	477	│ │ │ ├── dimm_ue_count
Mauro Carvalho Chehab	032d0ab	2016-10-27 10:00:46 -0200	[diff] [blame]	478	│ │ │ ├── size
				479	│ │ │ └── uevent
				480	│ │ ├── max_location
				481	│ │ ├── mc_name
				482	│ │ ├── reset_counters
				483	│ │ ├── seconds_since_reset
				484	│ │ ├── size_mb
				485	│ │ ├── ue_count
				486	│ │ ├── ue_noinfo_count
				487	│ │ └── uevent
				488	│ └── uevent
				489	└── uevent
				490
				491	In the ``dimmX`` directories are EDAC control and attribute files for
				492	this ``X`` memory module:
				493
				494	- ``size`` - Total memory managed by this csrow attribute file
				495
				496	This attribute file displays, in count of megabytes, the memory
				497	that this csrow contains.
				498
Aaron Miller	4fb6fde	2016-11-03 15:01:53 -0700	[diff] [blame]	499	- ``dimm_ue_count`` - Uncorrectable Errors count attribute file
				500
				501	This attribute file displays the total count of uncorrectable
				502	errors that have occurred on this DIMM. If panic_on_ue is set
				503	this counter will not have a chance to increment, since EDAC
				504	will panic the system.
				505
				506	- ``dimm_ce_count`` - Correctable Errors count attribute file
				507
				508	This attribute file displays the total count of correctable
				509	errors that have occurred on this DIMM. This count is very
				510	important to examine. CEs provide early indications that a
				511	DIMM is beginning to fail. This count field should be
				512	monitored for non-zero values and report such information
				513	to the system administrator.
				514
Mauro Carvalho Chehab	032d0ab	2016-10-27 10:00:46 -0200	[diff] [blame]	515	- ``dimm_dev_type`` - Device type attribute file
				516
				517	This attribute file will display what type of DRAM device is
				518	being utilized on this DIMM.
				519	Examples:
				520
				521	- x1
				522	- x2
				523	- x4
				524	- x8
				525
				526	- ``dimm_edac_mode`` - EDAC Mode of operation attribute file
				527
				528	This attribute file will display what type of Error detection
				529	and correction is being utilized.
				530
				531	- ``dimm_label`` - memory module label control file
				532
				533	This control file allows this DIMM to have a label assigned
				534	to it. With this label in the module, when errors occur
				535	the output can provide the DIMM label in the system log.
				536	This becomes vital for panic events to isolate the
				537	cause of the UE event.
				538
				539	DIMM Labels must be assigned after booting, with information
				540	that correctly identifies the physical slot with its
				541	silk screen label. This information is currently very
				542	motherboard specific and determination of this information
				543	must occur in userland at this time.
				544
				545	- ``dimm_location`` - location of the memory module
				546
				547	The location can have up to 3 levels, and describe how the
				548	memory controller identifies the location of a memory module.
				549	Depending on the type of memory and memory controller, it
				550	can be:
				551
				552	- csrow and channel - used when the memory controller
				553	doesn't identify a single DIMM - e. g. in ``rankX`` dir;
				554	- branch, channel, slot - typically used on FB-DIMM memory
				555	controllers;
				556	- channel, slot - used on Nehalem and newer Intel drivers.
				557
				558	- ``dimm_mem_type`` - Memory Type attribute file
				559
				560	This attribute file will display what type of memory is currently
				561	on this csrow. Normally, either buffered or unbuffered memory.
				562	Examples:
				563
				564	- Registered-DDR
				565	- Unbuffered-DDR
				566
				567	.. [#f5] On some systems, the memory controller doesn't have any logic
				568	to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
				569	On modern Intel memory controllers, the memory controller identifies the
				570	memory modules directly. On such systems, the directory is called ``dimmX``.
				571
				572	.. [#f6] There are also some ``power`` directories and ``subsystem``
				573	symlinks inside the sysfs mapping that are automatically created by
				574	the sysfs subsystem. Currently, they serve no purpose.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	575
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	576	``csrowX`` directories
				577	----------------------
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	578
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	579	When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	580	directories. As this API doesn't work properly for Rambus, FB-DIMMs and
				581	modern Intel Memory Controllers, this is being deprecated in favor of
Mauro Carvalho Chehab	9c058d24	2016-10-27 09:26:36 -0200	[diff] [blame]	582	``dimmX`` directories.
Mauro Carvalho Chehab	8b6f04c	2012-04-17 08:53:34 -0300	[diff] [blame]	583
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	584	In the ``csrowX`` directories are EDAC control and attribute files for
				585	this ``X`` instance of csrow:
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	586
				587
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	588	- ``ue_count`` - Total Uncorrectable Errors count attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	589
				590	This attribute file displays the total count of uncorrectable
				591	errors that have occurred on this csrow. If panic_on_ue is set
				592	this counter will not have a chance to increment, since EDAC
				593	will panic the system.
				594
				595
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	596	- ``ce_count`` - Total Correctable Errors count attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	597
				598	This attribute file displays the total count of correctable
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	599	errors that have occurred on this csrow. This count is very
				600	important to examine. CEs provide early indications that a
				601	DIMM is beginning to fail. This count field should be
				602	monitored for non-zero values and report such information
				603	to the system administrator.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	604
				605
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	606	- ``size_mb`` - Total memory managed by this csrow attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	607
Rami Rosen	3aae9ed	2015-06-19 09:18:34 +0300	[diff] [blame]	608	This attribute file displays, in count of megabytes, the memory
Dave Peterson	f347981	2006-03-26 01:38:53 -0800	[diff] [blame]	609	that this csrow contains.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	610
				611
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	612	- ``mem_type`` - Memory Type attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	613
				614	This attribute file will display what type of memory is currently
				615	on this csrow. Normally, either buffered or unbuffered memory.
Doug Thompson	49c0dab7	2006-07-10 04:45:19 -0700	[diff] [blame]	616	Examples:
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	617
				618	- Registered-DDR
				619	- Unbuffered-DDR
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	620
				621
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	622	- ``edac_mode`` - EDAC Mode of operation attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	623
				624	This attribute file will display what type of Error detection
				625	and correction is being utilized.
				626
				627
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	628	- ``dev_type`` - Device type attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	629
Doug Thompson	49c0dab7	2006-07-10 04:45:19 -0700	[diff] [blame]	630	This attribute file will display what type of DRAM device is
				631	being utilized on this DIMM.
				632	Examples:
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	633
				634	- x1
				635	- x2
				636	- x4
				637	- x8
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	638
				639
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	640	- ``ch0_ce_count`` - Channel 0 CE Count attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	641
				642	This attribute file will display the count of CEs on this
				643	DIMM located in channel 0.
				644
				645
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	646	- ``ch0_ue_count`` - Channel 0 UE Count attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	647
				648	This attribute file will display the count of UEs on this
				649	DIMM located in channel 0.
				650
				651
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	652	- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	653
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	654
				655	This control file allows this DIMM to have a label assigned
				656	to it. With this label in the module, when errors occur
				657	the output can provide the DIMM label in the system log.
				658	This becomes vital for panic events to isolate the
				659	cause of the UE event.
				660
				661	DIMM Labels must be assigned after booting, with information
				662	that correctly identifies the physical slot with its
				663	silk screen label. This information is currently very
				664	motherboard specific and determination of this information
				665	must occur in userland at this time.
				666
				667
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	668	- ``ch1_ce_count`` - Channel 1 CE Count attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	669
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	670
				671	This attribute file will display the count of CEs on this
				672	DIMM located in channel 1.
				673
				674
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	675	- ``ch1_ue_count`` - Channel 1 UE Count attribute file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	676
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	677
				678	This attribute file will display the count of UEs on this
				679	DIMM located in channel 0.
				680
				681
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	682	- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	683
				684	This control file allows this DIMM to have a label assigned
				685	to it. With this label in the module, when errors occur
				686	the output can provide the DIMM label in the system log.
				687	This becomes vital for panic events to isolate the
				688	cause of the UE event.
				689
				690	DIMM Labels must be assigned after booting, with information
				691	that correctly identifies the physical slot with its
				692	silk screen label. This information is currently very
				693	motherboard specific and determination of this information
				694	must occur in userland at this time.
				695
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	696
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	697	System Logging
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	698	--------------
				699
				700	If logging for UEs and CEs is enabled, then system logs will contain
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	701	information indicating that errors have been detected::
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	702
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	703	EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
				704	EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	705
				706
				707	The structure of the message is:
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	708
				709	+---------------------------------------+-------------+
Jonathan Corbet	82a1955	2017-06-18 17:30:18 -0600	[diff] [blame]	710	\| Content \| Example \|
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	711	+=======================================+=============+
				712	\| The memory controller \| MC0 \|
				713	+---------------------------------------+-------------+
				714	\| Error type \| CE \|
				715	+---------------------------------------+-------------+
				716	\| Memory page \| 0x283 \|
				717	+---------------------------------------+-------------+
				718	\| Offset in the page \| 0xce0 \|
				719	+---------------------------------------+-------------+
				720	\| The byte granularity \| grain 8 \|
				721	\| or resolution of the error \| \|
				722	+---------------------------------------+-------------+
				723	\| The error syndrome \| 0xb741 \|
				724	+---------------------------------------+-------------+
Jonathan Corbet	82a1955	2017-06-18 17:30:18 -0600	[diff] [blame]	725	\| Memory row \| row 0 \|
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	726	+---------------------------------------+-------------+
				727	\| Memory channel \| channel 1 \|
				728	+---------------------------------------+-------------+
				729	\| DIMM label, if set prior \| DIMM B1 \|
				730	+---------------------------------------+-------------+
				731	\| And then an optional, driver-specific \| \|
				732	\| message that may have additional \| \|
				733	\| information. \| \|
				734	+---------------------------------------+-------------+
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	735
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	736	Both UEs and CEs with no info will lack all but memory controller, error
				737	type, a notice of "no info" and then an optional, driver-specific error
				738	message.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	739
				740
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	741	PCI Bus Parity Detection
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	742	------------------------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	743
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	744	On Header Type 00 devices, the primary status is looked at for any
				745	parity error regardless of whether parity is enabled on the device or
				746	not. (The spec indicates parity is generated in some cases). On Header
				747	Type 01 bridges, the secondary status register is also looked at to see
				748	if parity occurred on the bus on the other side of the bridge.
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	749
				750
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	751	Sysfs configuration
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	752	-------------------
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	753
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	754	Under ``/sys/devices/system/edac/pci`` are control and attribute files as
				755	follows:
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	756
				757
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	758	- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	759
				760	This control file enables or disables the PCI Bus Parity scanning
				761	operation. Writing a 1 to this file enables the scanning. Writing
				762	a 0 to this file disables the scanning.
				763
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	764	Enable::
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	765
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	766	echo "1" >/sys/devices/system/edac/pci/check_pci_parity
				767
				768	Disable::
				769
				770	echo "0" >/sys/devices/system/edac/pci/check_pci_parity
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	771
				772
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	773	- ``pci_parity_count`` - Parity Count
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	774
				775	This attribute file will display the number of parity errors that
				776	have been detected.
				777
				778
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	779	Module parameters
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	780	-----------------
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	781
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	782	- ``edac_mc_panic_on_ue`` - Panic on UE control file
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	783
				784	An uncorrectable error will cause a machine panic. This is usually
				785	desirable. It is a bad idea to continue when an uncorrectable error
				786	occurs - it is indeterminate what was uncorrected and the operating
				787	system context might be so mangled that continuing will lead to further
				788	corruption. If the kernel has MCE configured, then EDAC will never
				789	notice the UE.
				790
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	791	LOAD TIME::
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	792
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	793	module/kernel parameter: edac_mc_panic_on_ue=[0\|1]
				794
				795	RUN TIME::
				796
				797	echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	798
				799
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	800	- ``edac_mc_log_ue`` - Log UE control file
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	801
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	802
				803	Generate kernel messages describing uncorrectable errors. These errors
				804	are reported through the system message log system. UE statistics
				805	will be accumulated even when UE logging is disabled.
				806
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	807	LOAD TIME::
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	808
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	809	module/kernel parameter: edac_mc_log_ue=[0\|1]
				810
				811	RUN TIME::
				812
				813	echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	814
				815
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	816	- ``edac_mc_log_ce`` - Log CE control file
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	817
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	818
				819	Generate kernel messages describing correctable errors. These
				820	errors are reported through the system message log system.
				821	CE statistics will be accumulated even when CE logging is disabled.
				822
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	823	LOAD TIME::
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	824
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	825	module/kernel parameter: edac_mc_log_ce=[0\|1]
				826
				827	RUN TIME::
				828
				829	echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	830
				831
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	832	- ``edac_mc_poll_msec`` - Polling period control file
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	833
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	834
				835	The time period, in milliseconds, for polling for error information.
				836	Too small a value wastes resources. Too large a value might delay
				837	necessary handling of errors and might loose valuable information for
				838	locating the error. 1000 milliseconds (once each second) is the current
				839	default. Systems which require all the bandwidth they can get, may
				840	increase this.
				841
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	842	LOAD TIME::
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	843
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	844	module/kernel parameter: edac_mc_poll_msec=[0\|1]
				845
				846	RUN TIME::
				847
				848	echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
Arthur Jones	327dafb	2008-07-25 01:49:10 -0700	[diff] [blame]	849
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	850
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	851	- ``panic_on_pci_parity`` - Panic on PCI PARITY Error
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	852
				853
Rami Rosen	3aae9ed	2015-06-19 09:18:34 +0300	[diff] [blame]	854	This control file enables or disables panicking when a parity
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	855	error has been detected.
				856
				857
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	858	module/kernel parameter::
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	859
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	860	edac_panic_on_pci_pe=[0\|1]
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	861
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	862	Enable::
				863
				864	echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
				865
				866	Disable::
				867
				868	echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
Alan Cox	da9bb1d	2006-01-18 17:44:13 -0800	[diff] [blame]	869
				870
				871
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	872	EDAC device type
				873	----------------
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	874
Mauro Carvalho Chehab	66c222a	2016-10-29 10:35:23 -0200	[diff] [blame]	875	In the header file, edac_pci.h, there is a series of edac_device structures
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	876	and APIs for the EDAC_DEVICE.
				877
				878	User space access to an edac_device is through the sysfs interface.
				879
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	880	At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
				881	will appear.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	882
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	883	There is a three level tree beneath the above ``edac`` directory. For example,
				884	the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
				885	website) installs itself as::
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	886
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	887	/sys/devices/system/edac/test-instance
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	888
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	889	in this directory are various controls, a symlink and one or more ``instance``
Carlos Garcia	c98be0c	2014-04-04 22:31:00 -0400	[diff] [blame]	890	directories.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	891
				892	The standard default controls are:
				893
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	894	============== =======================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	895	log_ce boolean to log CE events
				896	log_ue boolean to log UE events
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	897	panic_on_ue boolean to ``panic`` the system if an UE is encountered
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	898	(default off, can be set true via startup script)
				899	poll_msec time period between POLL cycles for events
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	900	============== =======================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	901
				902	The test_device_edac device adds at least one of its own custom control:
				903
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	904	============== ==================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	905	test_bits which in the current test driver does nothing but
				906	show how it is installed. A ported driver can
				907	add one or more such controls and/or attributes
				908	for specific uses.
				909	One out-of-tree driver uses controls here to allow
				910	for ERROR INJECTION operations to hardware
				911	injection registers
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	912	============== ==================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	913
				914	The symlink points to the 'struct dev' that is registered for this edac_device.
				915
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	916	Instances
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	917	---------
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	918
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	919	One or more instance directories are present. For the ``test_device_edac``
				920	case:
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	921
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	922	+----------------+
				923	\| test-instance0 \|
				924	+----------------+
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	925
				926
				927	In this directory there are two default counter attributes, which are totals of
				928	counter in deeper subdirectories.
				929
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	930	============== ====================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	931	ce_count total of CE events of subdirectories
				932	ue_count total of UE events of subdirectories
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	933	============== ====================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	934
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	935	Blocks
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	936	------
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	937
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	938	At the lowest directory level is the ``block`` directory. There can be 0, 1
				939	or more blocks specified in each instance:
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	940
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	941	+-------------+
				942	\| test-block0 \|
				943	+-------------+
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	944
				945	In this directory the default attributes are:
				946
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	947	============== ================================================
				948	ce_count which is counter of CE events for this ``block``
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	949	of hardware being monitored
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	950	ue_count which is counter of UE events for this ``block``
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	951	of hardware being monitored
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	952	============== ================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	953
				954
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	955	The ``test_device_edac`` device adds 4 attributes and 1 control:
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	956
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	957	================== ====================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	958	test-block-bits-0 for every POLL cycle this counter
				959	is incremented
				960	test-block-bits-1 every 10 cycles, this counter is bumped once,
				961	and test-block-bits-0 is set to 0
				962	test-block-bits-2 every 100 cycles, this counter is bumped once,
				963	and test-block-bits-1 is set to 0
				964	test-block-bits-3 every 1000 cycles, this counter is bumped once,
				965	and test-block-bits-2 is set to 0
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	966	================== ====================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	967
				968
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	969	================== ====================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	970	reset-counters writing ANY thing to this control will
				971	reset all the above counters.
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	972	================== ====================================================
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	973
				974
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	975	Use of the ``test_device_edac`` driver should enable any others to create their own
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	976	unique drivers for their hardware systems.
				977
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	978	The ``test_device_edac`` sample driver is located at the
				979	http://bluesmoke.sourceforge.net project site for EDAC.
Doug Thompson	87f24c3	2007-07-19 01:50:34 -0700	[diff] [blame]	980
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	981
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	982	Usage of EDAC APIs on Nehalem and newer Intel CPUs
				983	--------------------------------------------------
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	984
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	985	On older Intel architectures, the memory controller was part of the North
				986	Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
				987	newer Intel architectures integrated an enhanced version of the memory
				988	controller (MC) inside the CPUs.
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	989
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	990	This chapter will cover the differences of the enhanced memory controllers
				991	found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
				992	``sbx_edac`` drivers.
				993
				994	.. note::
				995
				996	The Xeon E7 processor families use a separate chip for the memory
				997	controller, called Intel Scalable Memory Buffer. This section doesn't
				998	apply for such families.
				999
				1000	1) There is one Memory Controller per Quick Patch Interconnect
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1001	(QPI). At the driver, the term "socket" means one QPI. This is
				1002	associated with a physical CPU socket.
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1003
				1004	Each MC have 3 physical read channels, 3 physical write channels and
Masanari Iida	c94bed8e	2012-04-10 00:22:13 +0900	[diff] [blame]	1005	3 logic channels. The driver currently sees it as just 3 channels.
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1006	Each channel can have up to 3 DIMMs.
				1007
				1008	The minimum known unity is DIMMs. There are no information about csrows.
Rami Rosen	3aae9ed	2015-06-19 09:18:34 +0300	[diff] [blame]	1009	As EDAC API maps the minimum unity is csrows, the driver sequentially
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	1010	maps channel/DIMM into different csrows.
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1011
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1012	For example, supposing the following layout::
				1013
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1014	Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
				1015	dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
				1016	dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
				1017	Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
				1018	dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
				1019	Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
				1020	dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1021
				1022	The driver will map it as::
				1023
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1024	csrow0: channel 0, dimm0
				1025	csrow1: channel 0, dimm1
				1026	csrow2: channel 1, dimm0
				1027	csrow3: channel 2, dimm0
				1028
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1029	exports one DIMM per csrow.
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1030
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1031	Each QPI is exported as a different memory controller.
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1032
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	1033	2) The MC has the ability to inject errors to test drivers. The drivers
				1034	implement this functionality via some error injection nodes:
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1035
				1036	For injecting a memory error, there are some sysfs nodes, under
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1037	``/sys/devices/system/edac/mc/mc?/``:
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1038
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1039	- ``inject_addrmatch/*``:
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1040	Controls the error injection mask register. It is possible to specify
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1041	several characteristics of the address to match an error code::
				1042
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1043	dimm = the affected dimm. Numbers are relative to a channel;
				1044	rank = the memory rank;
				1045	channel = the channel that will generate an error;
				1046	bank = the affected bank;
				1047	page = the page address;
				1048	column (or col) = the address column.
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1049
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1050	each of the above values can be set to "any" to match any valid value.
				1051
				1052	At driver init, all values are set to any.
				1053
				1054	For example, to generate an error at rank 1 of dimm 2, for any channel,
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1055	any bank, any page, any column::
				1056
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1057	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
				1058	echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1059
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1060	To return to the default behaviour of matching any, you can do::
				1061
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1062	echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
				1063	echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1064
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1065	- ``inject_eccmask``:
				1066	specifies what bits will have troubles,
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1067
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1068	- ``inject_section``:
				1069	specifies what ECC cache section will get the error::
				1070
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1071	3 for both
				1072	2 for the highest
				1073	1 for the lowest
				1074
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1075	- ``inject_type``:
				1076	specifies the type of error, being a combination of the following bits::
				1077
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1078	bit 0 - repeat
				1079	bit 1 - ecc
				1080	bit 2 - parity
				1081
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1082	- ``inject_enable``:
				1083	starts the error generation when something different than 0 is written.
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1084
				1085	All inject vars can be read. root permission is needed for write.
				1086
				1087	Datasheet states that the error will only be generated after a write on an
				1088	address that matches inject_addrmatch. It seems, however, that reading will
				1089	also produce an error.
				1090
				1091	For example, the following code will generate an error for any write access
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1092	at socket 0, on any DIMM/address on channel 2::
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1093
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1094	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
				1095	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
				1096	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
				1097	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
				1098	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
				1099	dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1100
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1101	For socket 1, it is needed to replace "mc0" by "mc1" at the above
				1102	commands.
				1103
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1104	The generated error message will look like::
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1105
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1106	EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1107
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	1108	3) Corrected Error memory register counters
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1109
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	1110	Those newer MCs have some registers to count memory errors. The driver
				1111	uses those registers to report Corrected Errors on devices with Registered
				1112	DIMMs.
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1113
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	1114	However, those counters don't work with Unregistered DIMM. As the chipset
				1115	offers some counters that also work with UDIMMs (but with a worse level of
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1116	granularity than the default ones), the driver exposes those registers for
				1117	UDIMM memories.
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1118
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1119	They can be read by looking at the contents of ``all_channel_counts/``::
Mauro Carvalho Chehab	31983a0	2009-08-05 21:16:56 -0300	[diff] [blame]	1120
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1121	$ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1122	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
				1123	0
				1124	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
				1125	0
				1126	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
				1127	0
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1128
				1129	What happens here is that errors on different csrows, but at the same
				1130	dimm number will increment the same counter.
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1131	So, in this memory mapping::
				1132
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1133	csrow0: channel 0, dimm0
				1134	csrow1: channel 0, dimm1
				1135	csrow2: channel 1, dimm0
				1136	csrow3: channel 2, dimm0
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1137
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1138	The hardware will increment udimm0 for an error at the first dimm at either
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1139	csrow0, csrow2 or csrow3;
				1140
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1141	The hardware will increment udimm1 for an error at the second dimm at either
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1142	csrow0, csrow2 or csrow3;
				1143
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1144	The hardware will increment udimm2 for an error at the third dimm at either
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1145	csrow0, csrow2 or csrow3;
Mauro Carvalho Chehab	c344436	2009-09-05 05:10:15 -0300	[diff] [blame]	1146
				1147	4) Standard error counters
				1148
				1149	The standard error counters are generated when an mcelog error is received
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	1150	by the driver. Since, with UDIMM, this is counted by software, it is
				1151	possible that some errors could be lost. With RDIMM's, they display the
Mauro Carvalho Chehab	35be954	2009-09-24 17:28:50 -0300	[diff] [blame]	1152	contents of the registers
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	1153
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1154	Reference documents used on ``amd64_edac``
				1155	------------------------------------------
				1156
				1157	``amd64_edac`` module is based on the following documents
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1158	(available from http://support.amd.com/en-us/search/tech-docs):
				1159
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1160	1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1161	Opteron Processors
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1162	:AMD publication #: 26094
				1163	:Revision: 3.26
				1164	:Link: http://support.amd.com/TechDocs/26094.PDF
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1165
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1166	2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1167	Processors
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1168	:AMD publication #: 32559
				1169	:Revision: 3.00
				1170	:Issue Date: May 2006
				1171	:Link: http://support.amd.com/TechDocs/32559.pdf
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1172
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1173	3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1174	Processors
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1175	:AMD publication #: 31116
				1176	:Revision: 3.00
				1177	:Issue Date: September 07, 2007
				1178	:Link: http://support.amd.com/TechDocs/31116.pdf
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1179
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1180	4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1181	Models 30h-3Fh Processors
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1182	:AMD publication #: 49125
				1183	:Revision: 3.06
				1184	:Issue Date: 2/12/2015 (latest release)
				1185	:Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1186
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1187	5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1188	Models 60h-6Fh Processors
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1189	:AMD publication #: 50742
				1190	:Revision: 3.01
				1191	:Issue Date: 7/23/2015 (latest release)
				1192	:Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1193
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1194	6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1195	Models 00h-0Fh Processors
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1196	:AMD publication #: 48751
				1197	:Revision: 3.03
				1198	:Issue Date: 2/23/2015 (latest release)
				1199	:Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
Aravind Gopalakrishnan	6b7464b	2015-09-28 06:44:31 -0500	[diff] [blame]	1200
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1201	Credits
				1202	=======
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	1203
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1204	* Written by Doug Thompson <dougthompson@xmission.com>
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	1205
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1206	- 7 Dec 2005
				1207	- 17 Jul 2007 Updated
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	1208
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1209	* \|copy\| Mauro Carvalho Chehab
Borislav Petkov	043b431	2015-06-19 11:47:17 +0200	[diff] [blame]	1210
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1211	- 05 Aug 2009 Nehalem interface
Mauro Carvalho Chehab	e4b5301	2016-10-26 08:43:58 -0200	[diff] [blame]	1212	- 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
Mauro Carvalho Chehab	b27a2d0	2016-10-26 08:14:12 -0200	[diff] [blame]	1213
				1214	* EDAC authors/maintainers:
				1215
				1216	- Doug Thompson, Dave Jiang, Dave Peterson et al,
				1217	- Mauro Carvalho Chehab
				1218	- Borislav Petkov
				1219	- original author: Thayne Harbaugh