Project

General

Profile

Actions

Bug #121

open

T520: Hangs in OS

Added by Firstname Lastname over 7 years ago. Updated 21 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
chipset configuration
Target version:
-
Start date:
06/09/2017
Due date:
% Done:

0%

Estimated time:
Affected versions:
Needs backport to:
Affected hardware:
SNB, IVY
Affected OS:
-

Description

I have been running coreboot since 2017.04.15 and have experienced hangs ever since then. It was suggested by folk on the IRC that I run memtest to check for incorrect raminit causing errors, however I have run memtest for 12 hours straight with no errors.

Due to the ambiguous nature of the hangs (immediate freeze with no warning signs, audio gets stuck repeating the last 50ms or so of noise, not sure what this effect is called) I don't have much useful information other than the .config and dmesg. However one thing I can say with high confidence is that the hangs occur significantly more frequently in Linux (*buntu distros) than Windows 10. Within an hour of launching Linux a hang is likely, whereas Windows typically runs for many hours before a hang occurs. I considered this an insignificant anecdotal anomaly at first but over the course of the nearly 2 months I have been running coreboot it seems to be a solid trend. The hangs occur anywhere, typically during mere desktop usage or basic web browsing.

Additionally there is another form of hang I experience where the screen goes black except for some sort of graphical corruption down the left side (http://i.imgur.com/4zWrlpX.jpg), whether this is related to the more common total freeze hangs I don't know but I figured I should include it nonetheless. These hangs only occur about 1:20 compared to the regular hangs.


Files

config (20.7 KB) config Firstname Lastname, 06/09/2017 06:21 AM
dmesg.txt (57.3 KB) dmesg.txt Firstname Lastname, 06/09/2017 06:21 AM
cbmem-raminit.txt (62 KB) cbmem-raminit.txt Firstname Lastname, 06/29/2017 11:58 PM
lspci.txt (29.6 KB) lspci.txt sudo lspci -vv Viktor V, 06/29/2019 06:46 AM
cpuinfo.txt (3.94 KB) cpuinfo.txt cat /proc/cpuinfo Viktor V, 06/29/2019 06:46 AM
defconfig (1023 Bytes) defconfig T420 config for CBET4000 4.16-1069-gf4905da14c Anastasios Koutian, 03/29/2023 02:55 PM
defconfig (699 Bytes) defconfig Working coreboot defconfig for ThinkPad T420 Anastasios Koutian, 09/24/2023 10:07 AM
Actions #1

Updated by Firstname Lastname over 7 years ago

https://mail.coreboot.org/pipermail/coreboot/2016-September/082009.html

According to this entry on the mailing list someone else was getting the same issue on their T520. I have tried limiting the max mem speed to 666 in devicetree.cb as suggested in the link, however it did not fix the issue as expected since my RAM is only 1333 anyway. The second suggestion (limiting CPU p-state), I wouldn't know how to do.

Actions #2

Updated by Nico Huber over 7 years ago

Does your T520 have a dedicated GPU or the integrated Intel GPU only?

Actions #3

Updated by Firstname Lastname over 7 years ago

Integrated only.

Actions #4

Updated by Iru Cai over 7 years ago

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

Actions #5

Updated by Vasya Boytsov over 7 years ago

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

Actions #6

Updated by Firstname Lastname over 7 years ago

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

I already using a 4 CPUs chip though (i5-3320M). Perhaps I could try setting maxcpus=2 in config.

Actions #7

Updated by Iru Cai over 7 years ago

Julz Buckton wrote:

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Have you tried mrc.bin yet, e.g revision 39937cc?
I've tried this revision and the first revision that uses native ram init, and it seems that native ram init is the problem. I just don't know if mrc.bin supports ivy bridge yet.

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

I already using a 4 CPUs chip though (i5-3320M). Perhaps I could try setting maxcpus=2 in config.

Actions #8

Updated by Iru Cai over 7 years ago

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

Linux kernel config?
I remember I haven't have any issue on an iGPU only T420. My last working revision is 8bbd596de631adc8b677e69603e978b848eb1708.

Actions #9

Updated by Vasya Boytsov over 7 years ago

Iru Cai wrote:

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

Linux kernel config?
I remember I haven't have any issue on an iGPU only T420. My last working revision is 8bbd596de631adc8b677e69603e978b848eb1708.

Yes, I've changed this setting in the Linux kernel config, compiled the kernel and it works flawlessly now. The last time I was testing was between 4.5 and 4.6 don't remember the exact revision. So, the problem should be connected with native ram init, I'll try earlier revisions later. How can one be of help with debugging of this issue?

Actions #10

Updated by Firstname Lastname over 7 years ago

Iru Cai wrote:

Julz Buckton wrote:

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Have you tried mrc.bin yet, e.g revision 39937cc?
I've tried this revision and the first revision that uses native ram init, and it seems that native ram init is the problem. I just don't know if mrc.bin supports ivy bridge yet.

You mean this version? https://review.coreboot.org/cgit/coreboot.git/commit/?id=39937cc2fd28bcc754c0595f1327467499af40ea

I will give it a try. Could native ram init really be the cause of the issue, even if I got no errors in memtest?

Actions #11

Updated by Firstname Lastname over 7 years ago

Tried coreboot revision 39937cc2fd28bcc754c0595f1327467499af40ea (with systemagent-r6.bin, tried systemagent-ivybridge.bin first and got brick) and got a hang within 30 seconds of booting into linux. Guess that rules out RAM init being the cause of hangs?

Actions #12

Updated by Firstname Lastname over 7 years ago

Here is cbmem output with verbose RAM init logging enabled, in case it is helpful.

Actions #13

Updated by Firstname Lastname over 7 years ago

I managed to get my hands on another SNB chip (i3-2310M) and with the same config (with just PCI ID for vga blob changed from 8086:0166 to 8086:0126), I get no hangs.

So looks like T520 mainboard + Ivy Bridge chip is cause for hangs.

Actions #14

Updated by Iru Cai over 7 years ago

Julz Buckton wrote:

I managed to get my hands on another SNB chip (i3-2310M) and with the same config (with just PCI ID for vga blob changed from 8086:0166 to 8086:0126), I get no hangs.

So looks like T520 mainboard + Ivy Bridge chip is cause for hangs.

Maybe related to turbo boost? Although the machine often hangs at idle time.
Because the system hang also happens when I use a Sandy Bridge Dual/Quad core processor.

Actions #15

Updated by Patrick Rudolph about 7 years ago

Vendor does dynamically limit pstate depending on attached power supply.
ATM coreboot doesn't care about attached PSU...

Example:
The battery charges at 45 Watt.
The CPU has a TPD of 45 W.
7W idle power.
Other components, including USB 10W ?

It would require a 135 Watt PSU or limiting the CPU TDP / battery charge current to a smaller value.

What power-rating does your PSU have ?

Actions #16

Updated by Seff Qin over 6 years ago

Test v4.8.1 with t420, this issue has not been fixed.

I got different informations by executing 'dmidecode -t 17':
Vendor BIOS: Total Width and Data Width are both 64 bits.
Coreboot: Total Width is 16 bits and Data Width is 8 bits.

It seems that the RAMs are not running at full speed.

Actions #17

Updated by Evgeny Zinoviev about 6 years ago

Having hangs on T520 + i5-2450M. Happened twice after ~1 min after booting debian (devuan). The interesting part is that it unfreezes after 4-5 minutes. I'm using two 4G Hynix RAM sticks, 8G in total. I'll see if maxcpus=2 helps.

Actions #18

Updated by Evgeny Zinoviev about 6 years ago

Update: maxcpus=2 didn't help

Actions #19

Updated by Nico Huber about 6 years ago

Evgeny Zinoviev wrote:

Update: maxcpus=2 didn't help

Please note that the original report was for an Ivy Bridge CPU in a T520 (probably caused by missing compatible ME firmware or whatnot). You seem to have a very different problem.

Actions #20

Updated by Evgeny Zinoviev almost 6 years ago

Now I have X220 with this bug. Yeah I know that the original report is for IVB CPU in T520, but i've seen both symptoms and they are the same: (1) just a hang and (2) a black screen with fluttering red line at the left, like on the photo from the last paragraph of this ticket.

Doesn't happen with lenovo bios. For now I suspect it's something RAM related (just have no other ideas). I'm using 2x8Gb Patriot PSD38G16002S sticks. I'll try to use different sticks and see if it helps. What else can I do to debug this? At least I have a hardware on which we can reproduce this, that's something for a start.

Actions #21

Updated by Evgeny Zinoviev over 5 years ago

Recent observations on X220.

Using most recent CPU microcode doesn't help.
Not using CPU microcode at all doesn't help.
Disabling HT with patch #29669 doesn't help.
Using mrc.bin instead of native raminit doesn't help.
Changing DIMMs doesn't help.
Using stock or neutered ME doesn't help.

Using OEM BIOS helps, of course, but that's not a solution.

Actions #22

Updated by Evgeny Zinoviev over 5 years ago

A also have a feeling that this happens more often when using virtualization (qemu/kvm). I'd say if I run virtual machines, the lockup is likely to happen in hour or so.

Actions #23

Updated by Viktor V over 5 years ago

Evgeny Zinoviev wrote:

Recent observations on X220.

Using most recent CPU microcode doesn't help.
Not using CPU microcode at all doesn't help.
Disabling HT with patch #29669 doesn't help.
Using mrc.bin instead of native raminit doesn't help.
Changing DIMMs doesn't help.
Using stock or neutered ME doesn't help.

Using OEM BIOS helps, of course, but that's not a solution.

I have exactly the same problem, my X220 randomly hangs with that weird glitch in the left side of the screen. My build settings are pretty much defaults with SeaBIOS and Intel ME disabled.

Using Debian with 2x4 Gb RAM and i5-2520M CPU.

By the way, I'm also from Russia. :)

Actions #24

Updated by Evgeny Zinoviev over 5 years ago

Viktor V wrote:

I have exactly the same problem, my X220 randomly hangs with that weird glitch in the left side of the screen. My build settings are pretty much defaults with SeaBIOS and Intel ME disabled.

Using Debian with 2x4 Gb RAM and i5-2520M CPU.

By the way, I'm also from Russia. :)

I'm glad to hear I'm not the only one. Did you update Lenovo BIOS to the latest version before extracting ME and flashing coreboot?

We had a discussion about these hangs on #coreboot and came up with two ideas:

  1. Make sure we use most recent ME firmware.
  2. Collect revisions and stepping ids of the Intel chips in faulty machines and compare them to the working ones.
Actions #25

Updated by Viktor V over 5 years ago

Did you update Lenovo BIOS to the latest version before extracting ME and flashing coreboot?

Yes, I did. It was version 1.45, but now it's already 1.46 available released in June 26 2019.

Collect revisions and stepping ids of the Intel chips in faulty machines and compare them to the working ones.

Can I help with providing this information? Not sure what revision and stepping id are, how can I see them in Debian? I've built coreboot 4.9 release.

I assumed that X220 is the most stable hardware for coreboot. Honestly, my very first thought was that this hang is caused by some kind of a failed BIOS exploit by some malware. (LOL I'm paranoid)

Actions #26

Updated by Evgeny Zinoviev over 5 years ago

Viktor V wrote:

Can I help with providing this information?

I hope so. Won't hurt anyway.

Not sure what revision and stepping id are, how can I see them in Debian?

I guess, lspci and cat /proc/cpuinfo

I assumed that X220 is the most stable hardware for coreboot.

It is believed to be very stable. Actually, I used to use an X220 (another one) for year and a half and never had a single crash or hang. This bug is quire rare, only some mainboards (or CPUs, or something) are affected and, at the moment, we have no idea why. This bug is known to occur only on SNB thinkpads, so, in this sense, X230 is probably more "stable".

Honestly, my very first thought was that this hang is caused by some kind of a failed BIOS exploit

Well, you have replaced your BIOS with coreboot, haven't you? ;)

Another idea: try disabling cstates and see if it helps. I was going to try it myself but I doubt I'll have time for it earlier than next week.

Updated by Viktor V over 5 years ago

Attaching lspci and cpuinfo outputs

Actions #28

Updated by Viktor V over 5 years ago

Evgeny Zinoviev wrote:

Another idea: try disabling cstates and see if it helps. I was going to try it myself but I doubt I'll have time for it earlier than next week.

Looks like it works! I've added "intel_idle.max_cstate=0 processor.max_cstate=1" kernel parameters and it runs for 2 days without hangs so far.

Actions #29

Updated by Viktor V over 5 years ago

Some strange things I've experienced while flashing this X220.

Every tutorial online says you can flash X220 with Raspberry Pi SPI interface, but I had no luck with it. Flashrom couldn't detect the chip, though it reads/writes fine with RPi on my other laptops. So I had to buy and use ch341a USB programmer (black version).

With ch341a Flashrom works fine, but it shows strange warnings while writing:

Found Macronix flash chip "MX25L6405" (8192 kB, SPI) on ch341a_spi.
Reading old flash chip contents... done.
Erasing and writing flash chip... FAILED at 0x00001000! Expected=0xff, Found=0xf0, failed byte count from 0x00000000-0x0000ffff: 0x1cf9
ERASE FAILED!
Reading current flash chip contents... done. Looking for another erase function.
Erase/write done.
Verifying flash... VERIFIED.

cbmem output says it has SF: Detected MX25L6405D with sector size 0x1000, total 0x800000

Edit: Right, sorry about that. Just trying to understand differences between this unstable X220 and other stable ones.

Actions #30

Updated by Paul Menzel over 5 years ago

Please contact the flashrom mailing list for the flashrom issue as it’s unrelated to the coreboot bug tracker and the issue at hand specifically.

Actions #31

Updated by Viktor V over 5 years ago

Those hangs must be related to CPU C-states for sure. After 4 days of stable uptime, I've changed back kernel parameters to default and rebooted my X220. It randomly hanged with that glitch on the left side of the screen after just 8 hours of work.

The temporary fix on a Linux system is to run kernel with parameters "intel_idle.max_cstate=0 processor.max_cstate=1".

For example, on Debian I do:

echo GRUB_CMDLINE_LINUX_DEFAULT=\"\$GRUB_CMDLINE_LINUX_DEFAULT intel_idle.max_cstate=0 processor.max_cstate=1\" > /etc/default/grub.d/corebootfix.cfg
sudo update-grub

Hoping this information is useful.

Actions #32

Updated by Evgeny Zinoviev over 5 years ago

Viktor V wrote:

Those hangs must be related to CPU C-states for sure. After 4 days of stable uptime, I've changed back kernel parameters to default and rebooted my X220. It randomly hanged with that glitch on the left side of the screen after just 8 hours of work.

The temporary fix on a Linux system is to run kernel with parameters "intel_idle.max_cstate=0 processor.max_cstate=1".

For example, on Debian I do:

echo GRUB_CMDLINE_LINUX_DEFAULT=\"\$GRUB_CMDLINE_LINUX_DEFAULT intel_idle.max_cstate=0 processor.max_cstate=1\" > /etc/default/grub.d/corebootfix.cfg
sudo update-grub

Hoping this information is useful.

Nice! Thank you very much. After months of hangs we finally understand something.

Actions #33

Updated by Martin Zwicknagl over 5 years ago

Hello all,

I can confirm that
intel_idle.max_cstate=0 processor.max_cstate=1
seems to fix the problem.

I also tried:
intel_idle.max_cstate=1 processor.max_cstate=2
The T520 is running for more than three days now, without freezes.

Hope this helps.

Actions #34

Updated by Evgeny Zinoviev over 5 years ago

Martin Zwicknagl wrote:

Hello all,

I can confirm that
intel_idle.max_cstate=0 processor.max_cstate=1
seems to fix the problem.

I also tried:
intel_idle.max_cstate=1 processor.max_cstate=2
The T520 is running for more than three days now, without freezes.

Hope this helps.

Do you mean that intel_idle.max_cstate=1 processor.max_cstate=2 is also stable?

Actions #35

Updated by Nico Huber over 5 years ago

AFAIK, intel_idle and ACPI processor are two independent drivers. Does this mean you tested both? if not, please always mention which one was effective, cf. cat /sys/devices/system/cpu/cpuidle/current_driver. Otherwise, the information "processor.max_cstate=2 works", for instance, may be very misleading if the processor driver wasn't used at all.

Actions #36

Updated by Martin Zwicknagl over 5 years ago

Nico Huber wrote:

AFAIK, intel_idle and ACPI processor are two independent drivers. Does this mean you tested both? if not, please always mention which one was effective, cf. cat /sys/devices/system/cpu/cpuidle/current_driver. Otherwise, the information "processor.max_cstate=2 works", for instance, may be very misleading if the processor driver wasn't used at all.

Ups, I was not aware of the difference. cat /sys/devices/system/cpu/cpuidle/current_driver shows intel_idle so I think I have tested intel_idle.max_cstate=1

Actions #37

Updated by Martin Zwicknagl over 5 years ago

Hello,

I want to tell you, that the Laptop does NOT freeze with
intel_idle.max_cstate=1, intel_idle.max_cstate=2 and intel_idle.max_cstate=3

with
intel_idle.max_cstate=4, intel_idle.max_cstate=5 and intel_idle.max_cstate=6
it freezes.

Actions #38

Updated by Evgeny Zinoviev over 5 years ago

My X220 just hung with intel_idle.max_cstate=3 :(

Actions #39

Updated by Alexander Wetzel over 5 years ago

I'm using coreboot since roughly six month on a thinkpad w530 (i7-3820QM, K2000M and 24GB of RAM with ME neutered) and have what looks like the same issue.
Now I did have an custom modification to coreboot but I've build and flashed fad9536edf yesterday without it and already had a few of the freezes. After reproducing the freezes without the mod I've it installed again. (Based on https://review.coreboot.org/c/coreboot/+/28380, just fixed an rather serious error in DSDT so windows boots with it.)

So I have those freezes with or without this mod, regardless if I set hybrid_graphics_mode to integrated, discrete or dual mode. (Using the discrete card seems to freeze the system more often, but that may also just have been bad luck.)
The freezes always happen with a load close to idle: While I had a few booting up the system it normally occurs when putting the system aside for a short moment after some light browsing or text file editing. But I also can have the idle system just sitting there for hours without hitting it. I get the impression that either putting it aside or picking it up again has a chance of triggering the bug and needed quite some time to accept that it's probably not the movement itself. (I got a new PSU, since it stopped charging the notebook sometimes on movements. The new PSU fixed the stop/resume charging issue - broken cable in the old PSU - but not the freezes.

Now when it freezes it's always the same: The screen freezes, any LED's which normally may flash are staying either lit or unlit. So far I did not had any screen corruption, though.
(Sound is normally muted, so I can't say if there are audio artefacts.)

But I have also an additional symptom after switching to coreboot which could be linked to the problem and if so could be very helpful for debugging it:

I'm also using gentoo and sometimes there are some painful software updates, keeping the CPU at 100% for hours.
Sometimes - less frequent in more current coreboot versions - when having such a big update the CPUs stop using the max speed (around 3491 MHz) and are stuck at a much lower speed. (I think it was around 2 GHz). All cores are still working 100% but the CPU power reduced, resulting in drastically longer compile times.
I tried some months ago to figure out why, but there was nothing in the logs and the CPU governor still reported the normal limits. For some undetermined reason the CPUs just did not use the higher frequencies till I rebooted. Some time later I figured out how to fix the stuck CPU frequency without rebooting: Suspending the system to RAM and resuming it. (Which is basically a CPU reboot after all.)
Since the system is still fully operational when I hit this bug I can execute basically anything. Are there anything I should gather when I get my system into that state next time?
Unfortunately I get into that state much less often than the freezes... But I guess I could try forcing the issue and let the system sit in a corner recompiling dev-qt/qtwebengine, the package most likely to triggering the bug for me.

Noteworthy here is, that with more recent coreboot versions I hit the CPU throttling bug much less frequent. Maybe once in the last two months, while getting it within maybe 30min compiling packages some time back. Normally it takes quite some time (>1h?) of 100% CPU to trigger this bug. Now I had quite some big updates in the past not triggering it, (un)fortunately.

But with time I'm sure I can trigger it again, either accidentally or forced. If you have suggestions what do do when I get into that state next I'll do that on top of what I can think of myself (Which is not much, to be honest. Still pretty new to coreboot...)

Actions #40

Updated by Evgeny Zinoviev over 5 years ago

Hello, Alexander.

That's sad. Until this moment I believed this bug affects at least only xx20 ThinkPad series. By the way I use corebooted W530 too (i7-3720QM, then i7-3940XM, 32GB RAM, neutered ME) for over a year and never ever had a single crash or freeze.

just fixed an rather serious error in DSDT so windows boots with it

Can you upload a fix somewhere? I'll update the patch on Gerrit.

The freezes always happen with a load close to idle
Now when it freezes it's always the same: The screen freezes, any LED's which normally may flash are staying either lit or unlit. So far I did not had any screen corruption, though.

This is also what I see on X220. The crash is more likely to happen when idle. Sometimes there is video corruption, sometimes it just stucks.

Sometimes - less frequent in more current coreboot versions - when having such a big update the CPUs stop using the max speed (around 3491 MHz) and are stuck at a much lower speed. (I think it was around 2 GHz). All cores are still working 100% but the CPU power reduced, resulting in drastically longer compile times.

Two suggestions.

  1. CPU is throttling because the temperature is too high. Not likely.
  2. I know how to reproduce a similar frequency drop, just put lower power adapter, not this huge 170W brick that comes with W530, but for example 90W one or 65W one. The CPU frequency will immediately drop to ~1200 MHz and the only way I know to fix this is to perform suspend/resume or reboot. But sometimes this happens to my W530 with original 170W brick, just as you say, maybe once in two months or so. I just didn't really bother debugging this.

Please post your lspci and cat /proc/cpuinfo | grep stepping output (I want to compare hardware revisions with mine). I'm collecting information about affected and non-affected machines, maybe I'll see some pattern, idk.

Actions #41

Updated by Alexander Wetzel over 5 years ago

That's sad. Until this moment I believed this bug affects at least only xx20 ThinkPad series. By the way I use corebooted W530 too (i7-3720QM, then i7-3940XM, 32GB RAM, neutered ME) for over a year and never ever had a single crash or freeze.

Really a strange bug...

just fixed an rather serious error in DSDT so windows boots with it

Can you upload a fix somewhere? I'll update the patch on Gerrit.

I was planning to work on that a bit more, this is basically only a forward ported version of my very first shot at coreboot patching without caring about other platforms...
The idea was to polish it prior to contacting you:-)... That said here what I have: https://www.awhome.eu/index.php/s/GBfFb2Et768cQWM
Since that is highly off-topic I've added the comments for that to the patch.

The freezes always happen with a load close to idle
Now when it freezes it's always the same: The screen freezes, any LED's which normally may flash are staying either lit or unlit. So far I did not had any screen corruption, though.

This is also what I see on X220. The crash is more likely to happen when idle. Sometimes there is video corruption, sometimes it just stucks.

Sometimes - less frequent in more current coreboot versions - when having such a big update the CPUs stop using the max speed (around 3491 MHz) and are stuck at a much lower speed. (I think it was around 2 GHz). All cores are still working 100% but the CPU power reduced, resulting in drastically longer compile times.

Two suggestions.

  1. CPU is throttling because the temperature is too high. Not likely.

Correct. I'm 100% sure it's not that. (Had that in the past and it DID cause log entries.)

  1. I know how to reproduce a similar frequency drop, just put lower power adapter, not this huge 170W brick that comes with W530, but for example 90W one or 65W one. The CPU frequency will immediately drop to ~1200 MHz and the only way I know to fix this is to perform suspend/resume or reboot. But sometimes this happens to my W530 with original 170W brick, just as you say, maybe once in two months or so. I just didn't really bother debugging this.

Some months ago I was wondering if I had to flash back to the official bios. But it has gotten much less frequent and is now only a itch.
Now I'm wondering if it's not linked to the bug... Maybe we do something wrong at setup with either can crash the CPU when idle or just whatever mechanism linux uses to tell the CPU to switch the frequency. Now that's a very thin link and it may well turn out to be something unrelated. But that since I have no idea how we can debug the freeze I hope that poking at that may turn up something...

Please post your lspci and cat /proc/cpuinfo | grep stepping output (I want to compare hardware revisions with mine). I'm collecting information about affected and non-affected machines, maybe I'll see some pattern, idk.

$ lspci
00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
00:04.0 Signal processing controller: Intel Corporation 3rd Gen Core Processor Thermal Subsystem (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4)
00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 3 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation QM77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro K2000M] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GK107 HDMI Audio Controller (rev ff)
02:00.0 SD Host controller: Ricoh Co Ltd PCIe SDXC/MMC Host Controller (rev 08)
02:00.3 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 PCIe IEEE 1394 Controller (rev 04)
03:00.0 Network controller: Intel Corporation Centrino Ultimate-N 6300 (rev 3e)

I'm mainly running the OS on a msata card but also have two HDDs installed. (Both normally powered but unused.)

$ cat /proc/cpuinfo | grep stepping
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9

Actions #42

Updated by Evgeny Zinoviev over 5 years ago

Thanks. All revisions are the same as on my machine :(

Did you try limiting C-States? People say it helps (earlier in this topic). Might be worth checking.
Didn't help mine X220 though. I made sure that current driver is intel_idle and it crashed after a couple of hours as usual with intel_idle.max_cstate=3.

Actions #43

Updated by Ryan Heyser over 5 years ago

Evgeny Zinoviev wrote:

Thanks. All revisions are the same as on my machine :(

Did you try limiting C-States? People say it helps (earlier in this topic). Might be worth checking.
Didn't help mine X220 though. I made sure that current driver is intel_idle and it crashed after a couple of hours as usual with intel_idle.max_cstate=3.

It doesn't help my T420. I've had, although a significant drop in crashes, still a few after limiting cstates with the same stepping as above. To note, I have a model with a discrete GPU.

Actions #44

Updated by Alexander Wetzel about 5 years ago

Evgeny Zinoviev wrote:

Did you try limiting C-States? People say it helps (earlier in this topic). Might be worth checking.
Didn't help mine X220 though. I made sure that current driver is intel_idle and it crashed after a couple of hours as usual with intel_idle.max_cstate=3.

I think my freezes were cause by something else...
As mentioned my freezes seem to by linked to physical movements of the device. Now I left the wires soldered to the debug connector in the system, so I just have to remove the keyboard and connect them to the flasher to restore the system of a potential brick. After you reported no problems with your W530 I placed these wires slightly different: And since that I had no new freeze. (I did not set any cstate kernel parameter.)
Of course it could also be linked to something in linux 5.3 kernel but that seems to be less likely. (I'm closely tracking the wireless git kernel and the last freeze was already with the kernel 5.3.0-rc6-wt).

I report back if the freezes come back, but it looks like my report here should be ignored for tracking down the bug handled here.

Actions #45

Updated by Andrey A. almost 5 years ago

Same problem with T420 on IvyBridge CPU (i5-3380m). Random rear hard freeze and nothing in log (Debian testing).

Actions #46

Updated by Martin Zwicknagl almost 5 years ago

Hello,

I can report, that I had NO freezes for a month now.

During a RAM upgrade to 16GB I have replaced one 2GB Samsung SODIMM (2GB 1Rx8 PC3-10600s-09-10-ZZZ, M471B5773CHS-CH9 1149)
with a SODIMM [Crucial CT102464BF160B 8GB Speicher (DDR3L, 1600 MT/s, PC3L-12800, SODIMM, 204-Pin)].
The laptop is now using two identical Crucial SODIMMs.

The problem is gone. I do not need any intel_idle.max_cstate tricks anymore. My T520 is using all cores, hyperthreading, all cstates (and turbo boost).

I think ticket 178 is also solved. Who can close 178?

Cheeers
Martin

Actions #47

Updated by Evgeny Zinoviev almost 5 years ago

That's very good! I have many ThinkPads though, some of them are affected and some are not, and using DIMM sticks from unaffected units in affected ones doesn't help them, but that might just mean that I don't have the right DIMMs.

I think ticket 178 is also solved. Who can close 178?

Closed it for you.

Actions #48

Updated by Andrey Korolyov almost 5 years ago

Could please anyone still experiencing this issue confirm that it continuing to appear under following conditions:

  • intel_pstate driver is either not compiled in or disabled via intel_pstate=off kernel commandline option,
  • the CPU governor is not set to powersave for acpi-cpufreq interface,
  • C-States are not limited by commandline option like one from above, e.g. deep C-States are available for the system.

EDIT: worksforme after disabling deep sleep states for the i915 via enable_rc6=0, the P-State/C-State settings seems to prevent this exact problem to appear in some indirect way, at this moment accumulated uptime with 4.11-ish branch equals to five days without a single freeze, hope that this issue is the same as at the beginning of this topic

Actions #49

Updated by Evgeny Zinoviev over 4 years ago

Andrey Korolyov wrote:

EDIT: worksforme after disabling deep sleep states for the i915 via enable_rc6=0, the P-State/C-State settings seems to prevent this exact problem to appear in some indirect way, at this moment accumulated uptime with 4.11-ish branch equals to five days without a single freeze, hope that this issue is the same as at the beginning of this topic

So I did a little digging and I see that enable_rc6=0 option was removed in https://patchwork.kernel.org/patch/10027945/. I just tested 5.7 kernel on x220 with fresh coreboot and with Lenovo BIOS and I see that GEN6_RC_CONTROL register (0xa090) is set to the same value of 0x88060000 in both cases, which means that RC6 (bit 18) and even RC6p (bit 17) is enabled. This value is set by kernel (coreboot sets it to 0x88040000, which enables RC6 but not RC6p).

I cannot confirm nor deny yet that it keeps crashing only with coreboot and not with Lenovo BIOS, but, I guess, if the RC6 is the real cause, then there should be reports of non-coreboot x220 (or other machines like t420) users that recent kernels are crashing.

Actions #50

Updated by Evgeny Zinoviev over 4 years ago

Oh and by the way I also tried 4.4 kernel with Lenovo BIOS and 0xa090 was 0x88040000, which means only RC6 is enabled but not RC6p.

Actions #51

Updated by Evgeny Zinoviev over 4 years ago

Okay, it just crashed (with coreboot). So I patched the kernel to disable rc6p for gen6 (GEN6_FEATURES in i915_pci.c), let's see if it helps. If not, then I'll patch it again to also disable rc6 and try again.

Actions #53

Updated by Evgeny Zinoviev over 4 years ago

Evgeny Zinoviev wrote:

Everyone who's still suffering from this bug is suggested to apply and try out these patches:

https://review.coreboot.org/c/coreboot/+/42410
https://review.coreboot.org/c/coreboot/+/42447
https://review.coreboot.org/c/coreboot/+/42450
https://review.coreboot.org/c/coreboot/+/42455

5 days no crashes with these patches on X220.

Actions #54

Updated by Sebastian Band over 4 years ago

Evgeny Zinoviev wrote:

Evgeny Zinoviev wrote:

Everyone who's still suffering from this bug is suggested to apply and try out these patches:

https://review.coreboot.org/c/coreboot/+/42410
https://review.coreboot.org/c/coreboot/+/42447
https://review.coreboot.org/c/coreboot/+/42450
https://review.coreboot.org/c/coreboot/+/42455

5 days no crashes with these patches on X220.

Could you please be so kind to post your config? I applied all the patches but my x220 still freezes quite regular using coreboot under idle conditions.
Lenovo Bios runs fine for several days. I tried different RAM setups (different vendors and singel channel vs. dual channel setups) and the news kernel 5.7, no success so far.

Actions #55

Updated by Evgeny Zinoviev over 4 years ago

Sebastian Band wrote:

Evgeny Zinoviev wrote:

Evgeny Zinoviev wrote:

Everyone who's still suffering from this bug is suggested to apply and try out these patches:

https://review.coreboot.org/c/coreboot/+/42410
https://review.coreboot.org/c/coreboot/+/42447
https://review.coreboot.org/c/coreboot/+/42450
https://review.coreboot.org/c/coreboot/+/42455

5 days no crashes with these patches on X220.

Could you please be so kind to post your config? I applied all the patches but my x220 still freezes quite regular using coreboot under idle conditions.
Lenovo Bios runs fine for several days. I tried different RAM setups (different vendors and singel channel vs. dual channel setups) and the news kernel 5.7, no success so far.

I'll post when I can, but my config is really just the defaults plus enabled usbdebug, so nothing special.

Have you tried disabling RC6 in coreboot config with these patches?

Actions #56

Updated by Sebastian Band over 4 years ago

Evgeny Zinoviev wrote:

Sebastian Band wrote:

Evgeny Zinoviev wrote:

Evgeny Zinoviev wrote:

Everyone who's still suffering from this bug is suggested to apply and try out these patches:

https://review.coreboot.org/c/coreboot/+/42410
https://review.coreboot.org/c/coreboot/+/42447
https://review.coreboot.org/c/coreboot/+/42450
https://review.coreboot.org/c/coreboot/+/42455

5 days no crashes with these patches on X220.

Could you please be so kind to post your config? I applied all the patches but my x220 still freezes quite regular using coreboot under idle conditions.
Lenovo Bios runs fine for several days. I tried different RAM setups (different vendors and singel channel vs. dual channel setups) and the news kernel 5.7, no success so far.

I'll post when I can, but my config is really just the defaults plus enabled usbdebug, so nothing special.

Have you tried disabling RC6 in coreboot config with these patches?

Yes, I tried with enabled / disabled RC6 and RC6p, not much difference. Could it be related to the vgabios blob? With included vgabios the crash sometimes occurred earlier. But I've not found a setup that lead to a stable machine. Have you still patched the linux kernel?

Actions #57

Updated by Evgeny Zinoviev over 4 years ago

Could it be related to the vgabios blob?

I don't know, unlikely...

Have you still patched the linux kernel?

Nope, I reverted my patches and 5.7 runs just fine on my X220 with RC6 enabled. But since another person said that disabling RC6 helped them to get rid of crashes, I'd suggest you to try to patch the kernel as well (look for GEN6_FEATURES in i915_pci.c) to disable it permanently. If it will not help, at least we'll rule this out.

Actions #58

Updated by Sebastian Band over 4 years ago

I tried a default configuration beside the ME, ethernet and description blobs from the lenovo bios, and this coreboot image was really stable. I played around with different configurations and the crashes seem to be related to CONFIG_USE_OPTION_TABLE. When this option is set, my x220 freezes over night :(.
Let me know if I can help any further.
Thank you for your help.

Actions #59

Updated by Evgeny Zinoviev over 4 years ago

Sebastian Band wrote:

I tried a default configuration beside the ME, ethernet and description blobs from the lenovo bios, and this coreboot image was really stable. I played around with different configurations and the crashes seem to be related to CONFIG_USE_OPTION_TABLE. When this option is set, my x220 freezes over night :(.
Let me know if I can help any further.
Thank you for your help.

Hm, how much sure are you that this relation with the CMOS support isn't just a coincidence?

Actions #60

Updated by Sebastian Band over 4 years ago

6 days ago, when I made the post I would have said about 60% sure: without CONFIG_USE_OPTION_TABLE I had an uptime of 1.5-2 days twice, as soon as I enabled CMOS settings, the laptop froze after about 6-7h.
In the meantime changed RAM to 2x4Gb SAMSUNG and I made a git update. After that the laptop was up 2 days without CONFIG_USE_OPTION_TABLE. For the last 1.5 days the laptop was running fine with CONFIG_USE_OPTION_TABLE enabled.
I'll enable my previous settings one by one, and give an update, if I can identify the cause of my problems, but maybe the RAM update did the trick.
Any idea if it will be safe to enable intel_pstate in the near future?

Actions #61

Updated by Sebastian Band over 4 years ago

Sorry it took me some time. After testing different configurations (with and without the above mentioned patches / default settings / adjustments I thought might help) compiled Linux with modified i915_pci.c, set cpu_scaling_governor to performance switching to a different RAM vendor I was not able to get a really stable system. The system freezes after about 3h to 2d. I was even able to get my hands on a second x220, no success. For a friend I installed coreboot on his old x230, and it is really stable.
Any tips, ideas or suggestions?

Maybe I'll try intel_idle.max_cstate=2 again.

Actions #62

Updated by Zak Brighton Knight about 4 years ago

I recently installed Coreboot plus SeaBIOS on my T520 Ivy Bridge and I have been having similar issues. The crashes happen exceptionally more often when I am using VMs or if I am docked in the ThinkPad dock. I managed to stop the crashes (so far) by passing the intel_idle.max_cstate=3 kernel parameter. However, I want a better solution and so I am posting here to see if anyone has worked out the issue. I am happy to test patches and provide information to help progress this issue.

Actions #63

Updated by Viktor V about 4 years ago

Viktor V wrote:

Evgeny Zinoviev wrote:

Recent observations on X220.

Using most recent CPU microcode doesn't help.
Not using CPU microcode at all doesn't help.
Disabling HT with patch #29669 doesn't help.
Using mrc.bin instead of native raminit doesn't help.
Changing DIMMs doesn't help.
Using stock or neutered ME doesn't help.

Using OEM BIOS helps, of course, but that's not a solution.

I have exactly the same problem, my X220 randomly hangs with that weird glitch in the left side of the screen. My build settings are pretty much defaults with SeaBIOS and Intel ME disabled.

Using Debian with 2x4 Gb RAM and i5-2520M CPU.

By the way, I'm also from Russia. :)

Brief update:

I've updated Coreboot on my X220 to 4.12 and the issue is still present.

With these kernel parameters Laptop hangs randomly in 2-5 hours of basic usage, confirmed 2 times:

BOOT_IMAGE=/vmlinuz-4.19.0-11-amd64 root=/dev/mapper/debian--vg-root ro quiet intel_iommu=on

With these parameters there are no hangs, Laptop works stable:

BOOT_IMAGE=/vmlinuz-4.19.0-11-amd64 root=/dev/mapper/debian--vg-root ro quiet intel_iommu=on intel_idle.max_cstate=0 processor.max_cstate=1

Actions #64

Updated by Zak Brighton Knight about 4 years ago

I recently installed Coreboot plus SeaBIOS on my T520 Ivy Bridge and I have been having similar issues. The crashes happen exceptionally more often when I am using VMs or if I am docked in the ThinkPad dock. I managed to stop the crashes (so far) by passing the intel_idle.max_cstate=3 kernel parameter. However, I want a better solution and so I am posting here to see if anyone has worked out the issue. I am happy to test patches and provide information to help progress this issue.

As an update to the above, intel_idle.max_cstate=3 was not stable but intel_idle.max_cstate=2 was and I haven't had any crashes in about a month.

I think it's fairly clear from reading through these issues that a fix is limiting intel_idle.max_cstate to 2 or below. If someone know what parts of coreboot interacts with the kernel code relevant for this kernel parameter I am happy to start looking into what is the underlying issue of these crashes.

Actions #65

Updated by Sebastian Band about 4 years ago

Using version 4.12-4147-gfd9a8b679b with the above mentioned patches (RC6p) the random freezes seem to be resolved, at least I had an uptime of 1.5 days with my patched kernel, using voidlinux default kernel I have an uptime of 1.5days so far. I've not test the dock so far, once I achieve an uptime of 5days, I'll give the dock a try. I just wanted to thank you for the great work on coreboot.

I Hope this is at least a little bit helpful to someone.

BTW. I'm still using my x220 which suffered from random freezes.

Edit (17h later): Sorry I've spoken to early two freezes in the last hours :(. I'll update and give it another try

Actions #66

Updated by Zak Brighton Knight over 3 years ago

Has any progress been made on this? I am happy to help out testing to try and fix this

Actions #67

Updated by Daniel Kulesz over 3 years ago

I encountered the issue with coreboot 4.14 on a T520 (i5) as well when running a VM using kernel 5.10 on the host. Setting the kernel boot parameter

intel_idle.max_cstate=2

as suggested here seems to have helped. Yet, I will need to use the machine longer to fully confirm this.

Interestingly, I did not encounter the issue on a X220 (i7) without this parameter set even with heavy VM usage. Could it be that this is an issue that is more likely to occur on i5 cpus? Or are they just more common?

Actions #68

Updated by David Gebski about 3 years ago

I encountered frequent (within 12 hours) hangs on a T420s (i5-2540M) running Manjaro Linux. I was first running Devuan with OpenRC, which seemed to cause less hangs.

intel_idle.max_cstate=2

Seemed to have fixed the hangs for the last two weeks (not permanently running), but recently encountered the first hang. I also seem to have higher CPU temps now (~10 degrees more).

UPDATE: the recent first hang may have been caused by Manjaro/GNOME/mpv, as I also encountered it on my non-corebooted desktop running the same setup.

Actions #69

Updated by Nemanja Z about 3 years ago

Here is my experience:
T520, i7-3740QM (upgraded from i5-2520M), HD4000
Vendor: coreboot
Version: CBET4000 4.14-2082-gee760b4be8
Release Date: 09/30/2021
RAM: 2x4GB DDR3 1333 (Rendition/Crucial RM51264BC1339.16)

Using SeaBIOS + Sandy and IvyBridge VGA Bios, standard config for these machines otherwise.
I experienced the first freeze in Windows10 after leaving the machine unattended for about 2 hours.
The fan was running, leds were on but the display was black, laptop not responding.
The same thing happened in Debian11, after a day of moderate usage without any problems I again left the machine unattended for at least 2 hours
and found it unresponsive.
So the issue seems to be the C7 state which I then disabled and had no problems after that.

C States for these cpus:
C1 – Auto Halt
C1E – Auto halt, low frequency, low voltage
C3 – L1/L2 caches flush, clocks off
C6 – Save core states before shutdown and PLL off
C7 – C6 + LLC may be flushed

After setting the max state to 4:

root@t520:~# cat /sys/module/intel_idle/parameters/max_cstate
4

root@t520:~# grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu0/cpuidle/state4/name:C6

I left the machine unused for 4-5 hours and the display fired up as soon as I moved the mouse again. (I don't use sleep, just display off, at least on power)
I have still to measure the power consumption difference when disabling cstate=5 but it can't be that much worse I guess.
Because of this thread I almost gave up flashing my T520 but then it also helped me a lot.
So I am overall very happy with coreboot.

Actions #70

Updated by Andrey A. about 3 years ago

T420 and IVB cpu here again.
After 9 months with intel_idle.max_cstate=4 (and last kernel and coreboot) I get a full stable system without a freeze.

Actions #71

Updated by Anastasios Koutian over 1 year ago

Andrey A. wrote in #note-70:

T420 and IVB cpu here again.
After 9 months with intel_idle.max_cstate=4 (and last kernel and coreboot) I get a full stable system without a freeze.

I have a similar setup:

Mainboard: T420
CPU: i7-3940XM
RAM: Corsair Vengeance 16 GB DDR3 1600 MHz (CMSX16GX3M2A1600C10)
iGPU: Intel HD Graphics 4000
dGPU: NVIDIA Quadro NVS 4200M

Coreboot version is CBET4000 4.16-1069-gf4905da14c using MrChromebox's old "corebootpayload" branch as payload (which is now deprecated, I think).
Intel ME was stripped. Coreboot config is attached.

With intel_idle.max_cstate=2 the system is stable. Any higher value results in the freezes as described above (sometimes with the glitch on the left side of the screen, sometimes without).

It's weird how intel_idle.max_cstate=4 doesn't work for me. If anyone is still looking into this, I would be happy to help.

Actions #72

Updated by Evgeny Zinoviev over 1 year ago

On my x230 with gentoo, with recent kernel the situation has worsened significally. If on 5.x kernels it worked somewhat stable (I had crashes maybe once in a month or so and maybe it wasn't even coreboot related), now I after upgrade to 6.1.12-gentoo and it freezes like twice a day.

This might be completely bug-121-unrelated, but limiting max cstate to c3 seems to fix the issue, so I decided to report here anyway... can anybody confirm the same? I mean that the newer kernel == more freezes/crashes

Actions #73

Updated by Anastasios Koutian about 1 year ago

The issue is resolved for me after doing the following:

a) Change motherboards from FRU 04W2049 (dGPU + iGPU) to FRU 04W2045 (iGPU).
b) Update coreboot to 4.21 with EDK II (defconfig is attached).

System has been stable for 6+ hours with no c-state limit. Previously it would freeze after 1-3 hours.
Everything else is the same as my previous post.
I recommend anyone still facing this issue to try the above.

Actions #74

Updated by Patrick Rudolph about 1 year ago

Does limiting the max C-state in MSR MSR_PKG_CST_CONFIG_CONTROL also work around the issue?
What setting is being used on vendor firmware for MSR MSR_PKG_CST_CONFIG_CONTROL?

Currently the code doesn't check if the processor supports C6/C3 sleep states, it just assumes it does.
According to the BWG, when bus masters, that cannot tolerate long bus master latency, are present, the BM_STS
avoidance must be used for C3/C6 states.

It sounds a bit like this could be the case here. The CPU is in C3/C6 and takes too long to wake in order to handle
the bus master request. If enabled the acpi_idle_bm_check() would then prevent the CPU from entering
a higher C-state (deeper sleep) when a bus master is active (similar to what intel_idle.max_cstate=2 does).

Actions #75

Updated by Patrick Rudolph about 1 year ago

I compared the vendor ACPI code for T520 and T530:

  • The T520 is missing _CST entries, thus the C-states are reported in FADT. I couldn't find a firmware dump that has a FADT.
  • The T530 has _CST, but the latencies are higher: For C3 148usec, vs 63 in coreboot. It's possible that too short advertised exit latencies causes issues when meeting deadlines within the kernel. The kernel will pick deeper C-states, while it shouldn't, as it assumes that those wake much faster.
Actions #76

Updated by Anastasios Koutian about 1 year ago

Patrick Rudolph wrote in #note-75:

I compared the vendor ACPI code for T520 and T530:

  • The T520 is missing _CST entries, thus the C-states are reported in FADT. I couldn't find a firmware dump that has a FADT.
  • The T530 has _CST, but the latencies are higher: For C3 148usec, vs 63 in coreboot. It's possible that too short advertised exit latencies causes issues when meeting deadlines within the kernel. The kernel will pick deeper C-states, while it shouldn't, as it assumes that those wake much faster.

Hi Patrick,

I have done some digging with MSRs but never looked into this particular one.
I have a spare T420 motherboard now and I could flash the vendor firmware back in, to examine the value.
If that would contribute to solving this issue, I'd be happy to do it.
Thank you for looking into this.

Actions #77

Updated by Patrick Rudolph about 1 year ago

Issue still occurs on ThinkPad X220 using https://review.coreboot.org/c/coreboot/+/78293 using Linux 6.5.6

Actions #78

Updated by Patrick Rudolph about 1 year ago

  • Category set to chipset configuration
  • Status changed from New to In Progress
  • Affected hardware set to SNB, IVY
  • Affected OS set to -

Limiting the package C-state to C2 using MSR MSR_PKG_CST_CONFIG_CONTROL seems to work around the issue.
With this workaround the CPU cores can still enter C7 and no cmdline fix in Linux is required.
It draws about 1-2Watts more power in idle.

The issue is thus related to the 'package C-state', not the 'CPU C-state'.
I was trying to figure out what happens on package C-states:

  • Package C2:
    • CPUs are in C3 or higher
    • bus traffic or device latency requirements keep it in C2
  • Package C3:
    • DRAM is in self refresh
    • Primary (IA Core power) and secondary plane (Graphics) VR are switched into light mode
  • Package C6:
    • CPU voltage is turned off
  • Package C7:
    • L3 cache is turned off
    • Chipset doesn't snoop

It is possible that the issue is DRAM / native ram init related.
When Package C3 is entered and the DRAM is placed in self refresh.

Actions #79

Updated by Anastasios Koutian 12 months ago

After upgrading to coreboot 4.22.01, the issue seems to have returned for me.
Hardware is the same as above, however now there is no black screen and no graphical corruption. Instead the whole screen just freezes (along with the entire system).
Also, it happens only after 1-2 days whereas before it would appear within 1-2 hours.

I am now limiting package C-state to 2 using MSR_PKG_CST_CONFIG_CONTROL as you suggested. I can confirm ~0.7 - 1.0 W extra power usage when idle.
If you would like me to test anything else, please let me know and I would be happy to help.
If this issue could be eliminated, it would be great for these older laptops, since their battery life is already limited.

Actions #80

Updated by Patrick Rudolph 12 months ago

It's still not clear why subsystem to look at. Even with access to Intel confidential documents it's hard to tell what happens when package C states are entered.
Due to the nature that the CPU isn't running at deeper C-states, it doesn't make it easier to debug.

What could help:

  • Compare MSRs against vendor firmware
  • Compare power/idle related registers against vendor firmware
  • Disable PCI devices and test if the problem persists
  • Test on desktops. So far only mobile users have reported problems.
Actions #81

Updated by Patrick Rudolph 11 months ago

It looks like this issue doesn't appear any more when using the MRC.bin instead of native raminit.
Thus it's likely that there's a difference between MRC and NRI that causes this issue.

Actions #82

Updated by Patrick Rudolph 11 months ago

It looks like returning PDM_APD_PPD in get_power_down_mode() also fixes the issue.
It matches the previous discoveries that the problem only appears when the package
is in higher C-states. get_power_down_mode() seems to be only used when the DRAM is idle/
the package is in high C-states.

Actions #83

Updated by Alex Gravitos 11 months ago

Patrick Rudolph wrote in #note-82:

It looks like returning PDM_APD_PPD in get_power_down_mode() also fixes the issue.

took me a while to figure out which return was supposed to be changed (its the first one).

however, for some reason doing that also kills ethernet - both on-board and in the dock. like, to the point of them not being seen by the system. an android phone connected as usb modem is detected fine, so i'd say changing these returns somehow impacts the gigabit ethernet blob.

sorry if i give not enough info, i am using coreboot for less than a week on my T520i, tell me if you need anything like logs or whatever

Actions #84

Updated by Alex Gravitos 11 months ago

seems my assumptions about it being the first return were not entirely correct, the laptop just froze on me during a system update (its arch, so no harm was actually done), i would like to see more details on the PDM_APD_PPD stuff

Actions #85

Updated by Patrick Rudolph 11 months ago

It might be possible that testing on my side was a false positive and I was just lucky that the issue didn't appear within time.
As it's unclear what the issue is about I'm just poking in the dark, comparing the reference code (MRC.bin) with coreboot's native code.

Anything that could help narrowing it down as described here: https://ticket.coreboot.org/issues/121#note-80 would help to fix it.

Actions #86

Updated by Anastasios Koutian 9 months ago

Martin Zwicknagl wrote in #note-46:

Hello,

I can report, that I had NO freezes for a month now.

During a RAM upgrade to 16GB I have replaced one 2GB Samsung SODIMM (2GB 1Rx8 PC3-10600s-09-10-ZZZ, M471B5773CHS-CH9 1149)
with a SODIMM [Crucial CT102464BF160B 8GB Speicher (DDR3L, 1600 MT/s, PC3L-12800, SODIMM, 204-Pin)].
The laptop is now using two identical Crucial SODIMMs.

The problem is gone. I do not need any intel_idle.max_cstate tricks anymore. My T520 is using all cores, hyperthreading, all cstates (and turbo boost).

I think ticket 178 is also solved. Who can close 178?

Cheeers
Martin

Hello all,

I installed the exact same SODIMMs (Crucial CT102464BF160B) as the message quoted above and achieved four days continuous uptime with no issues.
I eventually had to reboot due to a necessary update, so it might be a premature conclusion. I will keep the system running for as long as I can to confirm.

I haven't gotten around to comparing with MSRs on stock firmware yet, but will once I have spare time.

Actions #87

Updated by Anastasios Koutian 9 months ago

Anastasios Koutian wrote in #note-86:

Martin Zwicknagl wrote in #note-46:

Hello,

I can report, that I had NO freezes for a month now.

During a RAM upgrade to 16GB I have replaced one 2GB Samsung SODIMM (2GB 1Rx8 PC3-10600s-09-10-ZZZ, M471B5773CHS-CH9 1149)
with a SODIMM [Crucial CT102464BF160B 8GB Speicher (DDR3L, 1600 MT/s, PC3L-12800, SODIMM, 204-Pin)].
The laptop is now using two identical Crucial SODIMMs.

The problem is gone. I do not need any intel_idle.max_cstate tricks anymore. My T520 is using all cores, hyperthreading, all cstates (and turbo boost).

I think ticket 178 is also solved. Who can close 178?

Cheeers
Martin

Hello all,

I installed the exact same SODIMMs (Crucial CT102464BF160B) as the message quoted above and achieved four days continuous uptime with no issues.
I eventually had to reboot due to a necessary update, so it might be a premature conclusion. I will keep the system running for as long as I can to confirm.

I haven't gotten around to comparing with MSRs on stock firmware yet, but will once I have spare time.

After testing for two weeks, it became clear that the problem is still present, but with different symptoms: the system freezes for a short time and then reboots.
Previously, it would freeze indefinitely.

Actions #88

Updated by Patrick Rudolph 7 months ago

I made good progress with https://review.coreboot.org/c/coreboot/+/81597
It allows to configure the VR12-compatible regulator adjusting the CPU core voltage.
Especially the PSI state are from interest, since those are using in Package C3 or deeper.

My X220 is stable for 9hours while residing in Package C7 without freezes or shutdowns.

Since the VR12 configuration is mainboard specific the devicetree settings should not be copy pasted from existing board, but read from vendor firmware MSRs.
The values can be obtained from MSR 0x601 and MSR 0x602 for example using the $ rdmsr tool.
The devicetree values can be obtained using:

#!/bin/bash
echo "register \"pp0_current_limit\" = \"$(($(rdmsr -d -f 12:0 0x601) / 8))\""
echo "register \"pp1_current_limit\" = \"$(($(rdmsr -d -f 12:0 0x602) / 8))\""

echo "register \"pp0_psi[VR12_PSI1]\" = \"{$(($(rdmsr -d -f 41:39 0x601) + 1)), $(rdmsr -d -f 38:32 0x601)}\""
echo "register \"pp0_psi[VR12_PSI2]\" = \"{$(($(rdmsr -d -f 51:49 0x601) + 1)), $(rdmsr -d -f 48:42 0x601)}\""
echo "register \"pp0_psi[VR12_PSI3]\" = \"{$(($(rdmsr -d -f 61:59 0x601) + 1)), $(rdmsr -d -f 58:52 0x601)}\""

echo "register \"pp1_psi[VR12_PSI1]\" = \"{$(($(rdmsr -d -f 41:39 0x602) + 1)), $(rdmsr -d -f 38:32 0x602)}\""
echo "register \"pp1_psi[VR12_PSI2]\" = \"{$(($(rdmsr -d -f 51:49 0x602) + 1)), $(rdmsr -d -f 48:42 0x602)}\""
echo "register \"pp1_psi[VR12_PSI3]\" = \"{$(($(rdmsr -d -f 61:59 0x602) + 1)), $(rdmsr -d -f 58:52 0x602)}\""
Actions #89

Updated by Patrick Rudolph 7 months ago

I checked the vendor BIOS for (X220, T420 and T420s) and it hard-codes PSI2 and PSI3 values to 0 in PowerManagment2.efi.
The X230 vendor BIOS does not hard-code those values in PowerManagment2.efi.
Thus this is likely a bug in the voltage regulator used on Sandy-Bridge platforms.

I've created https://review.coreboot.org/c/coreboot/+/82070 based on this, please test.

Actions #90

Updated by Anastasios Koutian 7 months ago

Patrick Rudolph wrote in #note-89:

I checked the vendor BIOS for (X220, T420 and T420s) and it hard-codes PSI2 and PSI3 values to 0 in PowerManagment2.efi.
The X230 vendor BIOS does not hard-code those values in PowerManagment2.efi.
Thus this is likely a bug in the voltage regulator used on Sandy-Bridge platforms.

I've created https://review.coreboot.org/c/coreboot/+/82070 based on this, please test.

Hi Patrick, I have cherry-picked your commit on top of coreboot main and I am testing on my T420. I will inform you of the results.

Actions #91

Updated by Jun Muta 7 months ago

Patrick Rudolph wrote in #note-89:

I checked the vendor BIOS for (X220, T420 and T420s) and it hard-codes PSI2 and PSI3 values to 0 in PowerManagment2.efi.
The X230 vendor BIOS does not hard-code those values in PowerManagment2.efi.
Thus this is likely a bug in the voltage regulator used on Sandy-Bridge platforms.

I've created https://review.coreboot.org/c/coreboot/+/82070 based on this, please test.

Hi Patrick,
I've recently started using Libreboot and I'm having this issue on my T430. Do you happen to know what the values are for the T430?

Actions #92

Updated by Patrick Rudolph 7 months ago

Jun Muta wrote in #note-91:

Patrick Rudolph wrote in #note-89:

I checked the vendor BIOS for (X220, T420 and T420s) and it hard-codes PSI2 and PSI3 values to 0 in PowerManagment2.efi.
The X230 vendor BIOS does not hard-code those values in PowerManagment2.efi.
Thus this is likely a bug in the voltage regulator used on Sandy-Bridge platforms.

I've created https://review.coreboot.org/c/coreboot/+/82070 based on this, please test.

Hi Patrick,
I've recently started using Libreboot and I'm having this issue on my T430. Do you happen to know what the values are for the T430?

I check PowerManagment2.efi in T430 vendor firmware and it does not hard-code PSI2 and PSI3 to 0.
It's likely a different problem, not related to the VR12 configuration.

It looks like this only applies to Sandy-Bridge series, as reported here it only affects X220, T420, T520, T420s.

Actions #93

Updated by Anastasios Koutian 7 months ago

Anastasios Koutian wrote in #note-90:

Patrick Rudolph wrote in #note-89:

I checked the vendor BIOS for (X220, T420 and T420s) and it hard-codes PSI2 and PSI3 values to 0 in PowerManagment2.efi.
The X230 vendor BIOS does not hard-code those values in PowerManagment2.efi.
Thus this is likely a bug in the voltage regulator used on Sandy-Bridge platforms.

I've created https://review.coreboot.org/c/coreboot/+/82070 based on this, please test.

Hi Patrick, I have cherry-picked your commit on top of coreboot main and I am testing on my T420. I will inform you of the results.

System froze again after a couple of days. Unfortunately, this does not seem to have fixed the bug, however being able to set VR config in device tree is a very useful feature.

Actions #94

Updated by Patrick Rudolph 7 months ago

My X220 is stable for more than 48h, which wasn't possible before as it would crash within a couple of hours.
Since the issue doesn't appear any more I must assume it's fixed on my device.
There might be other settings causing a similar problem, however they won't show on my test system (until now).

Actions #95

Updated by Anastasios Koutian 7 months ago

The freezes have been quite random for me, sometimes happening within minutes of booting, sometimes hours, and sometimes days.
It is possible that there are two separate issues that have the same symptom.
You mentioned that using mrc.bin seems to also fix the freezes. Could you provide instructions on how to do that so I can confirm?

Actions #96

Updated by Anastasios Koutian 6 months ago

I tried booting with kernel command line parameter idle=nomwait.
There were no freezes for 1 week, 5 days and 16 hours, which was not possible previously.
After this, I powered down and removed the parameter to see what happens. The system froze after 15 hours and 7 minutes.
Intel PCM is showing no negative impact on idle power consumption with idle=nomwait.
I recommend to people following this bug to try this out and confirm if it works for them.

Actions #97

Updated by Alex Gravitos 6 months ago

Anastasios Koutian wrote in #note-96:

I tried booting with kernel command line parameter idle=nomwait.
There were no freezes for 1 week, 5 days and 16 hours, which was not possible previously.
After this, I powered down and removed the parameter to see what happens. The system froze after 15 hours and 7 minutes.
Intel PCM is showing no negative impact on idle power consumption with idle=nomwait.
I recommend to people following this bug to try this out and confirm if it works for them.

not in my case (t520i, i7-3840qm) - its almost as if the freezes are more frequent with this option 🤷

Actions #98

Updated by Anastasios Koutian 6 months ago

Alex Gravitos wrote in #note-97:

Anastasios Koutian wrote in #note-96:

I tried booting with kernel command line parameter idle=nomwait.
There were no freezes for 1 week, 5 days and 16 hours, which was not possible previously.
After this, I powered down and removed the parameter to see what happens. The system froze after 15 hours and 7 minutes.
Intel PCM is showing no negative impact on idle power consumption with idle=nomwait.
I recommend to people following this bug to try this out and confirm if it works for them.

not in my case (t520i, i7-3840qm) - its almost as if the freezes are more frequent with this option 🤷

That's interesting. I have not seen freezes since adding this option, but it's possible that it's just a coincidence, since this bug is quite random.

Actions #99

Updated by Patrick Rudolph 6 months ago

I tried booting with kernel command line parameter idle=nomwait.

That should make the OS use the hlt instruction instead of mwait, which is similar to package C1-state.
When using powertop or similar tools it should show how much time it spends in specific C-states.

I would expect it to draw a bit more power, but probably just ~1Watt.

not in my case (t520i, i7-3840qm) - its almost as if the freezes are more frequent with this option 🤷

I would also expect it to not freeze any more, since C1-states where reported to be working fine. Quote:

As an update to the above, intel_idle.max_cstate=3 was not stable but intel_idle.max_cstate=2 was and I haven't had any crashes in about a month.

Actions #100

Updated by Anastasios Koutian 4 months ago

Have been using this change: https://review.coreboot.org/c/coreboot/+/78609 since it was merged (May 23) and haven't seen any freezes.
It's possible that this solves the issue. Can others confirm?

Actions #101

Updated by Ján Mlynek 3 months ago

I updated to main a week ago and after more than 100 hours of uptime, I didn't experience any hangs so far. This amount of time wasn't possible to achieve before. I will report again in a few weeks if no hangs occur but it looks promising.

T420, i7-3632QM

Actions #102

Updated by Nemanja Z 21 days ago

I also finally updated my T520 i7-3740QM to 24.08-631-gf8d4283e78d2. (from 4.14-2082-gee760b4be8)
No kernel options, C7 working now, no crashes in Debian 12, Windows 10 also seems fine.

Actions

Also available in: Atom PDF