Bug #121

T520: Hangs in OS

Added by Julz Buckton over 2 years ago. Updated 30 days ago.

Status:NewStart date:06/09/2017
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I have been running coreboot since 2017.04.15 and have experienced hangs ever since then. It was suggested by folk on the IRC that I run memtest to check for incorrect raminit causing errors, however I have run memtest for 12 hours straight with no errors.

Due to the ambiguous nature of the hangs (immediate freeze with no warning signs, audio gets stuck repeating the last 50ms or so of noise, not sure what this effect is called) I don't have much useful information other than the .config and dmesg. However one thing I can say with high confidence is that the hangs occur significantly more frequently in Linux (*buntu distros) than Windows 10. Within an hour of launching Linux a hang is likely, whereas Windows typically runs for many hours before a hang occurs. I considered this an insignificant anecdotal anomaly at first but over the course of the nearly 2 months I have been running coreboot it seems to be a solid trend. The hangs occur anywhere, typically during mere desktop usage or basic web browsing.

Additionally there is another form of hang I experience where the screen goes black except for some sort of graphical corruption down the left side (http://i.imgur.com/4zWrlpX.jpg), whether this is related to the more common total freeze hangs I don't know but I figured I should include it nonetheless. These hangs only occur about 1:20 compared to the regular hangs.

config (20.7 KB) Julz Buckton, 06/09/2017 06:21 AM

dmesg.txt Magnifier (57.3 KB) Julz Buckton, 06/09/2017 06:21 AM

cbmem-raminit.txt Magnifier (62 KB) Julz Buckton, 06/29/2017 11:58 PM

lspci.txt Magnifier - sudo lspci -vv (29.6 KB) Viktor V, 06/29/2019 06:46 AM

cpuinfo.txt Magnifier - cat /proc/cpuinfo (3.94 KB) Viktor V, 06/29/2019 06:46 AM

History

#1 Updated by Julz Buckton over 2 years ago

https://mail.coreboot.org/pipermail/coreboot/2016-September/082009.html

According to this entry on the mailing list someone else was getting the same issue on their T520. I have tried limiting the max mem speed to 666 in devicetree.cb as suggested in the link, however it did not fix the issue as expected since my RAM is only 1333 anyway. The second suggestion (limiting CPU p-state), I wouldn't know how to do.

#2 Updated by Nico Huber over 2 years ago

Does your T520 have a dedicated GPU or the integrated Intel GPU only?

#3 Updated by Julz Buckton over 2 years ago

Integrated only.

#4 Updated by Iru Cai over 2 years ago

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

#5 Updated by Vasya Boytsov over 2 years ago

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

#6 Updated by Julz Buckton over 2 years ago

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

I already using a 4 CPUs chip though (i5-3320M). Perhaps I could try setting maxcpus=2 in config.

#7 Updated by Iru Cai over 2 years ago

Julz Buckton wrote:

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Have you tried mrc.bin yet, e.g revision 39937cc?
I've tried this revision and the first revision that uses native ram init, and it seems that native ram init is the problem. I just don't know if mrc.bin supports ivy bridge yet.

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

I already using a 4 CPUs chip though (i5-3320M). Perhaps I could try setting maxcpus=2 in config.

#8 Updated by Iru Cai over 2 years ago

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

Linux kernel config?
I remember I haven't have any issue on an iGPU only T420. My last working revision is 8bbd596de631adc8b677e69603e978b848eb1708.

#9 Updated by Vasya Boytsov over 2 years ago

Iru Cai wrote:

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

Linux kernel config?
I remember I haven't have any issue on an iGPU only T420. My last working revision is 8bbd596de631adc8b677e69603e978b848eb1708.

Yes, I've changed this setting in the Linux kernel config, compiled the kernel and it works flawlessly now. The last time I was testing was between 4.5 and 4.6 don't remember the exact revision. So, the problem should be connected with native ram init, I'll try earlier revisions later. How can one be of help with debugging of this issue?

#10 Updated by Julz Buckton over 2 years ago

Iru Cai wrote:

Julz Buckton wrote:

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Have you tried mrc.bin yet, e.g revision 39937cc?
I've tried this revision and the first revision that uses native ram init, and it seems that native ram init is the problem. I just don't know if mrc.bin supports ivy bridge yet.

You mean this version? https://review.coreboot.org/cgit/coreboot.git/commit/?id=39937cc2fd28bcc754c0595f1327467499af40ea

I will give it a try. Could native ram init really be the cause of the issue, even if I got no errors in memtest?

#11 Updated by Julz Buckton over 2 years ago

Tried coreboot revision 39937cc2fd28bcc754c0595f1327467499af40ea (with systemagent-r6.bin, tried systemagent-ivybridge.bin first and got brick) and got a hang within 30 seconds of booting into linux. Guess that rules out RAM init being the cause of hangs?

#12 Updated by Julz Buckton over 2 years ago

Here is cbmem output with verbose RAM init logging enabled, in case it is helpful.

#13 Updated by Julz Buckton over 2 years ago

I managed to get my hands on another SNB chip (i3-2310M) and with the same config (with just PCI ID for vga blob changed from 8086:0166 to 8086:0126), I get no hangs.

So looks like T520 mainboard + Ivy Bridge chip is cause for hangs.

#14 Updated by Iru Cai over 2 years ago

Julz Buckton wrote:

I managed to get my hands on another SNB chip (i3-2310M) and with the same config (with just PCI ID for vga blob changed from 8086:0166 to 8086:0126), I get no hangs.

So looks like T520 mainboard + Ivy Bridge chip is cause for hangs.

Maybe related to turbo boost? Although the machine often hangs at idle time.
Because the system hang also happens when I use a Sandy Bridge Dual/Quad core processor.

#15 Updated by Patrick Rudolph almost 2 years ago

Vendor does dynamically limit pstate depending on attached power supply.
ATM coreboot doesn't care about attached PSU...

Example:
The battery charges at 45 Watt.
The CPU has a TPD of 45 W.
7W idle power.
Other components, including USB 10W ?

It would require a 135 Watt PSU or limiting the CPU TDP / battery charge current to a smaller value.

What power-rating does your PSU have ?

#16 Updated by Seff Qin about 1 year ago

Test v4.8.1 with t420, this issue has not been fixed.

I got different informations by executing 'dmidecode -t 17':
Vendor BIOS: Total Width and Data Width are both 64 bits.
Coreboot: Total Width is 16 bits and Data Width is 8 bits.

It seems that the RAMs are not running at full speed.

#17 Updated by Evgeny Zinoviev about 1 year ago

Having hangs on T520 + i5-2450M. Happened twice after ~1 min after booting debian (devuan). The interesting part is that it unfreezes after 4-5 minutes. I'm using two 4G Hynix RAM sticks, 8G in total. I'll see if maxcpus=2 helps.

#18 Updated by Evgeny Zinoviev about 1 year ago

Update: maxcpus=2 didn't help

#19 Updated by Nico Huber about 1 year ago

Evgeny Zinoviev wrote:

Update: maxcpus=2 didn't help

Please note that the original report was for an Ivy Bridge CPU in a T520 (probably caused by missing compatible ME firmware or whatnot). You seem to have a very different problem.

#20 Updated by Evgeny Zinoviev 8 months ago

Now I have X220 with this bug. Yeah I know that the original report is for IVB CPU in T520, but i've seen both symptoms and they are the same: (1) just a hang and (2) a black screen with fluttering red line at the left, like on the photo from the last paragraph of this ticket.

Doesn't happen with lenovo bios. For now I suspect it's something RAM related (just have no other ideas). I'm using 2x8Gb Patriot PSD38G16002S sticks. I'll try to use different sticks and see if it helps. What else can I do to debug this? At least I have a hardware on which we can reproduce this, that's something for a start.

#21 Updated by Evgeny Zinoviev 5 months ago

Recent observations on X220.

Using most recent CPU microcode doesn't help.
Not using CPU microcode at all doesn't help.
Disabling HT with patch #29669 doesn't help.
Using mrc.bin instead of native raminit doesn't help.
Changing DIMMs doesn't help.
Using stock or neutered ME doesn't help.

Using OEM BIOS helps, of course, but that's not a solution.

#22 Updated by Evgeny Zinoviev 5 months ago

A also have a feeling that this happens more often when using virtualization (qemu/kvm). I'd say if I run virtual machines, the lockup is likely to happen in hour or so.

#23 Updated by Viktor V 4 months ago

Evgeny Zinoviev wrote:

Recent observations on X220.

Using most recent CPU microcode doesn't help.
Not using CPU microcode at all doesn't help.
Disabling HT with patch #29669 doesn't help.
Using mrc.bin instead of native raminit doesn't help.
Changing DIMMs doesn't help.
Using stock or neutered ME doesn't help.

Using OEM BIOS helps, of course, but that's not a solution.

I have exactly the same problem, my X220 randomly hangs with that weird glitch in the left side of the screen. My build settings are pretty much defaults with SeaBIOS and Intel ME disabled.

Using Debian with 2x4 Gb RAM and i5-2520M CPU.

By the way, I'm also from Russia. :)

#24 Updated by Evgeny Zinoviev 4 months ago

Viktor V wrote:

I have exactly the same problem, my X220 randomly hangs with that weird glitch in the left side of the screen. My build settings are pretty much defaults with SeaBIOS and Intel ME disabled.

Using Debian with 2x4 Gb RAM and i5-2520M CPU.

By the way, I'm also from Russia. :)

I'm glad to hear I'm not the only one. Did you update Lenovo BIOS to the latest version before extracting ME and flashing coreboot?

We had a discussion about these hangs on #coreboot and came up with two ideas:

  1. Make sure we use most recent ME firmware.
  2. Collect revisions and stepping ids of the Intel chips in faulty machines and compare them to the working ones.

#25 Updated by Viktor V 4 months ago

Did you update Lenovo BIOS to the latest version before extracting ME and flashing coreboot?

Yes, I did. It was version 1.45, but now it's already 1.46 available released in June 26 2019.

Collect revisions and stepping ids of the Intel chips in faulty machines and compare them to the working ones.

Can I help with providing this information? Not sure what revision and stepping id are, how can I see them in Debian? I've built coreboot 4.9 release.

I assumed that X220 is the most stable hardware for coreboot. Honestly, my very first thought was that this hang is caused by some kind of a failed BIOS exploit by some malware. (LOL I'm paranoid)

#26 Updated by Evgeny Zinoviev 4 months ago

Viktor V wrote:

Can I help with providing this information?

I hope so. Won't hurt anyway.

Not sure what revision and stepping id are, how can I see them in Debian?

I guess, lspci and cat /proc/cpuinfo

I assumed that X220 is the most stable hardware for coreboot.

It is believed to be very stable. Actually, I used to use an X220 (another one) for year and a half and never had a single crash or hang. This bug is quire rare, only some mainboards (or CPUs, or something) are affected and, at the moment, we have no idea why. This bug is known to occur only on SNB thinkpads, so, in this sense, X230 is probably more "stable".

Honestly, my very first thought was that this hang is caused by some kind of a failed BIOS exploit

Well, you have replaced your BIOS with coreboot, haven't you? ;)

Another idea: try disabling cstates and see if it helps. I was going to try it myself but I doubt I'll have time for it earlier than next week.

#27 Updated by Viktor V 4 months ago

Attaching lspci and cpuinfo outputs

#28 Updated by Viktor V 4 months ago

Evgeny Zinoviev wrote:

Another idea: try disabling cstates and see if it helps. I was going to try it myself but I doubt I'll have time for it earlier than next week.

Looks like it works! I've added "intel_idle.max_cstate=0 processor.max_cstate=1" kernel parameters and it runs for 2 days without hangs so far.

#29 Updated by Viktor V 4 months ago

Some strange things I've experienced while flashing this X220.

Every tutorial online says you can flash X220 with Raspberry Pi SPI interface, but I had no luck with it. Flashrom couldn't detect the chip, though it reads/writes fine with RPi on my other laptops. So I had to buy and use ch341a USB programmer (black version).

With ch341a Flashrom works fine, but it shows strange warnings while writing:

Found Macronix flash chip "MX25L6405" (8192 kB, SPI) on ch341a_spi.
Reading old flash chip contents... done.
Erasing and writing flash chip... FAILED at 0x00001000! Expected=0xff, Found=0xf0, failed byte count from 0x00000000-0x0000ffff: 0x1cf9
ERASE FAILED!
Reading current flash chip contents... done. Looking for another erase function.
Erase/write done.
Verifying flash... VERIFIED.

cbmem output says it has SF: Detected MX25L6405D with sector size 0x1000, total 0x800000

Edit: Right, sorry about that. Just trying to understand differences between this unstable X220 and other stable ones.

#30 Updated by Paul Menzel 4 months ago

Please contact the flashrom mailing list for the flashrom issue as it’s unrelated to the coreboot bug tracker and the issue at hand specifically.

#31 Updated by Viktor V 4 months ago

Those hangs must be related to CPU C-states for sure. After 4 days of stable uptime, I've changed back kernel parameters to default and rebooted my X220. It randomly hanged with that glitch on the left side of the screen after just 8 hours of work.

The temporary fix on a Linux system is to run kernel with parameters "intel_idle.max_cstate=0 processor.max_cstate=1".

For example, on Debian I do:

echo GRUB_CMDLINE_LINUX_DEFAULT=\"\$GRUB_CMDLINE_LINUX_DEFAULT intel_idle.max_cstate=0 processor.max_cstate=1\" > /etc/default/grub.d/corebootfix.cfg
sudo update-grub

Hoping this information is useful.

#32 Updated by Evgeny Zinoviev 4 months ago

Viktor V wrote:

Those hangs must be related to CPU C-states for sure. After 4 days of stable uptime, I've changed back kernel parameters to default and rebooted my X220. It randomly hanged with that glitch on the left side of the screen after just 8 hours of work.

The temporary fix on a Linux system is to run kernel with parameters "intel_idle.max_cstate=0 processor.max_cstate=1".

For example, on Debian I do:

echo GRUB_CMDLINE_LINUX_DEFAULT=\"\$GRUB_CMDLINE_LINUX_DEFAULT intel_idle.max_cstate=0 processor.max_cstate=1\" > /etc/default/grub.d/corebootfix.cfg
sudo update-grub

Hoping this information is useful.

Nice! Thank you very much. After months of hangs we finally understand something.

#33 Updated by Martin Zwicknagl 4 months ago

Hello all,

I can confirm that
intel_idle.max_cstate=0 processor.max_cstate=1
seems to fix the problem.

I also tried:
intel_idle.max_cstate=1 processor.max_cstate=2
The T520 is running for more than three days now, without freezes.

Hope this helps.

#34 Updated by Evgeny Zinoviev 4 months ago

Martin Zwicknagl wrote:

Hello all,

I can confirm that
intel_idle.max_cstate=0 processor.max_cstate=1
seems to fix the problem.

I also tried:
intel_idle.max_cstate=1 processor.max_cstate=2
The T520 is running for more than three days now, without freezes.

Hope this helps.

Do you mean that intel_idle.max_cstate=1 processor.max_cstate=2 is also stable?

#35 Updated by Nico Huber 4 months ago

AFAIK, intel_idle and ACPI processor are two independent drivers. Does this mean you tested both? if not, please always mention which one was effective, cf. cat /sys/devices/system/cpu/cpuidle/current_driver. Otherwise, the information "processor.max_cstate=2 works", for instance, may be very misleading if the processor driver wasn't used at all.

#36 Updated by Martin Zwicknagl 4 months ago

Nico Huber wrote:

AFAIK, intel_idle and ACPI processor are two independent drivers. Does this mean you tested both? if not, please always mention which one was effective, cf. cat /sys/devices/system/cpu/cpuidle/current_driver. Otherwise, the information "processor.max_cstate=2 works", for instance, may be very misleading if the processor driver wasn't used at all.

Ups, I was not aware of the difference. cat /sys/devices/system/cpu/cpuidle/current_driver shows intel_idle so I think I have tested intel_idle.max_cstate=1

#37 Updated by Martin Zwicknagl 3 months ago

Hello,

I want to tell you, that the Laptop does NOT freeze with
intel_idle.max_cstate=1, intel_idle.max_cstate=2 and intel_idle.max_cstate=3

with
intel_idle.max_cstate=4, intel_idle.max_cstate=5 and intel_idle.max_cstate=6
it freezes.

#38 Updated by Evgeny Zinoviev 2 months ago

My X220 just hung with intel_idle.max_cstate=3 :(

#39 Updated by Alexander Wetzel about 2 months ago

I'm using coreboot since roughly six month on a thinkpad w530 (i7-3820QM, K2000M and 24GB of RAM with ME neutered) and have what looks like the same issue.
Now I did have an custom modification to coreboot but I've build and flashed fad9536edf yesterday without it and already had a few of the freezes. After reproducing the freezes without the mod I've it installed again. (Based on https://review.coreboot.org/c/coreboot/+/28380, just fixed an rather serious error in DSDT so windows boots with it.)

So I have those freezes with or without this mod, regardless if I set hybrid_graphics_mode to integrated, discrete or dual mode. (Using the discrete card seems to freeze the system more often, but that may also just have been bad luck.)
The freezes always happen with a load close to idle: While I had a few booting up the system it normally occurs when putting the system aside for a short moment after some light browsing or text file editing. But I also can have the idle system just sitting there for hours without hitting it. I get the impression that either putting it aside or picking it up again has a chance of triggering the bug and needed quite some time to accept that it's probably not the movement itself. (I got a new PSU, since it stopped charging the notebook sometimes on movements. The new PSU fixed the stop/resume charging issue - broken cable in the old PSU - but not the freezes.

Now when it freezes it's always the same: The screen freezes, any LED's which normally may flash are staying either lit or unlit. So far I did not had any screen corruption, though.
(Sound is normally muted, so I can't say if there are audio artefacts.)

But I have also an additional symptom after switching to coreboot which could be linked to the problem and if so could be very helpful for debugging it:

I'm also using gentoo and sometimes there are some painful software updates, keeping the CPU at 100% for hours.
Sometimes - less frequent in more current coreboot versions - when having such a big update the CPUs stop using the max speed (around 3491 MHz) and are stuck at a much lower speed. (I think it was around 2 GHz). All cores are still working 100% but the CPU power reduced, resulting in drastically longer compile times.
I tried some months ago to figure out why, but there was nothing in the logs and the CPU governor still reported the normal limits. For some undetermined reason the CPUs just did not use the higher frequencies till I rebooted. Some time later I figured out how to fix the stuck CPU frequency without rebooting: Suspending the system to RAM and resuming it. (Which is basically a CPU reboot after all.)
Since the system is still fully operational when I hit this bug I can execute basically anything. Are there anything I should gather when I get my system into that state next time?
Unfortunately I get into that state much less often than the freezes... But I guess I could try forcing the issue and let the system sit in a corner recompiling dev-qt/qtwebengine, the package most likely to triggering the bug for me.

Noteworthy here is, that with more recent coreboot versions I hit the CPU throttling bug much less frequent. Maybe once in the last two months, while getting it within maybe 30min compiling packages some time back. Normally it takes quite some time (>1h?) of 100% CPU to trigger this bug. Now I had quite some big updates in the past not triggering it, (un)fortunately.

But with time I'm sure I can trigger it again, either accidentally or forced. If you have suggestions what do do when I get into that state next I'll do that on top of what I can think of myself (Which is not much, to be honest. Still pretty new to coreboot...)

#40 Updated by Evgeny Zinoviev about 2 months ago

Hello, Alexander.

That's sad. Until this moment I believed this bug affects at least only xx20 ThinkPad series. By the way I use corebooted W530 too (i7-3720QM, then i7-3940XM, 32GB RAM, neutered ME) for over a year and never ever had a single crash or freeze.

just fixed an rather serious error in DSDT so windows boots with it

Can you upload a fix somewhere? I'll update the patch on Gerrit.

The freezes always happen with a load close to idle
Now when it freezes it's always the same: The screen freezes, any LED's which normally may flash are staying either lit or unlit. So far I did not had any screen corruption, though.

This is also what I see on X220. The crash is more likely to happen when idle. Sometimes there is video corruption, sometimes it just stucks.

Sometimes - less frequent in more current coreboot versions - when having such a big update the CPUs stop using the max speed (around 3491 MHz) and are stuck at a much lower speed. (I think it was around 2 GHz). All cores are still working 100% but the CPU power reduced, resulting in drastically longer compile times.

Two suggestions.
1. CPU is throttling because the temperature is too high. Not likely.
2. I know how to reproduce a similar frequency drop, just put lower power adapter, not this huge 170W brick that comes with W530, but for example 90W one or 65W one. The CPU frequency will immediately drop to ~1200 MHz and the only way I know to fix this is to perform suspend/resume or reboot. But sometimes this happens to my W530 with original 170W brick, just as you say, maybe once in two months or so. I just didn't really bother debugging this.

Please post your lspci and cat /proc/cpuinfo | grep stepping output (I want to compare hardware revisions with mine). I'm collecting information about affected and non-affected machines, maybe I'll see some pattern, idk.

#41 Updated by Alexander Wetzel about 2 months ago

That's sad. Until this moment I believed this bug affects at least only xx20 ThinkPad series. By the way I use corebooted W530 too (i7-3720QM, then i7-3940XM, 32GB RAM, neutered ME) for over a year and never ever had a single crash or freeze.

Really a strange bug...

just fixed an rather serious error in DSDT so windows boots with it

Can you upload a fix somewhere? I'll update the patch on Gerrit.

I was planning to work on that a bit more, this is basically only a forward ported version of my very first shot at coreboot patching without caring about other platforms...
The idea was to polish it prior to contacting you:-)... That said here what I have: https://www.awhome.eu/index.php/s/GBfFb2Et768cQWM
Since that is highly off-topic I've added the comments for that to the patch.

The freezes always happen with a load close to idle
Now when it freezes it's always the same: The screen freezes, any LED's which normally may flash are staying either lit or unlit. So far I did not had any screen corruption, though.

This is also what I see on X220. The crash is more likely to happen when idle. Sometimes there is video corruption, sometimes it just stucks.

Sometimes - less frequent in more current coreboot versions - when having such a big update the CPUs stop using the max speed (around 3491 MHz) and are stuck at a much lower speed. (I think it was around 2 GHz). All cores are still working 100% but the CPU power reduced, resulting in drastically longer compile times.

Two suggestions.
1. CPU is throttling because the temperature is too high. Not likely.

Correct. I'm 100% sure it's not that. (Had that in the past and it DID cause log entries.)

  1. I know how to reproduce a similar frequency drop, just put lower power adapter, not this huge 170W brick that comes with W530, but for example 90W one or 65W one. The CPU frequency will immediately drop to ~1200 MHz and the only way I know to fix this is to perform suspend/resume or reboot. But sometimes this happens to my W530 with original 170W brick, just as you say, maybe once in two months or so. I just didn't really bother debugging this.

Some months ago I was wondering if I had to flash back to the official bios. But it has gotten much less frequent and is now only a itch.
Now I'm wondering if it's not linked to the bug... Maybe we do something wrong at setup with either can crash the CPU when idle or just whatever mechanism linux uses to tell the CPU to switch the frequency. Now that's a very thin link and it may well turn out to be something unrelated. But that since I have no idea how we can debug the freeze I hope that poking at that may turn up something...

Please post your lspci and cat /proc/cpuinfo | grep stepping output (I want to compare hardware revisions with mine). I'm collecting information about affected and non-affected machines, maybe I'll see some pattern, idk.

$ lspci
00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
00:04.0 Signal processing controller: Intel Corporation 3rd Gen Core Processor Thermal Subsystem (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4)
00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 3 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation QM77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation GK107GLM [Quadro K2000M] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GK107 HDMI Audio Controller (rev ff)
02:00.0 SD Host controller: Ricoh Co Ltd PCIe SDXC/MMC Host Controller (rev 08)
02:00.3 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 PCIe IEEE 1394 Controller (rev 04)
03:00.0 Network controller: Intel Corporation Centrino Ultimate-N 6300 (rev 3e)

I'm mainly running the OS on a msata card but also have two HDDs installed. (Both normally powered but unused.)

$ cat /proc/cpuinfo | grep stepping
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9
stepping : 9

#42 Updated by Evgeny Zinoviev about 2 months ago

Thanks. All revisions are the same as on my machine :(

Did you try limiting C-States? People say it helps (earlier in this topic). Might be worth checking.
Didn't help mine X220 though. I made sure that current driver is intel_idle and it crashed after a couple of hours as usual with intel_idle.max_cstate=3.

#43 Updated by Ryan Heyser about 2 months ago

Evgeny Zinoviev wrote:

Thanks. All revisions are the same as on my machine :(

Did you try limiting C-States? People say it helps (earlier in this topic). Might be worth checking.
Didn't help mine X220 though. I made sure that current driver is intel_idle and it crashed after a couple of hours as usual with intel_idle.max_cstate=3.

It doesn't help my T420. I've had, although a significant drop in crashes, still a few after limiting cstates with the same stepping as above. To note, I have a model with a discrete GPU.

#44 Updated by Alexander Wetzel 30 days ago

Evgeny Zinoviev wrote:

Did you try limiting C-States? People say it helps (earlier in this topic). Might be worth checking.
Didn't help mine X220 though. I made sure that current driver is intel_idle and it crashed after a couple of hours as usual with intel_idle.max_cstate=3.

I think my freezes were cause by something else...
As mentioned my freezes seem to by linked to physical movements of the device. Now I left the wires soldered to the debug connector in the system, so I just have to remove the keyboard and connect them to the flasher to restore the system of a potential brick. After you reported no problems with your W530 I placed these wires slightly different: And since that I had no new freeze. (I did not set any cstate kernel parameter.)
Of course it could also be linked to something in linux 5.3 kernel but that seems to be less likely. (I'm closely tracking the wireless git kernel and the last freeze was already with the kernel 5.3.0-rc6-wt).

I report back if the freezes come back, but it looks like my report here should be ignored for tracking down the bug handled here.

Also available in: Atom PDF