Bug #121

T520: Hangs in OS

Added by Julz Buckton about 2 years ago. Updated 9 days ago.

Status:NewStart date:06/09/2017
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I have been running coreboot since 2017.04.15 and have experienced hangs ever since then. It was suggested by folk on the IRC that I run memtest to check for incorrect raminit causing errors, however I have run memtest for 12 hours straight with no errors.

Due to the ambiguous nature of the hangs (immediate freeze with no warning signs, audio gets stuck repeating the last 50ms or so of noise, not sure what this effect is called) I don't have much useful information other than the .config and dmesg. However one thing I can say with high confidence is that the hangs occur significantly more frequently in Linux (*buntu distros) than Windows 10. Within an hour of launching Linux a hang is likely, whereas Windows typically runs for many hours before a hang occurs. I considered this an insignificant anecdotal anomaly at first but over the course of the nearly 2 months I have been running coreboot it seems to be a solid trend. The hangs occur anywhere, typically during mere desktop usage or basic web browsing.

Additionally there is another form of hang I experience where the screen goes black except for some sort of graphical corruption down the left side (http://i.imgur.com/4zWrlpX.jpg), whether this is related to the more common total freeze hangs I don't know but I figured I should include it nonetheless. These hangs only occur about 1:20 compared to the regular hangs.

config (20.7 KB) Julz Buckton, 06/09/2017 06:21 AM

dmesg.txt Magnifier (57.3 KB) Julz Buckton, 06/09/2017 06:21 AM

cbmem-raminit.txt Magnifier (62 KB) Julz Buckton, 06/29/2017 11:58 PM

lspci.txt Magnifier - sudo lspci -vv (29.6 KB) Viktor V, 06/29/2019 06:46 AM

cpuinfo.txt Magnifier - cat /proc/cpuinfo (3.94 KB) Viktor V, 06/29/2019 06:46 AM

History

#1 Updated by Julz Buckton about 2 years ago

https://mail.coreboot.org/pipermail/coreboot/2016-September/082009.html

According to this entry on the mailing list someone else was getting the same issue on their T520. I have tried limiting the max mem speed to 666 in devicetree.cb as suggested in the link, however it did not fix the issue as expected since my RAM is only 1333 anyway. The second suggestion (limiting CPU p-state), I wouldn't know how to do.

#2 Updated by Nico Huber about 2 years ago

Does your T520 have a dedicated GPU or the integrated Intel GPU only?

#3 Updated by Julz Buckton about 2 years ago

Integrated only.

#4 Updated by Iru Cai about 2 years ago

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

#5 Updated by Vasya Boytsov about 2 years ago

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

#6 Updated by Julz Buckton about 2 years ago

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

I already using a 4 CPUs chip though (i5-3320M). Perhaps I could try setting maxcpus=2 in config.

#7 Updated by Iru Cai about 2 years ago

Julz Buckton wrote:

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Have you tried mrc.bin yet, e.g revision 39937cc?
I've tried this revision and the first revision that uses native ram init, and it seems that native ram init is the problem. I just don't know if mrc.bin supports ivy bridge yet.

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

I already using a 4 CPUs chip though (i5-3320M). Perhaps I could try setting maxcpus=2 in config.

#8 Updated by Iru Cai about 2 years ago

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

Linux kernel config?
I remember I haven't have any issue on an iGPU only T420. My last working revision is 8bbd596de631adc8b677e69603e978b848eb1708.

#9 Updated by Vasya Boytsov about 2 years ago

Iru Cai wrote:

Vasya Boytsov wrote:

I have the same issue on t420 with 3632qm. And I accidentally found out that my laptop works more than 2 days without any hangs while I was using the x220 kernel config which had maxcpus set to 4. When I changed this value to 8 in the kernel config those hangs came back. I don't remember whether the maxcpus=7 worked the same way or not.

Linux kernel config?
I remember I haven't have any issue on an iGPU only T420. My last working revision is 8bbd596de631adc8b677e69603e978b848eb1708.

Yes, I've changed this setting in the Linux kernel config, compiled the kernel and it works flawlessly now. The last time I was testing was between 4.5 and 4.6 don't remember the exact revision. So, the problem should be connected with native ram init, I'll try earlier revisions later. How can one be of help with debugging of this issue?

#10 Updated by Julz Buckton about 2 years ago

Iru Cai wrote:

Julz Buckton wrote:

Iru Cai wrote:

What is the longest uptime before the system hangs in Linux?
How long the system can run before it hangs when you run some heavy loads (e.g. boinc) or do a lot of network transfer?

Also, I suggest you try revision 39937cc2fd28bcc754c0595f1327467499af40ea in which Lenovo T520 is still using mrc.bin blob. I'm now running it the first time and the system has run for >5 hours. However, I don't know if it's still stable in the future boots.

I am lucky to get 1 hour uptime in linux. Heavy loads on windows seem to prevent the hangs, I have run Linpack and some GPU benchmarks multiple times for 6+ hours at a time with no hang, and have never seen a hang during such programs. This doesn't seem to be the case on linux, where I frequently get hangs during the crossgcc build stage of the coreboot build, which I assume is running the CPU high. Network activity does not seem to prevent the hangs, furthermore the most common hang scenario for me now is when the laptop was left for some hours with only a torrent client running, where it is unlikely to not hang after 2 hours.

Have you tried mrc.bin yet, e.g revision 39937cc?
I've tried this revision and the first revision that uses native ram init, and it seems that native ram init is the problem. I just don't know if mrc.bin supports ivy bridge yet.

You mean this version? https://review.coreboot.org/cgit/coreboot.git/commit/?id=39937cc2fd28bcc754c0595f1327467499af40ea

I will give it a try. Could native ram init really be the cause of the issue, even if I got no errors in memtest?

#11 Updated by Julz Buckton about 2 years ago

Tried coreboot revision 39937cc2fd28bcc754c0595f1327467499af40ea (with systemagent-r6.bin, tried systemagent-ivybridge.bin first and got brick) and got a hang within 30 seconds of booting into linux. Guess that rules out RAM init being the cause of hangs?

#12 Updated by Julz Buckton about 2 years ago

Here is cbmem output with verbose RAM init logging enabled, in case it is helpful.

#13 Updated by Julz Buckton about 2 years ago

I managed to get my hands on another SNB chip (i3-2310M) and with the same config (with just PCI ID for vga blob changed from 8086:0166 to 8086:0126), I get no hangs.

So looks like T520 mainboard + Ivy Bridge chip is cause for hangs.

#14 Updated by Iru Cai about 2 years ago

Julz Buckton wrote:

I managed to get my hands on another SNB chip (i3-2310M) and with the same config (with just PCI ID for vga blob changed from 8086:0166 to 8086:0126), I get no hangs.

So looks like T520 mainboard + Ivy Bridge chip is cause for hangs.

Maybe related to turbo boost? Although the machine often hangs at idle time.
Because the system hang also happens when I use a Sandy Bridge Dual/Quad core processor.

#15 Updated by Patrick Rudolph over 1 year ago

Vendor does dynamically limit pstate depending on attached power supply.
ATM coreboot doesn't care about attached PSU...

Example:
The battery charges at 45 Watt.
The CPU has a TPD of 45 W.
7W idle power.
Other components, including USB 10W ?

It would require a 135 Watt PSU or limiting the CPU TDP / battery charge current to a smaller value.

What power-rating does your PSU have ?

#16 Updated by Seff Qin 12 months ago

Test v4.8.1 with t420, this issue has not been fixed.

I got different informations by executing 'dmidecode -t 17':
Vendor BIOS: Total Width and Data Width are both 64 bits.
Coreboot: Total Width is 16 bits and Data Width is 8 bits.

It seems that the RAMs are not running at full speed.

#17 Updated by Evgeny Zinoviev 10 months ago

Having hangs on T520 + i5-2450M. Happened twice after ~1 min after booting debian (devuan). The interesting part is that it unfreezes after 4-5 minutes. I'm using two 4G Hynix RAM sticks, 8G in total. I'll see if maxcpus=2 helps.

#18 Updated by Evgeny Zinoviev 10 months ago

Update: maxcpus=2 didn't help

#19 Updated by Nico Huber 10 months ago

Evgeny Zinoviev wrote:

Update: maxcpus=2 didn't help

Please note that the original report was for an Ivy Bridge CPU in a T520 (probably caused by missing compatible ME firmware or whatnot). You seem to have a very different problem.

#20 Updated by Evgeny Zinoviev 5 months ago

Now I have X220 with this bug. Yeah I know that the original report is for IVB CPU in T520, but i've seen both symptoms and they are the same: (1) just a hang and (2) a black screen with fluttering red line at the left, like on the photo from the last paragraph of this ticket.

Doesn't happen with lenovo bios. For now I suspect it's something RAM related (just have no other ideas). I'm using 2x8Gb Patriot PSD38G16002S sticks. I'll try to use different sticks and see if it helps. What else can I do to debug this? At least I have a hardware on which we can reproduce this, that's something for a start.

#21 Updated by Evgeny Zinoviev about 2 months ago

Recent observations on X220.

Using most recent CPU microcode doesn't help.
Not using CPU microcode at all doesn't help.
Disabling HT with patch #29669 doesn't help.
Using mrc.bin instead of native raminit doesn't help.
Changing DIMMs doesn't help.
Using stock or neutered ME doesn't help.

Using OEM BIOS helps, of course, but that's not a solution.

#22 Updated by Evgeny Zinoviev about 2 months ago

A also have a feeling that this happens more often when using virtualization (qemu/kvm). I'd say if I run virtual machines, the lockup is likely to happen in hour or so.

#23 Updated by Viktor V 25 days ago

Evgeny Zinoviev wrote:

Recent observations on X220.

Using most recent CPU microcode doesn't help.
Not using CPU microcode at all doesn't help.
Disabling HT with patch #29669 doesn't help.
Using mrc.bin instead of native raminit doesn't help.
Changing DIMMs doesn't help.
Using stock or neutered ME doesn't help.

Using OEM BIOS helps, of course, but that's not a solution.

I have exactly the same problem, my X220 randomly hangs with that weird glitch in the left side of the screen. My build settings are pretty much defaults with SeaBIOS and Intel ME disabled.

Using Debian with 2x4 Gb RAM and i5-2520M CPU.

By the way, I'm also from Russia. :)

#24 Updated by Evgeny Zinoviev 25 days ago

Viktor V wrote:

I have exactly the same problem, my X220 randomly hangs with that weird glitch in the left side of the screen. My build settings are pretty much defaults with SeaBIOS and Intel ME disabled.

Using Debian with 2x4 Gb RAM and i5-2520M CPU.

By the way, I'm also from Russia. :)

I'm glad to hear I'm not the only one. Did you update Lenovo BIOS to the latest version before extracting ME and flashing coreboot?

We had a discussion about these hangs on #coreboot and came up with two ideas:

  1. Make sure we use most recent ME firmware.
  2. Collect revisions and stepping ids of the Intel chips in faulty machines and compare them to the working ones.

#25 Updated by Viktor V 24 days ago

Did you update Lenovo BIOS to the latest version before extracting ME and flashing coreboot?

Yes, I did. It was version 1.45, but now it's already 1.46 available released in June 26 2019.

Collect revisions and stepping ids of the Intel chips in faulty machines and compare them to the working ones.

Can I help with providing this information? Not sure what revision and stepping id are, how can I see them in Debian? I've built coreboot 4.9 release.

I assumed that X220 is the most stable hardware for coreboot. Honestly, my very first thought was that this hang is caused by some kind of a failed BIOS exploit by some malware. (LOL I'm paranoid)

#26 Updated by Evgeny Zinoviev 24 days ago

Viktor V wrote:

Can I help with providing this information?

I hope so. Won't hurt anyway.

Not sure what revision and stepping id are, how can I see them in Debian?

I guess, lspci and cat /proc/cpuinfo

I assumed that X220 is the most stable hardware for coreboot.

It is believed to be very stable. Actually, I used to use an X220 (another one) for year and a half and never had a single crash or hang. This bug is quire rare, only some mainboards (or CPUs, or something) are affected and, at the moment, we have no idea why. This bug is known to occur only on SNB thinkpads, so, in this sense, X230 is probably more "stable".

Honestly, my very first thought was that this hang is caused by some kind of a failed BIOS exploit

Well, you have replaced your BIOS with coreboot, haven't you? ;)

Another idea: try disabling cstates and see if it helps. I was going to try it myself but I doubt I'll have time for it earlier than next week.

#27 Updated by Viktor V 24 days ago

Attaching lspci and cpuinfo outputs

#28 Updated by Viktor V 21 days ago

Evgeny Zinoviev wrote:

Another idea: try disabling cstates and see if it helps. I was going to try it myself but I doubt I'll have time for it earlier than next week.

Looks like it works! I've added "intel_idle.max_cstate=0 processor.max_cstate=1" kernel parameters and it runs for 2 days without hangs so far.

#29 Updated by Viktor V 21 days ago

Some strange things I've experienced while flashing this X220.

Every tutorial online says you can flash X220 with Raspberry Pi SPI interface, but I had no luck with it. Flashrom couldn't detect the chip, though it reads/writes fine with RPi on my other laptops. So I had to buy and use ch341a USB programmer (black version).

With ch341a Flashrom works fine, but it shows strange warnings while writing:

Found Macronix flash chip "MX25L6405" (8192 kB, SPI) on ch341a_spi.
Reading old flash chip contents... done.
Erasing and writing flash chip... FAILED at 0x00001000! Expected=0xff, Found=0xf0, failed byte count from 0x00000000-0x0000ffff: 0x1cf9
ERASE FAILED!
Reading current flash chip contents... done. Looking for another erase function.
Erase/write done.
Verifying flash... VERIFIED.

cbmem output says it has SF: Detected MX25L6405D with sector size 0x1000, total 0x800000

Edit: Right, sorry about that. Just trying to understand differences between this unstable X220 and other stable ones.

#30 Updated by Paul Menzel 21 days ago

Please contact the flashrom mailing list for the flashrom issue as it’s unrelated to the coreboot bug tracker and the issue at hand specifically.

#31 Updated by Viktor V 18 days ago

Those hangs must be related to CPU C-states for sure. After 4 days of stable uptime, I've changed back kernel parameters to default and rebooted my X220. It randomly hanged with that glitch on the left side of the screen after just 8 hours of work.

The temporary fix on a Linux system is to run kernel with parameters "intel_idle.max_cstate=0 processor.max_cstate=1".

For example, on Debian I do:

echo GRUB_CMDLINE_LINUX_DEFAULT=\"\$GRUB_CMDLINE_LINUX_DEFAULT intel_idle.max_cstate=0 processor.max_cstate=1\" > /etc/default/grub.d/corebootfix.cfg
sudo update-grub

Hoping this information is useful.

#32 Updated by Evgeny Zinoviev 18 days ago

Viktor V wrote:

Those hangs must be related to CPU C-states for sure. After 4 days of stable uptime, I've changed back kernel parameters to default and rebooted my X220. It randomly hanged with that glitch on the left side of the screen after just 8 hours of work.

The temporary fix on a Linux system is to run kernel with parameters "intel_idle.max_cstate=0 processor.max_cstate=1".

For example, on Debian I do:

echo GRUB_CMDLINE_LINUX_DEFAULT=\"\$GRUB_CMDLINE_LINUX_DEFAULT intel_idle.max_cstate=0 processor.max_cstate=1\" > /etc/default/grub.d/corebootfix.cfg
sudo update-grub

Hoping this information is useful.

Nice! Thank you very much. After months of hangs we finally understand something.

#33 Updated by Martin Zwicknagl 16 days ago

Hello all,

I can confirm that
intel_idle.max_cstate=0 processor.max_cstate=1
seems to fix the problem.

I also tried:
intel_idle.max_cstate=1 processor.max_cstate=2
The T520 is running for more than three days now, without freezes.

Hope this helps.

#34 Updated by Evgeny Zinoviev 16 days ago

Martin Zwicknagl wrote:

Hello all,

I can confirm that
intel_idle.max_cstate=0 processor.max_cstate=1
seems to fix the problem.

I also tried:
intel_idle.max_cstate=1 processor.max_cstate=2
The T520 is running for more than three days now, without freezes.

Hope this helps.

Do you mean that intel_idle.max_cstate=1 processor.max_cstate=2 is also stable?

#35 Updated by Nico Huber 16 days ago

AFAIK, intel_idle and ACPI processor are two independent drivers. Does this mean you tested both? if not, please always mention which one was effective, cf. cat /sys/devices/system/cpu/cpuidle/current_driver. Otherwise, the information "processor.max_cstate=2 works", for instance, may be very misleading if the processor driver wasn't used at all.

#36 Updated by Martin Zwicknagl 16 days ago

Nico Huber wrote:

AFAIK, intel_idle and ACPI processor are two independent drivers. Does this mean you tested both? if not, please always mention which one was effective, cf. cat /sys/devices/system/cpu/cpuidle/current_driver. Otherwise, the information "processor.max_cstate=2 works", for instance, may be very misleading if the processor driver wasn't used at all.

Ups, I was not aware of the difference. cat /sys/devices/system/cpu/cpuidle/current_driver shows intel_idle so I think I have tested intel_idle.max_cstate=1

#37 Updated by Martin Zwicknagl 9 days ago

Hello,

I want to tell you, that the Laptop does NOT freeze with
intel_idle.max_cstate=1, intel_idle.max_cstate=2 and intel_idle.max_cstate=3

with
intel_idle.max_cstate=4, intel_idle.max_cstate=5 and intel_idle.max_cstate=6
it freezes.

Also available in: Atom PDF