Bug #259

T440p: Tianocore unable to boot Windows 10 (MACHINE_CHECK_EXCEPTION)

Added by Crazy Fox about 2 months ago. Updated 10 days ago.

Status:NewStart date:06/09/2020
Priority:NormalDue date:
Assignee:Angel Pons% Done:

0%

Category:board support
Target version:-

Description

Hi, Team!

I've successfully corebooted my T440p (without dGPU) - Debian 10 works fine, but unable to boot into Windows 10 - getting BSOD with Stop Code: MACHINE_CHECK_EXCEPTION
The same exception appears even trying to boot from Windows usb installation media.

Tried different config options, tried coreboot master and v4.11/v4.12 tags with the same result

There is some info about my setup:
~~~
$ git rev-parse HEAD
342a8c3b2bc0845638e852af01f3054256a8446c
~~~
~~~
$ sudo hwinfo --cpu --short
cpu:
Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz, 800 MHz
~~~
~~~
$ cat defconfig
CONFIG_LOCALVERSION="GLETA1WW (2.55)"
CONFIG_USE_OPTION_TABLE=y
CONFIG_TIMESTAMPS_ON_CONSOLE=y
CONFIG_FW_CONFIG=y
CONFIG_FW_CONFIG_SOURCE_CBFS=y
CONFIG_VENDOR_LENOVO=y
CONFIG_ONBOARD_VGA_IS_PRIMARY=y
CONFIG_CBFS_SIZE=0x200000
CONFIG_MAINBOARD_SMBIOS_PRODUCT_NAME="20AWS0VK00"
CONFIG_HAVE_IFD_BIN=y
CONFIG_BOARD_LENOVO_THINKPAD_T440P=y
CONFIG_CONSOLE_POST=y
CONFIG_PCIEXP_L1_SUB_STATE=y
CONFIG_POWER_STATE_PREVIOUS_AFTER_FAILURE=y
CONFIG_HAVE_MRC=y
CONFIG_MRC_FILE="3rdparty/blobs/mainboard/$(MAINBOARDDIR)/mrc.bin"
CONFIG_PCIEXP_CLK_PM=y
CONFIG_VALIDATE_INTEL_DESCRIPTOR=y
CONFIG_H8_SUPPORT_BT_ON_WIFI=y
CONFIG_HAVE_ME_BIN=y
CONFIG_CHECK_ME=y
CONFIG_USE_ME_CLEANER=y
CONFIG_HAVE_GBE_BIN=y
CONFIG_ELOG=y
CONFIG_USBDEBUG=y
CONFIG_USBDEBUG_DONGLE_FTDI_FT232H=y
CONFIG_DRIVERS_GENERIC_CBFS_SERIAL=y
CONFIG_DRIVERS_PS2_KEYBOARD=y
CONFIG_DEBUG_TPM=y
CONFIG_TPM_RDRESP_NEED_DELAY=y
CONFIG_SECURITY_CLEAR_DRAM_ON_REGULAR_BOOT=y
CONFIG_PAYLOAD_TIANOCORE=y
~~~

results.log Magnifier - FWTS (343 KB) Crazy Fox, 06/10/2020 10:27 AM

ntbtlog.txt Magnifier - windows boot log (20.2 KB) Crazy Fox, 06/10/2020 10:28 AM

IMG_20200610_131344.jpg (114 KB) Crazy Fox, 06/10/2020 10:56 AM

IMG_20200610_132042.jpg (120 KB) Crazy Fox, 06/10/2020 10:56 AM

IMG_20200610_132407.jpg (123 KB) Crazy Fox, 06/10/2020 10:56 AM

History

#1 Updated by Paul Menzel about 2 months ago

Is there any more information regarding the machine check exception? Could you please take a picture of the error screen, and upload it?

Do you see ACPI errors when starting a GNU/Linux distribution? Does mcelog output anything? Does FWTS 1 show anything critical?

PS: For the record. The master commit you used:

$ git describe 342a8c3b2bc0845638e852af01f3054256a8446c
4.12-604-g342a8c3b2b

#2 Updated by Crazy Fox about 2 months ago

Paul Menzel wrote:

Is there any more information regarding the machine check exception? Could you please take a picture of the error screen, and upload it?

Do you see ACPI errors when starting a GNU/Linux distribution? Does mcelog output anything? Does FWTS [1] show anything critical?

PS: For the record. The master commit you used:

$ git describe 342a8c3b2bc0845638e852af01f3054256a8446c
4.12-604-g342a8c3b2b

[1]: https://wiki.ubuntu.com/FirmwareTestSuite

Thanks for reply!

Since yesterday there is a little progress - as was suggested in the reddit thread (https://www.reddit.com/r/coreboot/comments/gzmvgp/thinkpad_t440p_coreboot_v412tianocore_machine/ftjinjo/), after rollback the patch https://review.coreboot.org/c/coreboot/+/38723/6/src/mainboard/lenovo/t440p/romstage.c I can boot/login to Windows in Safe Mode.

But during normal boot it still BSODing with PAGE_FAULT_IN_NONPAGED_AREA, KMODE_EXCEPTION_NOT_HANDLED or BAD_POOL_CALLER just after Welcome Screen appears.
When switching back to stock bios (only bios region on 4mb flash, 8mb with stripped IME stay untouched) - Windows boots as expected.

FWTS accidentally gets 14 criticals:
~~~
Critical failures: 14
mtrr: Memory range 0x82200000 to 0x82200fff (0000:03:00.0) has incorrect attribute Unknown.
... just truncated same errors
mtrr: Memory range 0x82845000 to 0x8284500f (0000:00:16.0) has incorrect attribute Unknown.
~~~
~~~
$ sudo ras-mc-ctl --errors
No Memory errors.

PCIe AER events:
1 2020-06-09 20:11:07 +0300 Fatal error: Poisoned TLP
... just truncated same errors
7 2020-06-10 12:03:46 +0300 Fatal error: Poisoned TLP

No Extlog errors.

No MCE errors.
~~~
~~~
sudo systemctl status ras-mc-ctl
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2020-06-10 13:25:40 EEST; 8min ago
Process: 896 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
Main PID: 896 (code=exited, status=0/SUCCESS)

чер 10 13:25:39 ThinkPad-T440p systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
чер 10 13:25:40 ThinkPad-T440p ras-mc-ctl[896]: ras-mc-ctl: Error: No dimm labels for LENOVO model 20AWS0VK00
чер 10 13:25:40 ThinkPad-T440p systemd[1]: Started Initialize EDAC v3.0.0 Drivers For Machine Hardware.
~~~
~~~
$ sudo dmesg | grep acpi
[ 0.216226] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[ 0.224460] acpi PNP0A08:00: OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[ 0.224486] acpi PNP0A08:00: OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR]
[ 0.224492] acpi PNP0A08:00: [Firmware Info]: MMCONFIG for domain 0000 [bus 00-3f] only partially covers this bridge
[ 0.246620] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 2.303952] acpi device:0b: registered as cooling_device9
[ 21.922355] thinkpad
acpi: ThinkPad ACPI Extras v0.26
[ 21.922360] thinkpad_acpi: http://ibm-acpi.sf.net/
[ 21.922362] thinkpad_acpi: ThinkPad BIOS GLETA1WW (2.55), EC GLHT30WW-3.23
[ 21.932112] thinkpad_acpi: radio switch found; radios are enabled
[ 21.934136] thinkpad_acpi: Tablet mode switch found (type: MHKG), currently in laptop mode
[ 21.934198] thinkpad
acpi: This ThinkPad has standard ACPI backlight brightness control, supported by the ACPI video driver
[ 21.934199] thinkpad_acpi: Disabling thinkpad-acpi brightness events by default...
[ 21.956276] thinkpad_acpi: rfkill switch tpacpi_bluetooth_sw: radio is unblocked
[ 21.957888] thinkpad_acpi: rfkill switch tpacpi_wwan_sw: radio is unblocked
[ 21.967317] thinkpad_acpi: Standard ACPI backlight interface available, not loading native one
[ 21.967658] thinkpad_acpi: Console audio control enabled, mode: monitor (read only)
[ 21.971527] thinkpad_acpi: battery 1 registered (start 0, stop 100)
[ 21.971722] input: ThinkPad Extra Buttons as /devices/platform/thinkpad_acpi/input/input7
~~~

also attached BSODs shots, FWTS full log & windows boot log

#3 Updated by Paul Menzel about 2 months ago

I guess the pictures are from different boots?

  1. 2 * PAGE_FAULT_IN_NONPAGED_AREA
  2. BAD_POOL_CALLER

I guess with your original report you saw MACHINE_CHECK_EXCEPTION?

Getting different errors, I’d say there is a problem with memory init. But as GNU/Linux works, I am not sure. Hopefully others will have a clue.

#4 Updated by Crazy Fox about 2 months ago

Paul Menzel wrote:

I guess the pictures are from different boots?

  1. 2 * PAGE_FAULT_IN_NONPAGED_AREA
  2. BAD_POOL_CALLER

I guess with your original report you saw MACHINE_CHECK_EXCEPTION?

Getting different errors, I’d say there is a problem with memory init. But as GNU/Linux works, I am not sure. Hopefully others will have a clue.

Yes, initially I've got MACHINE_CHECK_EXCEPTION during Normal boot, Safe Mode boot and USB Installation Media boot with no single entry was added to the ntbtlog.txt.
With reverted patch at normal boot BSOD appears with PAGE_FAULT_IN_NONPAGED_AREA, KMODE_EXCEPTION_NOT_HANDLED or BAD_POOL_CALLER in random order just after Welcome Screen appears.

#6 Updated by Crazy Fox 12 days ago

seems it can be closed, as latest Win 10 2009 build works fine

#7 Updated by Angel Pons 10 days ago

  • Assignee set to Angel Pons

I see the same BSOD on the Asrock B85M Pro4, and decided to take a look. Out of desperation, I tried disabling things in the devicetree, and found out why things break. See https://review.coreboot.org/43763 for a dirty fix for the B85M Pro4.

Turns out that, when the last root port function is not visible for any reason, root_port_commit_config() is not called and PCH PCIe root port initialization is not completed. A symptom of this problem is that briefly pressing the power button on the payload should power the computer off, but it might lock up instead. It's a side effect of missing initialization, it seems.

There's a log of the T440p in board_status that shows that the last PCIe root port is disabled, which could be why things don't work: https://review.coreboot.org/cgit/board-status.git/tree/lenovo/t440p/4.11-1594-g6daa8c3ba5/2020-03-13T02_50_21Z/coreboot_console.txt

In short: no, this is far from solved. The PCIe handling code needs a revamp.

Also available in: Atom PDF