Saturday, February 22, 2014

0x101 Debugging

I've been working on an 0x101 BSOD (located here), and I thought I'd go ahead and blog about it, even though it's not officially solved just yet. It's still interesting nonetheless, and I believe is good content for a post.

CLOCK_WATCHDOG_TIMEOUT (101)

This indicates that an expected clock interrupt on a secondary processor, in a multi-processor system, was not received within the allocated interval. 

So there's the basic definition of this particular bug check. Let's get into the debugging now.

--------------------


BugCheck 101, {19, 0, fffff880009b2180, 4}

^^ 19 clock ticks in regards to the timeout.

fffff880009b2180 is the PRCB address of the hung processor, let's keep this address in mind.

0: kd> !prcb 4
PRCB for Processor 4 at fffff880009b2180:
Current IRQL -- 0
Threads--  Current fffffa800d851060 Next fffffa800caa6680 Idle fffff880009bd0c0
Processor Index 4 Number (0, 4) GroupSetMember 10
Interrupt Count -- 001bd6a1
Times -- Dpc    00000018 Interrupt 00000048
         Kernel 0000f52d User      00003d36
For reference, I did not do !prcb 0 through 4. That would have been very tedious. Instead, you can use !running -it. The "i" argument causes it to display idle processors too, and "t" displays the stack trace for the thread running on each processor. If we run that extension, it shows the is an 8 core box.

Hint: At times, the 4th parameter of the bug check will show you the responsible processor. For example, in your *101 here, it was correct as the 4th parameter was 4.

Hint #2: You can also generally tell the amount of cores on the box by checking the bugcheck_string - BUGCHECK_STR: CLOCK_WATCHDOG_TIMEOUT_8_PROC

As this matches the 3rd parameter of the bug check, processor #4 is the responsible processor. Now with the information we have here thus far, we know that processor #4 reached 19 clock ticks without responding, therefore the system crashed. Before we go further, what is a clock tick? A clock interrupt is a form of interrupt which involves counting the the cycles of the processor core, which is running a clock on the processors to keep them all in sync. A clock interrupt is handed out to all processors and then they must report in, and when one doesn't report in, you then crash.

--------------------

Let's now look at the stacks of the different processors to see what the threads were involved in:

We can use knL and go through a grueling method of obtaining the trap frame, but we don't like having to put in more work, so let's use kv instead on Processor 0:

0: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`009a9728 fffff800`0322c443 : 00000000`00000101 00000000`00000019 00000000`00000000 fffff880`009b2180 : nt!KeBugCheckEx
fffff880`009a9730 fffff800`032885f7 : 00000000`00000000 fffff800`00000004 00000000`00002711 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4e3e
fffff880`009a97c0 fffff800`037f5895 : fffff800`0381a460 fffff880`009a9970 fffff800`0381a460 00000000`00000000 : nt!KeUpdateSystemTime+0x377
fffff880`009a98c0 fffff800`0327c3f3 : 00000000`8e403992 fffff800`033f8e80 fffff880`03088180 fffffa80`101ca060 : hal!HalpHpetClockInterrupt+0x8d
fffff880`009a98f0 fffff800`032b55a3 : 00000000`00000000 00000000`00000001 00000000`00000000 00000000`00000000 : nt!KiInterruptDispatchNoLock+0x163 (TrapFrame @ fffff880`009a98f0)
fffff880`009a9a80 fffff800`0328de2c : 00000000`00000000 fffff6fc`40004308 00000000`00000000 00000000`00000000 : nt!KxFlushEntireTb+0x93
fffff880`009a9ac0 fffff800`032c76b9 : fffff6fc`40004308 00000000`00000008 fffff800`033f8e80 00000000`00000080 : nt!KeFlushMultipleRangeTb+0x28c
fffff880`009a9b90 fffff800`032c728f : ffffffff`ffffffff 00000000`0000007f 00000000`00000000 00000000`00000000 : nt!MiZeroPageChain+0x14e
fffff880`009a9bd0 fffff800`03523166 : fffffa80`0ca90460 00000000`00000080 fffffa80`0ca909e0 fffff800`0325e479 : nt!MmZeroPageThread+0x7da
fffff880`009a9d00 fffff800`0325e486 : fffff800`033f8e80 fffffa80`0ca90460 fffff800`03406c40 15ff0000`0160248c : nt!PspSystemThreadStartup+0x5a
fffff880`009a9d40 00000000`00000000 : fffff880`009aa000 fffff880`009a4000 fffff880`009a9970 00000000`00000000 : nt!KiStartSystemThread+0x16
There it is! Let's move forward:

0: kd> .trap fffff880`009a98f0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000001 rbx=0000000000000000 rcx=00000000000406f8
rdx=00000000000008e1 rsi=0000000000000000 rdi=0000000000000000
rip=fffff800032b55a3 rsp=fffff880009a9a80 rbp=fffff880009a9bb8
 r8=0000000000000000  r9=ffffffffffffff7f r10=0000000000000008
r11=fffff880009a9a20 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na pe nc
nt!KxFlushEntireTb+0x93:
fffff800`032b55a3 ebe4            jmp     nt!KxFlushEntireTb+0x79 (fffff800`032b5589)
0: kd> knL
  *** Stack trace for last set context - .thread/.cxr resets it
 # Child-SP          RetAddr           Call Site
00 fffff880`009a9a80 fffff800`0328de2c nt!KxFlushEntireTb+0x93
01 fffff880`009a9ac0 fffff800`032c76b9 nt!KeFlushMultipleRangeTb+0x28c
02 fffff880`009a9b90 fffff800`032c728f nt!MiZeroPageChain+0x14e
03 fffff880`009a9bd0 fffff800`03523166 nt!MmZeroPageThread+0x7da
04 fffff880`009a9d00 fffff800`0325e486 nt!PspSystemThreadStartup+0x5a
05 fffff880`009a9d40 00000000`00000000 nt!KiStartSystemThread+0x16
^^ Here we can find the stored registers and the stack at the time of the interrupt.

This is where we're going to do some instruction disassembling:

0: kd> u @rip
nt!KxFlushEntireTb+0x93:
fffff800`032b55a3 ebe4            jmp     nt!KxFlushEntireTb+0x79 (fffff800`032b5589)
fffff800`032b55a5 f08305d3c5140001 lock add dword ptr [nt!KiTbFlushTimeStamp (fffff800`03401b80)],1
fffff800`032b55ad 400fb6c6        movzx   eax,sil
fffff800`032b55b1 440f22c0        mov     cr8,rax
fffff800`032b55b5 488b5c2440      mov     rbx,qword ptr [rsp+40h]
fffff800`032b55ba 488b742448      mov     rsi,qword ptr [rsp+48h]
fffff800`032b55bf 4883c430        add     rsp,30h
fffff800`032b55c3 5f              pop     rdi
0: kd> u fffff800`032b5589 fffff800`032b55a5
nt!KxFlushEntireTb+0x79:
fffff800`032b5589 8b8780200000    mov     eax,dword ptr [rdi+2080h]
fffff800`032b558f 85c0            test    eax,eax <-- Checking if value is non-zero.
fffff800`032b5591 7412            je      nt!KxFlushEntireTb+0x95 (fffff800`032b55a5) <-- It looks like it takes the jmp here to stay in the loop.
fffff800`032b5593 ffc3            inc     ebx
fffff800`032b5595 851d310d2000    test    dword ptr [nt!HvlLongSpinCountMask (fffff800`034b62cc)],ebx
fffff800`032b559b 0f84a11e0200    je      nt! ?? ::FNODOBFM::`string'+0x7467 (fffff800`032d7442)
fffff800`032b55a1 f390            pause
fffff800`032b55a3 ebe4            jmp     nt!KxFlushEntireTb+0x79 (fffff800`032b5589)
fffff800`032b55a5 f08305d3c5140001 lock add dword ptr [nt!KiTbFlushTimeStamp (fffff800`03401b80)],1
^^ Disassembling the first few instructions reveals a jump (jmp) that is back up in the nt!KxFlushEntireTb function. It appears at the time of the bug check, the thread was executing a pause (a CPU delay), and doing this in a loop waiting for a release.

So, what's the summary so far? Processor #0 was the thread that created the bugcheck itself, and must have been interrupted by a clock interrupt in order to trigger the CLOCK_WATCHDOG_TIMEOUT bug check.

--------------------

Let's take a look into Processor #1's call stack like we did Processor #0:

Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`02f1bc58 fffff800`0328da3a : 00000000`0035ce39 fffffa80`0dc25bd8 fffff880`009fb0c0 00000000`00000001 : intelppm!MWaitIdle+0x19
fffff880`02f1bc60 fffff800`032886cc : fffff880`009f0180 fffff880`00000000 00000000`00000000 fffff880`00000000 : nt!PoIdle+0x53a
fffff880`02f1bd40 00000000`00000000 : fffff880`02f1c000 fffff880`02f16000 fffff880`02f1bd00 00000000`00000000 : nt!KiIdleLoop+0x2c
1: kd> !irql
Debugger saved IRQL for processor 0x1 -- 0 (LOW_LEVEL)
^^ Either it's running at 0 or the IRQL despite saying 'saved' really didn't get saved. Windows Internals notes this is a possibility.

1: kd> u @rip
intelppm!MWaitIdle+0x19:
fffff880`06c7ac61 c3              ret
fffff880`06c7ac62 cc              int     3
fffff880`06c7ac63 cc              int     3
fffff880`06c7ac64 cc              int     3
fffff880`06c7ac65 cc              int     3
fffff880`06c7ac66 cc              int     3
fffff880`06c7ac67 cc              int     3
intelppm!SetPerfStateIO:
fffff880`06c7ac68 48895c2408      mov     qword ptr [rsp+8],rbx
^^ So it seems that we have the intelppm!MWaitIdle function. I have done some research and I cannot find info on it, although intelppm is related to the processor and I believe its power configuration, power states, etc. Assuming idle implies what I believe it does, this may indicate that processor #1 at the time of the crash was idle waiting for something.

--------------------

Let's check Processor #2:
2: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`02f8cc58 fffff800`0328da3a : 00000000`0035ce39 fffffa80`0d163908 00000000`00000000 00000000`00000000 : intelppm!MWaitIdle+0x19
fffff880`02f8cc60 fffff800`032886cc : fffff880`02f64180 fffff880`00000000 00000000`00000000 fffff880`00000000 : nt!PoIdle+0x53a
fffff880`02f8cd40 00000000`00000000 : fffff880`02f8d000 fffff880`02f87000 fffff880`02f8cd00 00000000`00000000 : nt!KiIdleLoop+0x2c
^^ Exact same as Processor #1.

--------------------

Let's check Processor #3:

3: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`07bb5210 fffff800`032ab0fb : 00000000`00000002 fffff880`07bb5380 fffff900`c2e82000 00000000`00000001 : nt!MiFlushTbAsNeeded+0x28a
fffff880`07bb5320 fffff800`033afceb : 00000000`00001230 00000000`00001230 00000000`00000021 00000008`00000000 : nt!MiAllocatePagedPoolPages+0x4bb
fffff880`07bb5440 fffff800`03292860 : 00000000`00001230 fffff880`049fccc0 00000000`00000021 00000000`00000000 : nt!MiAllocatePoolPages+0x8e2
fffff880`07bb5590 fffff800`033b2bfe : 00000000`00000000 00000000`02323fff fffffa80`00000020 fffff880`049fccc0 : nt!ExpAllocateBigPool+0xb0
fffff880`07bb5680 fffff960`001928ed : 00000000`00001d01 00000000`00000000 00000000`00000000 fffff960`001a40d1 : nt!ExAllocatePoolWithTag+0x82e
fffff880`07bb5770 fffff960`00193e0f : 00000000`00000001 fffff880`07bb5908 00000000`00000001 fffff960`001a4302 : win32k!AllocateObject+0xdd
fffff880`07bb57b0 fffff960`00169f2f : fffff880`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : win32k!SURFMEM::bCreateDIB+0x1fb
fffff880`07bb58a0 fffff960`00180bc4 : 00000000`01010051 fffff900`c34b3a00 00000000`00000000 00000000`0000002c : win32k!GreCreateDIBitmapReal+0x533
fffff880`07bb59d0 fffff960`00182be6 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : win32k!InternalGetIconInfo+0x174
fffff880`07bb5ac0 fffff800`0327f153 : fffffa80`0ea2d060 00000000`0272f098 fffff880`07bb5b88 00000000`00000020 : win32k!NtUserGetIconInfo+0x182
fffff880`07bb5b70 00000000`778f462a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`07bb5be0)
00000000`0272f078 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x778f462a
3: kd> .trap fffff880`07bb5be0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=00000000778f462a rsp=000000000272f078 rbp=000000000272f1e0
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
0033:00000000`778f462a ??              ???
3: kd> u @rip
00000000`778f462a ??              ???
            ^ Memory access error in 'u @rip'
^^ Cannot seem to access the rip register on processor #3. From the stack, it looks like nt!MiFlushTbAsNeeded may be in a loop.

--------------------

Let's now take a look at the problematic processor (#4):

4: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
00000000`00000000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0
4: kd> r
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up di pl nz na pe nc
cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000
00000000`00000000 ??              ???
^^ We have a zerod stack + registers, so this will be problematic. Usually this occurs on the problem processor because the IRQL is too high, OR the processor was too hung at the time of the crash to report its information, etc. We will need to get the raw stack.

Let's give this a shot:

4: kd> !pcr
KPCR for Processor 4 at fffff880009b2000:
    Major 1 Minor 1
    NtTib.ExceptionList: fffff880009bd640
        NtTib.StackBase: fffff880009b7040
       NtTib.StackLimit: 00000000000ade58
     NtTib.SubSystemTib: fffff880009b2000
          NtTib.Version: 00000000009b2180
      NtTib.UserPointer: fffff880009b27f0
          NtTib.SelfTib: 00000000fffdb000

                SelfPcr: 0000000000000000
                   Prcb: fffff880009b2180
                   Irql: 0000000000000000
                    IRR: 0000000000000000
                    IDR: 0000000000000000
          InterruptMode: 0000000000000000
                    IDT: 0000000000000000
                    GDT: 0000000000000000
                    TSS: 0000000000000000

          CurrentThread: fffffa800d851060
             NextThread: fffffa800caa6680
             IdleThread: fffff880009bd0c0
4: kd> !thread
THREAD fffffa800d851060  Cid 0bb8.08c4  Teb: 00000000fffdb000 Win32Thread: fffff900c4147c30 RUNNING on processor 4
Not impersonating
DeviceMap                 fffff8a0019544e0
Owning Process            fffffa800d845b30       Image:         chrome.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      78494          Ticks: 714 (0:00:00:11.138)
Context Switch Count      232083         IdealProcessor: 4                 LargeStack
UserTime                  00:00:44.990
KernelTime                00:00:09.032
Win32 Start Address 0x0000000000288c9e
Stack Init fffff8800b922d70 Current fffff8800b922a60
Base fffff8800b923000 Limit fffff8800b91a000 Call 0
Priority 9 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
00000000`00000000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0
^^ We'll be using the base & limit addresses to dump the raw stack:

(For convenience purposes, I cut the stack to the important part because the entire raw stack is really large. Even after cutting it, it's still really large...)

fffff880`0b920988  fffff880`0fef3a11*** ERROR: Symbol file could not be found.  Defaulted to export symbols for nvlddmkm.sys -
 nvlddmkm+0xbda11
fffff880`0b920990  fffffa80`0fab3460
fffff880`0b920998  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b9209a0  00000000`0035ce39
fffff880`0b9209a8  fffffa80`0e47c000
fffff880`0b9209b0  00000000`00000000
fffff880`0b9209b8  00000000`00000005
fffff880`0b9209c0  fffffa80`0fab3460
fffff880`0b9209c8  00000000`0000e323
fffff880`0b9209d0  fffffa80`0e47d1a0
fffff880`0b9209d8  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b9209e0  fffffa80`0e47d1a0
fffff880`0b9209e8  fffffa80`0e44a410
fffff880`0b9209f0  fffffa80`0e47c000
fffff880`0b9209f8  fffffa80`0e47c000
fffff880`0b920a00  00000000`00000000
fffff880`0b920a08  00000000`00000000
fffff880`0b920a10  fffffa80`0e47d1a0
fffff880`0b920a18  00000000`00000000
fffff880`0b920a20  fffffa80`0fab3460
fffff880`0b920a28  00000000`00001661
fffff880`0b920a30  00000000`0000c350
fffff880`0b920a38  00000000`00400120
fffff880`0b920a40  00000000`0000e323
fffff880`0b920a48  00000000`00000001
fffff880`0b920a50  00000000`00000000
fffff880`0b920a58  00000000`00000000
fffff880`0b920a60  fffffa80`0e44a410
fffff880`0b920a68  fffff880`06d8e172 dxgmms1!VidSchDdiNotifyInterruptWorker+0x1ea
fffff880`0b920a70  fffffa80`0ebbd9c0
fffff880`0b920a78  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b920a80  00000000`0035ce39
fffff880`0b920a88  fffffa80`0e47e000
fffff880`0b920a90  00000000`00000000
fffff880`0b920a98  00000000`00000006
fffff880`0b920aa0  fffffa80`0ebbd9c0
fffff880`0b920aa8  00000000`00006ed6
fffff880`0b920ab0  fffffa80`0e47f5b0
fffff880`0b920ab8  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b920ac0  fffffa80`0e47f5b0
fffff880`0b920ac8  fffffa80`0e44a410
fffff880`0b920ad0  fffffa80`0e47e000
fffff880`0b920ad8  fffffa80`0e47e000
fffff880`0b920ae0  00000000`00000000
fffff880`0b920ae8  00000000`00000000
fffff880`0b920af0  fffffa80`0e47f5b0
fffff880`0b920af8  00000000`00000000
fffff880`0b920b00  fffffa80`0ebbd9c0
fffff880`0b920b08  00000000`0000071a
fffff880`0b920b10  00000000`000124f8
fffff880`0b920b18  00000000`00400120
fffff880`0b920b20  00000000`00006ed6
fffff880`0b920b28  00000000`00000001
fffff880`0b920b30  00000000`00000001
fffff880`0b920b38  00000000`00000000
fffff880`0b920b40  fffffa80`0e44a410
fffff880`0b920b48  fffff880`06d8e172 dxgmms1!VidSchDdiNotifyInterruptWorker+0x1ea
fffff880`0b920b50  fffffa80`0e47e000
fffff880`0b920b58  00000000`00000000
fffff880`0b920b60  fffffa80`00000001
fffff880`0b920b68  fffff880`0b920e00
fffff880`0b920b70  fffff880`0b920ba0
fffff880`0b920b78  00000000`00000005
fffff880`0b920b80  00000000`00000000
fffff880`0b920b88  00000000`00000001
fffff880`0b920b90  fffff880`0b920e00
fffff880`0b920b98  fffff880`06d8df76 dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff880`0b920ba0  fffffa80`00006ed6
fffff880`0b920ba8  00000000`00000000
fffff880`0b920bb0  fffffa80`0e472000
fffff880`0b920bb8  fffffa80`0ebc2010
fffff880`0b920bc0  fffff880`0b920e00
fffff880`0b920bc8  fffff880`06c9513f dxgkrnl!DxgNotifyInterruptCB+0x83
fffff880`0b920bd0  00000000`00006ed6
fffff880`0b920bd8  00000000`00000000
fffff880`0b920be0  00000000`00000001
fffff880`0b920be8  00000000`00000000
fffff880`0b920bf0  fffff880`0b920c80
fffff880`0b920bf8  fffff880`0fef37c9 nvlddmkm+0xbd7c9
fffff880`0b920c00  fffff880`0b920e00
fffff880`0b920c08  fffffa80`0ebc2010
fffff880`0b920c10  fffffa80`0d906000
fffff880`0b920c18  00000000`00000000
fffff880`0b920c20  fffff880`0fef376f nvlddmkm+0xbd76f
fffff880`0b920c28  fffffa80`0d906000
fffff880`0b920c30  00000000`00000000
fffff880`0b920c38  00000000`00000000
fffff880`0b920c40  00000000`00000000
fffff880`0b920c48  00000000`00000000
fffff880`0b920c50  fffffa80`0d697480
fffff880`0b920c58  fffff880`0b920e00
fffff880`0b920c60  00000000`00006ed6
fffff880`0b920c68  fffffa80`0e246000
fffff880`0b920c70  00000000`00000001
fffff880`0b920c78  00000000`00000000
fffff880`0b920c80  fffff880`0b920d40
fffff880`0b920c88  fffff880`0fef3a11 nvlddmkm+0xbda11
fffff880`0b920c90  fffffa80`0d906000
fffff880`0b920c98  fffffa80`0ebc2010
fffff880`0b920ca0  fffff880`0b920e00
fffff880`0b920ca8  fffffa80`0e246000
fffff880`0b920cb0  fffff880`0fef39aa nvlddmkm+0xbd9aa
fffff880`0b920cb8  fffffa80`0d906000
fffff880`0b920cc0  00000000`00000000
fffff880`0b920cc8  00000000`00000000
fffff880`0b920cd0  00000000`00000000
fffff880`0b920cd8  00000000`00000000
fffff880`0b920ce0  00000000`00006ed6
fffff880`0b920ce8  00000000`00000000
fffff880`0b920cf0  00000000`00000001
fffff880`0b920cf8  00000000`00000000
fffff880`0b920d00  00000000`00000000
fffff880`0b920d08  fffffa80`0d906000
fffff880`0b920d10  00000000`00006ed6
fffff880`0b920d18  fffff880`0fef3924 nvlddmkm+0xbd924
fffff880`0b920d20  fffff880`0fef39aa nvlddmkm+0xbd9aa
fffff880`0b920d28  fffffa80`0e246000
fffff880`0b920d30  fffffa80`0e251ad0
fffff880`0b920d38  fffffa80`0d906000
fffff880`0b920d40  fffff880`0b920e90
fffff880`0b920d48  fffff880`0ff2c17d nvlddmkm+0xf617d
fffff880`0b920d50  00000000`00000000
fffff880`0b920d58  fffffa80`0ebc2010
fffff880`0b920d60  00000000`00000001
fffff880`0b920d68  00000000`00000000
fffff880`0b920d70  fffff880`0ff2be9c nvlddmkm+0xf5e9c
fffff880`0b920d78  fffffa80`0d906000
fffff880`0b920d80  fffffa80`0ebb5010
fffff880`0b920d88  fffffa80`0ebb6010
fffff880`0b920d90  fffffa80`0ebb69d0
fffff880`0b920d98  00000000`00000000
fffff880`0b920da0  00000000`00000001
fffff880`0b920da8  00000000`00000000
fffff880`0b920db0  fffffa80`0fab3460
fffff880`0b920db8  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b920dc0  00000000`0035ce39
fffff880`0b920dc8  fffffa80`0e47c000
fffff880`0b920dd0  00000000`00000000
fffff880`0b920dd8  00000000`00000006
fffff880`0b920de0  fffffa80`0fab3460
fffff880`0b920de8  00000000`0005fea6
fffff880`0b920df0  fffffa80`0e47cf30
fffff880`0b920df8  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b920e00  fffffa80`0fab3460
fffff880`0b920e08  fffff880`06d92095 dxgmms1!VidSchiUpdateContextRunningTimeAtISR+0x45
fffff880`0b920e10  00000000`0035ce39
fffff880`0b920e18  fffffa80`0e47c000
fffff880`0b920e20  00000000`00000000
fffff880`0b920e28  00000000`0000000b
fffff880`0b920e30  fffffa80`0fab3460
fffff880`0b920e38  00000000`0007cd31
fffff880`0b920e40  fffffa80`0e47ccc0
fffff880`0b920e48  fffff880`06d8e66b dxgmms1!VidSchiProcessIsrCompletedPacket+0x1eb
fffff880`0b920e50  fffffa80`0e47ccc0
fffff880`0b920e58  fffffa80`0e44a410
fffff880`0b920e60  fffffa80`0e47c000
fffff880`0b920e68  fffffa80`0e47c000
fffff880`0b920e70  00000000`00000000
fffff880`0b920e78  00000000`00000000
fffff880`0b920e80  fffffa80`0e47ccc0
fffff880`0b920e88  00000000`00000000
fffff880`0b920e90  fffffa80`0fab3460
fffff880`0b920e98  00000000`00000d3f
fffff880`0b920ea0  00000000`00007111
fffff880`0b920ea8  00000000`00400120
fffff880`0b920eb0  00000000`0007cd31
fffff880`0b920eb8  00000000`00000001
fffff880`0b920ec0  00000000`00000000
fffff880`0b920ec8  00000000`00000000
fffff880`0b920ed0  fffffa80`0e44a410
fffff880`0b920ed8  fffff880`06d8e172 dxgmms1!VidSchDdiNotifyInterruptWorker+0x1ea
fffff880`0b920ee0  fffffa80`0e47c000
fffff880`0b920ee8  00000000`00000000
fffff880`0b920ef0  fffffa80`00000001
fffff880`0b920ef8  fffff880`0b921190
fffff880`0b920f00  fffff880`0b920f30
fffff880`0b920f08  00000000`00000005
fffff880`0b920f10  00000000`00000000
fffff880`0b920f18  00000000`00000000
fffff880`0b920f20  fffff880`0b921190
fffff880`0b920f28  fffff880`06d8df76 dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff880`0b920f30  fffffa80`0007cd31
fffff880`0b920f38  00000000`00000000
fffff880`0b920f40  fffffa80`0e472000
fffff880`0b920f48  fffffa80`0fbf5010
fffff880`0b920f50  fffff880`0b921190
fffff880`0b920f58  fffff880`06c9513f dxgkrnl!DxgNotifyInterruptCB+0x83
^^ Okay, so from that raw stack, we can see quite a few DirectX Kernel & MMS calls, as well as nVidia driver calls as well. This is good news, as this may be our problem (it gives us a good start as far as troubleshooting goes). I'd like to note that there were much more than this, and that the raw stack went on for a very very long time. I am just cutting it to a very small sample for blogging purposes.

--------------------

Let's check Processor #5:

5: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
00000000`00000000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0
5: kd> u @rip
00000000`00000000 ??              ???
            ^ Memory access error in 'u @rip'
^^ Looks like this specific processor was too hung at the time of the crash to report any information.

--------------------

Let's check Processor #6:

6: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`03121c58 fffff800`0328da3a : 00000000`0035ce39 fffffa80`0d16d378 fffff880`030f9180 00000000`00000000 : intelppm!MWaitIdle+0x19
fffff880`03121c60 fffff800`032886cc : fffff880`030f9180 fffff880`00000000 00000000`00000000 fffff880`00000000 : nt!PoIdle+0x53a
fffff880`03121d40 00000000`00000000 : fffff880`03122000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x2c
 ^^ Same as processors #1 and #2.

--------------------

Finally, let's check Processor #7:

7: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`07a19930 fffff800`032c1c4a : 00000000`00000000 00000000`2d6c0fff fffffa80`00000000 fffffa80`00000000 : nt!MiDeleteVirtualAddresses+0x7d8
fffff880`07a19af0 fffff800`0327f153 : ffffffff`ffffffff 00000000`2174e6d0 00000000`2174e6c8 00000000`00008000 : nt!NtFreeVirtualMemory+0x5ca
fffff880`07a19be0 00000000`77a3009a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`07a19be0)
00000000`2174e698 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77a3009a
 7: kd> .trap fffff880`07a19be0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=0000000077a3009a rsp=000000002174e698 rbp=0000000017a8f904
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
7: kd> u @rip
00000000`77a3009a ??              ???
            ^ Memory access error in 'u @rip'
^^ Cannot seem to go too far into processor #7, but it seemed to be doing virtual memory related things.

--------------------

Overall, from the above, this looks like a hardware issue. Video card, RAM, or CPU itself. I'd like to say it's also possible for it to be a video driver causing corruption, we shall see.

I'm having the user go through hardware diagnostics, as well as a few other things, so I'll report back with any info when I have it.

1 comment: