current position:Home>Remember a CPU burst analysis of an entertainment chat streaming platform in .NET

Remember a CPU burst analysis of an entertainment chat streaming platform in .NET

2022-10-24 09:34:23first-line coders



Some time ago, a friend added WeChat,Said his program directly CPU=100%,You can only manually intervene and restart each time,Let me help see what's going on,哈哈,这种CPUfull of accidents,Programmers can be very stressed, I let my friends CPU catch when high 2 个 dump 下来,Then send it to me for analysis.

二:WinDbg 分析

1. CPU Is it really full??

To prevent divergences,一定要用 !tp command to check if it's real CPU 爆高.

0:000> !tp
CPU utilization: 100%
Worker Thread: Total: 21 Running: 7 Idle: 0 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 3
    AsyncTimerCallbackCompletion [email protected]
    AsyncTimerCallbackCompletion [email protected]
    AsyncTimerCallbackCompletion [email protected]
Number of Timers: 0
Completion Port Thread:Total: 18 Free: 9 MaxFree: 8 CurrentLimit: 18 MaxLimit: 1000 MinLimit: 4

从卦中看确实 100%,太牛了,而且 WorkRequest and task stacking,After confirming,Let's see who caused it?

2. 谁导致的 CPU 爆高

It is customary to first doubt whether it is GC 触发所致,可以用 !t Check out the thread list below,Observe whether GC 字样.

:000> !t
ThreadCount:      53
UnstartedThread:  0
BackgroundThread: 53
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   4    1 1240 00000000021cdf30    2a220 Preemptive  0000000000000000:0000000000000000 00000000021d94c0 0     MTA 
  23    2 4db4 00000000041cdaa0    2b220 Preemptive  0000000000000000:0000000000000000 00000000021d94c0 0     MTA (Finalizer) 
  65  156 22f4 000000000b1a3f60  8029220 Preemptive  00000004527751F0:0000000452775EE8 00000000021d94c0 0     MTA (Threadpool Completion Port) 
  66  205 2ef8 000000000b1a1080  8029220 Preemptive  0000000157641DE0:00000001576435B0 00000000021d94c0 0     MTA (Threadpool Completion Port) 

From the hexagram, no GC 字样,also indicates that this program is not GC 触发所致,What to do next? 一般来说 CPU The burst height is lifted by the thread,So next is to look at CPU The grades and stacks of individual threads,see if there are any new clues, 可以使用 ~*e !clrstack .

0:000> !cpuid
CP  F/M/S  Manufacturer     MHz
 0  6,79,1  <unavailable>   2299
 1  6,79,1  <unavailable>   2299
 2  6,79,1  <unavailable>   2299
 3  6,79,1  <unavailable>   2299

0:000> ~*e !clrstack 
OS Thread Id: 0x2cc4 (68)
        Child SP               IP Call Site
000000000c14e758 00007ffadeb86e4a [GCFrame: 000000000c14e758] 
000000000c14e840 00007ffadeb86e4a [GCFrame: 000000000c14e840] 
000000000c14e878 00007ffadeb86e4a [HelperMethodFrame: 000000000c14e878] System.Threading.Monitor.Enter(System.Object)
000000000c14e970 00007ffaceb40491 System.Net.ConnectionGroup.Disassociate(System.Net.Connection) [f:\dd\NDP\fx\src\net\System\Net\_ConnectionGroup.cs @ 148]
000000000c14e9d0 00007ffaceb3fc93 System.Net.Connection.PrepareCloseConnectionSocket(System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3048]
000000000c14eaa0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3119]
000000000c14eb00 00007ffacebc4118 System.Net.Connection.ReadComplete(Int32, System.Net.WebExceptionStatus) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3387]
000000000c14eb80 00007ffacebe3dc5 System.Net.LazyAsyncResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 415]
000000000c14ebe0 00007ffacebe3d07 System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 368]
000000000c14ec20 00007ffacf3a476f System.Net.Security._SslStream.StartFrameBody(Int32, Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
000000000c14ec80 00007ffacebb3ed8 System.Net.Security._SslStream.ReadHeaderCallback(System.Net.AsyncProtocolRequest) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_SslStream.cs @ 1007]
000000000c14ece0 00007ffacebae5ee System.Net.AsyncProtocolRequest.CompleteRequest(Int32) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_HelperAsyncResults.cs @ 142]
000000000c14ed10 00007ffacf3a3567 System.Net.FixedSizeReader.CheckCompletionBeforeNextRead(Int32)
000000000c14ed40 00007ffacebae507 System.Net.FixedSizeReader.ReadCallback(System.IAsyncResult) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_FixedSizeReader.cs @ 148]
000000000c14ed90 00007ffacebe3dc5 System.Net.LazyAsyncResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 415]
000000000c14edf0 00007ffadcbe3a63 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 954]
000000000c14eec0 00007ffadcbe38f4 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 902]
000000000c14eef0 00007ffadcbe38c2 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 891]
000000000c14ef40 00007ffaceba60cf System.Net.ContextAwareResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_ContextAwareResult.cs @ 463]
000000000c14ef90 00007ffacebe3d07 System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 368]
000000000c14efd0 00007ffaceba5e2f System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [f:\dd\NDP\fx\src\net\System\Net\Sockets\_BaseOverlappedAsyncResult.cs @ 399]
000000000c14f040 00007ffadcc2ffef System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [f:\dd\ndp\clr\src\BCL\system\threading\overlapped.cs @ 135]
000000000c14f1f0 00007ffade9a6d93 [GCFrame: 000000000c14f1f0] 

OS Thread Id: 0x5d6c (70)
        Child SP               IP Call Site
000000000943e920 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef) 

OS Thread Id: 0x4ad4 (75)
        Child SP               IP Call Site
000000000c94e5a0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef)

OS Thread Id: 0x1d70 (80)
        Child SP               IP Call Site
000000000d24e3a0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3119]

从线程栈上看,这个 CPU 是 4个核,刚好对应着 4 个 HandleError 报错,Looks like something is wrong with the network,接下来切到 80 No. thread to see if there is any exception class.

0:000> ~80s
00007ffa`deb86e40 4883e801        sub     rax,1
0:080> !mdso
Thread 80:
Location          Object            Type
000000000d24e098  000000015765e028  System.Net.WebException
000000000d24e0f8  0000000340b07110  System.Collections.ArrayList
000000000d24e110  000000015765e2b8  System.Net.HttpWebRequest[]
000000000d24e1c0  0000000340b070b8  System.Net.ConnectionGroup
000000000d24e258  0000000144a79678  System.Net.Connection

0:080> !mdt 000000015765e028
000000015765e028 (System.Net.WebException)
    _className:NULL (System.String)
    _exceptionMethod:NULL (System.Reflection.MethodBase)
    _exceptionMethodString:NULL (System.String)
    _message:000000015765df70 (System.String) Length=77, String="The underlying connection was closed: The connection was closed unexpectedly."

果然看到了 System.Net.WebException, From the abnormal information, it seems that 连接关闭了,Here I have a bold guess,Is it caused by abnormal high frequency output? CPU 爆高呢? 为了验证,You can find it on the managed heap WebException 的个数.

0:080> !dumpheap -stat
              MT    Count    TotalSize Class Name
00007ffacecc38b0    13315      2343440 System.Net.WebException
00007ffadcdf6570    11369      1909992 System.IO.IOException
00007ffadcdf5fb8    13380      2247840 System.ObjectDisposedException

It's scary to see so many anomalies,A friend just caught twodump可以做个比较.

0:048> !dumpheap -stat
              MT    Count    TotalSize Class Name
00007ffacecc38b0    26745      4707120 System.Net.WebException
00007ffadcdf6570    26722      4489296 System.IO.IOException
00007ffadcdf5fb8    28745      4829160 System.ObjectDisposedException

可以看到,2 min Abnormal increase in total 4w 左右,This verifies that the program is indeed throwing exceptions insanely,在 Windows on the platform硬件异常还是软件异常都由 Windows SEH Unified processing of exception handling framework,There will be a switch between user mode and kernel mode,Such crazy throws,必然会导致 CPU 爆高,终于找到原因了,The next step is to look for triggers.

3. who caused the anomaly

再回头看 HandleError The call stack of the function is the underlying library function,from the thread stack PerformIOCompletionCallback function is IO线程 诱发的,能被 IO The thread comes to the fore because this is done asynchronously,既然是 异步,自然 OverlappedData 也会非常多.

0:080> !gchandles -stat
              MT    Count    TotalSize Class Name
00007ffadc6f7b98    14511      1625232 System.Threading.OverlappedData
Total 17550 objects

    Strong Handles:       426
    Pinned Handles:       23
    Async Pinned Handles: 14511
    Ref Count Handles:    24
    Weak Long Handles:    2430
    Weak Short Handles:   132
    SizedRef Handles:     4

Explain that there is approximately 1.5w The asynchronous request of the pending return,The request volume is still quite large,But still no abnormal user code was found,Can only find out who initiated what request.

0:080> !mdso
Thread 80:
Location          Object            Type
000000000d24e488  0000000358c57918  System.Net.HttpWebRequest
000000000d24e2e8  00000001407b5b40  System.String  "net_io_readfailure"

0:080> !mdt -r:2 0000000358c57918
0000000358c57918 (System.Net.HttpWebRequest)
    _Uri:0000000358c57210 (System.Uri)
        m_String:00000002407ee430 (System.String) Length=98, String="https://api.xxxx/peer_messages"

You can see that the requested connection is https://api.xxxx/peer_messages,是一个第三方的API接口, due to the underlying connection being closed,导致了最后 net_io_readfailure.

Putting all the information together is:

When the request volume is large,访问 https://api.xxxx/peer_messages 会出问题,The peer closed the underlying connection,This leads to a large number of requests on the client side IO 回调异常:IOException: Unable to read data from the transport connection: The connection was closed.,2minbetween up to total 4w 的异常抛出,进而引发 CPU 爆高,tell a friend,Let your friends focus on https://api.xxxx/peer_messages 这个连接.

三: 总结

This production accident was mainly due to excessive demand during peak periods,由于某种原因 Socket 连接关闭,Causes a lot of asynchronous callback exceptions.

The solution is done on the calling side限流,According to friends reduced unnecessary https://api.xxxx/peer_messages 调用,目前没有出现 CPU high burst phenomenon.


copyright notice
author[first-line coders],Please bring the original link to reprint, thank you.

Random recommended