이것이 "발생해서는 안되는 일"이 AMD Fusion CPU 버그를 충돌시키는 것입니까?
우리 회사는 시스템에 대한 액세스 위반으로 인해 프로그램이 충돌하기 때문에 많은 고객이 전화를 받기 시작했습니다.
충돌은 애플리케이션의 일부로 제공되는 SQLite 3.6.23.1에서 발생합니다. (나머지 앱과 동일한 VC ++ 라이브러리를 사용하기 위해 사용자 지정 빌드를 제공하지만 기본 SQLite 코드입니다.)
WinDbg 콜 스택에서 볼 수 있듯이를 pcache1Fetch
실행할 때 충돌이 발생합니다 call 00000000
.
0b50e5c4 719f9fad 06fe35f0 00000000 000079ad 0x0
0b50e5d8 719f9216 058d1628 000079ad 00000001 SQLite_Interop!pcache1Fetch+0x2d [sqlite3.c @ 31530]
0b50e5f4 719fd581 000079ad 00000001 0b50e63c SQLite_Interop!sqlite3PcacheFetch+0x76 [sqlite3.c @ 30651]
0b50e61c 719fff0c 000079ad 0b50e63c 00000000 SQLite_Interop!sqlite3PagerAcquire+0x51 [sqlite3.c @ 36026]
0b50e644 71a029ba 0b50e65c 00000001 00000e00 SQLite_Interop!getAndInitPage+0x1c [sqlite3.c @ 40158]
0b50e65c 71a030f8 000079ad 0aecd680 071ce030 SQLite_Interop!moveToChild+0x2a [sqlite3.c @ 42555]
0b50e690 71a0c637 0aecd6f0 00000000 0001edbe SQLite_Interop!sqlite3BtreeMovetoUnpacked+0x378 [sqlite3.c @ 43016]
0b50e6b8 71a109ed 06fd53e0 00000000 071ce030 SQLite_Interop!sqlite3VdbeCursorMoveto+0x27 [sqlite3.c @ 50624]
0b50e824 71a0db76 071ce030 0b50e880 071ce030 SQLite_Interop!sqlite3VdbeExec+0x14fd [sqlite3.c @ 55409]
0b50e850 71a0dcb5 0b50e880 21f9b4c0 00402540 SQLite_Interop!sqlite3Step+0x116 [sqlite3.c @ 51744]
0b50e870 00629a30 071ce030 76897ff4 70f24970 SQLite_Interop!sqlite3_step+0x75 [sqlite3.c @ 51806]
C 코드의 관련 줄은 다음과 같습니다.
if( createFlag==1 ) sqlite3BeginBenignMalloc();
컴파일러 sqlite3BeginBenignMalloc
는 다음과 같이 정의됩니다.
typedef struct BenignMallocHooks BenignMallocHooks;
static SQLITE_WSD struct BenignMallocHooks {
void (*xBenignBegin)(void);
void (*xBenignEnd)(void);
} sqlite3Hooks = { 0, 0 };
# define wsdHooksInit
# define wsdHooks sqlite3Hooks
SQLITE_PRIVATE void sqlite3BeginBenignMalloc(void){
wsdHooksInit;
if( wsdHooks.xBenignBegin ){
wsdHooks.xBenignBegin();
}
}
그리고이를위한 어셈블리는 다음과 같습니다.
719f9f99 mov esi,dword ptr [esp+1Ch]
719f9f9d cmp esi,1
719f9fa0 jne SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fa2 mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
719f9fa7 test eax,eax
719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fab call eax ; *** CRASH HERE ***
719f9fad mov ebx,dword ptr [esp+14h]
레지스터는 다음과 같습니다.
eax=00000000 ebx=00000001 ecx=000013f0 edx=fffffffe esi=00000001 edi=00000000
eip=00000000 esp=0b50e5c8 ebp=000079ad iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
경우 eax
(그것은이다) 0, 제로 플래그가 설정되어야 test eax, eax
하지만, 비 - 제로입니다. 0 플래그가 설정되지 않았기 때문에 je
점프하지 않고 앱이 call eax (00000000)
.
업데이트 : 코드 빌드에서 설정되지 않았 eax
으므로 여기서 항상 0이어야 sqlite3Hooks.xBenignBegin
합니다. SQLITE_OMIT_BUILTIN_TEST
정의 된 SQLite를 다시 빌드 #define sqlite3BeginBenignMalloc()
하면 코드가 켜지고이 코드 경로가 완전히 생략됩니다. 그것은 문제를 해결할 수 있지만 "진짜"수정처럼 느껴지지는 않습니다. 다른 코드 경로에서 일어나는 일을 어떻게 막을까요?
지금까지 공통된 요소는 모든 고객이 "Windows 7 Home Premium 64 비트 (6.1, 빌드 7601) 서비스 팩 1"을 실행하고 있으며 다음 CPU 중 하나를 보유하고 있다는 것입니다 (DxDiag에 따름).
- Radeon (tm) HD 그래픽 (4 CPU)이 장착 된 AMD A6-3400M APU, ~ 1.4GHz
- Radeon (tm) HD 그래픽 (4 CPU)이 탑재 된 AMD A8-3500M APU, ~ 1.5GHz
- Radeon (tm) HD 그래픽 (4 CPU)이 탑재 된 AMD A8-3850 APU, ~ 2.9GHz
Wikipedia의 AMD Fusion 기사 에 따르면 이들은 모두 K10 코어를 기반으로 한 "Llano"모델 AMD Fusion 칩이며 2011 년 6 월에 처음 보고서를 받기 시작했을 때 출시되었습니다.
가장 일반적인 고객 시스템은 Toshiba Satellite L775D이지만 HP Pavilion dv6 및 dv7 및 게이트웨이 시스템의 충돌 보고서도 있습니다.
이 충돌은 CPU 오류 ( AMD 제품군 12h 프로세서에 대한 정오표 참조)로 인해 발생할 수 있습니까? 아니면 제가 간과하고있는 다른 가능한 설명이 있습니까? (Raymond에 따르면 오버 클러킹 일 수 있지만이 특정 CPU 모델 만 영향을받는 것은 이상합니다.)
솔직히, 고객이 다른 응용 프로그램에서 블루 스크린이나 충돌이 발생하지 않기 때문에 이것이 실제로 CPU 또는 OS 오류 일 가능성이없는 것 같습니다. 더 가능성이 높은 다른 설명이있을 것입니다. 그러나 무엇입니까?
Update 15 August: I've acquired a Toshiba L745D notebook with an AMD A6-3400M processor and can reproduce the crash consistently when running the program. The crash is always on the same instruction; .time
reports anywhere from 1m30s to 7m of user time before the crash. One fact (that may be pertinent to the issue) that I neglected to mention in the original post is that the application is multi-threaded and has both high CPU and I/O usage. The application spawns four worker threads by default and posts 80+% CPU usage (there is some blocking for I/O as well as for mutexes in the SQLite code) until it crashes. I modified the application to only use two threads, and it still crashed (although it took longer to happen). I'm now running a test with just one thread, and it hasn't crashed yet.
Note also that it doesn't appear to be purely a CPU load problem; I can run Prime95 without errors on the system and it will boost the CPU temperature to >70°C, while my application barely gets the temperature above 50°C while it's running.
Update 16 August: Perturbing the instructions slightly makes the problem "go away". For eaxmple, replacing the memory load (mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
) with xor eax, eax
prevents the crash. Modifying the original C code to add an extra check to the if( createFlag==1 )
statement changes the relative offsets of various jumps in the compiled code (as well as the location of the test eax, eax
and call eax
statements) and also seems to prevent the problem.
The strangest result I've found so far is that changing the jne
at 719f9fa0
to two nop
instructions (so that control always falls through to the test eax, eax
instruction, no matter what the value of createFlag
/esi
is) allows the program to run without crashing.
I spoke to an AMD engineer at the Microsoft Build conference about this error, and showed him my repro. He emailed me this morning:
We have investigated and found that this is due to a known errata in the Llano APU family. It can be fixed via a BIOS update depending on the OEM – if possible please recommend it to your customers (even though you have a workaround).
In case you’re interested, the errata is 665 in the Family 12h Revision Guide (see page 45): http://support.amd.com/TechDocs/44739_12h_Rev_Gd.pdf#page=45
Here's the description of that erratum:
665 Integer Divide Instruction May Cause Unpredictable Behavior
Description
Under a highly specific and detailed set of internal timing conditions, the processor core may abort a speculative DIV or IDIV integer divide instruction (due to the speculative execution being redirected, for example due to a mispredicted branch) but may hang or prematurely complete the first instruction of the non-speculative path.
Potential Effect on System
Unpredictable system behavior, usually resulting in a system hang.
Suggested Workaround
BIOS should set MSRC001_1029[31].
This workaround alters the DIV/IDIV instruction latency specified in the Software Optimization Guide for AMD Family 10h and 12h Processors, order# 40546. With this workaround applied, the DIV/IDIV latency for AMD Family 12h Processors are similar to the DIV/IDIV latency for AMD Family 10h Processors.
Fix Planned
No
I'm a bit concerned that the code generated for Never mind: these instructions are for if (wsdHooks.xBenignBegin)
isn't very general. It assumes the only true value is 1
whereas it should really be testing for any nonzero value. Still, MSVC is sometimes baffling that way. It is probably nothing.C
code not presented.
Given that the eflag Z
bit is clear and EAX
is zero, the code did not get here by executing the instruction
719f9fa7 test eax,eax
There must be a jump from somewhere else to the instruction following (719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d
) or even the call
instruction itself.
Another complication is that with the x86 family, it is common for an invalid jump target (like the second byte of the JE
instruction) to execute unperturbed (no faults) for quite a few instructions, often eventually getting back on the proper instruction alignment. Said another way, you may not be looking for a jump to the beginning of any of these instructions: a jump might be in the midst of their bytes, resulting in executing unremarkable operations like add [al+ebp],al
which tend not to be noticed.
I predict that a breakpoint at the test
instruction will not be hit for the exception. The only ways to find such causes is either to be very lucky, or to suspect everything and prove them innocent one-by-one.
Before considering the possibility of a CPU bug, try to rule out the more probable causes
A different code path to the call instruction. Use the
uf
command to disassemble the function and look for other jumps / branches to the call instructionJump / call to 0 from the hook function.
dps SQLite_Interop!sqlite3Hooks l 2
and verify that it shows nulls.
참고URL : https://stackoverflow.com/questions/7004728/is-this-should-not-happen-crash-an-amd-fusion-cpu-bug
'program tip' 카테고리의 다른 글
새 업스트림 브랜치에 대한 GitHub 풀 요청 (0) | 2020.11.09 |
---|---|
오류 '보안 경고 : Rack :: Session :: Cookie'에 제공된 비밀 옵션이 없습니다. (0) | 2020.11.09 |
모든 빌드에서 Lint를 실행하도록 Android Studio를 어떻게 구성합니까? (0) | 2020.11.09 |
관리되는 NuGet 패키지가 C ++ / CLI 프로젝트를 지원하도록하려면 어떻게해야합니까? (0) | 2020.11.09 |
유형을 데이터 생성자와 연결하는 ADT 인코딩의 문제점은 무엇입니까? (0) | 2020.11.09 |