Erik McClure

The Ninth Circle of Bugs


So I’m rewriting my 2D culling kd-tree for my graphics engine, and a strange bug pops up. On release mode, one of the images vanished. Since it didn’t happen in debug mode, it was already a heisenbug. A heisenbug is defined as a bug that vanishes when you try to find it. It took me almost a day to trace the bug to the rebalance function. At first I thought the image had simply been removed from a node accidentally, but this wasn’t the case. It took another day to finally figure out that the numimages variable was getting set to 0, thus causing the node to think it was empty and resulted in it deleting itself and removing itself from the tree (which caused all sorts of other problems).

Unfortunately, I could not verify the tree. Any attempt that so much as touched the tree’s memory would wipe out the bug, or so I thought. Then I tried adding the verification function into an if statement that would only activate if the bug appeared - it did not. The act of adding a line of code that was never executed actually caused the bug to vanish.

I was absolutely stunned. This was completely insane. Following the advice of a friend, I was forced to assume the compiler somehow screwed something up, so I randomly disabled various optimizations in release mode. It turned out that disabling the Omit Frame Pointers optimization removed the bug. I didn’t actually know what frame pointers were, only that I had turned on this optimization in many other projects for the hell of it and it had never caused any problems (no optimizations ever should, for that matter). What I discovered was astonishing. Frame pointers couldn’t be omitted from a function if it got too complicated or needed to unwind the stack due to a possible exception. On a hunch, instead of adding the verification function to the chunk of code that was only executed if the error occurred, I instead added a vestigial 'throw "derp";' line.

The problem vanished.

I knew instantly that either the problem was caused by the omission of frame pointers, which would indicate a bug in the VC++ 2010 compiler (unlikely), or when the frame pointers were included, it masked the bug (much more likely). But I also had another piece of knowledge at my disposal - exactly how I could modify the function without masking the bug. I considered decompiling the function and forcing VC++ to use the flawed assembly, but that didn’t allow me to modify the assembly in any meaningful way. A bit more experimentation revealed that any access of the root node, or for that matter, the 'this' pointer itself (unless it was for calling a function) caused the inclusion of the frame pointer. I realized that a global variable would be exempt from this, and that I might be able to get by this limitation by assigning the address of whatever variable I needed to the global variable and passing that into the function instead.

This approach, however, failed. In fact most attempts to get around the frame pointer inclusion failed. I did, however, notice what appeared to be a separate bug in another part of the tree. A short investigation later revealed an unrelated bug in the tree caused by the solve function. However, what was causing this bug (duplicated parentC pointers) still threw up errors after solving the first bug, indicating that it was possible this mysterious insane compiler induced bug was just a symptom of a deeper one that would be easier to detect. After more hunting, a second unrelated bug was found. Clearly this tree was not nearly as stable as I had thought it was.

A third bug was eventually found, and I discovered the root cause of this bug to be an #NaN float value in the tree. This should never ever, ever happen, because it destabilizes the tree, but sure enough, I finally found the cause.

_totalremove(node->total,(const float (&)[4])currect);

Casting from a float* that was previous cast from a float[4] causes read errors at totally random times, despite this being completely valid under the circumstances. My only guess is that the compiler somehow interpreted this cast as undefined behavior and went crazy. I will never know. All I know is that I should never, ever, ever, ever, ever cast to that data type ever again, because guess what? After removing all my debug equipement and putting the cast back in, I was able to reliable reproduce the bug that started this whole mess, and removing the cast made the bug vanish.

This entire week long trek through hell was because the compiler fucked up on a goddamn variable cast. It wasn’t a memory leak, it wasn’t a buffer overrun, it was just a goddamn miscast variable.

Lesson: Re-validate every inch of your data structure the instant you realize you have a heisenbug, and make sure your validation function properly checks for all things that can screw things up.


Investigating Low-level CPU Performance


While reconstructing my threaded Red-Black tree data structure, I naturally assumed that due to invalid branch predictions costing significant amounts of performance, by eliminating branching in low-level data structures, one can significant enhance the performance of your application. I did some profiling and was stunned to discover that my new, optimized Red Black tree was… SLOWER then the old one! This can’t be right, I eliminated several branches and streamlined the whole thing, how can it be SLOWER?! I tested again, and again, and again, but the results were clear - even with fluctuations of up to 5% in the results, the average speed for my new tree was roughly 7.5% larger then my old one (the following numbers are the average of 5 tests).

Old: 626699 ticks New: 674000 ticks

//Old
c = C(key, y->_key);
if(c==0) return y;
if(c<0) y=y->_left;
else y=y->_right;

//New
if(!(c=C(key,y->_key)))
return y;
else
y=y->_children[(++c)>>1];

Now, those of you familiar with CPU branching and other low-level optimizations might point out that the compiler may have optimized the old code path more effectively, leaving the new code path with extra instructions due to the extra increment and bitshift operations. Wrong. Both code paths have the exact same number of instructions. Furthermore, there are only FOUR instructions that are different between the two implementations (highlighted in red below).

New
00F72DE5  mov         esi,dword ptr [esp+38h]  
00F72DE9  mov         eax,dword ptr 
00F72DEE  cmp         esi,eax  
00F72DF0  je          main+315h (0F72E25h)  
00F72DF2  mov         edx,dword ptr [esp+ebx*4+4ECh]
00F72DF9  lea         esp,[esp]
00F72E00  mov         edi,dword ptr [esi+4]  
00F72E03  cmp         edx,edi  
00F72E05  jge         main+2FCh (0F72E0Ch)  
00F72E07  or          ecx,0FFFFFFFFh  
00F72E0A  jmp         main+303h (0F72E13h)  
00F72E0C  xor         ecx,ecx  
00F72E0E  cmp         edx,edi  
00F72E10  setne       cl  
00F72E13  movsx       ecx,cl  
00F72E16  test        ecx,ecx  
00F72E18  je          main+317h (0F72E27h)  
00F72E1A  inc         ecx  
00F72E1B  sar         ecx,1  

00F72E1D  mov         esi,dword ptr [esi+ecx*4+18h]  
00F72E21  cmp         esi,eax  
00F72E23  jne         main+2F0h (0F72E00h)  
00F72E25  xor         esi,esi  
00F72E27  mov         eax,dword ptr [esi]  
00F72E29  add         dword ptr [esp+1Ch],eax
Old
00F32DF0  mov         edi,dword ptr [esp+38h]  
00F32DF4  mov         ebx,dword ptr 
00F32DFA  cmp         edi,ebx  
00F32DFC  je          main+31Dh (0F32E2Dh)  
00F32DFE  mov         edx,dword ptr [esp+eax*4+4ECh]  

00F32E05  mov         esi,dword ptr [edi+4]  
00F32E08  cmp         edx,esi  
00F32E0A  jge         main+301h (0F32E11h)  
00F32E0C  or          ecx,0FFFFFFFFh  
00F32E0F  jmp         main+308h (0F32E18h)  
00F32E11  xor         ecx,ecx  
00F32E13  cmp         edx,esi  
00F32E15  setne       cl  
00F32E18  movsx       ecx,cl  
00F32E1B  test        ecx,ecx  
00F32E1D  je          main+31Fh (0F32E2Fh)  
00F32E1F  jns         main+316h (0F32E26h)  
00F32E21  mov         edi,dword ptr [edi+18h]  
00F32E24  jmp         main+319h (0F32E29h)  
00F32E26  mov         edi,dword ptr [edi+1Ch]  
00F32E29  cmp         edi,ebx  
00F32E2B  jne         main+2F5h (0F32E05h)  
00F32E2D  xor         edi,edi  
00F32E2F  mov         ecx,dword ptr [edi]  
00F32E31  add         dword ptr [esp+1Ch],ecx

I have no real explanation for this behavior, but I do have a hypothesis: The important instruction is the extra LEA in my new method that appears to be before the branch itself. As a result, it may be possible for the CPU to be doing branch prediction in such a way it shaves off one instruction, which gives it a significant advantage. It may also be that the branching is just faster then my increment and bitshift, although I find this highly unlikely. At this point I was wondering if anything I knew about optimization held any meaning in the real world, or if everything was just a lot of guesswork and profiling because what the fuck?! However, it then occurred to me that there was an optimization possible for the old version - Move the if(c==0) statement to the bottom so the CPU does the (c<0) and (c>0) comparisons first, since the c==0 comparison only happens once in the traversal. Naturally I was a bit skeptical of this having any effect on the assembly-rewriting, branch-predicting, impulsive teenage bitch that my CPU was at this point, but I tried it anyway.

It worked. There was a small but noticeable improvement in running time by using the old technique and rewriting the if statements as such:

c = C(key, y->_key);
if (c < 0)  y = y->_left;
else if(c > 0) y = y->_right;
else return y;

Optimized: 610161.8 Ticks

The total performance improvement over my failed optimization attempt and my more successful branch-manipulation technique is a whopping 63838.2 Ticks, or a ~10% improvement in speed, caused by simply rearranging 4 or 5 instructions. These tests were done on a randomized collection of 500000 integers, so that means the optimized version can pack in 10% more comparisons in the same period of time as the bad optimization. That’s 550000 vs 500000 elements, which seems to suggest that delicate optimization, even in modern CPUs, can have significant speed improvements. Those of you who say that toying around with low level code can’t infer significant performance increases should probably reconsider exactly what you’re claiming. This wouldn’t directly translate to 50000 extra players on your server, but a 10% increase in speed isn’t statistically insignificant.


Album For Sale! [Renascent]


Due to Bandcamp’s sudden threat to turn all of my free downloads into paid ones, I decided to go ahead and start selling my music properly. Renascent is now available for $3, or about as much as a gallon of milk costs. It contains remastered, super high quality (lossless if you choose to download in FLAC format) versions of all 14 songs, in addition to the original FLP project files used to create them. If you have ever wondered how I made a particular song, this might be another incentive to purchase the album. Note that these FLPs are released under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license, so you can’t go running off with them like free candy.

Track List:

  1. On The Edge (2:56)
  2. Renascent (4:06)
  3. The Boundless Sea (6:49)
  4. Duress (2:40)
  5. Seaside Lookout (4:54)
  6. Sapphire [Redux] (2:20)
  7. Absolutia (3:04)
  8. The Plea (3:46)
  9. Now (2:34)
  10. Alutia (4:10)
  11. Rite (5:20)
  12. Crystalline Cloudscape (4:04)
  13. All Alone (3:06)
  14. SunStorm (4:12)

Total Time: 56:44

Listen and Buy It Here


WavSaver


There is a documented bug in windows 7 that has pissed me off a few times and recently crippled a friend of mine, where a .wav file with corrupted metadata causes explorer.exe to go into an infinite loop. My friend has a large collection of wavs that somehow got corrupted, so I wrote this program to strip them of all metadata. Due to the nature of the bug, the program can’t delete them (you must use the command prompt to do that), but rather creates a folder called “safe” with all the stripped wav files inside of it.

Hosted in case anyone else has corrupted wav files they need to save. Just stick it inside a folder and run it - it’ll automatically strip all wav files in the same folder as the executable.

WavSaver


Pixel Perfect Hit Testing


After beating World of Goo after stabilizing things in my game and renaming it, I wondered how easy it was to decompile C# applications and simultaneously thought this would be a great opportunity to get pixel perfect hit testing to work on my engine. So, I decompiled GearGOD’s composition example and quickly discovered that his method of detecting mouse messages was… well something completely different then his extremely bad attempt at explaining it to me had suggested.

Basically, he did not run into the window event issues that I was having because… he didn’t use them. XNA keeps track of the mouse coordinates in its own separate update function, most likely using its special input hook, and hence there is no mousemove to keep track of. Instead of occurring when the user moves the mouse, the hit tests occur every single frame.

Hence, once you have utilized WS_EX_TRANSPARENT|WS_EX_COMPOSITED|WS_EX_LAYERED to make your window click-through-able, you then simply do a hit test on a given pixel after everything has been drawn, and swap out WS_EX_TRANSPARENT depending on the value. GetCursorPos and ScreenToClient will get the mouse coordinates you need, although they can be off your app window entirely so check for that too.

if(_dxDriver->MouseHitTest(GetMouseExact()))
  SetWindowLong(_window,GWL_EXSTYLE,((GetWindowLong(_window,GWL_EXSTYLE))&(~WS_EX_TRANSPARENT)));
else
  SetWindowLong(_window,GWL_EXSTYLE,((GetWindowLong(_window,GWL_EXSTYLE))|WS_EX_TRANSPARENT));

To get the pixel value, its a bit trickier. You have two options - you can make a lockable render target, or you can copy the render target to a temporary texture and lock that instead. The DirectX docs said that locking a render target is so expensive you should just copy it over, but after GearGOD went and yelled at me I tested the lockable render target method, and it turns out to be significantly faster. Futher speed gains can be achieved by making a 1x1 lockable render target and simply copying a single pixel from the backbuffer into the lockable render target and testing that.

void cDirectX_real::ActivateMouseCheck()
{
  if(_mousehittest) _mousehittest->Release();
  DX3D_device->CreateRenderTarget(1,1,_holdparams.BackBufferFormat,D3DMULTISAMPLE_NONE,0,TRUE,&_mousehittest,NULL);
}
bool cDirectX_real::MouseHitTest(const cPositioni& mouse)
{
  RECT rect = { mouse.x,mouse.y,mouse.x+1,mouse.y+1 };
  DX3D_device->StretchRect(_backbuffer,&rect,_mousehittest,0,D3DTEXF_NONE);

  if(mouse.x<0 || mouse.y < 0 || mouse.x > (int)_width || mouse.y > (int)_height)
    return false; //off the stage
  D3DLOCKED_RECT desc = { 0,0 };
  if(FAILED(_mousehittest->LockRect(&desc, 0,D3DLOCK_READONLY)))
    return true;    

  unsigned char color = (*((unsigned long*)desc.pBits))>>24;
  _mousehittest->UnlockRect();
  return color>_alphacutoff;
}

Using this method, the performance hit is 620 FPS to 510 FPS at 1280x1024, which is fairly reasonable. However, my Planeshader SDK example is still at 0.9.71, which does not have this updated, fast version, so it will be using a much slower method to do it. The end result is the same though.


Avatar

Archive

  1. 2024
  2. 2023
  3. 2022
  4. 2021
  5. 2020
  6. 2019
  7. 2018
  8. 2017
  9. 2016
  10. 2015
  11. 2014
  12. 2013
  13. 2012
  14. 2011
  15. 2010
  16. 2009