Blogs - Erik McClure

The Problem of Vsync

If you were to write directly to the screen when drawing a bouncing circle, you would run into some problems. Because you don’t do any buffering, your user might end up with a quarter circle drawn for a frame. This can be solved through Double Buffering, which means you draw the circle on to a backbuffer, then “flip” (or copy) the completed image on to the screen. This means you will only ever send a completely drawn scene to the monitor, but you will still have tearing issues. These are caused by trying to update the monitor outside of its refresh rate, meaning you will have only finished drawing half of your new scene over the old scene in the monitor’s video buffer when it updates itself, resulting in half the scanlines on the screen having the new scene and half still having the old scene, which gives the impression of tearing.

This can be solved with Vsync, which only flips the backbuffer right before the screen refreshes, effectively locking your frames per second to the refresh rate (usually 60 Hz or 60 FPS). Unfortunately, Vsync with double buffering is implemented by simply locking up the entire program until the next refresh cycle. In DirectX, this problem is made even worse because the API locks up the program with a 100% CPU polling thread, sucking up an entire CPU core just waiting for the screen to enter a refresh cycle, often for almost 13 milliseconds. So your program sucks up an entire CPU core when 90% of the CPU isn’t actually doing anything but waiting around for the monitor.

This waiting introduces another issue - Input lag. By definition any input given during the current frame can only come up when the next frame is displayed. However, if you are using vsync and double buffering, the current frame on the screen was the LAST frame, and the CPU is now twiddling its thumbs until the monitor is ready to display the frame that you have already finished rendering. Because you already rendered the frame, the input now has to wait until the end of the frame being displayed on the screen, at which point the frame that was already rendered is flipped on to the screen and your program finally realizes that the mouse moved. It now renders yet another frame taking into account this movement, but because of Vsync that frame is blocked until the next refresh cycle. This means, if you were to press a key just as a frame was put up on the monitor, you would have two full frames of input lag, which at 60 FPS is 33 ms. I can ping a server 20 miles away with a ping of 21 ms. You might as well be in the next city with that much latency.

There is a solution to this - Triple Buffering. The idea is a standard flip mechanism commonly used in dual-thread lockless synchronization scenarios. With two backbuffers, the application can write to one and once its finished, tell the API and it will mark it for flipping to the front-buffer. Then the application starts drawing on the second, after waiting for any flipping operation to finish, and once its done, marks that for flipping to the front-buffer and starts drawing on the first again. This way, the application can draw 2000 frames a second, but only 60 of those frames actually get flipped on to the monitor using what is essentially a lockless flipping mechanism. Because the application is now effectively rendering 2000 frames per second, there is no more input lag. Problem Solved.

Except not, because DirectX implements Triple Buffering in the most useless manner possible. DirectX just treats the extra buffer as a chain, and rotates through the buffers as necessary. The only advantage this has is that it avoids waiting for the backbuffer copy operation to finish before writing again, which is completely useless in an era where said copy operation would have to be measured in microseconds. Instead, it simply ensures that vsync blocks the program, which doesn’t solve the input issue at all.

However, there is a flag, D3DPRESENT_DONOTWAIT, that forces vsync to simply return an error if the refresh cycle isn’t available. This would allow us to implement a hack resembling what triple buffering should be like by simply rolling our own polling loop and re-rendering things in the background on the second backbuffer. Problem solved!

Except not. It turns out the Nvidia and Intel don’t bother implementing this flag, forcing Vsync to block no matter what you do, and to make matters worse, this feature doesn’t have an entry in D3DCAPS9, meaning the DirectX9 API just assumes that it exists, and there is no way to check if it is supported. Of course, don’t complain about this to anyone, because of the 50% of people who asked about this who weren’t simply ignored, almost all of them were immediately accused of bad profiling, and that the Present() function couldn’t possibly be blocking with the flag on. I question the wisdom of people who ignore the fact that the code executed its main loop 2000 times with vsync off and 60 times with it on and somehow come to the conclusion that Present() isn’t blocking the code.

Either way, we’re kind of screwed now. Absolutely no feature in DirectX actually does what its supposed to do, so there doesn’t seem to be a way past this input lag.

There is, however, another option. Clever developers would note that to get around vsync’s tendency to eat up CPU cycles like a pig, one could introduce a Sleep() call. So long as you left enough time to render the frame, you could recover a large portion of the wasted CPU. A reliable way of doing this is figuring out how long the last frame took to render, then subtracting that from the FPS you want to enforce and sleep in the remaining time. By enforcing an FPS of something like 80, you give yourself a bit of breathing room, but end up finishing rendering the frame around the same time it would have been presented anyway.

By timing your updates very carefully, you can execute a Sleep() call, then update all the inputs, then render the scene. This allows you to cut down the additional lag time by nearly 50% in ideal conditions, almost completely eliminating excess input lag. Unfortunately, if your game is already rendering at or below 100 FPS, it takes you 10 milliseconds to render a frame, allowing you only 2.5 milliseconds of extra time to look for input, which is of limited usefulness. This illustrates why Intel and Nvidia are unlikely to care about D3DPRESENT_DONOTWAIT - modern games will never render fast enough for substantial input lag reduction.

Remember when implementing the Yield that the amount of time it takes to render the frame should be the time difference between the two render calls, minus the amount of time spent sleeping, minus the amount of time Present() was blocking.

Published on September 5, 2011 at 4:43am

C# to C++ Tutorial - Part 2: Pointers Everywhere!

[ 1 · 2 · 3 · 4 · 5 · 6 · 7 ]

We still have a lot of ground to cover on pointers, but before we do, we need to address certain conceptual frameworks missing from C# that one must be intimately familiar with when moving to C++.

Specifically, in C# you mostly work with the Heap. The heap is not difficult to understand - its a giant lump of memory that you take chunks out of to allocate space for your classes. Anything using the new keyword is allocated on the heap, which ends up being almost everything in a C# program. However, the heap isn’t the only source of memory - there is also the Stack. The Stack is best described as what your program lives inside of. I’ve said before that everything takes up memory, and yes, that includes your program. The thing is that the Heap is inherently dynamic, while the Stack is inherently fixed. Both can be re-purposed to do the opposite, but trying to get the Stack to do dynamic allocation is extremely dangerous and is almost guaranteed to open up a mile-wide security hole.

I’m going to assume that a C# programmer knows what a stack is. All you need to understand is that absolutely every single piece of data that isn’t allocated on the heap is pushed or popped off your program’s stack. That’s why most debuggers have a “stack” of functions that you can go up and down. Understanding the stack in terms of how many functions you’re inside of is ok, but in reality, there are also variables declared on the stack, including every single parameter passed to a function. It is important that you understand how variable scope works so you can take advantage of declaring things on the stack, and know when your stack variables will simply vanish into nothingness. This is where { and } come in.

int main(int argc, char *argv[])
{
  int bunny = 1;
  
  {
    int carrot=3;
    int lettuce=8;
    bunny = 2; // Legal
  }

  //carrot=2; //Compiler error: carrot does not exist
  int carrot = 3; //Legal, since the other carrot no longer exists
  
  {
    int lettuce = 0;

    { 
       //int carrot = 1; //Compiler error: carrot already defined
       int grass = 9;
       
       bunny = grass; //Still legal
       bunny = carrot; // Also legal
    }
    
    //bunny = grass; //Illegal
    bunny = lettuce; //Legal
  }
  
  //bunny = lettuce; //Illegal
}

{ and } define scope. Anything declared inside of them ceases to exist outside, but is still accessible to any additional layers of scope declared inside of them. This is a way to see your program’s stack in action. When bunny is declared, its pushed on to the stack. Then we enter our first scope area, where we push carrot and lettuce on to the stack and set bunny to 2, which is legal because bunny is still on the stack. When the scope is then closed, however, anything declared inside the scope is popped from the stack in the exact opposite order it was pushed on. Unfortunately, compiler optimization might change that order behind the scenes, so don’t rely on it, but it should be fairly consistent in debug builds. First lettuce is de-allocated (and its destructor called, if it has one), then carrot is de-allocated. Consequently, trying to set carrot to 2 outside of the scope will result in a compiler error, because it doesn’t exist anymore. This means we can now declare an entirely new integer variable that is also called carrot, without causing an error. If we visualize this as a stack, that means carrot is now directly above bunny. As we enter a new scope area, lettuce is then put on top of carrot, and then grass is put on top of lettuce. We can still assign either lettuce or carrot to bunny, since they are all on the stack, but once we leave this inner scope, grass is popped off the stack and no longer exists, so any attempt to use it causes an error. lettuce, however, is still there, so we can assign lettuce to bunny before the scope closes, which pops lettuce off the stack.

Now the only things on the stack are bunny and carrot, in that order (if the compiler hasn’t moved things around). We are about to leave the function, and the function is also surrounded by { and }. This is because a function is, itself, a scope, so that means all variables declared inside of that scope are also destroyed in the order they were declared in. First carrot is destroyed, then bunny is destroyed, and then the function’s parameters argc and argv are destroyed (however the compiler can push those on to the stack in whatever order it wants, so we don’t know the order they get popped off), until finally the function itself is popped off the stack, which returns program flow to whatever called it. In this case, the function was main, so program flow is returned to the parent operating system, which does cleanup and terminates the process.

You can declare anything that has a size determined at compile time on the stack. This means if you have an array that has a constant size, you can declare it on the stack:

int array[5]; //Array elements are not initialized and therefore are undefined!
int array[5] = {0,0,0,0,0}; //Elements all initialized to 0
//int array[5] = {0}; // Compiler error - your initialization must match the array size

You can also let the compiler infer the size of the array:

int array[] = {1,2,3,4}; //Declares an array of 4 ints on the stack initialized to 1,2,3,4

Not only that, but you can declare class instances and other objects on the stack.

Class instance(arg1, arg2); //Calls a constructor with 2 arguments
Class instance; //Used if there are no arguments for the constructor
//Class instance(); //Causes a compiler error! The compiler will think its a function.

In fact, if you have a very simple data structure that uses only default constructors, you can use a shortcut for initializing its members. I haven’t gone over classes and structs in C++ yet (See Part 3), but here is the syntax anyway:

struct Simple
{
  int a;
  int b;
  const char* str;
};

Simple instance = { 4, 5, "Sparkles" };
//instance.a is now 4
//instance.b is now 5
//instance.str is now "Sparkles"

All of these declare variables on the stack. C# actually does this with trivial datatypes like int and double that don’t require a new statement to allocate, but otherwise forces you to use the Heap so its garbage collector can do the work.

Wait a minute, stack variables automatically destroy themselves when they go out-of-scope, but how do you delete variables allocated from the Heap? In C#, you didn’t need to worry about this because of Garbage Collection, which everyone likes because it reduces memory leaks (but even I have still managed to cause a memory leak in C#). In C++, you must explicitly delete all your variables declared with the new keyword, and you must keep in mind which variables were declared as arrays and which ones weren’t. In both C# and C++, there are two uses of the new keyword - instantiating a single object, and instantiating an array. In C++, there are also two uses of the delete keyword - deleting a single object and deleting an array. You cannot mix up delete statements!

int* Fluffershy = new int();
int* ponies = new int[10];

delete Fluffershy; // Correct
//delete ponies; // WRONG, we should be using delete [] for ponies
delete [] ponies; // Just like this
//delete [] Fluffershy; // WRONG, we can't use delete [] on Fluffershy because we didn't
                        // allocate it as an array.

int* one = new int[1];

//delete one; // WRONG, just because an array only has one element doesn't mean you can
              // use the normal delete!
delete [] one; // You still must use delete [] because you used new [] to allocate it.

As you can see, it is much easier to deal with stack allocations, because they are automatically deallocated, even when the function terminates unexpectedly. [std::auto_ptr](http://www.cplusplus.com/reference/std/memory/auto_ptr/) takes advantage of this by taking ownership of a pointer and automatically deleting it when it is destroyed, so you can allocate the auto_ptr on the stack and benefit from the automatic destruction. However, in C++0x, this has been superseded by [std::unique_ptr](http://msdn.microsoft.com/en-us/library/ee410601.aspx), which operates in a similar manner but uses some complex move semantics introduced in the new standard. I won’t go into detail about how to use these here as its out of the scope of this tutorial. Har har har.

For those of you who like throwing exceptions, I should point out common causes of memory leaks. The most common is obviously just flat out forgetting to delete something, which is usually easily fixed. However, consider the following scenario:

void Kenny()
{
  int* kenny = new int();
  throw "BLARG";
  delete kenny; // Even if the above exception is caught, this line of code is never reached.
}

int main(int argc, char* argv[])
{
  try {
  Kenny();
  } catch(char * str) { 
    //Gotta catch'em all.
  }
  return 0; //We're leaking Kenny! o.O
}

Even this is fairly common:

int main(int argc, char* argv[])
{
  int* kitty = new int();

  *kitty=rand();
  if(*kitty==0)
    return 0; //LEAK
  
  delete kitty;
  return 0;
}

These situations seem obvious, but they will happen to you once the code becomes enormous. This is one reason you have to be careful when inside functions that are very large, because losing track of if statements may result in you forgetting what to delete. A good rule of thumb is to make sure you delete everything whenever you have a return statement. However, the opposite can also happen. If you are too vigilant about deleting everything, you might delete something you never allocated, which is just as bad:

int main(int argc, char* argv[])
{
  int* rarity = new int();
  int* spike;

  if(rarity==NULL)
  {
    spike=new int();
  }
  else
  {
    delete rarity;
    delete spike; // Suddenly, in an alternate dimension, earth ceased to exist
    return 0;
  }
  
  delete rarity; // Since this only happens if the allocation failed and returned a NULL
                 // pointer, this will also blow up.
  delete spike;
  return 0;
}

Clearly, one must be careful when dealing with allocating and destroying memory in C++. Its usually best to encapsulate as much as possible in classes that automate such things. But wait, what about that NULL pointer up there? Now that we’re familiar with memory management, we’re going to dig into pointers again, starting with the NULL pointer.

Since a pointer points to a piece of memory that’s somewhere between 0 and 4294967295, what happens if its pointing at 0? Any pointer to memory location 0 is always invalid. All you need to know is that the operating system does some magic voodoo to ensure that any attempted access of memory location 0 will always throw an error, no matter what. 1, 2, 3, and any other double or single digit low numbers are also always invalid. 0xfdfdfdfd is what the VC++ debugger sets uninitialized memory to, so that pointer location is also always invalid. A pointer set to 0 is called a Null Pointer, and is usually used to signify that a pointer is empty. Consequently if an allocation function fails, it tends to return a null pointer. Null pointers are returned when the operation failed and a valid pointer cannot be returned. Consequently, you may see this:

int main(int argc, char* argv[])
{
  int* blink = new int();
  if(blink!=0) delete blink;
  blink=0;
  return 0;
}

This is known as a safe deletion. It ensures that you only delete a pointer if it is valid, and once you delete the pointer you set the pointer to 0 to signify that it is invalid. Note that NULL is defined as 0 in the standard library, so you could also say blink = NULL.

Since pointers are just integers, we can do pointer arithmetic. What happens if you add 1 to a pointer? If you think of pointers as just integers, one would assume it would simply move the pointer forward a single byte.

This isn’t what happens. Adding 1 to a pointer of type integer results in the pointer moving forward 4 bytes.

Adding or subtracting an integer $i$ from a pointer moves that pointer $i\cdot n$ bytes, where $n$ is the size, in bytes, of the pointer’s type. This results in an interesting parallel - adding or subtracting from a pointer is the same as treating the pointer as an array and accessing it via an index.

int main(int argc, char* argv[])
{
  int* kitties = new int[14];
  int* a = &kitties[7];
  int* b = kitties+7; //b is now the same as a
  int* c = &a[4];
  int* d = b+4; //d is now the same as c
  int* e = &kitties[11];
  int* f = kitties+11; 
  //c,d,e, and f now all point to the same location
}

So pointer arithmetic is identical to accessing a given index and taking the address. But what happens when you try to add two pointers together? Adding two pointers together is undefined because it tends to produce total nonsense. Subtracting two pointers, however, is defined, provided you subtract a smaller pointer from a larger one. The reason this is allowed is so you can do this:

int main(int argc, char* argv[])
{
  int* eggplants = new int[14];
  int* a = &eggplants[7];
  int* b = eggplants+10;
  int diff = b-a; // Diff is now equal to 3
  a += (diff*2); // adds 6 to a, making it point to eggplants[13]
  diff = a-b; // diff is again equal to 3
  diff = a-eggplants; //diff is now 13
  ++a; //The increment operator is valid on pointers, and operates the same way a += 1 would
  // So now a points to eggplants[14], which is not a valid location, but this is still
  // where the "end" of the array technically is.
  diff = a-eggplants; // Diff now equals 14, the size of the array
  --b; // Decrement works too
  diff = a-b; // a is pointing to index 14, b is pointing to 9, so 14-9 = 5. Diff is now 5.
  return 0;
}

There is a mistake in the code above, can you spot it? I used a signed integer to store the difference between the two pointers. What if one pointer was above 2147483647 and the other was at 0? The difference would overflow! Had I used an unsigned integer to store the difference, I’d have to be really damn sure that the left pointer was larger than the right pointer, or the negative value would also overflow. This complexity is why you have to goad windows into letting your program deal with pointer sizes over 2147483647.

In addition to arithmetic, one can compare two pointers. We already know we can use == and !=, but we can also use < > <= and >=. While you can get away with comparing two completely unrelated pointers, these comparison operators are usually used in a context like the following:

int main(int argc, char* argv[])
{
  int* teapots = new int[15];
  int* end = teapots+15;
  for(int* s = teapots; s<end; ++s)
    *s = 0;
  return 0;
}

Here the for loop increments the pointer itself rather than an index, until the pointer reaches the end, at which point it terminates. But, what if you had a pointer that didn’t have any type at all? void* is a legal pointer type, that any pointer type can be implicitly converted to. You can also explicitly cast void* to any pointer type you want, which is why you are allowed to explicitly cast any pointer type to another pointer type (int* p; short* q = (short*)p; is entirely legal). Doing so, however, is obviously dangerous. void* has its own problems, namely, how big is it? The answer is, you don’t know. Any attempt to use any kind of pointer arithmetic with a void* pointer will cause a compiler error. It is most often used when copying generic chunks of memory that only care about size in bytes, and not what is actually contained in the memory, like memcpy().

int main(int argc, char* argv[])
{
  int* teapots = new int[15];
  void* p = (void*)teapots;
  p++; // compiler error
  unsigned short* d = (unsigned short*)p;
  d++; // No compiler error, but you end up pointing to half an integer
  d = (unsigned short*)teapots; // Still valid
  return 0;
}

Now that we know all about pointer manipulation, we need to look at pointers to pointers, and to anchor this in a context that actually makes sense, we need to look at how C++ does multidimensional arrays. In C#, multidimensional arrays look like this:

int[,] table = new int[4,5];

C++ has a different, but fairly reasonable stack-based syntax. When you want to declare a multidimensional array on the heap, however, things start getting weird:

int unicorns[5][3]; // Well this seems perfectly reasonable, I wonder what-
  int (*cthulu)[50] = new int[10][50]; // OH GOD GET IT AWAY GET IT AWAAAAAY...!
  int c=5;
  int (*cthulu)[50] = new int[c][50]; // legal
  //int (*cthulu)[] = new int[10][c]; // Not legal. Only the leftmost parameter
                                      // can be variable
  //int (*cthulu)[] = new int[10][50]; // This is also illegal, the compiler is not allowed
                                       // to infer the constant length of the array.

Why isn’t the multidimensional array here just an int**? Clearly if int* x is equivalent to int x[], shouldn’t int** x be equivalent to int x[][]? Well, it is - just look at the main() function, its got a multidimensional array in there that can be declared as just char** argv. The problem is that there are two kinds of multidimensional arrays - square and jagged. While both are accessed in identical ways, how they work is fundamentally different.

Let’s look at how one would go about allocating a 3x5 square array. We can’t allocate a 3x5 chunk out of our computer’s memory, because memory isn’t 2-dimensional, its 1-dimensional. Its just freaking huge line of bytes. Here is how you squeeze a 2-dimensional array into a 1-dimensional line:

As you can see, we just allocate each row right after the other to create a 15-element array ($5\cdot 3 = 15$). But then, how do we access it? Well, if it has a width of 5, to access another “row” we’d just skip forward by 5. In general, if we have an $n$ by $m$ multidimensional array being represented as a one-dimensional array, the proper index for a coordinate $(x,y)$ is given by: array[x + (y*n)]. This can be extended to 3D and beyond but it gets a little messy. This is all the compiler is really doing with multidimensional array syntax - just automating this for you.

Now, if this is a square array (as evidenced by it being a square in 2D or a cube in 3D), a jagged array is one where each array is a different size, resulting in a “jagged” appearance:

We can’t possibly allocate this in a single block of memory unless we did a lot of crazy ridiculous stuff that is totally unnecessary. However, given that arrays in C++ are just pointers to a block of memory, what if you had a pointer to a block of memory that was an array of pointers to more blocks of memory?

Suddenly we have our jagged array that can be accessed just like our previous arrays. It should be pointed out that with this format, each inner-array can be in a totally random chunk of memory, so the last element could be at position 200 and the first at position 5 billion. Consequently, pointer arithmetic only makes sense within each column. Because this is an array of arrays, we declare it by creating an array of pointers. This, however, does not initialize the entire array; all we have now is an array of illegal pointers. Since each array could be a different size than the other arrays (this being the entire point of having a jagged array in the first place), the only possible way of initializing these arrays is individually, often by using a for loop. Luckily, the syntax for accessing jagged arrays is the exact same as with square arrays.

int main(int argc, char* argv[])
{
  int** jagged = new int*[5]; //Creates an array of 5 pointers to integers.
  for(int i = 0; i < 5; ++i)
  {
    jagged[i] = new int[3+i]; //Assigns each pointer to a new array of a unique size
  }
  jagged[4][1]=0; //Now we can assign values directly, or...
  int* second = jagged[2]; //Pull out one column, and
  second[0]=0; //manipulate it as a single array

  // The double-access works because of the order of operations. Since [] is just an
  // operator, it is evaluated from left to right, like any other operator. Here it is
  // again, but with the respective types that each operator resolves to in parenthesis.
  ( (int&) ( (int*&) jagged[4] ) [1] ) = 0;
}

As you can see above, just like we can have pointers to pointers, we can also have references to pointers, since pointers are just another data type. This allows you to re-assign pointer values inside jagged arrays, like so: jagged[2] = (int*)kitty. However, until C++0x, those references didn’t have any meaningful data type, so even though the compiler was using int*&, using that in your code will throw a compiler error in older compilers. If you need to make your code work in non-C++0x compilers, you can simply avoid using references to pointers and instead use a pointer to a pointer.

int* bunny;
int* value = new int[5];

int*& bunnyref = bunny; // Throws an error in old compilers
int** pbunny = &bunny; // Will always work
bunnyref = value; // This does the same exact thing as below.
*pbunny = value;

// bunny is now equal to value

This also demonstrates the other use of a pointer-to-pointer data type, allowing you to remotely manipulate a pointer just like a pointer allows you to remotely manipulate an integer or other value type. So obviously you can do pointers to pointers to pointers to pointers to an absurd degree of lunacy, but this is exceedingly rare so you shouldn’t need to worry about it. Now you should be strong in the art of pointer-fu, so our next tutorial will finally get into object-oriented techniques in C++ in comparison to C#. Part 3: Classes and Structs and Inheritance OH MY!

Published on July 21, 2011 at 7:35pm

C# to C++ Tutorial - Part 1: Basics of Syntax

[ 1 · 2 · 3 · 4 · 5 · 6 · 7 ]

When moving from C# to C++, one must have a very deep knowledge of what C# is actually doing when you run your program. Doing so allows you to recognize the close parallels between both languages, and why and how they are different. This tutorial will assume you have a fairly strong grasp of C#, but may not be familiar with some of its more arcane attributes.

In C#, everything is an object, or a static member of an object. You can’t have a function just floating around willy-nilly. However, like all programs, a C# program must have an entry-point. If you have primarily done GUI-based design, you probably aren’t aware of the entry-point that is automatically generated, but it is definitely there, and like everything else, it’s part of an object. C# actually allows you to change the entry point function, but a default C# project will automatically generate a Program.cs file that looks like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Windows.Forms;

namespace ScheduleTimer
{
  static class Program
  {
    /// <summary>
    /// The main entry point for the application.
    /// </summary>
    [STAThread]
    static void Main()
    {
      Application.EnableVisualStyles();
      Application.SetCompatibleTextRenderingDefault(true);
      Application.Run(new frmMain());
    }
  }
}

static void Main() is the real entry point for your application, which simply initializes visual styles and then immediately launches the form that most C# users are accustomed to working with. Now, we can compare this with a simple “Hello World” C++ program:

#include <iostream>

int main(int argc, char *argv[])
{
  std::cout << "Hello World";
  return 0;
}

This program, to a C# user, immediately looks foreign and possibly even outright hostile. However, almost everything in it has a direct analogue in C#, despite the rather inane syntax that is being used. The most glaring example here is the insertion operator, «, because almost no one ever uses it except for in streams and the fact that it’s in a C++ Hello World program creates an absurd amount of confusion. It’s just a fancy way of doing this:


#include <iostream>

int main(int argc, char *argv[])
{
  std::cout.write("Hello World",11);
  return 0;
}

Now, counting the number of bytes you are pumping into the stream is really annoying, and that’s what the insertion operator does for you; it properly formats everything automatically. That’s all. It’s not a demon from hell bent on destroying your life, its just weird syntax. I don’t know why they don’t also have this functionality in a much easier to understand overloaded function, but there are a lot of things that they don’t do, so we’ll just have to live with it.

The main() function here serves the same exact purpose as the Main() function in C#. Strict C++ requires you to have a main() function to serve as an entry point, but various operating systems modify it and, in the case of Windows, outright replace it. As such, you will notice that your “hello world” C++ program, when built, opens in a command line. You will learn later how to prevent this by using Windows’ proprietary entry function. For those of you familiar with C#, this is exactly the same as C#’s ability to change around the entry point of the application, and you can even make a command line application in C# too by properly changing the compiler settings. The same concept applies to C++, but unlike C#, which defaults to a GUI, C++ defaults to a command line. Changing the compiler settings properly will result in a C++ program that starts in a GUI, just like C# (although unlike C#, C++ doesn’t have any help, which turns GUI programming into a complete nightmare).

So now that we have a direct analogue between C# and C++ in terms of where our application starts, we need to deal with a conceptual difference in how C# and C++ handle dependencies. In C#, your class file is just Class.cs, your helper class is Helper.cs, and both of them can call the other one provided they are in the same namespace, or if you are inheriting someone else’s, using the correct using statements to resolve the code. If these concepts are not familiar to you, you should learn more about C# before delving further into C++.

C++, on the other hand, does not do behind-the-scenes magic to help you resolve your dependencies. To understand what C++ is doing, one must understand how any compiler resolves references inside code (including C#). When the C# compiler is compiling your project, it goes through each of your code files one by one and compiles everything to an intermediate object code that will later be compiled down into the machine code (or, in this case, bytecode, since C# is an interpreted language). But wait, what if it’s compiling Class.cs before Helper.cs even though Class instantiates a Helper object and calls some functions inside of it that then instantiate another Class object? Well, what if you compiled Helper.cs first… but Helper.cs needs Class.cs to be compiled first because its instantiating a Class object inside the function that the Class object is calling! That’s a circular dependency! ~~THIS IS IMPOSSIBLE OH GOD WE’RE GOING TO DIE~~ No, it’s actually quite simple to deal with. Enter prototypes. If you have the following C# class:

using System;

namespace FunFunBunBuns
{
  class Class
  {
    private int _yay;
    private int _bunnies;

    // Constructor
    public Class(int yay)
    {
      _yay = yay;
      _bunnies = 0; // :C
    }

    // Destructor
    public ~Class()
    {
      _yay = 0;
    }

    public void IncrementYay()
    {
      _yay++;
    }
    
    public int MakeBunnies(int num) // :D
    {
      _bunnies = _bunnies + num;
      return _bunnies;
    }
  }
}

Making “prototypes” of these functions (which C# doesn’t have so this will be invalid syntax) would be the following:

using System;

namespace FunFunBunBuns
{
  class Class
  {
    private int _yay;
    private int _bunnies;

    // Constructor
    public Class(int yay);
    // Destructor
    public ~Class();
    public void IncrementYay();
    public int MakeBunnies(int num);
  }
}

Notice the distinct lack of code - this is how circular references get resolved. It turns out that to properly compile your program, the compiler only has to know what a function takes in as arguments, and what it returns. By treating the function as a “black box” of sorts, the compiler can ignore whatever code is inside it. Notice that this applies to constructors and destructors as well - they are simply special functions inside the class. In this manner the entire class can be treated as a bunch of black-box functions that don’t actually have any code that needs to be compiled in them. What the C# compiler does is create a bunch of these prototypes behind the scenes and feed them in front of all your code files, so it first compiles Class.cs using a prototype of the Helper class, which allows it to instantiate and use any functions that Helper defines without actually knowing the code inside them. Then, it compiles Helper.cs, compiling assigning code to the previously empty black-box functions defined in the Helper prototype, using a prototype of Class so that it can also instantiate and call functions from Class. In this way, both Helper.cs and Class.cs can be compiled in any order.

But wait, what if Class inherits Helper? In reality, this changes nothing. An important lesson here is that, in C++, you will not be able to simply ignore the fact that everything is a function. Classes are just an abstraction - in reality, inheritance, constructors, deconstructors, operators, everything is just various special functions. Python’s class syntax is interesting because requires that all class functions explicitly define the self parameter (which is identical to the this reference in C++ and C#), even the class constructor. Both C++ and C# hide all this from you, so Constructors and Destructors and class functions all magically just work, even though underneath it all they’re just ordinary functions with a special parameter that’s hidden from view. This is, in fact, how one mimics class behavior in C, which does not have object-oriented features - simply build a struct and make a bunch of functions for it that take a “this” pointer, or a pointer to a specific struct on which the function operates. This behavior can be (needlessly) duplicated using C# - let’s transform our Class class to C-style function implementations, ignoring the slightly invalid C# syntax.

using System;

namespace FunFunBunBuns
{
  struct Class
  {
    private int _yay;
    private int _bunnies;
  };

  public Constructor(Class this, int yay)
  {
    this._yay = yay;
    this._bunnies = 0; // :C
  }

  public Destructor(Class this)
  {
    this._yay = 0;
  }

  public void IncrementYay(Class this)
  {
    this._yay++;
  }
    
  public int MakeBunnies(Class this, int num) // :D
  {
    this._bunnies = this._bunnies + num;
    return this._bunnies;
  }
}

Thankfully, we don’t have to worry about this, since thinking of class functions as functions that operate on the object is a lot more intuitive. However, one must be aware that even in inheritance scenarios, everything is just a function, or an overload of a virtual function, or something similar (if you do not know what virtual functions are, you need to learn more C# before proceeding). Consequently, our ability to declare function prototypes solves all the dependency issues, because everything is a function.

This is where we get into exactly what the #include directive is for. In C#, all your files are automatically accessible from every other file, and this isn’t a problem because compilation is nigh-instantaneous. C++ is much more intensive to compile, partially because it doesn’t have a precompiled 400 MB library of crap to work off of, and partially due to a much more complicated precompiler. That means in C++, if you want a given file to have access to another file, you have to #include that file. In our Hello World application, we are including iostream, which does not have a .h file extension on the end for stupid regulatory reasons. However, what about the file our code is in? Our code is not in a .h file, its in a .cpp file. This is where we get to a critical difference between C# and C++. While C# just has .cs files for code, C++ has two types of files: header files and code files.

.cpp == C++ (C-plus-plus) code file
.h = C++ Header file

Header files contain class and function prototypes, and code files contain all the actual code. A C++ project is therefore defined entirely by a list of .cpp files that need to be compiled. Header files are just little helper files that make resolving dependencies easier. C# does this for you - C++ does not. Note that because these are technically arbitrary file distinctions, you can put whatever you want in either file type; nothing will stop you from doing #include "main.cpp", its just ridiculous and confusing. Both #include <> and #include "" are valid syntax for the #include directive, there is no real difference. Standard procedure, however, is that #include <> is used for any header files outside of your project, and #include "" is used for header files inside your project, or closely related to it.

So what we’re doing when we say #include <iostream> is that we’re including a bunch of prototypes for various input/output stream (i/o stream –> iostream) related classes defined in the standard library, which your compiler already has the corresponding .cpp implementations of built into it. So, the compiler links the application against this header file, and when you use std::cout, it just treats everything in it (including that ridiculously obtuse << operator, which is really just another function) as a black-box function.

Consequently, unless you know what your doing, you should keep code out of header files. C++ doesn’t prevent you from throwing functions that aren’t attached to classes all over the place, like C# does, so what would happen if you defined int ponies() { return 0; } in a header file that you include in two seperate .cpp files? The compiler will try to compile the function twice, and on the second time it will explode because the function it tried to put code into already had code in it, since it wasn’t a prototype! EVERYTHING DIES! So until you get to the more advanced areas of C++, don’t put code in your header files (unless you want to watch your compiler die, you monster).

At this point I want to clarify what std:: is, because it looks rather weird to a C# programmer. In C#, the . operator works on everything - you just have System.Forms.Whatever.Help.Im.Trapped.In.A.Universe.Factory.Your.Class.Member.Function() and its all good. In C++, that’s not going to work anymore. The :: operator is known as the Scope Resolution Operator. It’s a lot easier to explain if I first explain what the . operator has been demoted to. You can only use the . operator on a reference or value type of an instantiated object (basically everything you’ve ever worked with in C#). The important distinction here is that static functions cannot be accessed with the . operator anymore. This is because Static functions, along with namespaces and typedefs and everything else must use the Scope Resolution Operator. Consequently, you can think of the . operator as being demoted to just calling class functions, and everything else now uses the :: operator. So, std::cout just means that we’re access the cout class in the std namespace.

Now we just have one more hurdle to overcome with the “Hello World” application, that funky char* argv[] parameter in main(). Most C# programmers can correctly infer that it is probably an array of some sort, but we don’t know what type char* is, other than its clearly related to char.

char* is a pointer. Yes, the same scary pointers you hear about all the time. No, they aren’t really scary. In fact, you have been using similar concepts in C# all the time without actually realizing it. First, however, let’s take a hard look at what a pointer really is.

Everything in your entire program takes up memory. Since this tutorial is designed for people who know C# already, I really, really hope you already knew that. What you might not know is that all this memory has a specific location on the machine. In fact, on a 32-bit machine, every single possible location of a byte can be contained in an unsigned 32-bit integer. This is why we are currently moving to 64-bit CPUs, because an unsigned 32-bit integer can only hold up to 4294967295 possible byte locations, which amounts to 4.2 gigs of memory. That’s why you are limited to 4 gigs of RAM on a 32-bit machine, and windows has difficulty using more than 2 gigs because a lot of older programs assumed that a signed 32-bit integer was sufficient for all memory addresses, so windows has to do some funky memory paging techniques to get programs that ignore the last bit to use memory locations above 2147483647.

So, if you allocate a float, either on the stack or on the heap, it must exist somewhere within those 4294967295 possible byte locations. Consequently, lets say you want to call a function that modifies that float, but the function has to have a void return value for some arbitrary reason. If you know where in memory that float is, you can tell the function where to find the float and modify it to the desired value without ever returning a value. Here is an example C++ function doing just that (which is syntactically valid all by itself because C++ allows functions outside of classes):

void ModifyFloat(float* p)
{
  *p = 100.0;
}

int main(int argc, char* argv[])
{
  float x = 0; //x is equal to 0.0
  ModifyFloat( &x );
  // x is now equal to 100.0
}

What’s going on here? First, we have our ModifyFloat() function. This takes a pointer to a float, which is declared by adding a * to the desired type we want to make a pointer to. Remember that pointers are really just 32-bit integers (or 64-bit if you have a 64-bit operating system), but C++ assigns them a type so that if you try to assign a double to a pointer to a float, it throws an error instead of overflowing 4 extra bytes, causing a heap corruption and destroying the universe. So char* is a pointer to a char, a double* points to a double, and Helper* is a pointer to our own Helper class.

The next thing done in ModifyFloat() is *p. In this case, the * operator is the dereference operator. So unfortunately * is the multiply, pointer, and dereference operator in C++. Yes, this is retarded. I’m sorry. But what the heck does dereference even mean? It takes a pointer type and turns it into a reference. You already know what a reference is, even if you don’t realize it. In C#, you can pass a variable of your Helper class into a function, modify the class in the function, and the original variable will get modified too! This is because, by default, classes are passed by-reference in C#. That means, even though it looks identical to a variable passed by value, any changes made to the variable are in fact made to whatever variable it references. So, this idea of passing variables in by reference should be familiar to an experienced C# programmer. C++ has references too, I just haven’t gone over their syntax. Here’s a more explicit version of the function:

void ModifyFloat(float* p)
{
  float& ref_p = *p;
  ref_p = 100.0;
}

This is the exact same as the previous function, but here we can clearly see the reference. In C#, if you wanted a variable normally passed by value, like a struct, to get passed by reference, you had to override the default behavior by adding ref. In C++, a variable that is a reference to a given type is declared in a similar manner to a pointer. The & operator is used instead of *, so in this example, float& is a reference to a float. We assign it to the value produced by turning our pointer into a float reference. Then we just set our reference equal to 100.0 and it magically alters the original variable, just like it would in C#. In fact, here is the same function written in (slightly illegal) C#:

public static void ModifyFloat(ref float p)
{
  p=100.0;
}

This does the same thing, just without the pointer. In fact, we can totally ignore the pointer in C++ too, if we want (which I tend to prefer, when possible, because its a lot easier to work with):

void ModifyFloat(float& ref_p)
{
  ref_p = 100.0;
}
int main(int argc, char* argv[])
{
  float x = 0; //x is equal to 0.0
  ModifyFloat( x );
  // x is now equal to 100.0
}

Now, in this implementation, you will notice that our call to ModifyFloat is now equivalent to what it would be in C#, in that we just pass in the variable. What happened to that random & operator we had there before? The & operator is also known as the address-of operator, meaning when its applied to a variable as opposed to a type, it returns a pointer to that variable (yay, more context-dependent redundant operators). So, we could rewrite our function as follows to make it a bit more clear:

void ModifyFloat(float* p)
{
  float& ref_p = *p; //get a reference from the pointer
  ref_p = 100.0; //modify the reference
}
int main(int argc, char* argv[])
{
  float x = 0; //x is equal to 0.0
  float* p_x = &x; //get a pointer to x
  ModifyFloat( p_x ); //pass pointer into function
  // x is now equal to 100.0 
}

As we can see, pointers are just the underlying work behind references. If you ever go into Managed C++, you’ll find out that all C# references are really just pointers, but the language treats them as references so they’re hidden from you. In C++, you can have both pointers and references. It is important to note that you can only initialize a reference variable. Any subsequent operators will be applied to whatever variable its referencing, making it impossible to get the address of a reference variable or do anything to the reference variable itself - for all intents and purposes, it just is the variable it references. This is why pointers are handy - you CAN reassign the actual pointer variable while also accessing the variable its pointing to. Consequently, you can also get the address of a pointer variable, since just like any other variable, including the reference variable, it must occupy memory, and therefore has a location that you can get a pointer to (we’ll get to that syntax in a minute). But there’s one more thing…

What about arrays? In C#, arrays are actually a built-in class that has lots of fancy functions and whatnot. Interestingly, they are still of fixed size. C++ arrays are also fixed size, but they are manipulated as raw memory. Let’s compare initializing an array in C++ and initializing an array in C#:

int[] numbers = new int[5];

int* numbers = new int[5];

It should be pretty obvious at this point that arrays are pointers in C++. I can even rewrite the above in C++ using an empty array syntax, and it will be equally as valid:

int numbers[] = new int[5];

int *x*[] is identical to int* *x*. There is no difference. Observe the following modification of our original Hello World function:

int main(int argc, char** argv);

Same thing. In fact, if you watch your compiler output carefully, you might even see the compiler internally convert all the arrays to pointers when its resolving types. Now, as a C# programmer, you should already know what arrays are. You should probably also be at least dimly aware that each element of an array occupies memory directly after the element proceeding it. So, if you know where the address of the first element is, you know the second element is exactly x bytes afterwards, where x is the number of bytes your type takes up. This is why pointers have types associated with them - we know that float* points to an array of elements, and that each element takes up 4 bytes. To verify this, the sizeof() built-in function/operator/whatever will return the number of bytes a given type, class, or struct takes up. That’s the number of bytes we skip ahead to get to the next element in an array. This is all done transparently in C++ using the same array index operator as C# uses:

int main(int argc, char** argv)
{
  int* ponies = new int[5];
  ponies[0] = 1; //First element..
  ponies[1] = 2; //Second element...
}

So pointers can be treated as arrays that behave exactly the same way a C# array does. However, the astute C# programmer would ask, how do you know how long the array is?

YOU DON'T

Enter every single buffer overflow error that has been the bane of man since the beginning of time. YOU have to keep track of how long the array is, and you’d better be damn sure you don’t get it wrong. Consequently any function taking an array of variable size will also require a separate argument telling the function how many elements are in the array. Usually arrays are just constructed on the stack with a constant, known size, which is often harmless and pretty hard to screw up. If you start doing funky things with them, though, you might want to look up std::vector for an encapsulated dynamic array.

So C++ arrays are just like C# arrays, except they are pointers to the first element, and you don’t know how long they are (and they might cause the destruction of the universe if you screw up). You should already know that a string is an array, and consequently in C++ the standard string type is const char*, not string. You also can’t put them in switch() statements. Sorry.

There’s a lot of stuff about pointers that this tutorial hasn’t covered, like function pointers and pointer arithmetic, which we’ll get to next time.

Part 2: Pointers To Everything

Published on July 6, 2011 at 4:26am

The Ninth Circle of Bugs

So I’m rewriting my 2D culling kd-tree for my graphics engine, and a strange bug pops up. On release mode, one of the images vanished. Since it didn’t happen in debug mode, it was already a heisenbug. A heisenbug is defined as a bug that vanishes when you try to find it. It took me almost a day to trace the bug to the rebalance function. At first I thought the image had simply been removed from a node accidentally, but this wasn’t the case. It took another day to finally figure out that the numimages variable was getting set to 0, thus causing the node to think it was empty and resulted in it deleting itself and removing itself from the tree (which caused all sorts of other problems).

Unfortunately, I could not verify the tree. Any attempt that so much as touched the tree’s memory would wipe out the bug, or so I thought. Then I tried adding the verification function into an if statement that would only activate if the bug appeared - it did not. The act of adding a line of code that was never executed actually caused the bug to vanish.

I was absolutely stunned. This was completely insane. Following the advice of a friend, I was forced to assume the compiler somehow screwed something up, so I randomly disabled various optimizations in release mode. It turned out that disabling the Omit Frame Pointers optimization removed the bug. I didn’t actually know what frame pointers were, only that I had turned on this optimization in many other projects for the hell of it and it had never caused any problems (no optimizations ever should, for that matter). What I discovered was astonishing. Frame pointers couldn’t be omitted from a function if it got too complicated or needed to unwind the stack due to a possible exception. On a hunch, instead of adding the verification function to the chunk of code that was only executed if the error occurred, I instead added a vestigial 'throw "derp";' line.

The problem vanished.

I knew instantly that either the problem was caused by the omission of frame pointers, which would indicate a bug in the VC++ 2010 compiler (unlikely), or when the frame pointers were included, it masked the bug (much more likely). But I also had another piece of knowledge at my disposal - exactly how I could modify the function without masking the bug. I considered decompiling the function and forcing VC++ to use the flawed assembly, but that didn’t allow me to modify the assembly in any meaningful way. A bit more experimentation revealed that any access of the root node, or for that matter, the 'this' pointer itself (unless it was for calling a function) caused the inclusion of the frame pointer. I realized that a global variable would be exempt from this, and that I might be able to get by this limitation by assigning the address of whatever variable I needed to the global variable and passing that into the function instead.

This approach, however, failed. In fact most attempts to get around the frame pointer inclusion failed. I did, however, notice what appeared to be a separate bug in another part of the tree. A short investigation later revealed an unrelated bug in the tree caused by the solve function. However, what was causing this bug (duplicated parentC pointers) still threw up errors after solving the first bug, indicating that it was possible this mysterious insane compiler induced bug was just a symptom of a deeper one that would be easier to detect. After more hunting, a second unrelated bug was found. Clearly this tree was not nearly as stable as I had thought it was.

A third bug was eventually found, and I discovered the root cause of this bug to be an #NaN float value in the tree. This should never ever, ever happen, because it destabilizes the tree, but sure enough, I finally found the cause.

_totalremove(node->total,(const float (&)[4])currect);

Casting from a float* that was previous cast from a float[4] causes read errors at totally random times, despite this being completely valid under the circumstances. My only guess is that the compiler somehow interpreted this cast as undefined behavior and went crazy. I will never know. All I know is that I should never, ever, ever, ever, ever cast to that data type ever again, because guess what? After removing all my debug equipement and putting the cast back in, I was able to reliable reproduce the bug that started this whole mess, and removing the cast made the bug vanish.

This entire week long trek through hell was because the compiler fucked up on a goddamn variable cast. It wasn’t a memory leak, it wasn’t a buffer overrun, it was just a goddamn miscast variable.

Lesson: Re-validate every inch of your data structure the instant you realize you have a heisenbug, and make sure your validation function properly checks for all things that can screw things up.

Published on May 15, 2011 at 4:43am

Investigating Low-level CPU Performance

While reconstructing my threaded Red-Black tree data structure, I naturally assumed that due to invalid branch predictions costing significant amounts of performance, by eliminating branching in low-level data structures, one can significant enhance the performance of your application. I did some profiling and was stunned to discover that my new, optimized Red Black tree was… SLOWER then the old one! This can’t be right, I eliminated several branches and streamlined the whole thing, how can it be SLOWER?! I tested again, and again, and again, but the results were clear - even with fluctuations of up to 5% in the results, the average speed for my new tree was roughly 7.5% larger then my old one (the following numbers are the average of 5 tests).

Old: 626699 ticks New: 674000 ticks

//Old
c = C(key, y->_key);
if(c==0) return y;
if(c<0) y=y->_left;
else y=y->_right;

//New
if(!(c=C(key,y->_key)))
return y;
else
y=y->_children[(++c)>>1];

Now, those of you familiar with CPU branching and other low-level optimizations might point out that the compiler may have optimized the old code path more effectively, leaving the new code path with extra instructions due to the extra increment and bitshift operations. Wrong. Both code paths have the exact same number of instructions. Furthermore, there are only FOUR instructions that are different between the two implementations (highlighted in red below).

New

00F72DE5  mov         esi,dword ptr [esp+38h]  
00F72DE9  mov         eax,dword ptr 
00F72DEE  cmp         esi,eax  
00F72DF0  je          main+315h (0F72E25h)  
00F72DF2  mov         edx,dword ptr [esp+ebx*4+4ECh]
00F72DF9  lea         esp,[esp]
00F72E00  mov         edi,dword ptr [esi+4]  
00F72E03  cmp         edx,edi  
00F72E05  jge         main+2FCh (0F72E0Ch)  
00F72E07  or          ecx,0FFFFFFFFh  
00F72E0A  jmp         main+303h (0F72E13h)  
00F72E0C  xor         ecx,ecx  
00F72E0E  cmp         edx,edi  
00F72E10  setne       cl  
00F72E13  movsx       ecx,cl  
00F72E16  test        ecx,ecx  
00F72E18  je          main+317h (0F72E27h)  
00F72E1A  inc         ecx  
00F72E1B  sar         ecx,1  

00F72E1D  mov         esi,dword ptr [esi+ecx*4+18h]  
00F72E21  cmp         esi,eax  
00F72E23  jne         main+2F0h (0F72E00h)  
00F72E25  xor         esi,esi  
00F72E27  mov         eax,dword ptr [esi]  
00F72E29  add         dword ptr [esp+1Ch],eax

Old

00F32DF0  mov         edi,dword ptr [esp+38h]  
00F32DF4  mov         ebx,dword ptr 
00F32DFA  cmp         edi,ebx  
00F32DFC  je          main+31Dh (0F32E2Dh)  
00F32DFE  mov         edx,dword ptr [esp+eax*4+4ECh]  

00F32E05  mov         esi,dword ptr [edi+4]  
00F32E08  cmp         edx,esi  
00F32E0A  jge         main+301h (0F32E11h)  
00F32E0C  or          ecx,0FFFFFFFFh  
00F32E0F  jmp         main+308h (0F32E18h)  
00F32E11  xor         ecx,ecx  
00F32E13  cmp         edx,esi  
00F32E15  setne       cl  
00F32E18  movsx       ecx,cl  
00F32E1B  test        ecx,ecx  
00F32E1D  je          main+31Fh (0F32E2Fh)  
00F32E1F  jns         main+316h (0F32E26h)  
00F32E21  mov         edi,dword ptr [edi+18h]  
00F32E24  jmp         main+319h (0F32E29h)  
00F32E26  mov         edi,dword ptr [edi+1Ch]  
00F32E29  cmp         edi,ebx  
00F32E2B  jne         main+2F5h (0F32E05h)  
00F32E2D  xor         edi,edi  
00F32E2F  mov         ecx,dword ptr [edi]  
00F32E31  add         dword ptr [esp+1Ch],ecx

I have no real explanation for this behavior, but I do have a hypothesis: The important instruction is the extra LEA in my new method that appears to be before the branch itself. As a result, it may be possible for the CPU to be doing branch prediction in such a way it shaves off one instruction, which gives it a significant advantage. It may also be that the branching is just faster then my increment and bitshift, although I find this highly unlikely. At this point I was wondering if anything I knew about optimization held any meaning in the real world, or if everything was just a lot of guesswork and profiling because what the fuck?! However, it then occurred to me that there was an optimization possible for the old version - Move the if(c==0) statement to the bottom so the CPU does the (c<0) and (c>0) comparisons first, since the c==0 comparison only happens once in the traversal. Naturally I was a bit skeptical of this having any effect on the assembly-rewriting, branch-predicting, impulsive teenage bitch that my CPU was at this point, but I tried it anyway.

It worked. There was a small but noticeable improvement in running time by using the old technique and rewriting the if statements as such:

c = C(key, y->_key);
if (c < 0)  y = y->_left;
else if(c > 0) y = y->_right;
else return y;

Optimized: 610161.8 Ticks

The total performance improvement over my failed optimization attempt and my more successful branch-manipulation technique is a whopping 63838.2 Ticks, or a ~10% improvement in speed, caused by simply rearranging 4 or 5 instructions. These tests were done on a randomized collection of 500000 integers, so that means the optimized version can pack in 10% more comparisons in the same period of time as the bad optimization. That’s 550000 vs 500000 elements, which seems to suggest that delicate optimization, even in modern CPUs, can have significant speed improvements. Those of you who say that toying around with low level code can’t infer significant performance increases should probably reconsider exactly what you’re claiming. This wouldn’t directly translate to 50000 extra players on your server, but a 10% increase in speed isn’t statistically insignificant.

Published on April 10, 2011 at 5:06pm