Erik McClure

8-bit color cycling


Someone linked me to this awesome webpage that uses HTML5 to do 8-bit palette color cycling using Mark Ferrari’s technique and art. I immediately wanted to implement it in my graphics engine, but soon realized that the technique is so damn old that no modern graphics card supports it anymore. So, I have come up with a pixel shader that creates the same functionality, either by having one image with an alpha channel containing the palette indices and a separate texture acting as the palette, or you can combine them into a single image. This is supposed to support variable palette sizes (up to 256) but I haven’t had much ability to test the thing because its so damn hard to get the images formatted correctly. So while all of these variations i’m about to show you should work there is no guarantee they necessarily will.

Video Link

8-bit cycling multi-image

ps 2.0 HLSL
// Global variables
float frame;
float xdim;
float xoff;

// Samplers
sampler s0 : register(s0);
sampler s1 : register(s1);

float4 ps_main( float2 texCoord : TEXCOORD0 ) : COLOR0
{
float4 mainlookup = tex2D( s0, texCoord );
float2 palette = float2(mainlookup.a*xdim + xoff,frame);
mainlookup = tex2D(s1, palette);
return mainlookup;
}

 
ps 1.4 ASM
ps.1.4
texld r0, t0
mad r0.x, r0.a, c1, c2
mov r0.y, c0
phase
texld r1, r0
mov r0, r1
 
It is also possible to write the shader in ps.1.1 but it requires crazy UV coordinate hacks.

frame is a value from 0.0 to 1.0 (ps.1.4 will not allow you to wrap this value, but ps.2.0 will) that specifies how far through the palette animation you are.
xdim = 255/(width of palette)
xoff = 1/(2*(width of palette))

Note that all assembly registers correspond to a variable in order of its declaration. So, c0 = frame, c1 = xdim, c2 = xoff.

8-bit cycling single-image

ps 2.0 HLSL
// Global variables
float frame;
float xdim;
float xoff;

// Samplers
sampler s0 : register(s0);

float4 ps_main( float2 texCoord : TEXCOORD0 ) : COLOR0
{
float4 mainlookup = tex2D(s0, texCoord );
float2 palette = float2(mainlookup.a*xdim + xoff,frame);
mainlookup = tex2D(s0, palette);
mainlookup.a = 1.0f;
return mainlookup;
}

 
ps 1.4 ASM
ps.1.4
def c3, 1.0, 1.0, 1.0, 1.0
texld r0, t0
mov r1, r0
mad r1.x, r1.a, c1, c2
mov r1.y, c0
phase
texld r0, r1
mov r0.a, c3

frame is now a value between 0.0 and (palette height)/(image height). xdim = 255/(image width) xoff = 1/((image width)*2)

24-bit cycling

ps 2.0 HLSL
// Global variables
float frame;
float xdim;
float xoff;

// Samplers
sampler s0 : register(s0);
sampler s1 : register(s1);

float4 ps_main( float2 texCoord : TEXCOORD0 ) : COLOR0
{
float4 mainlookup = tex2D( s0, texCoord );
float2 palette = float2(mainlookup.a*xdim + xoff,frame*fElapsedTime);
float4 lookup = tex2D(s1, palette);
return lerp(mainlookup,lookup,lookup.a);
}

 
ps 1.4 ASM
ps.1.4
def c3, 1.0, 1.0, 1.0, 1.0
texld r0, t0
mov r2, c0
mad r2.x, r0.a, c1, c2
phase
texld r1, r2
mov r0.a, c3
lrp r0, r1.a, r1, r0

Variables are same as 8-bit multi-image.

All this variation does is make it possible to read an alpha value off of the palette texture, which is then interpolated between the palette and the original color value. This way, you can specify 0 alpha palette indexes to have full 24-bit color, and then just use the palette swapping for small, animated areas.

If I had infinite time, I’d write a program that analyzed a palette based image and re-assigned all of the color indexes based on proximety, which would make animating using this method much easier. This will stay as a proof of concept until I get some non-copyrighted images to play with, at which point I’ll probably throw an implementation of it inside my engine.


Physics Networking


I’m still working on integrating physics into my game, but at some point here I am going to hit on that one major hurdle: syncing one physics environment with another that could be halfway across the globe. There are a number of ways to do this; some of them are bad, and some of them are absolutely terrible.

If any of you have played Transformice, you will know what I mean by terrible. While I did decompile the game, I never bothered to examine their networking code, but I would speculate that they are simply mass-updating all the clients and not properly interpolating received packets. Consequently, when things get laggy, everyone doesn’t just hop around, they completely clip through objects, even ones that they should never clip though no matter what the other players are doing.

The question is, how do you properly handle physics networking without flooding the network with unnecessary packets, especially when you are dealing with very large numbers of physics objects spread across a large map with many people interacting in complex ways?

In a proper setup, a client processes input from the user, and sends this to the server. While it’s waiting for a response, it will interpolate the physics by guessing what all the other players are doing. If there are no other players this interpolation can, for the most part, be considered to be perfectly accurate. This will generally hold true if there are players that are far enough away from each other they can’t directly or indirectly influence each other’s interpolation.

Meanwhile, the server receives the input a little while later - say, 150 ms. In an ideal scenario, the entire physics world is rewound 150 ms and then re-simulated taking into account the player’s action. The server then broadcasts new physics locations and the player’s action to all other clients.

All the other clients now get these packets another 150 ms later. In an ideal scenario, all they would need to know is that the other player pressed a button 300 ms ago, rewind the simulation that much, then re-simulate to take it into account.

We are, obviously, not in an ideal scenario.

What exactly does this entail? In any interpolation function, speed is gained by making assumptions that introduce possible errors. This error margin grows as the player has more and more objects he might have interacted with. However, this also applies to each physics object against each other physics object, and in turn each other physics object from that. Hence, this boils down to the math equation: p = nn!. That is a worst case scenario. Best case scenario is that none of the physics objects interact with each other, so error margin p = n and is therefore linear.

Hence, we now know that interaction with physics objects - or possibly, any sort of nonuniform physical force, like an explosion - is what creates the uncertainty problem. This is why the server always maintains its own physics world that is synced to everyone else, so that even if its not always right, its at least somewhat consistent. The question is, when do we need to send a physics packet, and when do we not need to send one?

If the player is falling through the air and jumps, and there is nothing he could possibly interact with, we can assume that any half-decent interpolation function will be almost perfect. Hence, we don’t actually have to send any physics packets because the interpolation should be perfect. However, the more possible objects that are involved, the more uncertain things get and the more we need to send physics updates in case the interpolation functions get confused. In addition, we have to send packets for all affected objects as well. However, we only have to do this when the player gives input to the game. If the player’s input does not change, then he is obeying the interpolation laws of everyone else and no physics update is needed, since the interpolation always assumes the player’s input status has not changed.

Hence, in a semi-ideal scenario, we only have to send physics packets for the player and any physics objects he might have collided with, and only when the player changes their input status. Right?

But wait, that works for the server, but not all the other clients. Those clients receive those physics packets 150 ms late, and have to interpolate those as well, introducing more uncertainty that cannot be eliminated. In addition, we aren’t even in a semi-ideal scenario - the interpolation functions become more unreliable over time regardless of the player’s input status.

However, this uncertainty itself is still predictable. The more objects a player is potentially interacting with, the more uncertain those object states are. This grows exponentially when other players are in the mix because we simply cannot guess what they might do in those 150 ms.

Consequently, one technique would be to send physics updates for all potentially affected objects surrounding the player whenever the player changes their input or interacts with another physics object. This alone will only work when the player is completely alone, and even then require some intermediate packets for higher uncertainty. Hence, one can create an update pattern that looks something like this:

Physics packet rapidity: (number of potential interacting physics objects) * ( (other players) * (number of their potential interacting objects) )

More precise values could be attained by taking into account distance and understand exactly what and how the interpolation function is working. Building an interpolation function that is capable of rewinding and accurately re-simulating the small area around the player is obviously a very desirable course of action, because it removes the primary source of uncertainty and would allow for a lot more breathing room.

Either way, the general idea still stands - the closer your player is to a physics object, the more often that object (and the player) need to be updated. The closer your player is to another player, the update speed goes up exponentially. And always remember to send velocity data too!

I will make a less theoretical post on this after I’ve had an opportunity to do testing on what really works.


Assembly CAS implementation


inline unsigned char BSS_FASTCALL asmcas(int *pval, int newval, int oldval)
  {
      unsigned char rval;
      __asm {
#ifdef BSS_NO_FASTCALL //if we are using fastcall we don't need these instructions
        mov EDX, newval
        mov ECX, pval
#endif
        mov EAX, oldval
        lock cmpxchg [ECX], EDX
        sete rval // Note that sete sets a 'byte' not the word
      }
      return rval;
  }
This was an absolute bitch to get working in VC++, so maybe this will be useful to someone, somewhere, somehow. The GCC version I based this off of can be found here.

Note that, obviously, this will only work on x86 architecture.


Function Pointer Speed


So after a lot of misguided profiling where I ended up just testing the stupid CPU cache and its ability to fucking predict what my code is going to do, I have, for the most part, demonstrated the following:

if(!(rand()%2d)) footest.nothing(); else footest.nothing2();

is slightly faster then

(footest.*funcptr[rand()%2])();

where funcptr is an array of the possible function calls. I had suspected this after I looked at the assembly, and a basic function pointer call like that takes around 11 instructions whereas a normal function call takes a single instruction.

In debug mode, however, if you have more then 2 possibilities, a switch statement’s very existence takes up almost as many instructions as a single function pointer call, so the function pointer array technique is significantly faster with 3 or more possibilities. However, in release mode, if you write something like switch(rand()%3) and then just the function calls, the whole damn thing gets its own super special optimization that reduces it to about 3 instructions and hence makes the switch statement method slightly faster.

In all of these cases though the speed difference for 1000 calls is about 0.007 milliseconds and varies wildly. The CPU is doing so much architectural optimization that it most likely doesn’t really matter which method is used. I do find it interesting that the switch statement gets super-optimized in certain situations, though.


Most Bizarre Error Ever


Ok probably not the most bizarre error ever but it’s definitely the weirdest for me.

My graphics engine has a Debug, a Release, and a special Release STD version that’s compatible with CLI function requirements and other dependencies. These are organized as 3 separate configurations in my solution for compiling. Pretty normal stuff.

My example applications are all set to be dependent on the graphics engine project, which means visual studio automatically compiles the proper lib file into the project.

Well, normally.

My graphics engine examples suddenly and inexplicably stopped working in Release mode, but not Debug mode. While this normally signals an uninitialized variable problem, I had only flipped a few negative signs since the last build, so it was completely impossible for that to be the cause. I was terribly confused so I went into the code and discovered that it was failing because the singleton instance of the engine was null.

Now, if you know what a singleton is, you should know that this is absolutely, completely impossible. Under normal conditions, that is. At first I thought my engine instance assignment had been fucked, but that was working fine. In fact, the engine existed for one call, and then didn’t exist for the other.

Then I checked where the calls were coming from. What I discovered next blew my mind.

One call was from Planeshader.dll; The other call was from Planeshader_std.dll - oh crap.

Somehow, visual studio had managed to link my executable to both instances of my graphics engine at the same time. I’m not entirely sure how it managed that feat since compiling two identical lib files creates thousands of collisions, but it appears that half the function calls were being sent to one dll, and half the function calls being sent to the other. My engine was trying to run in two dlls simultaneously.

I solved the problem simply by explicitly specifying the lib file in the project properties.

Surely my skill at breaking things knows no bounds.


Avatar

Archive

  1. 2024
  2. 2023
  3. 2022
  4. 2021
  5. 2020
  6. 2019
  7. 2018
  8. 2017
  9. 2016
  10. 2015
  11. 2014
  12. 2013
  13. 2012
  14. 2011
  15. 2010
  16. 2009