setting the speed for the future of games programming

OPTIMIZATION GUIDELINES

1. Reduce Memory Accesses

One important area for optimization is reducing memory reads and writes by keeping values in the processor's registers. VectorC will try to keep values in registers as much as possible, but it sometimes needs help. This is because it often does not know whether it is safe to assume that the value of the variable is being modified via another pointer. This is called 'aliasing'. Read the guidelines on aliasing to help speed up your code on any processor.

By default, arguments to functions will be passed on the stack. You can change this by declaring functions with a register-passing calling convention, or setting register-calling as the default on the command line. Register calling conventions require function prototypes and will be disabled for old-style function declarations and functions with variable numbers of arguments.

Calling Conventions Specified in Declaration Command-Line Option Visual C-Compatibility Command-line Option

cdecl cdecl default /Gd

fastcall fastcall /fastcall /Gr

Watcom Register Calling __declspec (wcall) /wcalls not available

2. Reduce Misaligned Memory Accesses

Some processors wil execute code very slowly if memory is accessed but not aligned. This is a more serious problem on the Intel Pentium Pro, Pentium II and Pentium III than it is on other processors. It can often be faster to do 2 aligned memory reads and combine the results than to do 1 incorrectly aligned memory read. Alignment also affects whether VectorC will vectorize, so it is important to get it right if you want to use MMX, 3D Now! and SSE.

VectorC will attempt to align variables itself, but can only do this if the definitions of the variables are compiled by VectorC. If the variables are aligned by the linker or by another compiler, they may not be aligned sufficiently to work well. To make sure that VectorC controls the alignment of variables, make sure the definition of the variable is in a file compiled by VectorC and that the variable is initialized (to 0, normally). Values allocated with malloc will have the alignment determined by the malloc function and cannot be controlled by VectorC.

When executing Intel's Streaming SIMD Extensions, accesses to values of 16-bytes generate execeptions if not aligned and aligned instructions are used. VectorC will generate aligned instructions when compiling for SSE, so make sure that if you tell VectorC that a variable or pointer value is aligned to 16-bytes, it is. Otherwise, you will get mis-aligned access exceptions.

3. Use MMX

MMX instructions were added to the Pentium processor and have been standard on all x86 processors for some time. These instructions work on 8-byte integer vector values. The integers can be signed or unsigned of 8, 16 or 32 bits. So 8 8-bit chars can be processed with one instruction. The operations are: add, subtract, multiply, exclusive-or, and, or, and shift left/right. Vectors can also be saturated, which means that the results of operations will be minimized or maximized based on the minimum and maximum values of the type.

To write C code that can be optimized to MMX code, you will need to make sure that the data you are operating on is made up of 8, 16 or 32-bit integers that can be processed in parallel. The data will normally need to be stored aligned to 4 or preferably 8-bytes and VectorC will need to be able to determine that the input and output data is not aliased. Also, you may want to give a hint to the compiler that the loop should be unrolled. Unroll it so that the loop processes 8 or 16-bytes of data at a time.
e.g.

This code cannot be vectorized because:
1. in and out might point the same memory region
2. in and out might not be aligned to 8-byte boundaries
3. The loop should be unrolled by 8 or 16 times, to process 8 or 16 bytes at a time.

void CopyBytes (char *in, char *out, int length)
    {
    int i;
    for (i=0; i<length; i++)
        out [i] = in [i];
    }
To take advantage of MMX, this function should be re-written as:
void CopyBytesWithMMX (__declspec (alignedvalue (8)) char restrict *in, __declspec (alignedvalue (8)) char restrict *out, int length)
    {
    int i;
    __hint__ ((unroll (2)));
    for (i=0; i<length; i++)
        out [i] = in [i];
    }
This version will be much faster because it can copy 8 bytes of data at a time
Of course, this only works if CopyBytes is called with pointers to 8-byte aligned data and the input and output memory areas are separate.

Multiplication can only be performed on 16-bit signed values. It is also possible to produce 2 32-bit signed results from 2 16-bit signed factors. It is also possible to multiply 4 signed 16-bit values and shift the result right by 16 to give 4 signed 16-bit results - to allow fixed-point arithmetic. Make sure that both operands to a multiplication are shorts. Remember that C converts all shorts to ints in arithmetic, if one of the factors is an expression, you will need to cast it to a short before multiplication.
e.g.

void MultiplyTexture (alignvalue (4) unsigned char restrict *Texture1, alignvalue (4) unsigned char restrict *Texture2, int b)
    {
    int r, x, y, w, h;
    short R, G, B;
    for (y=0; y<h; y++)
        for (x=0; x<w; x++)
            {
            __hint__ ((unroll(2)));
            R = (*Texture1) [y] [x].R * 8;
            G = (*Texture1) [y] [x].G * 8;
            B = (*Texture1) [y] [x].B * 8;
            R = R * b >> 16;
            G = G * b >> 16;
            B = B * b >> 16;
            MyScreen [y] [x].R = R;
            MyScreen [y] [x].G = G;
            MyScreen [y] [x].B = B;
            MyScreen [y] [x].Alpha = 0;
            }
    }

Saturation can be performed on signed or unsigned 8-bit integers and 16-bit signed integers. In C, you would need to do the operation on a larger integer size and use if statements to check for both minimum and maximum values.
e.g.

void BlendTextures (alignvalue (4) unsigned char restrict *Texture1, alignvalue (4) unsigned char restrict *Texture2, int b1, int b2)

{
int r, x, y, w, h;
short R, G, B, R1, G1, B1, R2, G2, B2, b1, b2;
for (y=0; y<h; y++)
        for (x=0; x<w; x++)
            {
            __hint__ ((unroll (2)));
            R1 = (*Texture1) [y] [x].R * 2;
            G1 = (*Texture1) [y] [x].G * 2;
            B1 = (*Texture1) [y] [x].B * 2;
            R2 = (*Texture2) [y] [x].R * 2;
            G2 = (*Texture2) [y] [x].G * 2;
            B2 = (*Texture2) [y] [x].B * 2;
            R = (R1 * b1 >> 16) + (R2 * b2 >> 16);
            G = (G1 * b1 >> 16) + (G2 * b2 >> 16);
            B = (B1 * b1 >> 16) + (B2 * b2 >> 16);
            if (R < 0) R = 0;
            if (R > 255) R = 255;
            if (G < 0) G = 0;
            if (G > 255) G = 255;
            if (B < 0) B = 0;
            if (B > 255) B = 255;
            MyScreen [y] [x].R = R;
            MyScreen [y] [x].G = G;
            MyScreen [y] [x].B = B;
            MyScreen [y] [x].Alpha = 0;
            }

MMX instructions cannot be interleaved with floating-point arithmetic (unless it is done with 3D Now! or SSE, see below). Function calls will be assumed to require the processor in floating-point mode. Switches between MMX and floating-point mode can be slow, therefore:

Put code that you wish to run as MMX in separate functions
MMX mode will only be used in loops, because the overhead of switching to MMX mode is too high for just a few instructions.
Don't put floating-point code that cannot be executed with 3D Now! or SSE in the same loop, or maybe even the same function, as code that should be executed with MMX.
Don't call functions within loops that should be executed with MMX (unless the function is inline).

4. Use AMD 3D Now!

AMD's 3D Now! is an extension to MMX introduced with their K6-2 processor and available on all subsequent AMD processors. It is also available on some other processors from other manufacturers. Because 3D Now! uses the MMX registers, everything in MMX that applies to alignment, unrolling and MMX/floating-point modes applies to 3D Now!.

3D Now! adds operations on single-precision floating-point types. The operations available are: minimum, maximum, comparison, multiplication, addition, subtraction, negation, absolute, reciprocal square root and division. The results of floating-point operations with 3D Now! are not guaranteed to be exactly the same is with normal floating-point code, so do not use 3D Now! in situations where exact accuracy is necessary. Also, floating-point exceptions are not generated on errors.

Because the processor must be in MMX mode to execute 3D Now! instructions, you cannot have any floating-point code that cannot be executed with 3D Now! in the same area of your program.

Floating-point constants in C are doubles by default, so you need to add 'f' to the end of your floating-point constants (e.g. 1.0f). Alternatively, use the /single (VectorC.exe command-line) or /vec:single (Compatibility mode) command-line options to make floating-point constants automatically single-precision within single-precision expressions.

The floating-point instrinsic functions 'fabs' and 'sqrt' return doubles. You can re-define the prototypes to take and return floats, or use the /single (VectorC.exe command-line) or /vec:single (Compatibility mode) command-line options to make floating-point intrinsic functions return floats if their parameters are floats.

Even without vectorization, 3D Now! will speed up code that operates on float types as long as there are not too many switches between MMX and floating-point code.

To use 12-bit precision division, you must specify 12-bit precision with: __hint__ ((precision (12))) immediately before the /.

To use the 12-bit precision reciprocal square root function, you need to specify 12-bit precision with: __hint__ ((precision (12))) immediately before the sqrt (). You must also be dividing a value by the square root.

5. Use Streaming SIMD Extensions

Intel's Streaming SIMD Extensions (SSE) were added to their Pentium III processor. These extensions add a new set of 128-bit registers that do not have the restrictions of MMX. However, they do require operating-system support (which exists in Windows 95/98/2000 and is available as a driver for Windows NT4). SSE operates on 1, 2 or 4 single-precision floating-point values at a time. When operating on 4-float vectors (128 bits) alignment is extremely important. Incorrect alignment at this size can cause exceptions.

The operations on float vectors for SSE are: minimum, maximum, comparison, multiplication, division, 12-bit precision division, square root, 12-bit inverse square root, absolute, negation, addition and subtraction.

Everything about alignment and aliasing described in the MMX section above applies here, also.

As with 3D Now! (described above), 12-bit precision division and square root are available. Also, you should make sure your program uses floats as much as possible. See the 3D Now! descriptions of how VectorC can help. However, because SSE and normal x86 floating-point code can occur without mode switching, the penalties for mixing floats and doubles are much lower.

When using SSE to rotate or transform vectors in 3D, it is faster to store the source and destination values in structure-of-arrays format.
e.g.

Array of structures:
float OutPoints [NUMBEROFVECTORS] [4], InPoints [NUMBEROFVECTORS] [4];
void RotateVectors (void)
    {
    int i;
    for (i=0; i<NUMBEROFVECTORS; i++)
        {
        OutPoints [i] [0] = CameraRotation [0] [0] * InPoints [i] [0]
                          + CameraRotation [0] [1] * InPoints [i] [1]
                          + CameraRotation [0] [2] * InPoints [i] [2];
        OutPoints [i] [1] = CameraRotation [1] [0] * InPoints [i] [0]
                          + CameraRotation [1] [1] * InPoints [i] [1]
                          + CameraRotation [1] [2] * InPoints [i] [2];
        OutPoints [i] [2] = CameraRotation [2] [0] * InPoints [i] [0]
                          + CameraRotation [2] [1] * InPoints [i] [1]
                          + CameraRotation [2] [2] * InPoints [i] [2];
        __hint__ ((unroll(4)));
        }
    }
Structure of arrays:
float OutPoints [3] [NUMBEROFVECTORS], InPoints [3] [NUMBEROFVECTORS];
static void RotateVectorsSOA (void)
    {
    int i;
    for (i=0; i<NUMBEROFVECTORS; i++)
        {
        OutPointsSOA [0] [i] = CameraRotation [0] [0] * InPointsSOA [0] [i]
                             + CameraRotation [0] [1] * InPointsSOA [1] [i]
                             + CameraRotation [0] [2] * InPointsSOA [2] [i];
        OutPointsSOA [1] [i] = CameraRotation [1] [0] * InPointsSOA [0] [i]
                             + CameraRotation [1] [1] * InPointsSOA [1] [i]
                             + CameraRotation [1] [2] * InPointsSOA [2] [i];
        OutPointsSOA [2] [i] = CameraRotation [2] [0] * InPointsSOA [0] [i]
                             + CameraRotation [2] [1] * InPointsSOA [1] [i]
                             + CameraRotation [2] [2] * InPointsSOA [2] [i];
        __hint__ ((unroll(4)));
        }
    }

6. Use Prefetching

Prefetching can speed up memory reads by telling the cache to load data before it is required.

7. Use Fast Float to Integer Conversions

In C, conversions from floating-point to integers is done by truncation (rounding towards zero). The FPU (not SSE or 3D Now) defaults to round-to-nearest. This means that VectorC must generate a call to a runtime function to do the conversion. If you don't mind float-to-integer conversions being round-to-nearest, or if you change the FPU mode to round-towards-zero, then calling the runtime function is unnecessary. Use the "/fastftoi" or "/vec:fastftoi" command-line options to turn off the runtime calls. This can speed up conversions to integers considerably.

8. Use Fast Calling Conventions

You can speed up function calls and returns by using register-based calling conventions. You can also improve memory alignment and MMX usage by using CodePlay's own calling conventions. You may want to use different calling conventions for each processor that you compile for.

9. Debugging Optimized Code

When debugging optimized code, you need to make sure that outputting debugging information does not affect the object code produced. The default mode for 'VectorC.exe' is to output full debugging information without affecting object code performance. The default mode when in Microsoft compatibility mode is to output no debugging information - use /Zd to turn on line number debugging information only. Alternatively, use /Zi, then /Zd to turn on full debugging information without affecting performance.

10. Use OKRead Hints

It is often useful for the optimizer to know when it is OK to read from a memory address that you have not written an explicit read for. For example, when reading 3 bytes, it may be possible for the optimizer to convert this to a single read of 4 bytes.

e.g. In this example, "Out" can be written to with a single 4-byte write, but "In" can only use a single 4-byte read because of the "okread" hint.

__hint__ ((okread (In [i].Alpha))); Out [i].R = In [i].R; Out [i].G = In [i].G; Out [i].B = In [i].B; Out.Alpha [i] = 0;

The hint in this case could also be written as. This implies a 4-byte "okread" hint if "In" is an array of 4-byte structures.
__hint__ ((okread (In [i])));

Also, conditional moves may be possible if a memory address can be accessed even if a conditional is not executed.

e.g. Without the okread hint in this example, the conditional cannot be converted into a conditional move. This can give a large speed improvement on modern processors.

__hint__ ((okread (d [i]))); if (a == b) c = d [i];