setting the speed for the future of games programming
vectorc

contentsclose
 

STREAMING SIMD EXTENSIONS

Intel's Streaming SIMD Extensions (SSE) is a set of new processor instructions added to the Pentium III processor. The new instructions perform single-precision floating-point operations on values in a new set of 128-bit registers. These new registers do not have the restrictions of MMX, but they do require operating system support. This support is available in Windows 95 and Windows NT (with a service pack).

The SSE instructions can operate on types:-

float
float [2]
- (not as fast as float [4], normally)
float [4]

Floating-point operations carried out in SSE registers will not produce exactly the same results in all circumstances as that of normal floating-point code. If 2 versions of your program must produce exactly the same results, you should enable consistent precision.

SSE is ideal for multiplying vectors and matrices as well as 3D projection. It can also be used with MMX for image and sound processing using floating-point intermediates.

Streaming SIMD Extensions use a different control register. So exceptions and rounding conditions have to be set differently to normal floating-point code. Also (unlike the FPU) SSE has a convert to integer with round to zero (as defined in C).

16-byte alignment is required for most SSE instructions. If VectorC incorrectly assumes 16-byte alignment, you will get alignment exceptions. Also, unaligned memory accesses are significantly slower than aligned accesses.

Operations Supported by Streaming SIMD Extensions

Arithmetic - addition, subtraction, multiplication and division on floats or float vectors

e.g.

Source code
Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

VECTOR4F __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b)
   {
   VECTOR4F r;
   r.f [0] = a.f [0] * b.f [0];
   r.f [1] = a.f [1] * b.f [1];
   r.f [2] = a.f [2] * b.f [2];
   r.f [3] = a.f [3] * b.f [3];
   return r;
   }

@example@SSE_32:
   addps xmm0,xmm1

   ret



Reciprocal 12-bit precision - floats and float vectors

VectorC will also do 12-bit division by calculating the reciprocal of the right hand side and multiplying by the left hand side.

e.g.

Source code
Compiled for SSE

float __declspec (codeplay_3dnow) example (float a)
   {
   return 1 __hint__ ((precision (12))) / a;
   }

@example@SSE_4:
   rcpss xmm0,xmm0

   ret



Reciprocal Square Root - 12-bit precision float and float vector.

The "sqrt" function in "math.h" is declared with doubles. Doubles cannot be processed with SSE, so you can either declare your own version of "sqrt", or use the command-line switch "/vec:single" or "/single". This will use a "float" version of "sqrt" if the argument is a float.

e.g.

Source code
Compiled for SSE

float sqrt (float);

float __declspec (codeplay_3dnow) example (float a)
   {
   return 1 / __hint__ ((precision (12))) sqrt (a);
   }

@example@SSE_4:
   rsqrtss xmm0,xmm0
   ret



Square Root - float and float vector.

The "sqrt" function in "math.h" is declared with doubles. Doubles cannot be processed with SSE, so you can either declare your own version of "sqrt", or use the command-line switch "/vec:single" or "/single". This will use a "float" version of "sqrt" if the argument is a float.

e.g.

Source code
Compiled for SSE

float sqrt (float);

float __declspec (codeplay_3dnow) example (float a)
   {
   return sqrt (a);
   }

@example@SSE_4:
   sqrt xmm0,xmm0
   ret



Conversion to or from 32-bit signed integer or 2-element signed 32-bit integer vector

When converting to or from integer values, only 2 elements can be converted at a time and the integer vector will be in an MMX register.

e.g.

Source code
Compiled for SSE

typedef struct {float f [2];} VECTOR2F;
typedef struct {int i [2];} VECTOR2SD;

VECTOR2SD __declspec (codeplay_3dnow) example (VECTOR2F a)
   {
   VECTOR2SD r;
   r.i [0] = a.f [0];
   r.i [1] = a.f [1];
   return r;
   }

@example@SSE_8:
   sub esp,28
   movq [esp],mm0
   cvttps2pi mm0,[esp]
   add esp,28
   ret



Minimum and Maximum - floats or float vectors

e.g.

Source code
Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

float __inline min (float a, float b)
   {
   if (b < a) a = b; return a;
   }

VECTOR4F __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b)
   {
   VECTOR4F r;
   float f;
   r.f [0] = min (a.f [0], b.f [0]);
   r.f [1] = min (a.f [1], b.f [1]);
   r.f [2] = min (a.f [2], b.f [2]);
   r.f [3] = min (a.f [3], b.f [3]);
   return r;
   }

@example@SSE_32:
   minps xmm0,xmm1

   ret



Absolute and Negate - floats and float vectors

The SSE logical instructions can be used to negate floats and calculate the absolute (positive) values.



Conditional Move - float, float vectors

It is possible to conditional assignments on vectors using a sequence of instructions without branching. This can be much faster than branching (which can be very slow on modern processors).

e.g.

Source code
Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

float __inline cond (float a, float b)
   {
   if (a == 5) a = b; return a;
   }

VECTOR4F __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b)
   {
   VECTOR4F r;
   r.f [0] = cond (a.f [0], b.f [0]);
   r.f [1] = cond (a.f [1], b.f [1]);
   r.f [2] = cond (a.f [2], b.f [2]);
   r.f [3] = cond (a.f [3], b.f [3]);
   return r;
   }

@example@SSE_32:
   movaps xmm3,xmm0
   movaps xmm4,xmm1
   cmpps xmm3,const,0
   movaps xmm2,xmm0
   andps xmm4,xmm3
   andnps xmm3,xmm2
   orps xmm3,xmm4
   movaps xmm2,xmm3
   movups xmm0,xmm2    
   ret

const dd 40A00000H,40A00000H
   dd 40A00000H,40A00000H



Set Bit Mask - 4-bit from float vector

e.g.

Source code
Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

unsigned char __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b)
   {
   unsigned char r;
   r = (a.f [0] < b.f [0]);
   r |= (a.f [1] < b.f [1]) << 1;
   r |= (a.f [2] < b.f [2]) << 2;
   r |= (a.f [3] < b.f [3]) << 3;
   return r;
   }

@example@SSE_32:
   cmpps xmm0,xmm1
   movmskps eax,xmm0

   ret



Prefetching

Prefetching is available on processors with SSE support.

top

contentsclose