setting the speed for the future of games programming

STREAMING SIMD EXTENSIONS

Intel's Streaming SIMD Extensions (SSE) is a set of new processor instructions added to the Pentium III processor. The new instructions perform single-precision floating-point operations on values in a new set of 128-bit registers. These new registers do not have the restrictions of MMX, but they do require operating system support. This support is available in Windows 95 and Windows NT (with a service pack).

The SSE instructions can operate on types:-

float float [2] - (not as fast as float [4], normally)float [4]

Floating-point operations carried out in SSE registers will not produce exactly the same results in all circumstances as that of normal floating-point code. If 2 versions of your program must produce exactly the same results, you should enable consistent precision.

SSE is ideal for multiplying vectors and matrices as well as 3D projection. It can also be used with MMX for image and sound processing using floating-point intermediates.

Streaming SIMD Extensions use a different control register. So exceptions and rounding conditions have to be set differently to normal floating-point code. Also (unlike the FPU) SSE has a convert to integer with round to zero (as defined in C).

16-byte alignment is required for most SSE instructions. If VectorC incorrectly assumes 16-byte alignment, you will get alignment exceptions. Also, unaligned memory accesses are significantly slower than aligned accesses.

Operations Supported by Streaming SIMD Extensions

Arithmetic - addition, subtraction, multiplication and division on floats or float vectors

e.g.

Source code

Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

VECTOR4F __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b) { VECTOR4F r; r.f [0] = a.f [0] * b.f [0]; r.f [1] = a.f [1] * b.f [1]; r.f [2] = a.f [2] * b.f [2]; r.f [3] = a.f [3] * b.f [3]; return r; }

@example@SSE_32: addps xmm0,xmm1
ret

Reciprocal 12-bit precision - floats and float vectors

VectorC will also do 12-bit division by calculating the reciprocal of the right hand side and multiplying by the left hand side.

e.g.

Source code	Compiled for SSE
`float __declspec (codeplay_3dnow) example (float a) { return 1 __hint__ ((precision (12))) / a; }`	`@example@SSE_4: rcpss xmm0,xmm0` `ret`

Reciprocal Square Root - 12-bit precision float and float vector.

The "sqrt" function in "math.h" is declared with doubles. Doubles cannot be processed with SSE, so you can either declare your own version of "sqrt", or use the command-line switch "/vec:single" or "/single". This will use a "float" version of "sqrt" if the argument is a float.

e.g.

Source code

Compiled for SSE

float sqrt (float);

float __declspec (codeplay_3dnow) example (float a) { return 1 / __hint__ ((precision (12))) sqrt (a); }

@example@SSE_4: rsqrtss xmm0,xmm0 ret

Square Root - float and float vector.

e.g.

Source code

Compiled for SSE

float sqrt (float);

float __declspec (codeplay_3dnow) example (float a) { return sqrt (a); }

@example@SSE_4: sqrt xmm0,xmm0 ret

Conversion to or from 32-bit signed integer or 2-element signed 32-bit integer vector

When converting to or from integer values, only 2 elements can be converted at a time and the integer vector will be in an MMX register.

e.g.

Source code

Compiled for SSE

typedef struct {float f [2];} VECTOR2F; typedef struct {int i [2];} VECTOR2SD;

VECTOR2SD __declspec (codeplay_3dnow) example (VECTOR2F a) { VECTOR2SD r; r.i [0] = a.f [0]; r.i [1] = a.f [1]; return r; }

@example@SSE_8: sub esp,28 movq [esp],mm0 cvttps2pi mm0,[esp] add esp,28 ret

Minimum and Maximum - floats or float vectors

e.g.

Source code

Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

float __inline min (float a, float b) { if (b < a) a = b; return a; }

VECTOR4F __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b) { VECTOR4F r; float f; r.f [0] = min (a.f [0], b.f [0]); r.f [1] = min (a.f [1], b.f [1]); r.f [2] = min (a.f [2], b.f [2]); r.f [3] = min (a.f [3], b.f [3]); return r; }

@example@SSE_32: minps xmm0,xmm1
ret

Absolute and Negate - floats and float vectors

The SSE logical instructions can be used to negate floats and calculate the absolute (positive) values.

Conditional Move - float, float vectors

It is possible to conditional assignments on vectors using a sequence of instructions without branching. This can be much faster than branching (which can be very slow on modern processors).

e.g.

Source code

Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

float __inline cond (float a, float b) { if (a == 5) a = b; return a; }

VECTOR4F __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b) { VECTOR4F r; r.f [0] = cond (a.f [0], b.f [0]); r.f [1] = cond (a.f [1], b.f [1]); r.f [2] = cond (a.f [2], b.f [2]); r.f [3] = cond (a.f [3], b.f [3]); return r; }

@example@SSE_32: movaps xmm3,xmm0 movaps xmm4,xmm1 cmpps xmm3,const,0 movaps xmm2,xmm0 andps xmm4,xmm3 andnps xmm3,xmm2 orps xmm3,xmm4 movaps xmm2,xmm3 movups xmm0,xmm2 ret

const dd 40A00000H,40A00000H dd 40A00000H,40A00000H

Set Bit Mask - 4-bit from float vector

e.g.

Source code

Compiled for SSE

typedef struct {float f [4];} VECTOR4F;

unsigned char __declspec (codeplay_sse) example (VECTOR4F a, VECTOR4F b) { unsigned char r; r = (a.f [0] < b.f [0]); r |= (a.f [1] < b.f [1]) << 1; r |= (a.f [2] < b.f [2]) << 2; r |= (a.f [3] < b.f [3]) << 3; return r; }

@example@SSE_32: cmpps xmm0,xmm1 movmskps eax,xmm0 ret

Prefetching

Prefetching is available on processors with SSE support.