setting the speed for the future of games programming

MMX

MMX is a set of new processor instructions and registers added to an upgraded Pentium processor. It has since become standard and been added to all new Intel Architecture processors since its release - K6 MMX, K6-2, K6-III, Celeron, Pentium II, Pentium III, Athlon. It is not available on the Pentium Pro, old Pentiums, old K6s and all earlier processors.

MMX was designed to deal with 2D images and sounds. It can be used within 3D software renderers (to draw polygons), image processing, sound DSP-style processing and can be used to copy memory quickly. To maintain support with the Operating Systems of the time, the 8 new MMX registers were aliased directly with the floating-point unit registers. This enabled existing task switchers to not require modifying to support MMX code. Unfortunately, this also means that MMX code and floating-point code cannot be mixed without a slow mode switch. VectorC, therefore, will only use MMX in loops where there is no conflicting floating-point code.

A function with no loop will not be able to use MMX because VectorC will decide that the mode switch will cancel any speed improvements. (Unless the function is declared with one of the MMX calling conventions).

MMX instructions operate on 8-byte vectors (you could also describe them as arrays) of integer types. Only one instruction is required is to operate on the entire vector.

An MMX register could be treated as :-

signed char [8]; unsigned char [8]; signed short [4]; unsigned short [4]; signed int [2]; unsigned int [2]; __int64; unsigned __int64;

MMX instructions work fastest when data in memory is aligned on an 8-byte boundary.

Because MMX instructions operate on an entire vector in one instruction, VectorC may have to re-order the sequence of instructions from your C code. This requires alias detection. You may need to use the restrict keyword to help VectorC do this.

To make maximum use of MMX, you should use "codeplay_mmx" or "codeplay_3dnow" calling conventions.

General guidelines on achieving vectorization, which is necessary to make full use of MMX, is available in the Optimization Guidelines section.

e.g. You can see from this example that there is only one MMX instruction to do all the C source code on the left. Without MMX, the C source code requires 27 instructions! This is a simple example demonstrating one instruction. Normally, you would put MMX code inside loops. You may need to unroll the loops to get them to process a full 8 bytes of data per iteration. It is often faster to process 16 bytes of data per iteration because 2 MMX instructions can be executed at the same time.

Source code

without MMX

MMX (with "__declspec (codeplay_mmx)")

typedef struct {unsigned short w [4];} VECTOR4UW;

VECTOR4UW example (VECTOR4UW a, VECTOR4UW b) { VECTOR4UW r; r.w [0] = a.w [0] + b.w [0]; r.w [1] = a.w [1] + b.w [1]; r.w [2] = a.w [2] + b.w [2]; r.w [3] = a.w [3] + b.w [3]; return r; }

@example@CP_16: push ebp push esi push edi sub esp,64 mov si,word ptr 80[esp] mov di,word ptr 88[esp] mov ebp,esp add di,si mov word ptr [ebp],di mov si,word ptr 82[esp] mov di,word ptr 90[esp] add di,si mov word ptr 2[ebp],di mov si,word ptr 84[esp] mov di,word ptr 92[esp] add di,si mov word ptr 4[ebp],di mov si,word ptr 86[esp] mov di,word ptr 94[esp] add di,si mov word ptr 6[ebp],di mov eax,dword ptr [esp] mov edx,dword ptr 4[esp] add esp,64 pop edi pop esi pop ebp ret 16

@example@MMX_16: paddw mm0,mm1
ret

Features of MMX

Addition and subtraction - 8, 16 or 32 bit

e.g.

Source code	Compiled for MMX
`typedef struct {unsigned short w [4];} VECTOR8UW;VECTOR8UW __declspec (codeplay_mmx) example (VECTOR8UW a, VECTOR8UW b) { VECTOR8UW r; r.w [0] = a.w [0] + b.w [0]; r.w [1] = a.w [1] + b.w [1]; r.w [2] = a.w [2] + b.w [2]; r.w [3] = a.w [3] + b.w [3]; return r; }`	`@example@MMX_16: paddw mm0,mm1` `ret`

Saturated addition and subtraction - signed 8-bit, signed 16-bit, unsigned 8-bit, unsigned 16-bit.

Saturated arithmetic is different from the usual (modulo) arithmetic in that when a value overflows, it stops at the maximum or minimum values for that type. So

(unsigned char) (255 + 1)

in modulo arithmetic is 0. Whereas the same expression with saturated arithmetic evaluates to 255.

e.g.

Source code

Compiled for MMX

typedef struct {unsigned short w [4];} VECTOR4UW; VECTOR4UW __declspec (codeplay_mmx) example (VECTOR4UW a, VECTOR4UW b) { VECTOR4UW r; int i; i = a.w [0] + b.w [0]; if (i > 65535) i = 65535; r.w [0] = i; i = a.w [1] + b.w [1]; if (i > 65535) i = 65535; r.w [1] = i; i = a.w [2] + b.w [2]; if (i > 65535) i = 65535; r.w [2] = i; i = a.w [3] + b.w [3]; if (i > 65535) i = 65535; r.w [3] = i; return r; }

@example@MMX_16: paddw mm0,mm1
ret

Saturated conversions - signed 16-bit to signed 8-bit, signed 32-bit to signed 16-bit, unsigned 16-bit to unsigned 8-bit

Saturated conversions convert from a larger integer size to a smaller integer size. Unlike normal (modulo) conversions, saturated conversions will give convert a source that is lower than the minimum within the destination type to the minimum within the destination type - and the same for the maximum. So

(unsigned char) 256

is 0. Whereas the same expression with a saturated conversion would be 255. Because C does not directly support saturated conversions, the maximum and minimum calculations have to be added in as conditionals. Saturated conversions are usually faster than normal (modulo) conversions.

e.g.

Source code

Compiled for MMX

typedef struct {signed char b [4];} VECTOR4SB; typedef struct {signed short w [4];} VECTOR4SW; VECTOR4SB __declspec (codeplay_mmx) example (VECTOR4SW a) { VECTOR4SB r; int i; i = a.w [0]; if (i > 127) i = 127; if (i < -128) i = -128; r.b [0] = i; i = a.w [1]; if (i > 127) i = 127; if (i < -128) i = -128; r.b [1] = i; i = a.w [2]; if (i > 127) i = 127; if (i < -128) i = -128; r.b [2] = i; i = a.w [3]; if (i > 127) i = 127; if (i < -128) i = -128; r.b [3] = i; return r; }

@example@MMX_8: packsswb mm0,mm1 movd eax,mm0
ret

Logical Operations - binary and, binary or, binary exclusive-or

Logical operations work on 8-byte values. They can also be used with comparisons to make conditional assignments. This is done automatically by VectorC.

Comparisons - signed 8-bit, signed 16-bit, signed 32-bit

Comparisons can be combined with logical operations to create conditional assignments.

e.g.

Source code

Compiled for MMX

typedef struct {signed short w [4];} VECTOR4SW; VECTOR4SW __declspec (codeplay_mmx) example (VECTOR4SW a) { VECTOR4SW r; int i; i = a.w [0]; if (i > 8) i = 8; r.w [0] = i; i = a.w [1]; if (i > 8) i = 8; r.w [1] = i; i = a.w [2]; if (i > 8) i = 8; r.w [2] = i; i = a.w [3]; if (i > 8) i = 8; r.w [3] = i; return r; }

@example@MMX_8: movq mm2,qword ptr const movq mm1,mm0 pcmpgtw mm0,mm2 pand mm2,mm0 pandn mm0,mm1 por mm0,mm2 ret

const dw 8,8,8,8

Multiplication - signed 16-bit, signed 16-bit to 32-bit result

Multiplication can only be performed on signed 16-bit values. It is possible to take a 2 32-bit results or 4 16-bit results. It is also possible to calculate the high order 16-bits of the results (useful for fixed-point arithmetic).

Because C always operates on 32-bit types, you may need to explicitly cast both left and right arguments of the multiplication operator to shorts.

e.g..

Source code	Compiled for MMX
`typedef struct {signed short w [4];} VECTOR4SW; VECTOR4SW __declspec (codeplay_mmx) example (VECTOR4SW a, VECTOR4SW b) { VECTOR4SW r; int i; r.w [0] = a.w [0] * b.w [0] >> 16; r.w [1] = a.w [1] * b.w [1] >> 16; r.w [2] = a.w [2] * b.w [2] >> 16; r.w [3] = a.w [3] * b.w [3] >> 16; return r; }`	`@example@MMX_16:` `pmulhw mm0,mm1 ret`

Shifts - 16-bit left, 32-bit left, 64-bit left, signed 16-bit right, signed 32-bit right, unsigned 16-bit right, unsigned 32-bit right, unsigned 64-bit right

All components of the vector must be shifted by the same value (i.e. You cannot shift a vector by another vector, only by a scalar).

e.g..

Source code

Compiled for MMX

typedef struct {signed short w [4];} VECTOR4SW; VECTOR4SW __declspec (codeplay_mmx) example (VECTOR4SW a) { VECTOR4SW r; int i; r.w [0] = a.w [0] << 4; r.w [1] = a.w [1] << 4; r.w [2] = a.w [2] << 4; r.w [3] = a.w [3] << 4; return r; }

@example@MMX_8: psllw mm0,const ret

const dq 4

Conversions

Conversions from one vector type to another are possible, but may not be implemented in one single instruction, so you may wish to avoid them where possible.

Extensions to MMX available with Streaming SIMD Extensions and AMD Athlon

These extensions to MMX are available on processors with Streaming SIMD Extensions support or on AMD's Athlon processor and compatibles. The same restrictions apply to these extensions as to the rest of MMX.

Average - unsigned 8-bit, unsigned 16-bit

e.g..

Source code

Compiled for MMX

typedef struct {unsigned short w [4];} VECTOR4UW; VECTOR4UW __declspec (codeplay_mmx) example (VECTOR4UW a, VECTOR4UW b) { VECTOR4UW r; int i; r.w [0] = (a.w [0] + b.w [0] + 1) >> 1; r.w [1] = (a.w [1] + b.w [1] + 1) >> 1; r.w [2] = (a.w [2] + b.w [2] + 1) >> 1; r.w [3] = (a.w [3] + b.w [3] + 1) >> 1; return r; }

@example@MMX_8: pavgw mm0,mm1 ret

const dq 4

Maximum and Minimum - unsigned 8-bit, signed 16-bit

e.g..

Source code

Compiled for MMX

typedef struct {signed short w [4];} VECTOR4SW;

short __inline min (short a, short b) { if (b < a) a = b; return a; }

VECTOR4SW __declspec (codeplay_mmx) example (VECTOR4SW a, VECTOR4SW b) { VECTOR4SW r; short i; r.w [0] = min (a.w [0], b.w [0]); r.w [1] = min (a.w [1], b.w [1]); r.w [2] = min (a.w [2], b.w [2]); r.w [3] = min (a.w [3], b.w [3]); return r; }

@example@MMX_8: pminsw mm0,mm1 ret

const dq 4

Multiplication - unsigned 16-bit with high 16-bit result

e.g..

Source code	Compiled for MMX
`typedef struct {unsigned short w [4];} VECTOR4UW; VECTOR4UW __declspec (codeplay_mmx) example (VECTOR4UW a, VECTOR4UW b) { VECTOR4UW r; int i; r.w [0] = a.w [0] * b.w [0] >> 16; r.w [1] = a.w [1] * b.w [1] >> 16; r.w [2] = a.w [2] * b.w [2] >> 16; r.w [3] = a.w [3] * b.w [3] >> 16; return r; }`	`@example@MMX_16:` `pmulhuw mm0,mm1 ret`

Set Bit Mask - 4-bit from signed 16-bit vector or signed 8-bit vector

e.g.

Source code

Compiled for MMX

typedef struct {short f [4];} VECTOR4SW;

unsigned char __declspec (codeplay_mmx) example (VECTOR4SW a, VECTOR4SW b) { unsigned char r; r = (a.f [0] < b.f [0]); r |= (a.f [1] < b.f [1]) << 1; r |= (a.f [2] < b.f [2]) << 2; r |= (a.f [3] < b.f [3]) << 3; return r; }

@example@MMX_16: pcmpgtw mm1,mm0 packsswb mm1,mm1 psrlq mm1,32 pmovmskb eax,mm1 ret