setting the speed for the future of games programming |
||||||||||||||||||||||||||||||||||||||||||||||||
MMXMMX is a set of new processor instructions and registers added to an upgraded Pentium processor. It has since become standard and been added to all new Intel Architecture processors since its release - K6 MMX, K6-2, K6-III, Celeron, Pentium II, Pentium III, Athlon. It is not available on the Pentium Pro, old Pentiums, old K6s and all earlier processors.MMX was designed to deal with 2D images and sounds. It can be used within 3D software renderers (to draw polygons), image processing, sound DSP-style processing and can be used to copy memory quickly. To maintain support with the Operating Systems of the time, the 8 new MMX registers were aliased directly with the floating-point unit registers. This enabled existing task switchers to not require modifying to support MMX code. Unfortunately, this also means that MMX code and floating-point code cannot be mixed without a slow mode switch. VectorC, therefore, will only use MMX in loops where there is no conflicting floating-point code.
MMX instructions operate on 8-byte vectors (you could also describe them as arrays) of integer types. Only one instruction is required is to operate on the entire vector. An MMX register could be treated as :-
MMX instructions work fastest when data in memory is aligned on an 8-byte boundary. Because MMX instructions operate on an entire vector in one instruction, VectorC may have to re-order the sequence of instructions from your C code. This requires alias detection. You may need to use the restrict keyword to help VectorC do this. To make maximum use of MMX, you should use "codeplay_mmx" or "codeplay_3dnow" calling conventions. General guidelines on achieving vectorization, which is necessary to make full use of MMX, is available in the Optimization Guidelines section. e.g. You can see from this example that there is only one MMX instruction to do all the C source code on the left. Without MMX, the C source code requires 27 instructions! This is a simple example demonstrating one instruction. Normally, you would put MMX code inside loops. You may need to unroll the loops to get them to process a full 8 bytes of data per iteration. It is often faster to process 16 bytes of data per iteration because 2 MMX instructions can be executed at the same time.
Features of MMXAddition and subtraction - 8, 16 or 32 bite.g.
Saturated addition and subtraction - signed 8-bit, signed 16-bit, unsigned 8-bit, unsigned 16-bit.Saturated arithmetic is different from the usual (modulo) arithmetic in that when a value overflows, it stops at the maximum or minimum values for that type. So (unsigned char) (255 + 1) in modulo arithmetic is 0. Whereas the same expression with saturated arithmetic evaluates to 255. e.g.
Saturated conversions - signed 16-bit to signed 8-bit, signed 32-bit to signed 16-bit, unsigned 16-bit to unsigned 8-bitSaturated conversions convert from a larger integer size to a smaller integer size. Unlike normal (modulo) conversions, saturated conversions will give convert a source that is lower than the minimum within the destination type to the minimum within the destination type - and the same for the maximum. So (unsigned char) 256 is 0. Whereas the same expression with a saturated conversion
would be 255. Because C does not directly support saturated conversions, the maximum
and minimum calculations have to be added in as conditionals. Saturated conversions
are usually faster than normal (modulo) conversions.
e.g.
Logical Operations - binary and, binary or, binary exclusive-orLogical operations work on 8-byte values. They can also be used with comparisons to make conditional assignments. This is done automatically by VectorC. Comparisons - signed 8-bit, signed 16-bit, signed 32-bitComparisons can be combined with logical operations to create conditional assignments. e.g.
Multiplication - signed 16-bit, signed 16-bit to 32-bit resultMultiplication can only be performed on signed 16-bit values. It is possible to take a 2 32-bit results or 4 16-bit results. It is also possible to calculate the high order 16-bits of the results (useful for fixed-point arithmetic). Because C always operates on 32-bit types, you may need to explicitly cast both left and right arguments of the multiplication operator to shorts. e.g..
Shifts - 16-bit left, 32-bit left, 64-bit left, signed 16-bit right, signed 32-bit right, unsigned 16-bit right, unsigned 32-bit right, unsigned 64-bit rightAll components of the vector must be shifted by the same value (i.e. You cannot shift a vector by another vector, only by a scalar).e.g..
ConversionsConversions from one vector type to another are possible, but may not be implemented in one single instruction, so you may wish to avoid them where possible. Extensions to MMX available with Streaming SIMD Extensions and AMD AthlonThese extensions to MMX are available on processors with Streaming SIMD Extensions support or on AMD's Athlon processor and compatibles. The same restrictions apply to these extensions as to the rest of MMX. Average - unsigned 8-bit, unsigned 16-bite.g..
Maximum and Minimum - unsigned 8-bit, signed 16-bite.g..
Multiplication - unsigned 16-bit with high 16-bit resulte.g..
Set Bit Mask - 4-bit from signed 16-bit vector or signed 8-bit vectore.g.
|
||||||||||||||||||||||||||||||||||||||||||||||||