dcb2011 wrote:Assuming my code is going to be run on a standard Windows computer that can be purchased at a local computer store (e.g. - 64 bit), would a different interval be more appropriate (6, 7, etc.)? If so, any recommendations?

Because this is Fortran code, I'm going to assume that everything is floating point. In that case, there is some pretty good advice: On a modern machine, if you must manually unroll simple floating point code, use a power of two. Unroll it by at least 4 for single precision and 2 for double precision. Try doubling it to see if it helps.

Why? Because you want to use vector instructions if you can. SSE registers are 128 bits, which means they fit 4 single-precision floats and 2 double-precision floats, so you should be able to do that number of multiplies in one instruction. If your compiler is worth anything at all, and you are compiling for an x86_64 target, then the relevant registers and instructions are available, everything should just happen.

(Oh, and AVX registers are 256 bits, but you can't assume AVX any time soon.)

Incidentally, the reason why 5 was optimal on the machine it was written for was some combination of instruction window size, reservation station size, whether or not it could do speculative execution across a conditional branch, number of machine registers, and so on. It's hard to say without knowing the specifics.

One more option which I should mention for completeness is that you may have access to an optimised BLAS library for your target platform. In that case, use the SSCAL or DSCAL operation (single-precision and double-precision respectively), then you won't have to worry about these details.