Loop Unrolling: Suggested Interval?

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Moderators General, Prelates

dcb2011
Posts: 24
Joined: Thu Nov 03, 2011 10:19 pm UTC

Loop Unrolling: Suggested Interval?

Postby dcb2011 » Tue Apr 30, 2013 9:42 pm UTC

I am about to write a small program in C++ to illustrate loop-unrolling as a method of speeding up program execution for small loops.

I am looking at an old program written in FORTRAN 77 that used an interval of 5.
For example:

Code: Select all

for (i = 0; i < LIMIT; i+= 5){
   a[i] *= scalar_number;
   a[i + 1] *= scalar_number;
   a[i + 2] *= scalar_number;
   a[i + 3] *= scalar_number;
   a[i + 4] *= scalar_number;
} // End for loop

However, I believe that interval was selected because of the type of computer on which it was planned to run.

So . . . I am wondering:
Assuming my code is going to be run on a standard Windows computer that can be purchased at a local computer store (e.g. - 64 bit), would a different interval be more appropriate (6, 7, etc.)? If so, any recommendations?

User avatar
WanderingLinguist
Posts: 237
Joined: Tue May 22, 2012 5:14 pm UTC
Location: Seoul
Contact:

Re: Loop Unrolling: Suggested Interval?

Postby WanderingLinguist » Tue Apr 30, 2013 11:24 pm UTC

Why not just try different intervals and profile the results?

If you have a decent compiler, I suspect you're not going to see much of a difference between the regular loop and the unrolled loop unless you disable compiler optimizations.

Edit: Most modern compilers will do optimizations (like loop unrolling) for you, so for something as simple as the sample code you posted, you're probably not going to see any performance improvement by manually unrolling. If it's for educational purposes, you'll want to disable compiler optimizations in order to actually see the difference. However, in real-life situations, better to let the compiler handle it for you in most cases.
Last edited by WanderingLinguist on Tue Apr 30, 2013 11:37 pm UTC, edited 1 time in total.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Loop Unrolling: Suggested Interval?

Postby EvanED » Tue Apr 30, 2013 11:34 pm UTC

WanderingLinguist wrote:If you have a decent compiler, I suspect you're not going to see much of a difference between the regular loop and the unrolled loop unless you disable compiler optimizations.
It'd be interesting to see though. Loop unrolling is something I'm not sure how good compilers are at doing. Might be quite good; might be bad.

Also, the flip side of that remark is that you should profile with optimizations on (or set to whatever you expect to use in the final result), because if they're off and you later turn them on the degree to which manually unrolling helps -- and even the relative order of different unrolling amounts -- could change. (Actually this is a fairly general rule of profiling in the first place.)

User avatar
WanderingLinguist
Posts: 237
Joined: Tue May 22, 2012 5:14 pm UTC
Location: Seoul
Contact:

Re: Loop Unrolling: Suggested Interval?

Postby WanderingLinguist » Tue Apr 30, 2013 11:49 pm UTC

Edit: clumsily hit "submit" rather than "preview" and posted an incomplete response

EvanED wrote:
WanderingLinguist wrote:If you have a decent compiler, I suspect you're not going to see much of a difference between the regular loop and the unrolled loop unless you disable compiler optimizations.
It'd be interesting to see though. Loop unrolling is something I'm not sure how good compilers are at doing. Might be quite good; might be bad.


I don't have access to a Windows system at the moment, but I tried it with GCC on my Mac, and it doesn't seem to be unrolling the loop unless it's relatively short (or you specifically request unrolling). CLANG also doesn't seem to do unrolling unless you ask for it. Interestingly, the unrolled (interval of 5) vs. non-unrolled versions seem to perform identically. (Perhaps there's some kind of optimization going on at the processor level?)

I don't think I've had to unroll a loop for performance reasons in, um... almost ten years. I guess it might still come up, but with the work I do these days, performance bottlenecks generally seem to be in other places.

Also, the flip side of that remark is that you should profile with optimizations on (or set to whatever you expect to use in the final result), because if they're off and you later turn them on the degree to which manually unrolling helps -- and even the relative order of different unrolling amounts -- could change. (Actually this is a fairly general rule of profiling in the first place.)


Indeed. The only time I ever turn off optimizations in real life is if I suspect they are causing a bug (happens less and less frequently these days, as I get better at concurrent programming, and optimizers get better at not breaking my code...) or to make it easier to step through code in the debugger (occasionally optimizing re-ordering makes debugging difficult).

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Loop Unrolling: Suggested Interval?

Postby EvanED » Wed May 01, 2013 12:51 am UTC

(Off topic a little bit; sorry for the hopefully mini threadjack.)
WanderingLinguist wrote:The only time I ever turn off optimizations in real life is ...to make it easier to step through code in the debugger (occasionally optimizing re-ordering makes debugging difficult).
"Occasionally"? What optimization do you use? I haven't really tried with -O1, but my experience trying to debug code compiled with -O2 is that it's almost undebuggable.

Reordering isn't so much the problem, it's when I go and look at a variable and it says "value is optimized out", or try to call a function (e.g. print myvec.size()) and it can't because that function isn't actually compiled in anywhere because it was inlined in all of its uses or never called at all.

Actually the latest version of GCC created a new optimization level, -Og I think, that turns on many optimizations but is supposed to not compromise debugability. Oooo, looks like it's available to me as well... maybe I'll have to play around with it.

csanders
Posts: 30
Joined: Mon Feb 22, 2010 9:09 pm UTC

Re: Loop Unrolling: Suggested Interval?

Postby csanders » Wed May 01, 2013 12:58 am UTC

Loop unrolling doesn't do much good unless you can manage some other tricks along with it. In the case of your loop, the CPU is going to be spending almost all of its time waiting on RAM access, so the efficiency of the unrolling doesn't matter.

Instead, consider a Fibonacci-calculating loop that doesn't hit RAM:

Code: Select all

for (i = 0; i < LIMIT; ++i)
{
    c = a + b;
    a = b;
    b = c;
}


Unrolling it still didn't get me any measurable performance improvement:

Code: Select all

for (i = 0; i < LIMIT; i += 2)
{
    c = a + b;
    a = b;
    b = c;
    c = a + b;
    a = b;
    b = c;
}


But with the unrolled version, you can go a bit farther and get rid of a bunch of assignment statements:

Code: Select all

for (i = 0; i < LIMIT; i += 2)
{
    a = a+b;
    b = a+b;
}

On my system, this last version is the only one that showed any meaningful difference, at about double the speed of the original.

User avatar
WanderingLinguist
Posts: 237
Joined: Tue May 22, 2012 5:14 pm UTC
Location: Seoul
Contact:

Re: Loop Unrolling: Suggested Interval?

Postby WanderingLinguist » Wed May 01, 2013 2:18 pm UTC

EvanED wrote:(Off topic a little bit; sorry for the hopefully mini threadjack.)
WanderingLinguist wrote:The only time I ever turn off optimizations in real life is ...to make it easier to step through code in the debugger (occasionally optimizing re-ordering makes debugging difficult).
"Occasionally"? What optimization do you use? I haven't really tried with -O1, but my experience trying to debug code compiled with -O2 is that it's almost undebuggable.


Heh, yeah, by "occasionally", of course, I mean "usually" ;-)

I do Android development, and about half is native code (via NDK/JNI), while the rest is Java. If you think debugging -O2 optimized C/C++ is bad, try throwing some Java functions on the call stack as well. What gets really fun is when something is only broken with optimizations ON, only when the debugger is not attached, and only when running on a multicore device. Had one of those the other day.

Reordering isn't so much the problem, it's when I go and look at a variable and it says "value is optimized out", or try to call a function (e.g. print myvec.size()) and it can't because that function isn't actually compiled in anywhere because it was inlined in all of its uses or never called at all.


Annoyingly, it seems like I'm starting to see values optimized out like this in Java too. Well, either that or it could be bugs in Eclipse (I mostly work on the C side, so I haven't really looked into it).

Actually the latest version of GCC created a new optimization level, -Og I think, that turns on many optimizations but is supposed to not compromise debugability. Oooo, looks like it's available to me as well... maybe I'll have to play around with it.


Oooh, I gotta see if that's available in the version of GCC we're using. That could save a lot of headaches.

Derek
Posts: 2180
Joined: Wed Aug 18, 2010 4:15 am UTC

Re: Loop Unrolling: Suggested Interval?

Postby Derek » Wed May 01, 2013 6:21 pm UTC

WanderingLinguist wrote:(Perhaps there's some kind of optimization going on at the processor level?)

Modern processors do use branch prediction, so it may be the case that branch predictions are giving the same effect as loop unrolling.

I know I've seen unrolled loops in -O2/3 code before though, but gcc might have changed it's strategies since then.

DeGuerre
Posts: 51
Joined: Mon Feb 04, 2008 6:41 am UTC

Re: Loop Unrolling: Suggested Interval?

Postby DeGuerre » Fri May 03, 2013 1:55 am UTC

dcb2011 wrote:Assuming my code is going to be run on a standard Windows computer that can be purchased at a local computer store (e.g. - 64 bit), would a different interval be more appropriate (6, 7, etc.)? If so, any recommendations?

Because this is Fortran code, I'm going to assume that everything is floating point. In that case, there is some pretty good advice: On a modern machine, if you must manually unroll simple floating point code, use a power of two. Unroll it by at least 4 for single precision and 2 for double precision. Try doubling it to see if it helps.

Why? Because you want to use vector instructions if you can. SSE registers are 128 bits, which means they fit 4 single-precision floats and 2 double-precision floats, so you should be able to do that number of multiplies in one instruction. If your compiler is worth anything at all, and you are compiling for an x86_64 target, then the relevant registers and instructions are available, everything should just happen.

(Oh, and AVX registers are 256 bits, but you can't assume AVX any time soon.)

Incidentally, the reason why 5 was optimal on the machine it was written for was some combination of instruction window size, reservation station size, whether or not it could do speculative execution across a conditional branch, number of machine registers, and so on. It's hard to say without knowing the specifics.

One more option which I should mention for completeness is that you may have access to an optimised BLAS library for your target platform. In that case, use the SSCAL or DSCAL operation (single-precision and double-precision respectively), then you won't have to worry about these details.

Carnildo
Posts: 2023
Joined: Fri Jul 18, 2008 8:43 am UTC

Re: Loop Unrolling: Suggested Interval?

Postby Carnildo » Mon May 06, 2013 4:20 am UTC

WanderingLinguist wrote:I don't have access to a Windows system at the moment, but I tried it with GCC on my Mac, and it doesn't seem to be unrolling the loop unless it's relatively short (or you specifically request unrolling). CLANG also doesn't seem to do unrolling unless you ask for it. Interestingly, the unrolled (interval of 5) vs. non-unrolled versions seem to perform identically. (Perhaps there's some kind of optimization going on at the processor level?)

A branch instruction is cheap; with a good branch predictor, it's even cheaper. A cache miss, on the other hand, is relatively expensive, and an unrolled loop takes up more space in the instruction cache, increasing the odds of a miss. Consequently, compilers tend not to unroll loops unless there's a large performance gain in doing so.


Return to “Computer Science”

Who is online

Users browsing this forum: No registered users and 8 guests