Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

loop unrolling etc.

Posted on 2003-11-02
9
Medium Priority
?
484 Views
Last Modified: 2012-05-04
I'm intrigued by - in effect - some new responsibilities on the job which entails assembly but more importantly understanding computer architecture.   So i'm reviewing a sheet of paper with questions that once i get a general feel or am able to answer i'll be able to at a minumum get started.

Consider this assembly language code


   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
   bnez r30, start    # jump to start if r30 not equal with zero

1.   How would I 'loop unroll' this four times?
2.  Assume a five stage pipeline.  Insert nop in assembly code where there's stalls.  The only stall I see is

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   :   # assembly code
   :   # assembly code
   bnez r30, start    # jump to start if r30 not equal with zero
   nop ## <--------------  stall

correct?

3.   How would i modify program via register renaming such that the number of stalls is minimized
0
Comment
Question by:forums_mp
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 5

Expert Comment

by:mtmike
ID: 9666410
Loop unrolling is simply a matter of cutting and pasting and changing some of the loop conditions to branch out. Loop unrolling is a precursor to different optimizations. You can usually also remove the branches by looping until there are fewer than 4 elements left. The trailing 0-3 elements should then be processed in a non-unrolled loop.

start:
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  bnez r30, start    # jump to start if r30 not equal with zero

end:

There are probably a few data hazards in the code besides the branch control hazard. This one looks like a write-after-read stall:

st  r12, 0(r14)   <- read r12
ld  r12, 4         <- write r12

Register renaming can solve these issues, for example, by using a different register than r12 in the code above.

Note that a five stage dlx architecture is quite old. Modern CPUs do register renaming automatically and execute multiple instructions per cycle.
0
 
LVL 11

Expert Comment

by:dimitry
ID: 9666695
Just to add... You are using MIPS assembler. It has 1 "delay slot". It has some correlation with pipeline length, but it is
much easier to consider it from "programmer" point of view.

start:
  :   # assembly code 1
  :   # assembly code 2
  :   # assembly code 3
  bnez r30, start    # jump to start if r30 not equal with zero
  nop ## <--------------  stall

can be changed to

start:
  :   # assembly code 1
  :   # assembly code 2
  bnez r30, start    # jump to start if r30 not equal with zero
  :   # assembly code 3

However you need to remember that if # assembly code 3 affects on your branch condition you can not use it this way.
Also you need to be sure you set .noreorder directive because "clever" compiler may add nops automatically after branch.


0
 
LVL 22

Expert Comment

by:grg99
ID: 9667268
1.   How would I 'loop unroll' this four times?

Probably not worth doing.

You have eight instructions in the loop, unrolling it is only going to help at most remove one instruction from the bottom of the loop.

If your goal is to spped up the code, there are several much more fruitful things you might consider:

#1:  factor out the two "add" instructions, they're jsut recalculating addresses that could be done once before the loop.

#2:  move  the ld  r12,4 outside the loop if you have another free register.

#3: instead of counting down r30, count until one of the address registers hits a precomputed limit.

Do all of those and your inner loop shrinks to just 5 instructions, THEN it might be time to try unrolling the loop a bit.

0
Tech or Treat!

Submit an article about your scariest tech experience—and the solution—and you’ll be automatically entered to win one of 4 fantastic tech gadgets.

 
LVL 3

Accepted Solution

by:
terageek earned 2000 total points
ID: 9822009
1.  The purpose of loop unrolloing is to help with branch penalties.  A five stage pipelined processor example usually is used to teach from will not have branch prediction.  The 5 stages are fetch, decode/register access, execute, memory access, and write back. The processor won't know if the branch is taken until step 3, so 2 instructions in the pipeline from after the branch will need to be invalidated if the branch is actually taken. By unrolling the loop, there is no need to calculate the branch, and therefore nothing must be flushed.

r30 is initialized with 16, and is decremented by 4 at each itteration, so it will execute exactly 4 times.  The following unrolls the loop...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
# Loop1
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop2
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop3
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop4
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory


2) Stalls occur any time you need data which isn't there yet, or when you are trying to branch, but don't know if you need to.  If you modify a register, it takes until stage 5.  If you want to read the data in the next cycle, you will need to stall that cycle back at stage 2 until it becomes available.  Your store cycle will keep moving forward while the next one waits at stage 2 until the register it needs is available.  For a branch, it takes until stage 3 to determine if you are going to branch, and until branch command gets to stage 3, you will need to stall fetching any additional cycles, so you will need to stall for 2 cycles at stage 1.  I am assuming for now that you can store and load from a register at the same time.  See below...

   ld  r30, 16
   ld  r29, 20               # r29 is updated in stage 5
   ld  r28, 40               # This is in stage 4 while r29 is updated
   nop                         # Fill stage 3 while r29 is updated, so it can be fetched
                                 # by the next cycle in stage 2
start:
   add r13, r30, r29     # r13 won't be updated until stage 5
   nop                        # While r13 is being stored, this nop will be in stage 4
   nop                        # While r13 is being stored, this nop will be in stage 3
   ld  r11, 0(r13)         # r13 is loaded fetched from the register in stage 2 at the
                                 # same time it is updated by the cycle in stage 5
   add r14, r30, r28     # r14 written back in stage 5
   nop                        # fills stage 4 while r14 is written back
   nop                        # fills stage 3 while r14 is written back
   ld  r12, 0(r14)         # Get r14 in stage 2 while it is being updated from the cycle in stage 5
   nop                        # Fills stage 4 while r12 is written back in stage 5
   nop                        # Get the idea?
   mult r12, r11, r12    # ...
   nop                        # ...
   nop                        # ...
   st  r12, 0(r14)         # You will get data from r12 in stage 2, but won't need it again
   ld  r12, 4                #  So there is no hazard here
   nop
   nop
   sub r30, r30, r12
   nop
   nop
   bnez r30, start    # jump to start if r30 not equal with zero
   nop                    # While the branch is determined in stage 3, this fills stage 2
   nop                    # This will fill stage 2 while branching

3) Using register renaming to limit the number of stalls...

Register renaming is needed if you need to access a register and them modify it.  The modifying instruction cannot be executed until the previous instruction has read the needed data.  This is only an issue in out-of-order and super-scalar processors.  It is never an issue in a five stage pipelined processor since the write instruction cannot pass the read instruction, even if there is no register name conflict.

You can however minimize the number of stalls by a combination of re-ordering and renaming...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
   nop                     # Still need 1 bubble slot so r29 will be ready
 start:
   add r13, r30, r29
   add r14, r30, r28  # Move this up to fill in one bubble slot
   ld r10, 4              # Rename the register from 12 to 10, and move it into a bubble
   ld  r11, 0(r13)      # r13 was calculated 3 instrucitions ago, so it is now ready
   ld  r12, 0(r14)      # r14 was stored 3 instructions ago, so it is ready
   sub r30, r30, r10  # Move this up into a bubble slot
   nop                     # Still need 1 bubble so r12 will be ready
   mult r12, r11, r12
   bnez r30, start     # jump to start if r30 not equal with zero
   nop                     # Still need 1 bubble so r12 will be ready
   st  r12, 0(r14)      # store r12 into memory.  This instruction will be fetched while
                             # The jump is being considered at stage 3, and will be executed in
                             # a 5 stage pipeline, unless there is stall or instruction invalidate logic.
                             # I assume this is the case because you would want to add nop
                             # instructions to create stalls where the hardware does not have
                             # stall or invalidate logic.
0
 
LVL 3

Expert Comment

by:kenspencer
ID: 9827612
Just watching ... I know x86 and 370 assembler, so this is interesting.
Ken
0
 
LVL 3

Expert Comment

by:terageek
ID: 9854963
By the way, I am not sure if I explained this very well before.  You need to rename r12 to something else in the line
ld r12, 4
or you can't move the instruction before the previous uses or r12 in these lines:
ld  r12, 0(r14)
mult r12, r11, r12
st  r12, 0(r14)
0
 
LVL 3

Expert Comment

by:terageek
ID: 10150380
Not that I am in it for the points, but I did put a lot into my answer above.
0

Featured Post

Tech or Treat! - Giveaway

Submit an article about your scariest tech experience—and the solution—and you’ll be automatically entered to win one of 4 fantastic tech gadgets.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Are you looking for the options available for exporting EDB files to PST? You may be confused as they are different in different Exchange versions. Here, I will discuss some options available.
With so many activities to perform, Exchange administrators are always busy in organizations. If everything, including Exchange Servers, Outlook clients, and Office 365 accounts work without any issues, they can sit and relax. But unfortunately, it…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
Want to learn how to record your desktop screen without having to use an outside camera. Click on this video and learn how to use the cool google extension called "Screencastify"! Step 1: Open a new google tab Step 2: Go to the left hand upper corn…

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question