Solved

loop unrolling etc.

Posted on 2003-11-02
9
472 Views
Last Modified: 2012-05-04
I'm intrigued by - in effect - some new responsibilities on the job which entails assembly but more importantly understanding computer architecture.   So i'm reviewing a sheet of paper with questions that once i get a general feel or am able to answer i'll be able to at a minumum get started.

Consider this assembly language code


   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
   bnez r30, start    # jump to start if r30 not equal with zero

1.   How would I 'loop unroll' this four times?
2.  Assume a five stage pipeline.  Insert nop in assembly code where there's stalls.  The only stall I see is

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   :   # assembly code
   :   # assembly code
   bnez r30, start    # jump to start if r30 not equal with zero
   nop ## <--------------  stall

correct?

3.   How would i modify program via register renaming such that the number of stalls is minimized
0
Comment
Question by:forums_mp
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 5

Expert Comment

by:mtmike
ID: 9666410
Loop unrolling is simply a matter of cutting and pasting and changing some of the loop conditions to branch out. Loop unrolling is a precursor to different optimizations. You can usually also remove the branches by looping until there are fewer than 4 elements left. The trailing 0-3 elements should then be processed in a non-unrolled loop.

start:
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  bnez r30, start    # jump to start if r30 not equal with zero

end:

There are probably a few data hazards in the code besides the branch control hazard. This one looks like a write-after-read stall:

st  r12, 0(r14)   <- read r12
ld  r12, 4         <- write r12

Register renaming can solve these issues, for example, by using a different register than r12 in the code above.

Note that a five stage dlx architecture is quite old. Modern CPUs do register renaming automatically and execute multiple instructions per cycle.
0
 
LVL 11

Expert Comment

by:dimitry
ID: 9666695
Just to add... You are using MIPS assembler. It has 1 "delay slot". It has some correlation with pipeline length, but it is
much easier to consider it from "programmer" point of view.

start:
  :   # assembly code 1
  :   # assembly code 2
  :   # assembly code 3
  bnez r30, start    # jump to start if r30 not equal with zero
  nop ## <--------------  stall

can be changed to

start:
  :   # assembly code 1
  :   # assembly code 2
  bnez r30, start    # jump to start if r30 not equal with zero
  :   # assembly code 3

However you need to remember that if # assembly code 3 affects on your branch condition you can not use it this way.
Also you need to be sure you set .noreorder directive because "clever" compiler may add nops automatically after branch.


0
 
LVL 22

Expert Comment

by:grg99
ID: 9667268
1.   How would I 'loop unroll' this four times?

Probably not worth doing.

You have eight instructions in the loop, unrolling it is only going to help at most remove one instruction from the bottom of the loop.

If your goal is to spped up the code, there are several much more fruitful things you might consider:

#1:  factor out the two "add" instructions, they're jsut recalculating addresses that could be done once before the loop.

#2:  move  the ld  r12,4 outside the loop if you have another free register.

#3: instead of counting down r30, count until one of the address registers hits a precomputed limit.

Do all of those and your inner loop shrinks to just 5 instructions, THEN it might be time to try unrolling the loop a bit.

0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
LVL 3

Accepted Solution

by:
terageek earned 500 total points
ID: 9822009
1.  The purpose of loop unrolloing is to help with branch penalties.  A five stage pipelined processor example usually is used to teach from will not have branch prediction.  The 5 stages are fetch, decode/register access, execute, memory access, and write back. The processor won't know if the branch is taken until step 3, so 2 instructions in the pipeline from after the branch will need to be invalidated if the branch is actually taken. By unrolling the loop, there is no need to calculate the branch, and therefore nothing must be flushed.

r30 is initialized with 16, and is decremented by 4 at each itteration, so it will execute exactly 4 times.  The following unrolls the loop...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
# Loop1
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop2
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop3
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop4
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory


2) Stalls occur any time you need data which isn't there yet, or when you are trying to branch, but don't know if you need to.  If you modify a register, it takes until stage 5.  If you want to read the data in the next cycle, you will need to stall that cycle back at stage 2 until it becomes available.  Your store cycle will keep moving forward while the next one waits at stage 2 until the register it needs is available.  For a branch, it takes until stage 3 to determine if you are going to branch, and until branch command gets to stage 3, you will need to stall fetching any additional cycles, so you will need to stall for 2 cycles at stage 1.  I am assuming for now that you can store and load from a register at the same time.  See below...

   ld  r30, 16
   ld  r29, 20               # r29 is updated in stage 5
   ld  r28, 40               # This is in stage 4 while r29 is updated
   nop                         # Fill stage 3 while r29 is updated, so it can be fetched
                                 # by the next cycle in stage 2
start:
   add r13, r30, r29     # r13 won't be updated until stage 5
   nop                        # While r13 is being stored, this nop will be in stage 4
   nop                        # While r13 is being stored, this nop will be in stage 3
   ld  r11, 0(r13)         # r13 is loaded fetched from the register in stage 2 at the
                                 # same time it is updated by the cycle in stage 5
   add r14, r30, r28     # r14 written back in stage 5
   nop                        # fills stage 4 while r14 is written back
   nop                        # fills stage 3 while r14 is written back
   ld  r12, 0(r14)         # Get r14 in stage 2 while it is being updated from the cycle in stage 5
   nop                        # Fills stage 4 while r12 is written back in stage 5
   nop                        # Get the idea?
   mult r12, r11, r12    # ...
   nop                        # ...
   nop                        # ...
   st  r12, 0(r14)         # You will get data from r12 in stage 2, but won't need it again
   ld  r12, 4                #  So there is no hazard here
   nop
   nop
   sub r30, r30, r12
   nop
   nop
   bnez r30, start    # jump to start if r30 not equal with zero
   nop                    # While the branch is determined in stage 3, this fills stage 2
   nop                    # This will fill stage 2 while branching

3) Using register renaming to limit the number of stalls...

Register renaming is needed if you need to access a register and them modify it.  The modifying instruction cannot be executed until the previous instruction has read the needed data.  This is only an issue in out-of-order and super-scalar processors.  It is never an issue in a five stage pipelined processor since the write instruction cannot pass the read instruction, even if there is no register name conflict.

You can however minimize the number of stalls by a combination of re-ordering and renaming...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
   nop                     # Still need 1 bubble slot so r29 will be ready
 start:
   add r13, r30, r29
   add r14, r30, r28  # Move this up to fill in one bubble slot
   ld r10, 4              # Rename the register from 12 to 10, and move it into a bubble
   ld  r11, 0(r13)      # r13 was calculated 3 instrucitions ago, so it is now ready
   ld  r12, 0(r14)      # r14 was stored 3 instructions ago, so it is ready
   sub r30, r30, r10  # Move this up into a bubble slot
   nop                     # Still need 1 bubble so r12 will be ready
   mult r12, r11, r12
   bnez r30, start     # jump to start if r30 not equal with zero
   nop                     # Still need 1 bubble so r12 will be ready
   st  r12, 0(r14)      # store r12 into memory.  This instruction will be fetched while
                             # The jump is being considered at stage 3, and will be executed in
                             # a 5 stage pipeline, unless there is stall or instruction invalidate logic.
                             # I assume this is the case because you would want to add nop
                             # instructions to create stalls where the hardware does not have
                             # stall or invalidate logic.
0
 
LVL 3

Expert Comment

by:kenspencer
ID: 9827612
Just watching ... I know x86 and 370 assembler, so this is interesting.
Ken
0
 
LVL 3

Expert Comment

by:terageek
ID: 9854963
By the way, I am not sure if I explained this very well before.  You need to rename r12 to something else in the line
ld r12, 4
or you can't move the instruction before the previous uses or r12 in these lines:
ld  r12, 0(r14)
mult r12, r11, r12
st  r12, 0(r14)
0
 
LVL 3

Expert Comment

by:terageek
ID: 10150380
Not that I am in it for the points, but I did put a lot into my answer above.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Gain an elementary understanding of Blockchain technology.
This article provides a convenient collection of links to Microsoft provided Security Patches for operating systems that have reached their End of Life support cycle. Included operating systems covered by this article are Windows XP,  Windows Server…
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
Monitoring a network: how to monitor network services and why? Michael Kulchisky, MCSE, MCSA, MCP, VTSP, VSP, CCSP outlines the philosophy behind service monitoring and why a handshake validation is critical in network monitoring. Software utilized …

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question