Solved

loop unrolling etc.

Posted on 2003-11-02
9
445 Views
Last Modified: 2012-05-04
I'm intrigued by - in effect - some new responsibilities on the job which entails assembly but more importantly understanding computer architecture.   So i'm reviewing a sheet of paper with questions that once i get a general feel or am able to answer i'll be able to at a minumum get started.

Consider this assembly language code


   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
   bnez r30, start    # jump to start if r30 not equal with zero

1.   How would I 'loop unroll' this four times?
2.  Assume a five stage pipeline.  Insert nop in assembly code where there's stalls.  The only stall I see is

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   :   # assembly code
   :   # assembly code
   bnez r30, start    # jump to start if r30 not equal with zero
   nop ## <--------------  stall

correct?

3.   How would i modify program via register renaming such that the number of stalls is minimized
0
Comment
Question by:forums_mp
9 Comments
 
LVL 5

Expert Comment

by:mtmike
ID: 9666410
Loop unrolling is simply a matter of cutting and pasting and changing some of the loop conditions to branch out. Loop unrolling is a precursor to different optimizations. You can usually also remove the branches by looping until there are fewer than 4 elements left. The trailing 0-3 elements should then be processed in a non-unrolled loop.

start:
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  bnez r30, start    # jump to start if r30 not equal with zero

end:

There are probably a few data hazards in the code besides the branch control hazard. This one looks like a write-after-read stall:

st  r12, 0(r14)   <- read r12
ld  r12, 4         <- write r12

Register renaming can solve these issues, for example, by using a different register than r12 in the code above.

Note that a five stage dlx architecture is quite old. Modern CPUs do register renaming automatically and execute multiple instructions per cycle.
0
 
LVL 11

Expert Comment

by:dimitry
ID: 9666695
Just to add... You are using MIPS assembler. It has 1 "delay slot". It has some correlation with pipeline length, but it is
much easier to consider it from "programmer" point of view.

start:
  :   # assembly code 1
  :   # assembly code 2
  :   # assembly code 3
  bnez r30, start    # jump to start if r30 not equal with zero
  nop ## <--------------  stall

can be changed to

start:
  :   # assembly code 1
  :   # assembly code 2
  bnez r30, start    # jump to start if r30 not equal with zero
  :   # assembly code 3

However you need to remember that if # assembly code 3 affects on your branch condition you can not use it this way.
Also you need to be sure you set .noreorder directive because "clever" compiler may add nops automatically after branch.


0
 
LVL 22

Expert Comment

by:grg99
ID: 9667268
1.   How would I 'loop unroll' this four times?

Probably not worth doing.

You have eight instructions in the loop, unrolling it is only going to help at most remove one instruction from the bottom of the loop.

If your goal is to spped up the code, there are several much more fruitful things you might consider:

#1:  factor out the two "add" instructions, they're jsut recalculating addresses that could be done once before the loop.

#2:  move  the ld  r12,4 outside the loop if you have another free register.

#3: instead of counting down r30, count until one of the address registers hits a precomputed limit.

Do all of those and your inner loop shrinks to just 5 instructions, THEN it might be time to try unrolling the loop a bit.

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 3

Accepted Solution

by:
terageek earned 500 total points
ID: 9822009
1.  The purpose of loop unrolloing is to help with branch penalties.  A five stage pipelined processor example usually is used to teach from will not have branch prediction.  The 5 stages are fetch, decode/register access, execute, memory access, and write back. The processor won't know if the branch is taken until step 3, so 2 instructions in the pipeline from after the branch will need to be invalidated if the branch is actually taken. By unrolling the loop, there is no need to calculate the branch, and therefore nothing must be flushed.

r30 is initialized with 16, and is decremented by 4 at each itteration, so it will execute exactly 4 times.  The following unrolls the loop...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
# Loop1
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop2
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop3
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop4
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory


2) Stalls occur any time you need data which isn't there yet, or when you are trying to branch, but don't know if you need to.  If you modify a register, it takes until stage 5.  If you want to read the data in the next cycle, you will need to stall that cycle back at stage 2 until it becomes available.  Your store cycle will keep moving forward while the next one waits at stage 2 until the register it needs is available.  For a branch, it takes until stage 3 to determine if you are going to branch, and until branch command gets to stage 3, you will need to stall fetching any additional cycles, so you will need to stall for 2 cycles at stage 1.  I am assuming for now that you can store and load from a register at the same time.  See below...

   ld  r30, 16
   ld  r29, 20               # r29 is updated in stage 5
   ld  r28, 40               # This is in stage 4 while r29 is updated
   nop                         # Fill stage 3 while r29 is updated, so it can be fetched
                                 # by the next cycle in stage 2
start:
   add r13, r30, r29     # r13 won't be updated until stage 5
   nop                        # While r13 is being stored, this nop will be in stage 4
   nop                        # While r13 is being stored, this nop will be in stage 3
   ld  r11, 0(r13)         # r13 is loaded fetched from the register in stage 2 at the
                                 # same time it is updated by the cycle in stage 5
   add r14, r30, r28     # r14 written back in stage 5
   nop                        # fills stage 4 while r14 is written back
   nop                        # fills stage 3 while r14 is written back
   ld  r12, 0(r14)         # Get r14 in stage 2 while it is being updated from the cycle in stage 5
   nop                        # Fills stage 4 while r12 is written back in stage 5
   nop                        # Get the idea?
   mult r12, r11, r12    # ...
   nop                        # ...
   nop                        # ...
   st  r12, 0(r14)         # You will get data from r12 in stage 2, but won't need it again
   ld  r12, 4                #  So there is no hazard here
   nop
   nop
   sub r30, r30, r12
   nop
   nop
   bnez r30, start    # jump to start if r30 not equal with zero
   nop                    # While the branch is determined in stage 3, this fills stage 2
   nop                    # This will fill stage 2 while branching

3) Using register renaming to limit the number of stalls...

Register renaming is needed if you need to access a register and them modify it.  The modifying instruction cannot be executed until the previous instruction has read the needed data.  This is only an issue in out-of-order and super-scalar processors.  It is never an issue in a five stage pipelined processor since the write instruction cannot pass the read instruction, even if there is no register name conflict.

You can however minimize the number of stalls by a combination of re-ordering and renaming...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
   nop                     # Still need 1 bubble slot so r29 will be ready
 start:
   add r13, r30, r29
   add r14, r30, r28  # Move this up to fill in one bubble slot
   ld r10, 4              # Rename the register from 12 to 10, and move it into a bubble
   ld  r11, 0(r13)      # r13 was calculated 3 instrucitions ago, so it is now ready
   ld  r12, 0(r14)      # r14 was stored 3 instructions ago, so it is ready
   sub r30, r30, r10  # Move this up into a bubble slot
   nop                     # Still need 1 bubble so r12 will be ready
   mult r12, r11, r12
   bnez r30, start     # jump to start if r30 not equal with zero
   nop                     # Still need 1 bubble so r12 will be ready
   st  r12, 0(r14)      # store r12 into memory.  This instruction will be fetched while
                             # The jump is being considered at stage 3, and will be executed in
                             # a 5 stage pipeline, unless there is stall or instruction invalidate logic.
                             # I assume this is the case because you would want to add nop
                             # instructions to create stalls where the hardware does not have
                             # stall or invalidate logic.
0
 
LVL 3

Expert Comment

by:kenspencer
ID: 9827612
Just watching ... I know x86 and 370 assembler, so this is interesting.
Ken
0
 
LVL 3

Expert Comment

by:terageek
ID: 9854963
By the way, I am not sure if I explained this very well before.  You need to rename r12 to something else in the line
ld r12, 4
or you can't move the instruction before the previous uses or r12 in these lines:
ld  r12, 0(r14)
mult r12, r11, r12
st  r12, 0(r14)
0
 
LVL 3

Expert Comment

by:terageek
ID: 10150380
Not that I am in it for the points, but I did put a lot into my answer above.
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

HOW TO: Connect to the VMware vSphere Hypervisor 6.5 (ESXi 6.5) using the vSphere (HTML5 Web) Host Client 6.5, and perform a simple configuration task of adding a new VMFS 6 datastore.
We have come a long way with backup and data protection — from backing up to floppies, external drives, CDs, Blu-ray, flash drives, SSD drives, and now to the cloud.
This video discusses moving either the default database or any database to a new volume.
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now