Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 487
  • Last Modified:

loop unrolling etc.

I'm intrigued by - in effect - some new responsibilities on the job which entails assembly but more importantly understanding computer architecture.   So i'm reviewing a sheet of paper with questions that once i get a general feel or am able to answer i'll be able to at a minumum get started.

Consider this assembly language code


   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
   bnez r30, start    # jump to start if r30 not equal with zero

1.   How would I 'loop unroll' this four times?
2.  Assume a five stage pipeline.  Insert nop in assembly code where there's stalls.  The only stall I see is

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
start:
   :   # assembly code
   :   # assembly code
   bnez r30, start    # jump to start if r30 not equal with zero
   nop ## <--------------  stall

correct?

3.   How would i modify program via register renaming such that the number of stalls is minimized
0
forums_mp
Asked:
forums_mp
1 Solution
 
mtmikeCommented:
Loop unrolling is simply a matter of cutting and pasting and changing some of the loop conditions to branch out. Loop unrolling is a precursor to different optimizations. You can usually also remove the branches by looping until there are fewer than 4 elements left. The trailing 0-3 elements should then be processed in a non-unrolled loop.

start:
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  beqz r30, end    # jump to end if r30 equal to zero
  ...
  bnez r30, start    # jump to start if r30 not equal with zero

end:

There are probably a few data hazards in the code besides the branch control hazard. This one looks like a write-after-read stall:

st  r12, 0(r14)   <- read r12
ld  r12, 4         <- write r12

Register renaming can solve these issues, for example, by using a different register than r12 in the code above.

Note that a five stage dlx architecture is quite old. Modern CPUs do register renaming automatically and execute multiple instructions per cycle.
0
 
dimitryCommented:
Just to add... You are using MIPS assembler. It has 1 "delay slot". It has some correlation with pipeline length, but it is
much easier to consider it from "programmer" point of view.

start:
  :   # assembly code 1
  :   # assembly code 2
  :   # assembly code 3
  bnez r30, start    # jump to start if r30 not equal with zero
  nop ## <--------------  stall

can be changed to

start:
  :   # assembly code 1
  :   # assembly code 2
  bnez r30, start    # jump to start if r30 not equal with zero
  :   # assembly code 3

However you need to remember that if # assembly code 3 affects on your branch condition you can not use it this way.
Also you need to be sure you set .noreorder directive because "clever" compiler may add nops automatically after branch.


0
 
grg99Commented:
1.   How would I 'loop unroll' this four times?

Probably not worth doing.

You have eight instructions in the loop, unrolling it is only going to help at most remove one instruction from the bottom of the loop.

If your goal is to spped up the code, there are several much more fruitful things you might consider:

#1:  factor out the two "add" instructions, they're jsut recalculating addresses that could be done once before the loop.

#2:  move  the ld  r12,4 outside the loop if you have another free register.

#3: instead of counting down r30, count until one of the address registers hits a precomputed limit.

Do all of those and your inner loop shrinks to just 5 instructions, THEN it might be time to try unrolling the loop a bit.

0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
terageekCommented:
1.  The purpose of loop unrolloing is to help with branch penalties.  A five stage pipelined processor example usually is used to teach from will not have branch prediction.  The 5 stages are fetch, decode/register access, execute, memory access, and write back. The processor won't know if the branch is taken until step 3, so 2 instructions in the pipeline from after the branch will need to be invalidated if the branch is actually taken. By unrolling the loop, there is no need to calculate the branch, and therefore nothing must be flushed.

r30 is initialized with 16, and is decremented by 4 at each itteration, so it will execute exactly 4 times.  The following unrolls the loop...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
# Loop1
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop2
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop3
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory
   ld  r12, 4
   sub r30, r30, r12  
#Loop4
   add r13, r30, r29
   ld  r11, 0(r13)      # load into r11 from memory
   add r14, r30, r28
   ld  r12, 0(r14)
   mult r12, r11, r12
   st  r12, 0(r14)      # store r2 into memory


2) Stalls occur any time you need data which isn't there yet, or when you are trying to branch, but don't know if you need to.  If you modify a register, it takes until stage 5.  If you want to read the data in the next cycle, you will need to stall that cycle back at stage 2 until it becomes available.  Your store cycle will keep moving forward while the next one waits at stage 2 until the register it needs is available.  For a branch, it takes until stage 3 to determine if you are going to branch, and until branch command gets to stage 3, you will need to stall fetching any additional cycles, so you will need to stall for 2 cycles at stage 1.  I am assuming for now that you can store and load from a register at the same time.  See below...

   ld  r30, 16
   ld  r29, 20               # r29 is updated in stage 5
   ld  r28, 40               # This is in stage 4 while r29 is updated
   nop                         # Fill stage 3 while r29 is updated, so it can be fetched
                                 # by the next cycle in stage 2
start:
   add r13, r30, r29     # r13 won't be updated until stage 5
   nop                        # While r13 is being stored, this nop will be in stage 4
   nop                        # While r13 is being stored, this nop will be in stage 3
   ld  r11, 0(r13)         # r13 is loaded fetched from the register in stage 2 at the
                                 # same time it is updated by the cycle in stage 5
   add r14, r30, r28     # r14 written back in stage 5
   nop                        # fills stage 4 while r14 is written back
   nop                        # fills stage 3 while r14 is written back
   ld  r12, 0(r14)         # Get r14 in stage 2 while it is being updated from the cycle in stage 5
   nop                        # Fills stage 4 while r12 is written back in stage 5
   nop                        # Get the idea?
   mult r12, r11, r12    # ...
   nop                        # ...
   nop                        # ...
   st  r12, 0(r14)         # You will get data from r12 in stage 2, but won't need it again
   ld  r12, 4                #  So there is no hazard here
   nop
   nop
   sub r30, r30, r12
   nop
   nop
   bnez r30, start    # jump to start if r30 not equal with zero
   nop                    # While the branch is determined in stage 3, this fills stage 2
   nop                    # This will fill stage 2 while branching

3) Using register renaming to limit the number of stalls...

Register renaming is needed if you need to access a register and them modify it.  The modifying instruction cannot be executed until the previous instruction has read the needed data.  This is only an issue in out-of-order and super-scalar processors.  It is never an issue in a five stage pipelined processor since the write instruction cannot pass the read instruction, even if there is no register name conflict.

You can however minimize the number of stalls by a combination of re-ordering and renaming...

   ld  r30, 16       # load immediate
   ld  r29, 20  
   ld  r28, 40    
   nop                     # Still need 1 bubble slot so r29 will be ready
 start:
   add r13, r30, r29
   add r14, r30, r28  # Move this up to fill in one bubble slot
   ld r10, 4              # Rename the register from 12 to 10, and move it into a bubble
   ld  r11, 0(r13)      # r13 was calculated 3 instrucitions ago, so it is now ready
   ld  r12, 0(r14)      # r14 was stored 3 instructions ago, so it is ready
   sub r30, r30, r10  # Move this up into a bubble slot
   nop                     # Still need 1 bubble so r12 will be ready
   mult r12, r11, r12
   bnez r30, start     # jump to start if r30 not equal with zero
   nop                     # Still need 1 bubble so r12 will be ready
   st  r12, 0(r14)      # store r12 into memory.  This instruction will be fetched while
                             # The jump is being considered at stage 3, and will be executed in
                             # a 5 stage pipeline, unless there is stall or instruction invalidate logic.
                             # I assume this is the case because you would want to add nop
                             # instructions to create stalls where the hardware does not have
                             # stall or invalidate logic.
0
 
kenspencerCommented:
Just watching ... I know x86 and 370 assembler, so this is interesting.
Ken
0
 
terageekCommented:
By the way, I am not sure if I explained this very well before.  You need to rename r12 to something else in the line
ld r12, 4
or you can't move the instruction before the previous uses or r12 in these lines:
ld  r12, 0(r14)
mult r12, r11, r12
st  r12, 0(r14)
0
 
terageekCommented:
Not that I am in it for the points, but I did put a lot into my answer above.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now