Example with MIPS, Pipelining and Branch Delay Slot

J. Doe

I am preparing for a test and have such example. Following code:

1: SLL $1, $1, 2
2: LW $2, 1000($1)
3: BEQL $2, $0, END
4: ADDI $3, $2, 1
5: MULT $3, $2
6: MFLO $4
END:
7: J QUIT
...
QUIT:
100: NOP

is executed on RISC processor (with quasi MIPS instruction set) with

  • five-stage pipeline
  • no bypassing
  • no dynamic scheduling
  • Branch Delay Slot
  • Additionally we know, that branch won't be taken

My task is to understand how the Branch Delay Slot works in this situation and build the correct Pipeline Diagram.

I have an official solution and it gives the following diagram with no explanation:

1: SLL $1, $1, 2         IDEMW  
2: LW $2, 1000($1)        I---DEMW  
3: BEQL $2, $0, END           I---DEMW  
4: ADDI $3, $2, 1                 IDx
5: MULT $3, $2                       IDEMW
6: MFLO $4                            I---DEMW

As far as I understand, ADDI is executed in Branch Delay Slot and is stopped after processor understands, that branch is not taken, what leads us to wrong result. My Questions here are

  • Am i right?
  • When yes, why is ADDI executed in Branch Delay Slot and not Jump?
Ped7g

The CPU keeps reading instructions sequentially, i.e. during execution (was already fetched, decoded and the remaining phases are now being processed, I don't know your exact phases, so this is just general description) of beql it will get the other part of pipe free to fetch next instruction, but the branch was not finished yet, so the PC is still pointing to the next instruction after branch -> that's the "branch delay slot".

On classic MIPS this next instruction is fetched, decoded, and executed, and meanwhile the branch may or may not modify the PC to the branch target, so the branch-delay slot instruction will get executed every time. The next instruction after it gets executed only when branching didn't happen, i.e. the PC continues after the "branch delay slot" position sequentially. In case the branch did modify the PC, the fetch+decode will take notice and decode the next instruction from new destination, so on classic MIPS the branch delay slot is only 1 instruction "big" (I have no idea if more complex MIPS CPUs can have more stages and more delay slots available, technically with 5 stage pipeline even 5 instructions delayed sounds HW possible, but it would be probably very difficult to use practically and sounds like it would create more problems than help).

The BEQL is more complex instruction, killing the delay slot instruction halfway into execution if the branching condition fails.

See http://math-atlas.sourceforge.net/devel/assembly/mips-iv.pdf page 45 for detailed description of BEQL.

So that "NullifyCurrentInstruction()" is probably that "x" in the diagram. Remaining things in diagram, I'm just guessing as I didn't study your 5 stage details, but second LW after fetching and decoding(?) finds out it depends on $1, so it waits in depend-stage for previous instruction W phase. Etc...The ADDI doesn't depend on anything, so it is executed almost parallel to BEQL, and gets killed toward the end.

But I don't understand why there's no "I" phase every time the "I" stage gets freed, looks like the "I" waits for something and in the end you have like at most 2 instructions going on at the same time.

Anyway, this is quite undecipherable without studying the technical details of the CPU used in your question, and I don't want to study it, I'm not even sure what kind of CPU you have, and where to get it's technical documentation.


edit: I will try to extract relevant part of pdf also here, to make this answer not "just link", but copying pdf may be tricky...

BEQL instruction docs of MIPS IV CPU:

Description: if (rs = rt) then branch_likely
An 18-bit signed offset (the 16-bit offset field shifted left 2 bits) is added to the address of the instruction following the branch (not the branch itself), in the branch delay slot, to form a PC-relative effective target address.
If the contents of GPR rs and GPR rt are equal, branch to the target address after the instruction in the delay slot is executed. If the branch is not taken, the instruction in the delay slot is not executed.

Operation:
I:
  tgt_offset ← sign_extend(offset || 02)
  condition ← (GPR[rs] = GPR[rt])
I+1:
if condition then
  PC ← PC + tgt_offset
else
  NullifyCurrentInstruction()
endif

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related