Home |
Menu
A Simple Implementation of DLX using VHDL
Click here for Tutorials in DLX pipeline
Click here for source code
Click here for more references
DLX provides a good architectural model for study, not only because of the recent popularity of this type of machine, but also because it is easy to understand.
Like most recent load/store machines, DLX emphasizes
- A simple load/store instruction set
- Design for pipelining efficiency
- An easily decoded instruction set
- Efficiency as a compiler target

Registers for DLX
- thirty-two 32-bit general purpose registers (GPRs), named R0, R1, ..., R31. The value of R0 is always 0.
- thirty-two floating-point registers (FPRs), which can be used as
- 32 single precision (32-bit) registers or
- even-odd pairs holding double-precision values. Thus, the 64-bit FPRs are named F0,F2,...,F30
- a few special registers can be transferred to and from the integer registers.
Data types for DLX
for integer data
- 8-bit bytes
- 16-bit half words
- 32-bit words
for floating point
- 32-bit single precision
- 64-bit double precision
The DLX operations work on 32-bit integers and 32- or 64-bit floating point. Bytes and half words are loaded into registers with either zeros or the sign bit replicated to fill the 32 bits of the registers.
Instruction format
Operations
There are four classes of instructions:
1. Load/Store
Any of the GPRs or FPRs may be loaded and stored except that loading R0 has no effect.
2. ALU Operations
All ALU instructions are register-register instructions.
The operations are :
- add
- subtract
- AND
- OR
- XOR
- shifts
Compare instructions compare two registers (=,!=,<,>,=<,=>).
If the condition is true, these instructions place a 1 in the destination register, otherwise they place a 0.
3. Branches/Jumps
All branches are conditional.The branch condition is specified by the instruction, which may test the register source for zero or nonzero.
4. Floating-Point Operations
- add
- subtract
- multiply
- divide
Performance analysis:
Compare with nonpipelined machine (assuming no hazards or cache misses).
Assume 40% of instructions are ALU operations requiring 4 cycles, 20%
are branches requiring 4 cycles, and 40% are loads and stores requiring
5 cycles. Assume a clock cycle time of 10 ns.
In comparing machines, we can disregard IC and just look at CPI, clock
time.
Average instruction time (nonpipelined, 5-cycle) = clock time x CPI
= (10ns) x ((.6)(4) + (.4)(5))
= 44 ns
Average instruction time (pipelined) = (11 ns)(1) = 11 ns
So the speedup is 4.
Consider a single-cycle implementation of DLX. Assume the stages
require 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Then one instruction
can be completed in a single 45 ns clock cycle.
Average instruction time (nonpipelined, single cycle) = (45 ns)(1) =
45 ns
The speedup is 4.1.