Inspiration
In ECE342, we studied floating-point representation, normalization, rounding, and how arithmetic is implemented at the hardware level. While we understood the theory of IEEE-style floating point formats, we wanted to go deeper and build a working floating-point unit ourselves.
Instead of implementing full IEEE-754 single precision, we designed a simplified 8-bit floating point format (E3M4) to better understand how exponent alignment, mantissa arithmetic, normalization, and rounding are actually handled in hardware.
What it does
Our design implements a custom 8-bit floating point unit using:
Format: [S][EEE][MMMM]
1 sign bit, 3 exponent bits (bias = 3), 4 mantissa bits
The FPU supports: Addition Subtraction Multiplication Division Fused Multiply-Add (FMA) Square Root
All outputs are normalized and rounded back into E3M4 format after computation.
How we built it
The architecture is divided into modular stages:
Unpack stage: Extract sign, exponent, and mantissa. Insert hidden leading 1 for normalized numbers.
Arithmetic core: Add/Sub: exponent alignment, significand addition/subtraction Mul: exponent addition, significand multiplication Div: exponent subtraction, iterative mantissa division FMA: full-precision multiply followed by aligned accumulation Sqrt: exponent halving and iterative mantissa square root
Normalization stage: Results are shifted to restore canonical form:
1.𝑀×2𝐸 Exponents are adjusted accordingly.
Rounding stage: Guard, round, and sticky bits are used to implement round-to-nearest behavior before truncating back to 4 mantissa bits.
Pack stage: Final result is encoded back into 8-bit E3M4 format.
Challenges we ran into
Limited dynamic range (exponent only 3 bits) Significant rounding error due to only 4 mantissa bits
Accomplishments that we're proud of
Successfully implemented six floating-point operations in only 8 bits Demonstrated correct normalized outputs in all supported operations
What we learned
How small bit-width designs magnify architectural decisions
What's next for 8 bit FPU
IEEE-style NaN and infinity handling Multiple rounding modes
Built With
- c++
- cognichip
Log in or sign up for Devpost to join the conversation.