AegisFS

Inspiration

I wanted to build something grounded in real systems concepts, not just a surface-level demo. AegisFS started as a way for me to understand how distributed file systems actually handle correctness, durability, and crash recovery.

What it does

AegisFS is a lightweight distributed file system with a Metadata Server and DataNodes. It supports block reads/writes, persistent metadata through a journal, and safe recovery even if the system crashes during an update.

How I built it

I implemented a custom TCP-based RPC layer, a journaling Metadata Server using a BEGIN/APPLY/COMMIT protocol, and a DataNode storage system that uses temporary files, fsync, and atomic renames to guarantee durability. Everything is modular so each component can be tested, debugged, or expanded independently.

Challenges I ran into

Designing the journal so it could correctly handle crashes was the hardest part. I also had to solve issues with TCP message boundaries, partial reads, and ensuring DataNode writes were truly atomic and not just “write a file and hope for the best.”

Accomplishments I'm proud of

The system works reliably under failure conditions. Metadata replays cleanly after restarts, block writes avoid corruption, and the overall architecture is clean and extendable. Getting the journaling layer working correctly was a major accomplishment.

What I learned

I learned how real systems use intent logs, commit protocols, and atomic filesystem operations to guarantee data integrity. I also saw how valuable modular design is when building and debugging complex systems.

What’s next for AegisFS

Next, I plan to add replication across DataNodes, load balancing, and more advanced fault-tolerance features. The foundation is solid enough to grow into a more fully featured distributed file system.

Built With

python

Updates

Andrew Chan started this project — Nov 16, 2025 02:25 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.