Inspiration

Data centers are mission-critical infrastructure, but even a single overheating or failed server can trigger downtime, expensive repairs, and slow incident response. We were inspired by the Microsoft AI & Automation challenge to build a scaled autonomous system that can do what an on-site technician would: detect a fault, identify the correct rack slot, and physically remove the failing server board. That became RackMedic.

What it does

RackMedic is an autonomous server-rack maintenance prototype. It simulates a small data-center rack with multiple server blades, monitors thermal/sensor data from each slot, and triggers an intervention workflow when one blade overheats.

The system:

  • Ingests telemetry from sensor nodes (ESP32-based setup with temperature monitoring).
  • Uses AI logic to determine which server is in a fault state.
  • Commands a robotic arm to align with the target slot.
  • Grasps and pulls out the affected server tray (cardboard prototype blade) for maintenance.
  • Supports manual override/teleoperation for safety and debugging.

How we built it

We built RackMedic as a full-stack robotics + AI prototype with parallel work across hardware, software, and controls:

Hardware and mechanical:

  • Raspberry Pi 5 + AI HAT for edge compute/control.
  • Viam/servo-based robotic arm with custom motion scripts.
  • Custom rack-and-blade physical mockup for repeatable testing.
  • Ultrasonic and thermal/sensor inputs for alignment and fault simulation.

Robotics control software:

  • Low-level serial control for 6-DOF servos (custom packet protocol).
  • Servo ID discovery and calibration tooling.
  • Keyboard teleoperation mode for manual test control.
  • Forward-motion and return trajectories with leveling compensation.
  • Kinematics-assisted movement and fallback logic for reliable reach behavior.

AI and orchestration:

  • AI-assisted fault diagnosis logic for “which blade is failing.”
  • Event-driven action pipeline: detect fault -> select target -> command arm -> extract.
  • Real-time status logging for demos and judging walkthroughs.

Challenges we ran into

  • Hardware constraints and stock limitations forced design pivots.
  • Mechanical precision in a fast-built cardboard rack meant alignment tolerance was tight.
  • Servo calibration and mapping were nontrivial; minor offsets caused major end-effector errors.
  • Inverse kinematics did not always produce practical motion in real-world calibration, so we implemented deterministic fallback movement deltas.
  • Integrating sensing, AI decision logic, and actuation into a dependable autonomous loop under hackathon time pressure required aggressive debugging and simplification.

Accomplishments that we're proud of

  • Delivered an end-to-end autonomous maintenance demo, not just isolated subsystems.
  • Built robust robot-control utilities (ID sweep, teleop, trajectory scripts) that made rapid debugging possible.
  • Created a practical physical prototype environment aligned with sponsor challenge expectations.
  • Executed as a cross-functional team across embedded, mechanical, AI, and systems integration.

What we learned

  • In robotics hackathons, reliability beats complexity: a constrained, repeatable setup wins.
  • Good calibration tooling is as important as the core robot algorithm.
  • Simulation and kinematics are useful, but real hardware behavior always needs empirical fallback logic.
  • Early architecture decisions on interfaces between sensing, AI, and actuation dramatically reduce integration risk.
  • Fast team communication and clear task ownership are critical in a 36-hour build.

What's next for RackMedic

  • Replace coarse alignment with stronger vision-based rack-slot localization.
  • Add force-feedback/grasp validation for safer extraction and reduced jamming risk.
  • Improve autonomy with closed-loop correction during pull-out.
  • Build a cleaner operator dashboard with live telemetry, fault history, and manual override controls.
  • Move from cardboard rack prototype toward a more realistic modular testbed.
  • Expand from single-fault response to scheduling/queuing multiple incidents and predictive maintenance scoring.

Built With

Share this project:

Updates