S.M.A.L.L - E | Devpost

Inspiration

We wanted to explore what happens when artificial intelligence can not only think and speak, but also see and respond to the people around it. Our team envisioned an interactive, physical assistant that merges computer vision, language models, and speech synthesis into one friendly, real-world presence.

What it does

S.M.A.L.L - E — is an interactive AI robot that can see, listen, and respond like a human companion. It recognizes and tracks faces, listens to speech, and replies in real time with natural conversation and expressive voice. The camera, mounted on a movable servo, allows the robot to physically turn toward whoever is speaking, creating the sense of genuine attention and awareness. Together, these elements make for a small yet lively assistant that feels truly present in its environment and capable of holding meaningful dialogue.

How we built it

We combined several frameworks and devices to bring the robot to life. Two Jetson Nano boards handled the vision and speech recognition tasks, sending processed data to a Raspberry Pi that controlled the servo-mounted camera. This coordination allowed the robot to orient toward a speaker while maintaining smooth, real-time interaction. Our system incorporated the following key components:

-DeepFace for facial detection and tracking -Jetson Xavier for speech recognition and for processing camera data to the Raspberry Pi -Raspberry Pi + servo motors for camera motion control -Faster whisper on coda 11.4 for speech-to-text processing -Gemini API for intelligent language understanding and response generation -ElevenLabs API for realistic text-to-speech output -Python 3.11/3.12 and VS Code for integration and testing -Intel Realsense d435

The system communicates through lightweight network calls between the camera module, the Jetson processing units, and the Raspberry Pi’s control program, forming a unified audio–visual interaction pipeline.

Challenges we ran into

Version control of Jetson hardware and communications between Jetson and servos.

Accomplishments that we're proud of

We hope to have delivered a finished product that allows hardware and AI to effectively interact with its environment and individuals.

What we learned

We deepened our understanding of multimodal AI systems, asynchronous communication between hardware and cloud APIs, and the importance of consistent environment setup for reproducible results. We also gained hands-on experience integrating vision, audio, and language in a cohesive user experience.