๐ก Inspiration
The inspiration for AI Doctor came from a simple yet powerful question:
โWhat if medical advice could be available to anyone, anytime โ just by talking?โ
In many regions, people face:
Long hospital waiting times
Lack of nearby doctors
Language and literacy barriers
We wanted to create a solution where any person could speak their symptoms, upload an image (like a rash or eye issue), and receive AI-powered medical guidance โ all within seconds.
๐ง What We Learned
During this project, our team learned:
How to use Groq Whisper for real-time speech-to-text conversion
How to integrate AI vision models for image-based reasoning
How to build interactive multimodal apps with Gradio
How to securely deploy projects using Hugging Face Spaces
The importance of designing safe, ethical, and human-centered AI in healthcare
We also explored how latency and inference speed can impact real-time AI interactions, and how Groqโs accelerated LLMs solve that challenge.
๐ ๏ธ How We Built It
Frontend: Built using Gradio to capture audio and image inputs.
Audio Pipeline:
User speaks into the mic ๐๏ธ
Audio is recorded and processed with pydub
Transcribed using Whisper-large-v3 via Groq API
Image + Text Processing:
Uploaded image encoded into base64 format
Text and image passed into LLaMA 4 instruct model for reasoning
Doctorโs Response:
AI generates a natural medical-style answer
(Optional) Converted to voice with text-to-speech (gTTS / ElevenLabs)
Deployment:
Hosted on Hugging Face Spaces using gradio deploy
Mathematically, we represent the multimodal input as:
๐ ( voice , image
)
LLM ( Transcribe ( voice ) , Encode ( image ) ) f(voice,image)=LLM(Transcribe(voice),Encode(image))
where ๐ f is the AI Doctorโs response function combining audio and visual understanding.
๐ง Challenges We Faced
Audio Processing on Cloud:
pyaudio failed on Hugging Face (no mic device), so we redesigned the pipeline to use browser-based mic recording via Gradio.
Groq API Integration:
Parsing responses from the chat completion API required handling new response formats.
Model Latency:
Managing large models for image reasoning while keeping inference time low.
Deployment Issues:
Handling environment variables and file paths on Hugging Face Spaces correctly.
๐ Future Scope
๐ Multi-language support
๐ฑ Mobile app version
๐งฉ Integration with hospital databases (EHR)
โก Offline lightweight version for low-connectivity regions
๐จโ๐ป Tech Stack
Python, Gradio, Groq API (Whisper + LLaMA), Hugging Face Spaces, pydub, dotenv, GitHub
Built With
- dotenv
- gradio
- groq-api-(whisper-+-llama)
- hugging-face-spaces
- pydub
- python
Log in or sign up for Devpost to join the conversation.