๐Ÿ’ก Inspiration

The inspiration for AI Doctor came from a simple yet powerful question:

โ€œWhat if medical advice could be available to anyone, anytime โ€” just by talking?โ€

In many regions, people face:

Long hospital waiting times

Lack of nearby doctors

Language and literacy barriers

We wanted to create a solution where any person could speak their symptoms, upload an image (like a rash or eye issue), and receive AI-powered medical guidance โ€” all within seconds.

๐Ÿง  What We Learned

During this project, our team learned:

How to use Groq Whisper for real-time speech-to-text conversion

How to integrate AI vision models for image-based reasoning

How to build interactive multimodal apps with Gradio

How to securely deploy projects using Hugging Face Spaces

The importance of designing safe, ethical, and human-centered AI in healthcare

We also explored how latency and inference speed can impact real-time AI interactions, and how Groqโ€™s accelerated LLMs solve that challenge.

๐Ÿ› ๏ธ How We Built It

Frontend: Built using Gradio to capture audio and image inputs.

Audio Pipeline:

User speaks into the mic ๐ŸŽ™๏ธ

Audio is recorded and processed with pydub

Transcribed using Whisper-large-v3 via Groq API

Image + Text Processing:

Uploaded image encoded into base64 format

Text and image passed into LLaMA 4 instruct model for reasoning

Doctorโ€™s Response:

AI generates a natural medical-style answer

(Optional) Converted to voice with text-to-speech (gTTS / ElevenLabs)

Deployment:

Hosted on Hugging Face Spaces using gradio deploy

Mathematically, we represent the multimodal input as:

๐‘“ ( voice , image

)

LLM ( Transcribe ( voice ) , Encode ( image ) ) f(voice,image)=LLM(Transcribe(voice),Encode(image))

where ๐‘“ f is the AI Doctorโ€™s response function combining audio and visual understanding.

๐Ÿšง Challenges We Faced

Audio Processing on Cloud:

pyaudio failed on Hugging Face (no mic device), so we redesigned the pipeline to use browser-based mic recording via Gradio.

Groq API Integration:

Parsing responses from the chat completion API required handling new response formats.

Model Latency:

Managing large models for image reasoning while keeping inference time low.

Deployment Issues:

Handling environment variables and file paths on Hugging Face Spaces correctly.

๐Ÿš€ Future Scope

๐ŸŒ Multi-language support

๐Ÿ“ฑ Mobile app version

๐Ÿงฉ Integration with hospital databases (EHR)

โšก Offline lightweight version for low-connectivity regions

๐Ÿ‘จโ€๐Ÿ’ป Tech Stack

Python, Gradio, Groq API (Whisper + LLaMA), Hugging Face Spaces, pydub, dotenv, GitHub

Built With

  • dotenv
  • gradio
  • groq-api-(whisper-+-llama)
  • hugging-face-spaces
  • pydub
  • python
Share this project:

Updates