Inspiration

This project started with a personal reflection. I occasionally experience minor visual difficulties, which led me to think about how technology could support people with more significant visual impairments.

The result is a serverless, multilingual API that turns images into both text and audio descriptions. Built to improve accessibility, enhance content workflows, and enable new use cases across the web.

What it does

The Image Description API accepts a base64-encoded image and a target language. It returns:

  • A detailed description of the image in plain text
  • An optional MP3 audio version of the description, using natural-sounding voices

Use cases include:

  • Accessibility tools (e.g. alt-text generation, screen readers)
  • E-commerce automation (product descriptions)
  • Social media (caption generation in multiple languages)
  • Educational tools (describing visuals for learners)

How we built it

We used a fully serverless AWS architecture:

  • Amazon Bedrock (Nova Lite model): For generating image descriptions using a multimodal foundation model
  • AWS Lambda: Orchestrates each request, handles validation, model invocation, and text-to-speech
  • Amazon Polly: Generates the audio version of the description in a natural voice
  • Amazon API Gateway: Exposes a secure, scalable REST API
  • AWS SAM: Used for infrastructure as code and streamlined deployment

Everything runs serverlessly with fast cold-start times and low latency.

Challenges we ran into

  • Handling image inputs across API Gateway, Lambda, and Bedrock with minimal overhead
  • Prompting the Bedrock model to generate accurate, concise, and translatable descriptions
  • Balancing performance with multi-step operations (image → text → translation → speech)
  • Making the API responses simple yet flexible (JSON or audio, multiple formats)

Accomplishments that we're proud of

  • Built a fully functional, production-ready solution
  • Delivered multilingual accessibility out of the box
  • Created a tool that's flexible enough for real-world use cases across industries
  • Open-sourced the entire codebase for others to use, learn from, and extend

What we learned

  • How to integrate AWS Bedrock’s multimodal AI models (like Nova Lite) into real-time workflows
  • Effective prompt engineering for image understanding in multiple languages
  • Best practices for managing binary data (images/audio) in serverless APIs
  • How powerful and accessible AI becomes when paired with clean cloud architecture

What's next for Image Description API

  • Build a lightweight web app demo to make the API more accessible for non-developers
  • Allow batch image processing

Built With

Share this project:

Updates