🖼️ Image Caption Generator using Vision Transformers + OpenCV
📌 Overview
This project generates natural language captions for images by combining Vision Transformer (ViT)-based vision encoders with transformer-based language models. It uses OpenCV for image preprocessing and frame handling, and a pretrained multimodal architecture (such as BLIP or ViT-GPT style models) to convert visual features into textual descriptions.
The system supports two captioning modes:
- 🧠 Professional captions (LinkedIn-style)
- 🔥 Casual captions (Instagram/trend-style with hashtags)
🏗️ Architecture
1. Image Processing (OpenCV)
- Loads images or video frames
- Converts BGR → RGB
- Resizes and normalizes input
- Optional: frame sampling for video-based captioning
2. Vision Encoder (Vision Transformer / ViT)
- Splits image into fixed-size patches
- Converts patches into embeddings
- Uses self-attention to capture global context
- Produces high-dimensional visual representation
3. Language Decoder (Transformer Model)
- Receives visual embeddings
- Generates text autoregressively (token-by-token)
- Uses cross-attention over image features
4. Dual Caption Head
- Formal Head: structured, descriptive, professional output
- Casual Head: social-media optimized captions with hashtags and emojis
⚙️ Tech Stack
- Python 🐍
- OpenCV
- PyTorch
- HuggingFace Transformers
- Vision Transformer (ViT / BLIP)
- Gradio (optional UI layer)
🚀 Workflow
- Input image/video loaded via OpenCV
- Preprocessing (resize, normalize, color conversion)
- Vision Transformer extracts image embeddings
- Transformer decoder generates caption
- Two outputs produced:
- Professional caption (LinkedIn-ready)
- Casual caption (Instagram-style)
📦 Installation
git clone https://github.com/Santhosh-p654/imagecaptiongenerator
cd image-caption-generator
pip install -r requirements.txt
Log in or sign up for Devpost to join the conversation.