🖼️ Image Caption Generator using Vision Transformers + OpenCV

📌 Overview

This project generates natural language captions for images by combining Vision Transformer (ViT)-based vision encoders with transformer-based language models. It uses OpenCV for image preprocessing and frame handling, and a pretrained multimodal architecture (such as BLIP or ViT-GPT style models) to convert visual features into textual descriptions.

The system supports two captioning modes:

🧠 Professional captions (LinkedIn-style)
🔥 Casual captions (Instagram/trend-style with hashtags)

🏗️ Architecture

1. Image Processing (OpenCV)

Loads images or video frames
Converts BGR → RGB
Resizes and normalizes input
Optional: frame sampling for video-based captioning

2. Vision Encoder (Vision Transformer / ViT)

Splits image into fixed-size patches
Converts patches into embeddings
Uses self-attention to capture global context
Produces high-dimensional visual representation

3. Language Decoder (Transformer Model)

Receives visual embeddings
Generates text autoregressively (token-by-token)
Uses cross-attention over image features

4. Dual Caption Head

Formal Head: structured, descriptive, professional output
Casual Head: social-media optimized captions with hashtags and emojis

⚙️ Tech Stack

Python 🐍
OpenCV
PyTorch
HuggingFace Transformers
Vision Transformer (ViT / BLIP)
Gradio (optional UI layer)

🚀 Workflow

Input image/video loaded via OpenCV
Preprocessing (resize, normalize, color conversion)
Vision Transformer extracts image embeddings
Transformer decoder generates caption
Two outputs produced:
- Professional caption (LinkedIn-ready)
- Casual caption (Instagram-style)

📦 Installation

git clone https://github.com/Santhosh-p654/imagecaptiongenerator
cd image-caption-generator

pip install -r requirements.txt

Built With

docker
git
huggingface
opencv
python
torch
transformers

Updates

Santhosh P started this project — Jun 08, 2026 12:14 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.