🖼️ Image Caption Generator using Vision Transformers + OpenCV

📌 Overview

This project generates natural language captions for images by combining Vision Transformer (ViT)-based vision encoders with transformer-based language models. It uses OpenCV for image preprocessing and frame handling, and a pretrained multimodal architecture (such as BLIP or ViT-GPT style models) to convert visual features into textual descriptions.

The system supports two captioning modes:

  • 🧠 Professional captions (LinkedIn-style)
  • 🔥 Casual captions (Instagram/trend-style with hashtags)

🏗️ Architecture

1. Image Processing (OpenCV)

  • Loads images or video frames
  • Converts BGR → RGB
  • Resizes and normalizes input
  • Optional: frame sampling for video-based captioning

2. Vision Encoder (Vision Transformer / ViT)

  • Splits image into fixed-size patches
  • Converts patches into embeddings
  • Uses self-attention to capture global context
  • Produces high-dimensional visual representation

3. Language Decoder (Transformer Model)

  • Receives visual embeddings
  • Generates text autoregressively (token-by-token)
  • Uses cross-attention over image features

4. Dual Caption Head

  • Formal Head: structured, descriptive, professional output
  • Casual Head: social-media optimized captions with hashtags and emojis

⚙️ Tech Stack

  • Python 🐍
  • OpenCV
  • PyTorch
  • HuggingFace Transformers
  • Vision Transformer (ViT / BLIP)
  • Gradio (optional UI layer)

🚀 Workflow

  1. Input image/video loaded via OpenCV
  2. Preprocessing (resize, normalize, color conversion)
  3. Vision Transformer extracts image embeddings
  4. Transformer decoder generates caption
  5. Two outputs produced:
    • Professional caption (LinkedIn-ready)
    • Casual caption (Instagram-style)

📦 Installation

git clone https://github.com/Santhosh-p654/imagecaptiongenerator
cd image-caption-generator

pip install -r requirements.txt

Built With

Share this project:

Updates