Transformers for Vision: Image Captioning

An End-to-End Implementation of Modern Vision Transformer Architectures

Mahanth Yalla

M.Tech Artificial Intelligence

Indian Institute of Science, Bengaluru

Project Overview

This repository contains end-to-end implementations of Transformer-based architectures for Computer Vision, focusing on the task of Image Captioning. Starting from the foundational "Attention is All You Need" architecture, the framework is progressively extended to visual understanding by implementing and experimenting with several modern Vision Transformer (ViT) variants. The project serves as a practical exploration of applying cutting-edge visual encoders to complex image-text modeling tasks, bridging the gap between image recognition and natural language generation.

Core Goals & Learning Objectives

The primary objective is to recreate, understand, and compare various Transformer architectures for vision tasks within a unified image captioning framework.

Architectural Implementation

Implement and deeply understand ViT, DeiT, Swin Transformer, PVTv2, and DINOv2 from the ground up.

Encoder-Decoder Framework

Apply these advanced visual encoders in combination with a Transformer-based text decoder for caption generation.

End-to-End Training

Build a complete training and evaluation pipeline, including custom tokenizers, dataset loaders, and DDP-based trainers.

Benchmarking & Analysis

Compare various visual encoder–text decoder combinations for caption quality, efficiency, and performance.

Methodology & Key Components

The project uses a standard encoder-decoder architecture, where the key innovation lies in substituting the visual encoder with various state-of-the-art Vision Transformer models.

Component Description Models / Datasets
Encoder Vision Transformer backbone to extract visual features ViT, DeiT, Swin, PVTv2, DINOv2
Decoder Transformer-based text generator for caption synthesis Vaswani-style Decoder
Training Objective Cross-Entropy loss for text generation Optional CIDEr fine-tuning
Datasets Benchmark datasets for image captioning COCO Captions, Flickr30k, NoCaps
Evaluation Metrics Standard metrics for caption quality BLEU, METEOR, ROUGE-L, SPICE, CIDEr

Project Progress In-Progress

The project is under active development. The following checklist outlines the current status of key components and planned architectural implementations:

Core Components

Visual Encoder Implementation

Project Structure

The repository is organized to separate concerns, with distinct modules for encoders, decoders, data handling, and training utilities.

src/
├── encoder/           # Vision Transformer-based visual encoders
├── decoder/           # Transformer-based text decoders
├── datasets/          # COCO / Flickr30k dataset loaders
├── training/          # Training, utilities and evaluation scripts
├── checkpoints/       # Results and model weights
├── notebooks/         # Experiments and visualization
└── main.py            # Entry point

Key Research References

This work is built upon the foundational research in Transformers and their application to computer vision.

Conclusion & Future Direction

This project provides a comprehensive framework for implementing and evaluating modern Vision Transformer architectures on the challenging task of image captioning. By systematically integrating and testing various visual encoders, the goal is to provide clear insights into their effectiveness and trade-offs in a multi-modal context. Future work will focus on completing the planned encoder implementations, conducting extensive hyperparameter tuning, and performing a thorough comparative analysis of all models on benchmark datasets.

See All Projects of Mine →