Transformers for Image Captioning

Project Overview

This repository contains end-to-end implementations of Transformer-based architectures for Computer Vision, focusing on the task of Image Captioning. Starting from the foundational "Attention is All You Need" architecture, the framework is progressively extended to visual understanding by implementing and experimenting with several modern Vision Transformer (ViT) variants. The project serves as a practical exploration of applying cutting-edge visual encoders to complex image-text modeling tasks, bridging the gap between image recognition and natural language generation.

Core Goals & Learning Objectives

The primary objective is to recreate, understand, and compare various Transformer architectures for vision tasks within a unified image captioning framework.

Architectural Implementation

Implement and deeply understand ViT, DeiT, Swin Transformer, PVTv2, and DINOv2 from the ground up.

Encoder-Decoder Framework

Apply these advanced visual encoders in combination with a Transformer-based text decoder for caption generation.

End-to-End Training

Build a complete training and evaluation pipeline, including custom tokenizers, dataset loaders, and DDP-based trainers.

Benchmarking & Analysis

Compare various visual encoder–text decoder combinations for caption quality, efficiency, and performance.

Methodology & Key Components

The project uses a standard encoder-decoder architecture, where the key innovation lies in substituting the visual encoder with various state-of-the-art Vision Transformer models.

Component	Description	Models / Datasets
Encoder	Vision Transformer backbone to extract visual features	ViT, DeiT, Swin, PVTv2, DINOv2
Decoder	Transformer-based text generator for caption synthesis	Vaswani-style Decoder
Training Objective	Cross-Entropy loss for text generation	Optional CIDEr fine-tuning
Datasets	Benchmark datasets for image captioning	COCO Captions, Flickr30k, NoCaps
Evaluation Metrics	Standard metrics for caption quality	BLEU, METEOR, ROUGE-L, SPICE, CIDEr

Project Progress In-Progress

The project is under active development. The following checklist outlines the current status of key components and planned architectural implementations:

Core Components

Tokenizer & Text Preprocessing
Dataset Streamer (COCO/Flickr)
Transformer-based Text Decoder
DDP-based Parallel Trainer (with AutoResume)

Visual Encoder Implementation

Vision Transformer (ViT)
Data-Efficient Image Transformer (DeiT)
Swin Transformer
Pyramid Vision Transformer (PVT)
DINO & DINOv2
Swin Transformer v2
Pyramid Vision Transformer v2 (PVTv2)

Project Structure

The repository is organized to separate concerns, with distinct modules for encoders, decoders, data handling, and training utilities.

src/
├── encoder/           # Vision Transformer-based visual encoders
├── decoder/           # Transformer-based text decoders
├── datasets/          # COCO / Flickr30k dataset loaders
├── training/          # Training, utilities and evaluation scripts
├── checkpoints/       # Results and model weights
├── notebooks/         # Experiments and visualization
└── main.py            # Entry point

Key Research References

This work is built upon the foundational research in Transformers and their application to computer vision.

Attention is All You Need (Vaswani et al., 2017)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021)
DeiT: Data-Efficient Image Transformers (Touvron et al., 2021)
Swin Transformer (Liu et al., 2021)
PVTv2: Pyramid Vision Transformer (Wang et al., 2022)
DINOv2: Learning Robust Visual Features without Supervision (Oquab et al., 2023)

Conclusion & Future Direction

This project provides a comprehensive framework for implementing and evaluating modern Vision Transformer architectures on the challenging task of image captioning. By systematically integrating and testing various visual encoders, the goal is to provide clear insights into their effectiveness and trade-offs in a multi-modal context. Future work will focus on completing the planned encoder implementations, conducting extensive hyperparameter tuning, and performing a thorough comparative analysis of all models on benchmark datasets.

See All Projects of Mine →