Project Overview
Core Goals & Learning Objectives
The primary objective is to recreate, understand, and compare various Transformer architectures for vision tasks within a unified image captioning framework.
Architectural Implementation
Implement and deeply understand ViT, DeiT, Swin Transformer, PVTv2, and DINOv2 from the ground up.
Encoder-Decoder Framework
Apply these advanced visual encoders in combination with a Transformer-based text decoder for caption generation.
End-to-End Training
Build a complete training and evaluation pipeline, including custom tokenizers, dataset loaders, and DDP-based trainers.
Benchmarking & Analysis
Compare various visual encoder–text decoder combinations for caption quality, efficiency, and performance.
Methodology & Key Components
The project uses a standard encoder-decoder architecture, where the key innovation lies in substituting the visual encoder with various state-of-the-art Vision Transformer models.
| Component | Description | Models / Datasets |
|---|---|---|
| Encoder | Vision Transformer backbone to extract visual features | ViT, DeiT, Swin, PVTv2, DINOv2 |
| Decoder | Transformer-based text generator for caption synthesis | Vaswani-style Decoder |
| Training Objective | Cross-Entropy loss for text generation | Optional CIDEr fine-tuning |
| Datasets | Benchmark datasets for image captioning | COCO Captions, Flickr30k, NoCaps |
| Evaluation Metrics | Standard metrics for caption quality | BLEU, METEOR, ROUGE-L, SPICE, CIDEr |
Project Progress In-Progress
The project is under active development. The following checklist outlines the current status of key components and planned architectural implementations:
Core Components
- Tokenizer & Text Preprocessing
- Dataset Streamer (COCO/Flickr)
- Transformer-based Text Decoder
- DDP-based Parallel Trainer (with AutoResume)
Visual Encoder Implementation
- Vision Transformer (ViT)
- Data-Efficient Image Transformer (DeiT)
- Swin Transformer
- Pyramid Vision Transformer (PVT)
- DINO & DINOv2
- Swin Transformer v2
- Pyramid Vision Transformer v2 (PVTv2)
Project Structure
The repository is organized to separate concerns, with distinct modules for encoders, decoders, data handling, and training utilities.
src/
├── encoder/ # Vision Transformer-based visual encoders
├── decoder/ # Transformer-based text decoders
├── datasets/ # COCO / Flickr30k dataset loaders
├── training/ # Training, utilities and evaluation scripts
├── checkpoints/ # Results and model weights
├── notebooks/ # Experiments and visualization
└── main.py # Entry point
Key Research References
This work is built upon the foundational research in Transformers and their application to computer vision.
- Attention is All You Need (Vaswani et al., 2017)
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021)
- DeiT: Data-Efficient Image Transformers (Touvron et al., 2021)
- Swin Transformer (Liu et al., 2021)
- PVTv2: Pyramid Vision Transformer (Wang et al., 2022)
- DINOv2: Learning Robust Visual Features without Supervision (Oquab et al., 2023)
Conclusion & Future Direction
This project provides a comprehensive framework for implementing and evaluating modern Vision Transformer architectures on the challenging task of image captioning. By systematically integrating and testing various visual encoders, the goal is to provide clear insights into their effectiveness and trade-offs in a multi-modal context. Future work will focus on completing the planned encoder implementations, conducting extensive hyperparameter tuning, and performing a thorough comparative analysis of all models on benchmark datasets.