Image Descriptor using Transformer Encoder

Radhin krishna
May 27, 2025
2 min read

I completed this project during my final year of bachelor's in Data Science at Hindustan Institute of Technology and Science. I have much to discuss regarding it and have attached the entire project report for reference. Unlike my other projects, I have not included the source code for this one.

Detailed Project Report:

This project explores the development of a compact, self-reliant image captioning system, with potential applications in assistive technologies—specifically for integration into bionic eyes and neuro prosthetic systems. With the growing need for intelligent systems that can bridge the gap between visual perception and human language, this work aims to design a solution that can interpret visual input and generate meaningful natural language descriptions in real time, without relying on internet connectivity or large-scale language models.

The project was undertaken with a two-phase objective. The first phase involved constructing a CNN-based object classifier capable of recognizing close to 30 commonly encountered classes, which laid the foundation for visual understanding. The second and more complex phase was implementing a CNN-RNN encoder-decoder architecture for image captioning. Here, based on a pre-trained convolutional network such as ResNet, the encoder extracts high-dimensional visual features from the image. These features are then fed into an LSTM-based decoder that generates a word-by-word caption, forming a grammatically and contextually coherent sentence. Attention mechanisms were also considered to enhance the decoder's focus on relevant spatial features during generation. Image captioning is inherently challenging due to its multimodal nature, requiring deep integration of visual feature extraction and sequence modelling. The additional constraint of operating offline, without access to large external datasets or pretrained LLMs, further raises the bar. Therefore, the model was optimized for low-latency inference, making it suitable for edge deployment on embedded hardware.

The overarching goal of this project is to take a meaningful step toward intelligent, wearable vision systems that could restore or augment visual understanding for users with visual impairments. By ensuring that the system is lightweight, efficient, and fully autonomous, it lays the groundwork for future applications in neural interfaces and assistive robotics.

Image Descriptor using Transformer Encoder

Recent Posts

Comments