Captioning an image involves generating a human readable textual description given an image, such as a photograph.
It is an easy problem for a human, but very challenging for a machine as it involves both understanding the content of an image and how to translate this understanding into natural language.
Recently, deep learning methods have displaced classical methods and are achieving state-of-the-art results for the problem of automatically generating descriptions, called “captions,” for images.
In this post, you will discover how deep neural network models can be used to automatically generate descriptions for images, such as photographs.
After completing this post, you will know:
-
About the challenge of generating textual descriptions for images and the need to combine breakthroughs from computer vision and natural language processing.
-
About the elements that comprise a neural feature captioning model, namely the feature extractor and language model.
-
How the elements of the model can be arranged into an Encoder-Decoder, possibly with the use of an attention mechanism.
Let’s get started.
Overview
This post is divided into 3 parts; they are:
-
Describing an Image with Text
-
Neural Captioning Model
-
Encoder-Decoder Architecture
-
Describing an Image with Text
Describing an image is the problem of generating a human-readable textual description of an image, such as a photograph of an object or scene.
The problem is sometimes called “automatic image annotation” or “image tagging.”
It is an easy problem for a human, but very challenging for a machine.
A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models
— Deep Visual-Semantic Alignments for Generating Image Descriptions, 2015.
A solution requires both that the content of the image be understood and translated to meaning in the terms of words, and that the words must string together to be comprehensible. It combines both computer vision and natural language processing and marks a true challenging problem in broader artificial intelligence.
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.