专栏名称: 机器学习研究会

机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织，旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外，协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。

【推荐】深度学习照片描述自动生成

机器学习研究会 · 公众号 · AI · 2017-11-15 22:34

正文

点击上方 “机器学习研究会” 可以订阅

摘要

转自：爱可可-爱生活

Captioning an image involves generating a human readable textual description given an image, such as a photograph.

It is an easy problem for a human, but very challenging for a machine as it involves both understanding the content of an image and how to translate this understanding into natural language.

Recently, deep learning methods have displaced classical methods and are achieving state-of-the-art results for the problem of automatically generating descriptions, called “captions,” for images.

In this post, you will discover how deep neural network models can be used to automatically generate descriptions for images, such as photographs.

After completing this post, you will know:

About the challenge of generating textual descriptions for images and the need to combine breakthroughs from computer vision and natural language processing.
About the elements that comprise a neural feature captioning model, namely the feature extractor and language model.
How the elements of the model can be arranged into an Encoder-Decoder, possibly with the use of an attention mechanism.

Let’s get started.

Overview

This post is divided into 3 parts; they are:

Describing an Image with Text
Neural Captioning Model
Encoder-Decoder Architecture

Describing an Image with Text

Describing an image is the problem of generating a human-readable textual description of an image, such as a photograph of an object or scene.

The problem is sometimes called “automatic image annotation” or “image tagging.”

It is an easy problem for a human, but very challenging for a machine.

A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models

— Deep Visual-Semantic Alignments for Generating Image Descriptions, 2015.

A solution requires both that the content of the image be understood and translated to meaning in the terms of words, and that the words must string together to be comprehensible. It combines both computer vision and natural language processing and marks a true challenging problem in broader artificial intelligence.

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.