专栏名称: 机器学习研究会
机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织,旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外,协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。
宝玉xp  ·  转发微博-20241003232327 ·  4 天前  
黄建同学  ·  o1-engineer来了!#ai##程序员 ... ·  5 天前  
爱可可-爱生活  ·  《爱可可微博热门分享(9.29)》 ... ·  1 周前  
爱可可-爱生活  ·  欢迎收听「爱可可AI前沿快报」,用最通俗的语 ... ·  1 周前  
机器学习研究组订阅  ·  苹果反水:OpenAI的1500亿「史上最大 ... ·  1 周前  
51好读  ›  专栏  ›  机器学习研究会

【推荐】Faster R-CNN视频目标检测

机器学习研究会  · 公众号  · AI  · 2017-04-01 19:36





I implement Ross's work "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" to realize real-time object detection, which focuses on image-level problem. Here, I extend it to video-level problem by treating videos like a series of frames and also take the relation between each frame into account. Use a tracker to track the video frame by frame and finally visualize the final result


Object detection is an age-old question. Many application need the techniques of object detection, such as IoT, self-driving car. So here we're gonna introduce the state-of-the-art: faster rcnn, achieves high performance and can be used in real-time.

Region-based Convolutional Neural Network

Region-based Convolutional Neural Network, aka R-CNN, is a visual object detection system that combines bottom-up region proposals with features computed by a convolutional neural network.
R-CNN first computes the region proposal with techniques, such as selective search, and feeds the candidates to the convolutional neural network to do the classification task. Here's the system flow of R-CNN:

However, it can some notable disadvantages:

  1. Training is a multi-stage pipeline.(three training stages)

  2. Training is expensive in space. (due to the multi-stage training pipeline)

  3. Object detection is slow. Detection with VGG16 takes 47s / image (on a GPU).

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Many works have been proposed to solve this problem, such as spatial pyramid pooling networks(SPPnets), whcih tries to avoid repeatedly computing the convolutional features.
Thus, R-CNN is not good enough for us in application uses though it provides bare enough performance.

Fast R-CNN

"Fast" R-CNN is quite easy to understand by its word. It's quite fast, achieving 0.3s per image for detection when ignoring the region proposal. Well, how does this magic works?
The most important factor is that it shares the computation. After the region proposal, we'll get some bounding boxes. In the previous alogrithm, they just directly feed the warped image to the CNN. That is, if we have 2000 proposals, we have to do 2000 times forward pass, which wasting lots of time. Actually, we can use the relation between these proposals. Many proposals have overlap with others, and these overlap part is fed into the CNN for many times. Maybe we can just compute them for once.
Fast R-CNN utilizes this property well. Here's the illustration of how it really works:

First, we'll feed the whole image into the ConvNet (to conv5). Then, it's where the magic lies in: we know that convolutional layer won't change the spatial relation between the adjacent pixels. That is, the upmost pixel will still falls on the upmost part of the feature map in conv5. Based on this, it's possible for us to porject the coordinates in raw image to the corresponding neuron in conv5! In this way, we can just compute the image through ConvNet once. After getting the faeture for each bounnding box, it will be fed into the RoI pooling layer, which is a special-case of the spatial pyramid pooling layer. The rest work is similar to the previous work.
The following figure is a more clear illustration of Fast R-CNN:






