​ 【推荐】深度学习卫星图像分割(Kaggle竞赛第四名)

机器学习研究会  · 公众号  · AI  · 2017-04-16 19:05


In the recent Kaggle competition Dstl Satellite Imagery Feature Detection our deepsense.io team won 4th place among 419 teams. We applied a modified U-Net – an artificial neural network for image segmentation. In this blog post we wish to present our deep learning solution and share the lessons that we have learnt in the process with you.


The challenge was organized by the Defence Science and Technology Laboratory (Dstl), an Executive Agency of the United Kingdom’s Ministry of Defence on Kaggle platform. As a training set, they provided 25 high-resolution satellite images representing 1 km 2 areas. The task was to locate 10 different types of objects:

  1. Buildings

  2. Miscellaneous manmade structures

  3. Roads

  4. Tracks

  5. Trees

  6. Crops

  7. Waterway

  8. Standing water

  9. Large vehicles

  10. Small vehicles

Sample image from the training set with labels.

These objects were not completely disjoint – you can find examples with vehicles on roads or trees within crops. The distribution of classes was uneven: from very common, such as crops (28% of the total area) and trees (10%), to much smaller such as roads (0.8%) or vehicles (0.02%). Moreover, most images only had a subset of classes.

Correctness of prediction was calculated using Intersection over Union (IoU, known also as Jaccard Index) between predictions and the ground truth. A score of 0 meant complete mismatch, whereas 1 – complete overlap. The score result was calculated for each class separately and then averaged. For our solution the average IoU was 0.46, whereas for the winning solution it was 0.49.


For each image we were given three versions: grayscale, 3-band and 16-band. Details are presented in the table below:

Type Wavebands Pixel resolution #channels Size
grayscale Panchromatic 0.31 m 1 3348 x 3392
3-band RGB 0.31 m 3 3348 x 3392
16-band Multispectral 1.24 m 8 837 x 848
Short-wave infrared 7.5 m 8 134 x 136

We resized and aligned 16-band channels to match those from 3-band channels. Alignment was necessary to remove shifts between channels. Finally all channels were concatenated into single 20-channels input image.


Our fully convolutional model was inspired by the family of U-Net architectures, where low-level feature maps are combined with higher-level ones, which enables precise localization. This type of network architecture was especially designed to effectively solve image segmentation problems. U-Net was the default choice for us and other competitors. If you would like more insights into architecture we suggest that you read the original paper . Our final architecture is depicted below:
