Image Segmentation

Image segmentation refers to the process of dividing an image into meaningful and distinct regions or objects at the pixel level. It involves assigning a label or class to each pixel in an image to identify different objects, boundaries, or areas of interest. The goal of image segmentation is to separate and distinguish different objects or regions within an image, enabling a computer or an algorithm to understand and analyze the image at a more detailed level.

Segmentation has benefits to downstream tasks such as object recognition and tracking, scene understanding, medical image analysis and robotics to name a few.

U-Net

The U-Net is a convolutional neural net (CNN) that was originally developed in 2015 at the Computer Science Department of the University of Freiburg for the task of biomedical image segmentation.

U-Net introduced an encoder-decoder architecture with skip connections. The contracting path captured context and abstract features, while the expansive path recovered spatial resolution using skip connections. U-Net's design made it highly effective for biomedical image segmentation and subsequently gained popularity in other domains.

from fastai.vision.all import *
from fastai.data.all import *

FloodNet

The below description is from the FloodNet GitHub

FloodNet provides high-resolution UAV (Unmanned Aerial Vehicle) imageries with detailed semantic annotation regarding the damages. To advance the damage assessment process for post-disaster scenarios, we present a unique challenge considering classification, semantic segmentation, visual question answering highlighting the UAS imagery-based FloodNet dataset.

Track 1

In this track, participants are required to complete two semi-supervised tasks. The first task is image classification, and the second task is semantic segmentation. 1. Semi-Supervised Classification: Classification for FloodNet dataset requires classifying the images into ‘Flooded’ and ‘Non-Flooded’ classes. Only a few of the training images have their labels available, while most of the training images are unlabeled.

Semi-Supervised Semantic Segmentation: The semantic segmentation labels include: 1) Background, 2) Building Flooded, 3) Building Non-Flooded, 4) Road Flooded, 5) Road Non-Flooded, 6) Water, 7)Tree, 8) Vehicle, 9) Pool, 10) Grass. Only a small portion of the training images have their corresponding masks available.

Links

import numpy as np
import pandas as pd
from pathlib import Path

Segmentaion datasets usually consist of image files, mask files and codes which are the segmenttion pixel labels.

path = Path.cwd()/'floodnet_data'
path

Path('/Users/ddearaujo/Desktop/dl/vision/image_segmentation/floodnet_data')

# get loabels / codes
col_map = {'Class Index.1': 'class_id', 'Class Name.1':'label'}

df_codes = pd.read_csv(
    path/'class_mapping.csv', 
    header=2
).iloc[:, -2:].rename(columns=col_map)

codes = df_codes.label.values

df_codes.head()

	class_id	label
0	0	Background
1	1	Building-flooded
2	2	Building-non-flooded
3	3	Road-flooded
4	4	Road-non-flooded

# Get all the files in path with optional extensions
# mask files are PNG so we can exclude these by specifying the extensions

fnames = get_files(path/"train", extensions='.jpg')
fnames[0]

Path('/Users/ddearaujo/Desktop/dl/vision/image_segmentation/floodnet_data/train/not_flooded/image/7078.jpg')

def label_func(fn):
    p = path/'train'/fn.parts[-3]/'mask'/f'{fn.stem}_lab.png'
    return p

dls = SegmentationDataLoaders.from_label_func(
    path, 
    bs=8,
    fnames=fnames, 
    label_func=label_func, 
    codes=codes,
    item_tfms=Resize(128)
)

segmentation

dls.show_batch(max_n=4)

Model: U-Net

Traditional convolutional neural networks (CNNs) are effective for various computer vision tasks, such as image classification, object detection, and localization. However, they have limitations when it comes to image segmentation. Reasons for this include...

Resolution Loss: CNNs typically downsample the input image as they progress through the network to capture higher-level features. This downsampling reduces the resolution of the feature maps, making it challenging to accurately localize and segment small objects or fine details in the image.
Contextual Information: Segmentation tasks often require capturing contextual information to distinguish between objects with similar appearances or to handle complex object boundaries. Traditional CNNs, with their hierarchical feature extraction, may struggle to capture long-range dependencies and global context, which are crucial for accurate segmentation.
Limited Localization Accuracy: CNNs designed for classification or localization tasks focus on identifying the presence of objects within an image but do not provide precise information about their boundaries. Segmenting an image requires pixel-level localization accuracy, which is not emphasized in traditional CNNs.

The U-Net is specifically designed for semantic segmentation and addresses the above limitations. It employs a U-shaped architecture, consisting of a contracting path (encoder) and an expansive path (decoder), with skip connections between corresponding encoder and decoder layers. Advantages of using a UNet include... - U-shaped Architecture: U-Net's U-shaped design enables the preservation of high-resolution feature maps through skip connections, which helps in localizing objects accurately. - Context Aggregation: Skip connections in UNet allow the decoder to receive feature maps from different resolutions, incorporating both local and global contextual information. This aids in better segmentation by capturing fine details and understanding the overall context. - Dense Feature Propagation: U-Net uses upsampling and concatenation operations during the decoding phase, which helps in recovering the lost spatial resolution. This dense feature propagation aids in precise segmentation by retaining spatial information.

model = unet_learner(dls, resnet34)
model.fine_tune(10)

epoch	train_loss	valid_loss	time
0	1.452725	1.594936	02:43

epoch	train_loss	valid_loss	time
0	1.108694	1.171069	02:42
1	1.014180	1.138221	02:50
2	0.988032	1.126446	02:40
3	0.967991	1.056948	02:40
4	0.885660	0.986956	02:38
5	0.820628	0.929066	02:39
6	0.770862	0.848385	02:38
7	0.713452	0.886977	02:36
8	0.667192	0.956540	02:37
9	0.625676	0.927906	02:37

model.show_results(max_n=4, figsize=(7,10))

interp = SegmentationInterpretation.from_learner(model)

top_losses = interp.top_losses(4, largest=True)[1]
interp.show_results(top_losses.data)

The interpreter shows the model makes some reasonable predictions, but there is still room for improvement!

Bibliography

FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding
Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, Robin Murphy
arXiv preprint arXiv:2012.02951
2020