Supporting Fully Convolutional Networks (and U-Net) for Image Segmentation

We are excited to introduce two new models on Nexus - Fully Convolutional Networks (FCN) and U-Nets, both popular semantic segmentation models.

Akanksha Chokshi
Editor

Introduction

Machine learning based computer vision has evolved to be able to take on increasingly complex and precise tasks. In many fields and industry verticals, computer vision practitioners are seeking to determine object class outlines in an automated fashion. Segmentation models have become a larger focus for development due to this rise in demand. With this in mind, Datature has introduced two extremely popular semantic segmentation models to Nexus - Fully Convolutional Networks (FCN) and U-Nets. To facilitate awareness and practical knowledge of model architectures and how they might fit with your use case, we outline below what tasks these models are designed to target and their unique benefits that have allowed them to excel in certain contexts, and allow you to identify how they might align with your own.

What is Semantic Segmentation? 

Semantic segmentation is a computer vision task that requires assigning a label to each pixel in an image based on what it represents. The result of image segmentation is a semantic map, a high-resolution image where each pixel is colour-coded based on the object it belongs to. This is in contrast to general image classification models which provide a single label for the object as output.

Semantic segmentation has several advantages over classification models like increased granularity, better feature extraction and improved performance. By dividing an image into multiple segments, it becomes possible to identify and analyse specific parts of objects within the image as well as extract meaningful features from them. It also allows the model to learn information about the image as a whole, increasing accuracy and performance in cases when certain objects might appear more often in certain positions or near other specific objects. 

Image from: https://towardsai.net/p/l/machine-learning-7

Fully Convolutional Networks (FCNs) are commonly used for semantic segmentation tasks. FCNs are a specialised version of Convolutional Neural Networks (CNNs), designed to be particularly better at generating these high-resolution segmentation maps. We shall first introduce CNNs - their basic structure, building blocks and benefits, before exploring how FCNs evolved to handle semantic segmentation tasks even more effectively.

Introducing Fully Convolutional Networks (FCNs) 

Fully Convolutional Networks (FCNs) are CNNs that have been modified for image segmentation rather than classification. Unlike traditional CNNs that output a single class prediction, FCNs generate a corresponding segmentation map with the same spatial resolution as the input image. The main difference between CNNs and FCNs is the lack of dense layers. FCNs don’t need dense layers which might destroy the spatial resolution of the segmentation map, since they don’t need to make predictions. Instead, true to their name, FCNs are built using only a combination of convolutional layers and upsampling layers, as we shall further explore in the next section. 

Image from: https://library.itc.utwente.nl/papers_2019/msc/gfm/LiuLi.pdf, Semantic Segmentation of Urban Airborne Oblique Images, (Li Liu 2019)

How do FCNs work?

FCNs have two main components: an encoder section and a decoder section. The encoder portion of the network consists of a series of convolutional and pooling layers that are used to extract features from the input image and reduce its spatial resolution. The decoder portion of the network consists of a series of upsampling layers that are used to increase the spatial resolution of the predictions produced by the network.

Each convolution layer in a FCN performs multiple convolutions on the input image, using different filters to detect different features of the data. The combination of multiple convolutions allows the network to learn increasingly complex features of the input data. Once the network has learned to detect these local patterns, the feature maps produced by the convolutional layers are then upsampled to produce a dense segmentation map for the entire input image. 

This upsampling from the feature maps to the segmentation map is accomplished by adding zero-padding and then using transposed convolutions. Zero-padding involves adding additional rows and columns of zeros to the feature maps produced by the convolutional layers. This allows the network to maintain the spatial dimensions of the feature maps. A transposed convolution is similar to a normal convolution, but instead of sliding the filters over the input data, it slides the input data over the filters, thereby increasing the spatial resolution of the feature maps.

In an FCN, this upsampling process is repeated multiple times, allowing the network to gradually increase the resolution of the feature maps from the lower-level convolutional layers to the higher-level convolutional layers, resulting in a dense segmentation map. In the next section, we shall delve deeper into the architecture of an FCN.

FCN Architecture

Image from: https://arxiv.org/abs/1605.06211v1, Fully Convolutional Networks for Semantic Segmentation (Shelhamer et al, 2016)

The above architecture represents the first proposed FCN, built on top of VGG16, a popular CNN for image classification. It consists of a series of pooling and convolutional layers in the encoder section. In the decoder section, the image is upsampled over three steps. At each step, the upsampled feature map is concatenated with the feature map from a previous pooling layer in order to preserve some of the data that may be lost during the downsampling process.

Image from: https://arxiv.org/abs/1605.06211v1, Fully Convolutional Networks for Semantic Segmentation (Shelhamer et al, 2016)

The researchers term this concept as “skip connections”, where some information from the encoder section flows into the decoder section to preserve image resolution. Different FCNs implement these skip connections differently. A specialised version of FCN is U-Net, a type of FCN that has an architecture specifically designed to implement symmetric skip connections.

Image from: https://arxiv.org/abs/1605.06211v1, Fully Convolutional Networks for Semantic Segmentation (Shelhamer et al, 2016)

U-Net: An Extension of the FCN

U-Net is an extension of FCN architecture that was introduced in 2015 for image segmentation tasks. The architecture was initially designed for use in biomedical image segmentation, and its success in this field has led to its use in a wide range of other computer vision tasks. U-Net combines the strengths of traditional FCNs with additional features that make it more effective for image segmentation tasks. The key difference between the two models is the symmetricity of the encoder and decoder portions of the network and the skip connections between them. We shall now explore how this symmetricity is integrated within the architecture of the network.

U-Net Architecture

Image from: https://arxiv.org/abs/1505.04597v1, U-Net: Convolutional Networks for Biomedical Image Segmentation (Ronneberger et al, 2015)

True to its name, U-Net’s architecture follows the shape of a “U” - its encoder and decoder sections are symmetric and mirror each other. The encoder section consists of a series of two 3x3 convolutions, each followed by a rectified linear unit (ReLU) and a 2x2 max pooling layer for downsampling. At each downsampling step, the number of feature channels is doubled. Every step in the decoder section consists of an upsampling of the feature map followed by a 2x2 transposed convolution that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the encoder section, and two 3x3 convolutions, each followed by a ReLU. Hence, in total, the network has 23 convolutional layers - 11 in the encoder section, 11 in the decoder section (that mirror the encoder section) and a final 1x1 convolution used to reshape the output segmentation map.

Each series of layers in the encoder section is connected to the corresponding mirroring series of layers in the decoder section through skip connections. These skip connections allow information from each encoder segment to be combined directly with the corresponding decoder segment by concatenating the respective feature maps. This helps to preserve high-level information from the input image at every stage of the downsampling process. The result is that the network can produce high-resolution predictions with fine-grained details, making U-Net well suited for image segmentation tasks, especially in fields like biomedical research.

Comparing FCNs and U-Nets

Both general FCNs and U-Nets have the same pros: they are pre-trained on large amounts of image data and work well with new images of any size and offer all the benefits of semantic segmentation models that we discussed earlier.

Compared to U-Nets, some general FCNs may potentially lose image resolution while downsampling and may fail to preserve fine-grained details while upsampling. U-Net is also trained to work well on smaller datasets with limited training data available. However, U-Net can also be more computationally complex and has relatively more parameters that might make it prone to overfitting.

In general, FCNs are a good choice for large-scale datasets and real-time segmentation, while U-Nets are better suited for tasks requiring high precision and the preservation of fine-grained details.

Semantic Segmentation in the Real World

U-Nets were specifically created for the field of biomedical research, given their ability to pick up fine-grained details from images and work well on a much smaller training dataset. Semantic segmentation, in general, is very popular for medical image diagnosis and is often used to classify abnormalities within scans. Semantic segmentation maps generated by FCNs also help in distinguishing between crops and weeds within a landscape, allowing precision farming robots to efficiently minimise herbicides sprayed in real time. Semantic segmentation is also incredibly useful when working with autonomous vehicles or satellite imagery as it can detect and identify objects (for example, traffic signs, open lanes, forest cover) across very different landscapes and locations. Hence, semantic segmentation has applications across various fields including medicine, agriculture, urban planning and transportation.

Image from: https://developer.nvidia.com/blog/using-multi-scale-attention-for-semantic-segmentation/

How to Train FCNs and U-Nets on Nexus?

The steps to training any model on Nexus are as follows:

  1. Create your project
  2. Upload your images
  3. Label your images
  4. Define your training workflow
  5. Monitor your training progress

Check out Section 4 of our How To Train YOLOX Object Detection Model On A Custom Dataset article for more details on creating a project, uploading and annotating your images and defining your training workflow. Since the article was published, we have added FCNs and U-Nets to the models you could add to your workflow. 

When defining your workflow, you could right click anywhere in the canvas and hover over Models to be able to view FCNs and U-Nets as well as the different base model options we provide.

Choosing the Right Base Model on Nexus

Our FCN models have four options in terms of the base model they are built on: ResNet50 (320x320 and 640x640) and ResNet101 (also 320x320 and 640x640). ResNet is a popular CNN model that has been trained on large datasets, such as ImageNet, for computer vision tasks such as image classification and object detection. ResNet50 has 50 layers and ResNet101 has 101. Hence, ResNet101 is deeper and can learn more complex representations of the data compared to ResNet50. However, this also means that ResNet101 has a larger number of parameters and is more computationally expensive to train and use. Our U-Net models have four options as well: ResNet50 (320x320 and 640x640) and VGG16 (also 320x320 and 640x640). Similar to ResNet, VGG16 is a popular pre-trained deep learning model for computer vision. The main difference between ResNet and VGG16 is that ResNet’s residual connections allow the network to learn more complex representations while being more computationally efficient, while VGG16 can increase robustness with a shallow network with a large number of parameters. Overall, users should select the appropriate model based on the level of accuracy, computational overhead, and type of task needed.

Conclusion

As we have explored, semantic segmentation is an extremely popular and effective technique for most computer vision tasks including object detection, image classification and real-time segmentation. Convolutional Neural Networks have been specifically designed to be able to handle image-related tasks, and FCNs emerged from CNNs as a way of generating dense semantic maps that consider the image as a whole, preserve its granularity and improve the quality of features extracted. U-Nets evolved as a specialised version of FCNs designed for relatively high-precision tasks and smaller datasets. Nexus now provides you the ability to easily train and analyse these models so that you can understand your data better. We hope this article gave you a good idea of how these models work and how you can determine the right model for you and train it yourself on Nexus.

What’s Next?

While U-Nets and FCNs have been performant at the industry level, machine learning and computer vision is still largely unexplored and underutilized in the biomedical space [solutions]. One startup that has integrated this emerging technology is BrainScanology. They have utilized the visual data from DICOM images and other medical tests to automate pre-diagnostic processes with computer vision that can be time-consuming and prone to human error. There is so much more data that can be efficiently analyzed through computer vision methods. Through Datature’s Nexus, companies or medical practitioners interested in leveraging computer vision in their biomedical contexts can easily upload and annotate their data, train state-of-the-art models, and deploy them at the production level easily and quickly. If you have any questions about whether Datature can help you with your use-case, please feel free to reach out!

Our Developer’s Roadmap

At Datature, we are committed to accommodating and facilitating the onboarding of more types of visual data. With the incredible amount of interest developing for analyzing DICOM images, Datature continues to work to make the onboarding process of medical imaging as seamless as possible, whether it be through our platform, from our Python SDK, or otherwise. DICOM images can currently be easily converted to individual image frames or to a video format which can be easily uploaded onto the Datature platform. We will soon be supporting DICOM image upload directly, and other medical imaging formats such as NIfTI will also be prioritized for support soon!

Want to Get Started?

If you have questions, feel free to join our Community Slack to post your questions or contact us about how FCNs or U-Nets might fit in with your usage. 

For more detailed information about the Model Training pipeline and other state-of-the-art model offerings in Nexus, or answers to any common questions you might have, read more about Nexus on our Developer Portal.

Build models with the best tools.

develop ml models in minutes with datature

START A PROJECT