A Guide to Using DeepLabV3 for Semantic Segmentation

We are excited to introduce one of our Nexus models: DeepLabV3, a state-of-the-art multi-scale semantic segmentation model that can support any use case.

Akanksha Chokshi

In this article, we are going to explore DeepLabV3, an extremely popular semantic segmentation model. As we explored in a previous article, semantic segmentation is a computer vision task that requires assigning a label to each pixel in an image based on what it represents. The result of image segmentation is a semantic map, a high-resolution image where each pixel is colour-coded based on the object it belongs to. This is in contrast to general image classification models which provide a single label for the object as output. 

To learn more about semantic segmentation, its advantages, current applications as well as two semantic segmentation models we offer on Nexus (FCNs and U-Nets). 

Introducing DeepLabV3

DeepLabV3 is a state-of-the-art deep learning architecture best suited for semantic segmentation tasks. It evolved from its previous versions, DeepLabV1 and DeepLabV2, both of which were developed by the Google Research team. DeepLabV3 was first introduced in 2017 and has since been used in various applications such as medical image analysis, autonomous driving, and satellite image analysis. 

One very important benefit that DeepLabV3 has over other semantic segmentation and classification models is that it is extremely accurate when it comes to multi-scale segmentation. Multi-scale segmentation involves analysing the image at different scales to capture objects of different sizes and shapes. DeepLabV3 uses atrous (or dilated) convolutions, a technique that allows it to capture context information at different scales without increasing the complexity of the model. 

Before diving into the architecture of DeepLabV3, we should first understand the basic idea behind Atrous Convolutions and ASPP (Atrous Spatial Pyramid Pooling) that make DeepLabV3 uniquely suited to pick up image features across different shapes, sizes and scales.

Atrous Convolutions and ASPP

As we explored in our FCN article, traditional convolutions involve a filter which slides over the input image or feature map with a fixed stride. Each pixel in the resulting feature map is computed by performing a dot product between the filter and the corresponding pixel in the input. However, dilated convolutions introduce gaps or "holes" in the filter, allowing it to capture more spatial context without increasing the number of parameters or the computation time. 

In such dilated (or atrous) convolutions, the filter is applied to the input image or feature map with a fixed stride, but with gaps between the filter elements. The gap size or dilation rate determines the amount of spatial context that the filter can capture. For example, a dilation rate of 2 means that there is one empty space between each filter element, while a dilation rate of 3 means that there are two empty spaces between each filter element. By increasing the dilation rate, the filter can capture more spatial context and better preserve the spatial resolution of the input image or feature map. This can be especially useful in semantic segmentation tasks, where atrous convolutions can help the network better capture the global context and identify objects at different scales, leading to more accurate segmentation results.

Image From: https://paperswithcode.com/method/deeplabv3

Atrous Spatial Pyramid Pooling (ASPP) is a feature extraction technique first introduced in the DeepLab network for improving the segmentation accuracy of natural images.

ASPP applies a set of parallel dilated convolutions with different dilation rates to extract features at different scales. The output of each dilated convolution is then aggregated through global pooling operations such as average pooling or max pooling, which help capture context information at different scales. Finally, the output feature maps of each parallel path are concatenated and processed through a 1x1 convolution layer to obtain the final feature representation. This allows the network to capture both local and global contextual information, which can be critical for accurate segmentation in images with multi-scale data.

Image From: https://paperswithcode.com/method/deeplabv3

Architecture of DeepLabV3

Semantic segmentation models typically have two sections: an encoder section which is designed to extract features from the image and a decoder section which upsamples and combines the extracted feature maps together to produce the output segmentation map. 

The encoder section for DeepLabV3 typically uses a modified version of the ResNet architecture, but some versions of it exist that use simpler base models like MobileNet in case the task is very straightforward and demands real-time speed and efficiency. After the last block of its base model architecture, DeepLabv3 adds the ASPP module, which consists of multiple parallel atrous convolutional layers with different dilation rates. The outputs of these parallel convolutions are then concatenated and passed through a 1x1 convolutional layer to produce a fused feature map.

Upsampling of feature maps within the decoder section is done using bilinear interpolation. This means that each pixel in the original feature map is expanded into a block of four pixels in the upsampled feature map. The value of each pixel in the upsampled feature map is then computed using bilinear interpolation based on the values of the surrounding four pixels in the original feature map. This helps increase the spatial resolution of the feature map and is done in stages to gradually generate a segmentation map that matches the original input image.

The decoder in DeepLabv3 is also designed to refine the segmentation output by combining the high-level and fine-grained features extracted by the encoder with low-level features from the early layers of the network. This is done using skip connections, which connect the encoder and decoder at multiple resolutions. Similar to the ones used in FCNs, these skip connections concatenate the feature maps from the encoder with the upsampled feature maps from the decoder at each stage to preserve fine-grained details in the segmentation output. The output of the decoder is finally passed through a 1x1 convolutional layer to produce a pixel-wise probability map for each class. This probability map is then upsampled to the original image resolution using bilinear interpolation and generates the final segmentation map. 

The use of both atrous convolutions and ASPP in the encoder and skip connections in the decoder allows DeepLabv3 to achieve state-of-the-art performance in semantic image segmentation tasks.

Image From: https://paperswithcode.com/lib/detectron2/deeplabv3-1

How to Train DeepLabV3 on Nexus?

The steps to training any model on Nexus are as follows:

  1. Create your project
  2. Upload your images
  3. Label your images
  4. Define your training workflow
  5. Monitor your training progress

Check out Section 4 of our How To Train YOLOX Object Detection Model On A Custom Dataset article for more details on creating a project, uploading and annotating your images and defining your training workflow. Since the article was published, we have added DeepLabV3 to the list of models we provide. 

When defining your workflow, you could right click anywhere in the canvas and hover over Models to be able to select DeepLabV3 and view the different base model options we provide: ResNet50, ResNet101 and MobileNetV3 in different resolutions (320x320, 640x640 and 1024x1024). 

ResNet is a popular CNN model that has been trained on large datasets, such as ImageNet, for computer vision tasks such as image classification and object detection. ResNet50 has 50 layers and ResNet101 has 101. Hence, ResNet101 is deeper and can learn more complex representations of the data compared to ResNet50. However, this also means that ResNet101 has a larger number of parameters and is more computationally expensive to train and use. MobileNetV3, on the other hand, is a lightweight model that was introduced by Google in 2019. It is designed to be efficient and fast, making it ideal for use on mobile and embedded devices with limited computational resources.

Comparing DeepLabV3 with FCNs and U-Nets

The first version of DeepLab was built on a base FCN, and consequent versions have seen more improvements and increased capabilities. Since then, specialised versions of FCNs (like U-Nets) have also emerged to tackle specific types of tasks effectively.

Some benefits of DeepLabV3 over FCNs (and U-Nets) include the ability to pick up larger context information due to atrous convolutions and the ability to extract features at different scales. These factors make it extremely effective when dealing with images with objects of different sizes. DeepLabV3 is, however, more computationally expensive to train. In some cases, the symmetric skip connections of U-Nets allow them to combine high-level and low-level features to make it better able to detect objects with complex shapes, something DeepLabV3 might struggle with.


Though DeepLabV3 was released in 2017, it remains the staple choice for performant models in the semantic segmentation space, with state-of-the-art techniques still using it as a baseline for segmentation quality, or as the backbone for more complicated segmentation tasks, such as video object segmentation. Through its unique architectural innovations in the usage of ASPPs, DeepLabV3 has demonstrated its ability to learn to detect a broad range of objects. Through this technical overview of the DeepLabV3 architecture, we hope that users can feel confident with its adaptability to unique use cases when utilizing this architecture in their Nexus workflows.

What’s Next?

Once you have trained your DeepLabV3 model from the steps above, you can compare your trained model with other semantic model architectures such as FCN and U-Net to see which architecture best suits your use case. You can use our plethora of model training analysis tools from metrics in the Training Dashboard, visual comparisons on the Advanced Evaluation, or aggregated comparisons through our Neural Training Insights graph.

If you have any questions about whether Datature can help you with your use-case, please feel free to reach out!

Our Developer’s Roadmap

Datature always strives to provide the best resources for users to build high quality computer vision applications for their use cases. As such, we’re always continuing to review and tune our model architectures to improve our capabilities in accuracy, inference speed, and deployment flexibility. We will also continue to add to our model selection as more research continues to be published or by popular demand.

Want to Get Started?

If you have questions, feel free to join our Community Slack to post your questions or contact us about how DeepLabV3 might fit in with your usage. 

For more detailed information about the Model Training pipeline and other state-of-the-art model offerings in Nexus, or answers to any common questions you might have, read more about Nexus on our Developer Portal.

Build models with the best tools.

develop ml models in minutes with datature