Introducing Vision Transformers for Robust Segmentation

Datature Introduces Vision Transformers (ViT) Models Support to Improve Segmentation for Complex Datasets

Wei Loon Cheng

What are Vision Transformers?

Vision transformers are a class of neural networks that apply the transformer architecture, originally designed for sequence modelling tasks like language translation, to image processing tasks. Unlike convolutional neural networks (CNNs), which rely on convolutions to capture spatial information hierarchically, vision transformers process images by dividing them into patches and processing them as sequences of vectors. This allows vision transformers to capture global relationships among image elements without the need for explicit spatial hierarchies, offering potential advantages in capturing long-range dependencies and facilitating easier parallelization during training. Moreover, vision transformers offer scalability benefits, as they can handle images of arbitrary size without requiring architectural modifications.

Their recent success has spurred research into transformer-based architectures for various computer vision tasks, driving innovation and paving the way for more flexible and powerful models in the field. They’ve led to the introduction of the first computer vision foundational models and have demonstrated promising results as the basis for universal computer vision models.

How Do Vision Transformers Work?

Vision transformers operate by combining patch-based processing with self-attention mechanisms in typical transformer architectures.

  1. Patch Extraction: The first step is to divide the input image into smaller fixed-size square or rectangular regions called "patches". Each patch typically contains a local region of the image and is represented as a vector of pixel values.

  2. Patch Embeddings: After extracting patches from the input image, each patch is linearly projected into a lower-dimensional space called the "embedding space". This projection transforms the pixel values of each patch into a lower-dimensional representation, capturing essential visual features.

  3. Positional Encoding: Since transformers do not inherently understand the spatial relationships between patches, positional encoding is added to provide spatial information. Positional encodings are additional learnable parameters or fixed functions injected into the input embeddings to encode the position of each patch within the image. This enables the transformer to capture spatial relationships between patches during processing.

  4. Transformer Encoder: Once the patches are embedded and positional encodings are added, the resulting sequence of patch embeddings is fed into the transformer encoder. The transformer encoder consists of multiple layers, each comprising self-attention mechanisms and feed-forward neural networks. Self-attention allows each patch to attend to all other patches in the sequence, capturing global relationships between patches. The feed-forward neural networks process the attended patches to capture local feature interactions.

  5. Layer Stacking and Output: The process of self-attention and feed-forward computation is typically repeated across multiple layers in the encoder stack. Each layer refines the representation of the input patches by capturing different levels of abstraction and contextual information. Finally, the output of the last layer of the transformer encoder is used for downstream tasks such as image classification, object detection, or image generation.

Vision Transformers vs. Convolutional Neural Networks

Though vision transformers are certainly more than just a fad in computer vision given the results that they’ve demonstrated, convolutional neural networks still have clear practical use cases in which they still remain the preferred option.

Choosing between vision transformers and convolutional neural networks (CNNs) depends on several factors. Vision transformers excel in tasks requiring capturing long-range dependencies and global context within images, where understanding the relationships between distant image elements is crucial. They offer scalability benefits for large-scale datasets or high-resolution images and provide interpretability through attention mechanisms. On the other hand, CNNs are still highly performant for tasks where spatial hierarchies and local features are important characteristics. They require fewer computational resources, making them suitable for resource-constrained environments. Ultimately, the choice between vision transformers and CNNs is contextual and does come down to the specific requirements of the task such as the deployment device, the type of image data being used, and more.

Vision Transformers on Nexus

In line with Nexus’ continual effort to keep with state of the art technologies, we are introducing our first wave of vision transformers for custom model training and finetuning on semantic segmentation tasks: Mask2Former and SegFormer. These two models and their variants are at the forefront of semantic segmentation model benchmarks.

Mask2Former (Cheng et al., 2022a) builds on previous vision transformers for semantic segmentation tasks like MaskFormer in three significant changes. First, the transformer decoder utilizes masked attention which limits the attention to localized features centered around predicted segments instead of the full feature map. This improves the efficiency by limiting attention and demonstrates improved performance through increased focus on local features. Second, high-resolution features are fully utilized via a feature pyramid, allowing each layer of the transformer decoder to accept each scale of the multi-scale feature pyramid at a time. Thirdly, computations are reduced by only calculating mask loss on K randomly sampled points instead of the entire mask.


SegFormer (Xie et al., 2021) is a simple and lightweight transformer based network that provides a very honest and streamlined utilization of features extracted by transformers. It contains a hierarchical transformer encoder and a simple multi-layer perceptron (MLP) decoder. The encoder module is a series of ViT-style vision transformers at various sizes to generate multiscale features. Patch merging of the coarse and fine features generates the final hierarchical feature map. Finally, a lightweight MLP decoder simply combines the information from multiscale features and predicts a final pixel mask for semantic segmentation.


How to Train Vision Transformers on Datature Nexus?

Create Your Semantic Segmentation Project

To get started, log in or create an account with Datature Nexus. In your workspace, you can create a project with the project type of Segmentation and specify your data type as Images, Videos, or both Images and Videos.

Create Your Segmentation Project on Nexus.

Onboard and Label Your Data

You can upload your image and video data once the project has been created. If you already have existing segmentation annotations, you can easily import them into your project via popular formats such as COCO Mask or LabelMe Mask. If not, you can use our Annotator to label your data with a variety of tools, including Intellibrush - our smart labelling tool to precisely and efficiently segment objects.

Upload Your Image Data onto Nexus.
Use Intellibrush for Fast, Precise Segmentation Annotation.

How to Fine-Tune Mask2Former on Your Custom Data?

To fine-tune a Mask2Former model, you can create a workflow that will fine-tune a model with your annotated data. With Datature, you can choose to train a Mask2Former model with pre-trained weights from the COCO dataset and continue from a trained artifact of the same model type on Nexus. Datature offers four architectural variants with differing numbers of parameters and input resolutions. Choosing a suitable model depends on the nature and complexity of your data, as well as your requirements on inference speed versus accuracy.

Build Training Workflows with Different Vision Transformer Architectures.

You can also tune different hyperparameters such as the batch size, number of training steps, and even choosing a previously trained checkpoint as the starting point of the new training for fine-tuning purposes.

Customize Vision Transformer Model Hyperparameters on Nexus.

Once the training workflow has been set up, you can select Run Training and customize your hardware and training settings, such as the number of GPUs as well as the checkpoint strategy.

Customize Training and Hardware Settings on Nexus.

To monitor your training and model performance, you can view the metrics curves in real-time on the Trainings page, as well as visualize predictions through our Advanced Evaluation and Confusion Matrix tools.

Visualize Real-Time Metrics Curves on Nexus.
Visualize Model Predictions Across Different Training Checkpoints.

How to Deploy Your Trained SegFormer Model for Inference?

With your trained artifacts, you can quickly deploy your model on our cloud servers with a choice of CPU or GPU depending on your needs. Using our Inference API, you can easily send your unseen data through an API request for inference, and the response will contain the predictions in JSON format.

Test the Model Deployment with Unseen Test Data.

Alternatively, if you wish to deploy your model locally or on edge devices, we provide a variety of model export formats, including edge-optimized frameworks, to better integrate with any existing tech stack you may have. These exports come bundled with simple prediction scripts to quickly get you started with the inference stage. You can also opt to quantize or prune your model for faster inference and compatibility on resource-constrained hardware. Check out our articles on Post-Training Quantization and Post-Training Pruning to learn more.

Export Your Model to Various Frameworks and Apply Model Optimizations.

Vision Transformers Results Comparison

From our benchmark tests conducted on a range of semantic segmentation models trained on a squirrel and butterfly segmentation dataset, we found that the Mask2Former and SegFormer architectures predicted the most precise masks, with F1 scores that exceeded popular CNN architectures like DeepLabV3 and UNet by up to 20%.

Even though an increase in accuracy is more often than not correlated with a potential dip in inference speed, Mask2Former and SegFormer models are able to outperform similar input resolution variants of FCN and UNet, with the smallest SegFormer B0 architecture achieving almost 8 FPS when deployed on a CPU. This shows promising potential for such vision transformers to achieve real-time inference speeds on edge devices with model optimization techniques such as Quantization and Pruning in play.

Try It On Your Own Data

You can easily train your own custom Mask2Former or SegFormer models with your own image or video data by following the steps above with your own Datature Nexus account. With our Free tier account, you can perform the steps without any credit card or payment required and can certainly test the steps within the limits of the account quota.

What’s Next?

You can always compare if the new transformer-based architectures are more well suited for your context as compared with training Convolutional Neural Networks, such as YOLOv8-SEG for instance segmentation, or DeepLabV3 for semantic segmentation. To learn more about training segmentation models, you can read this article.

Our Developer’s Roadmap

We will be developing support for more transformer-based architectures, including foundational models. As such, users can look out for model training support for a larger variety of use cases and computer vision tasks. As always, user feedback is welcome and if there are any particular model architectures that you feel should be on the platform, please feel free to reach out and let us know!

Build models with the best tools.

develop ml models in minutes with datature