Simple Science

Cutting edge science explained simply

# Computer Science# Computer Vision and Pattern Recognition

Understanding Occlusion in CNNs and ViTs

A look at how CNNs and ViTs handle occlusion and patch selectivity.

― 7 min read


CNNs vs. ViTs onCNNs vs. ViTs onOcclusionscenarios.Examining model performance in occluded
Table of Contents

In recent years, two main types of models have become popular for tasks in computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Both are used to help machines see, recognize, and understand images. CNNs have been around longer and have been the standard choice for many applications, but ViTs have recently gained attention because they often perform just as well, or even better, on some important tasks.

Despite both models proving their usefulness, they operate differently due to their unique structures. CNNs process images in a more local manner, focusing on small parts of an image at a time, while ViTs look at the entire image at once and can connect information from distant parts. This difference leads to varied performance in certain situations, especially when images are partially blocked or occluded.

What is Occlusion?

Occlusion refers to situations where objects in an image are partially hidden or blocked by other objects. For instance, if a person is standing behind a tree in a photograph, the tree's leaves and branches can obscure parts of the person's figure. Understanding how models deal with occlusion is essential, as it has significant real-world implications. For example, in self-driving cars, accurately detecting pedestrians or other vehicles is crucial, even when they are only partially visible.

While some earlier research explored how CNNs and ViTs handle occlusion, there is still much to learn, especially regarding newer CNN architectures.

The Importance of Patch Selectivity

In analyzing how these models perform, researchers have introduced a concept known as "patch selectivity." This term refers to the ability of a model to ignore parts of an image that are not relevant or that may confuse it, instead focusing on the parts that matter. ViTs have shown a natural talent for this, allowing them to perform well even when there are occluded areas in images.

In contrast, CNNs traditionally struggled with this challenge, often being thrown off by irrelevant, out-of-context patches. However, there is a way to train CNNs to improve their patch selectivity, and that method is called Patch Mixing.

What is Patch Mixing?

Patch Mixing is a training technique where pieces (or patches) from different images are combined while training a model. For instance, patches from one image can be placed onto another image, while also adjusting the labels (the information we use to tell the model what the images represent). This technique exposes CNNs to a more extensive variety of information during training, making them more resilient to Occlusions.

By using Patch Mixing, researchers found that CNNs could gain the ability to ignore out-of-context information much like ViTs do. Essentially, this method aims to bridge the gap between the two model types regarding their robustness to occlusion.

Contributions of the Research

This research presents several key contributions to our understanding of how CNNs and ViTs deal with occlusion and patch selectivity:

  1. Identifying Differences: The study identifies a clear performance difference between CNNs and ViTs when faced with out-of-context information. ViTs naturally handle the addition of irrelevant information better than CNNs, showcasing their patch selectivity.

  2. Revisiting Data Augmentation: The research revisits the Patch Mixing technique as a data augmentation method to help CNNs learn to ignore these irrelevant details. By training CNNs with Patch Mixing, their performance improves, allowing them to become more robust against occlusions.

  3. New Datasets for Evaluation: The researchers introduce two new datasets specifically designed to test model performance in occluded scenarios: the Superimposed Masked Dataset (SMD) and the Realistic Occlusion Dataset (Rod). These datasets help evaluate how well models handle real-world situations where parts of objects might be hidden.

  4. New Explainability Method: The study presents a new way to understand how models make decisions called contrastive RISE (c-RISE). This method helps visualize and quantify patch selectivity for both CNNs and ViTs.

CNNs vs. ViTs: How They Process Information

CNNs are structured with layers of convolutional operations. They focus on small areas of images, gradually building up an understanding of the whole image. Older CNN models were good at recognizing patterns but had limitations in terms of how they related distant parts of an image.

ViTs, on the other hand, work by breaking down images into smaller patches and using self-attention to relate all parts of the image to one another. This allows them to learn relationships between pixels that are far apart and helps them ignore irrelevant patches more effectively.

The Challenge of Early Layer Dependence

One significant difference between these two types of models lies in their early layers. CNNs are restricted by their design; the information gathered in the early layers is limited. In contrast, ViTs can attend to any part of the image from the very beginning. As a result, ViTs can see broader relationships in an image, while CNNs are stuck focusing more on nearby pixels.

Empirical Evidence of Performance

Through empirical testing, this research aims to prove that ViTs are better at handling occlusion than CNNs. Various experiments were conducted comparing modern CNNs with ViTs under conditions of occlusion, and the results confirmed that ViTs could consistently better ignore irrelevant patches.

How Patch Mixing Works

Patch Mixing involves taking patches from multiple images and merging them. While mixing these patches, the labels attached to the images are also blended to reflect the changes made. By exposing CNNs to different patches, these models learn to rely less on spatial relationships and adapt to the presence of occluded areas.

Implementing Patch Mixing

To implement Patch Mixing, a mask is created to decide which patches of the image will be replaced. Patches are randomly chosen from the selected images, and the mix is created based on a set percentage of how many patches to replace. This strategy helps to improve the robustness of CNNs.

The Benefits of Patch Mixing

The application of Patch Mixing has shown promising results. CNNs trained with this method have demonstrated improved abilities to ignore out-of-context information. This enhancement allows CNNs to better handle real-world scenarios where objects are not always fully visible.

Evaluating Model Performance

To assess how well models manage occlusion, two new datasets were created. These datasets provide challenging scenarios to further understand model behavior when parts of images are hidden.

Realistic Occlusion Dataset (ROD)

The ROD is designed to test models on realistic occlusion scenarios using real objects captured under controlled conditions. Images are made by placing occluding objects in various positions relative to the main object to simulate how occlusion occurs naturally.

Superimposed Masked Dataset (SMD)

The SMD provides an occluded version of the ImageNet-1K validation dataset, using well-defined occluders that are not part of the main label set. This added complexity helps in evaluating how models respond to different occlusion types.

Testing and Results

In testing, CNNs trained with Patch Mixing typically performed better than their original counterparts on occlusion benchmarks. While ViTs showed some performance improvements, they did not benefit as significantly as CNNs from the Patch Mixing technique.

How Models Handle Changes in Image Structure

The study also examined how well models retain their accuracy when presented with shuffled or altered versions of images. During these tests, models trained using Patch Mixing showed a significant reduction in reliance on spatial structures. This result indicated an advancement in their capability to adapt to image variations.

Conclusion

This research sheds light on the critical differences between CNNs and ViTs in handling occlusion and ignoring irrelevant information. The concept of patch selectivity has proven to be an essential aspect of model performance under these conditions. By introducing Patch Mixing, a method that enhances this ability in CNNs, the researchers have provided a pathway to improve these models significantly.

The development of new datasets for evaluation and the introduction of c-RISE for better explainability further advances our understanding of how these models operate. As applications of computer vision continue to grow in importance, understanding these differences and improvements is vital for deploying robust models in real-world situations.

In summary, both CNNs and ViTs have strengths and weaknesses in computer vision tasks. However, with techniques like Patch Mixing, we can enhance traditional models, making them increasingly versatile in handling challenges like occlusion. This advancement holds promise for many fields, including autonomous vehicles, medical imaging, and security systems, where accurate image recognition is essential even in less-than-ideal conditions.

Original Source

Title: Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Abstract: Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name $\textit{patch selectivity}$), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs $\textit{simulate}$ this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/

Authors: Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz

Last Update: 2023-06-30 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2306.17848

Source PDF: https://arxiv.org/pdf/2306.17848

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles