📖 Revisit the Open Nature of
Open Vocabulary Semantic Segmentation

The MIx Group, University of Birmingham
ICLR 2025
示例图片

Category ambiguity in open vocabulary semantic segmentation. One object can be assigned multiple possible labels while the human label is only one of them. For example, the area on the left with a yellow star was annotated as ‘‘plant’’ by humans, but predicted to be ‘‘flower’’ by the OVS model; the bottom part annotated as ‘‘seat’’ was predicted as ‘‘chair’’ by open-vocabulary semantic segmentation models.

Abstract

In Open Vocabulary Semantic Segmentation (OVS), we observe a consistent drop in model performance as the query vocabulary set expands, especially when it includes semantically similar and ambiguous vocabularies, such as ‘sofa’ and ‘couch’. The previous OVS evaluation protocol, however, does not account for such ambiguity, as any mismatch between model-predicted and human-annotated pairs is simply treated as incorrect on a pixel-wise basis. This contradicts the open nature of OVS, where ambiguous categories may both be correct from an open-world perspective.

To address this, in this work, we study the open nature of OVS and propose a mask-wise evaluation protocol that is based on matched and mismatched mask pairs between prediction and annotation respectively. Extensive experimental evaluations show that the proposed mask-wise protocol provides a more effective and reliable evaluation framework for OVS models compared to the previous pixel-wise approach on the perspective of the open world. Moreover, analysis of mismatched mask pairs reveals that a large number of ambiguous categories exist in commonly used OVS datasets.

Interestingly, we find that reducing these ambiguities during both training and inference enhances the capabilities of OVS models. These findings and the new evaluation protocol encourage further exploration of the open nature of OVS, as well as broader open-world challenges.

The proposed mask-wise evaluation protocol

示例图片

Our method differs from traditional pixel-wise evaluation by retaining multiple reasonable class predictions based on a threshold, rather than selecting only the highest-probability class per pixel. We introduce three types of match relationships for evaluation:



We further utilize out-matched pairs to construct an ambiguous vocabulary graph and perform community discovery analysis to better understand the model’s open-set prediction capabilities.

Visualization of the ambiguous vocabulary graph

示例图片

A community extracted from the COCO-Stuff171 dataset (showing only 50 classes). Example images from the same community, where images from the same vocabulary community exhibit visually similar semantics (best viewed in colour).

Using community discovery methods, we can partition the ambiguous graph into communities as. Each community represents a cluster of classes that are often confused with each other. For example, in an object detection or segmentation dataset, we might observe that the categories “sofa”, “couch”, and “armchair” form a tightly connected community, indicating that these classes are frequently misclassified or confused by the model.


This insight suggests that the dataset may contain ambiguous annotations where these objects are not clearly distinguishable or where multiple terms are used interchangeably in different regions or contexts. By visualising the labels of the same community, we find that these labels are extremely similar visually and difficult to distinguish through subtle visual differences. From a human perspective, these labels are likely to be classified as the same thing, showing extremely high similarity, which may indicate that they share some core features or attributes.

Vocabulary co-occurring relationship

示例图片

During the OVS inference stage, the segmentation performance for a particular object is dependent on other related vocabularies. For instance, the segmentation of chair is most effective when vocabularies like table and floor are simultaneously provided.

Re-benchmarking Results

Table 1: Quantitative results of our proposed mask-wise evaluation protocol. The symbol ★ indicates using a joint-dataset vocabulary set during testing. NULL denotes non-out-matched mask is predicted. Results of the conventional argmax pixel-wise approach are shown in the first four rows.

Method Venue PC59 ADE150 PC459 ADE847
front↑ back↑ err↓ front↑ back↑ err↓ front↑ back↑ err↓ front↑ back↑ err↓
SAN CVPR'23 57.70 32.10 15.70 12.40
CAT-Seg CVPR'24 63.30 37.90 23.80 16.00
SED CVPR'24 60.90 35.30 22.10 13.70
MAFT+ ECCV'24 59.40 36.10 21.60 15.10
SAN CVPR'23 65.91 93.75 9.99 42.89 93.12 8.56 27.65 70.87 6.67 22.84 92.46 8.41
CAT-Seg CVPR'24 68.46 94.24 NULL 45.74 94.61 5.53 30.95 68.96 3.86 26.39 93.66 5.20
SED CVPR'24 66.29 94.21 6.43 44.90 93.50 5.20 31.41 70.72 4.93 26.99 92.61 5.07
MAFT+ ECCV'24 64.95 93.57 9.10 46.51 93.10 7.31 31.89 70.82 7.12 28.72 92.15 7.84

Poster

BibTeX

@inproceedings{huangrevisit,
        title={Revisit the open nature of open vocabulary segmentation},
        author={Huang, Qiming and Hu, Han and Jiao, Jianbo},
        booktitle={The Thirteenth International Conference on Learning Representations}}