RevisitOVS

Abstract

In Open Vocabulary Semantic Segmentation (OVS), we observe a consistent drop in model performance as the query vocabulary set expands, especially when it includes semantically similar and ambiguous vocabularies, such as ‘sofa’ and ‘couch’. The previous OVS evaluation protocol, however, does not account for such ambiguity, as any mismatch between model-predicted and human-annotated pairs is simply treated as incorrect on a pixel-wise basis. This contradicts the open nature of OVS, where ambiguous categories may both be correct from an open-world perspective.

To address this, in this work, we study the open nature of OVS and propose a mask-wise evaluation protocol that is based on matched and mismatched mask pairs between prediction and annotation respectively. Extensive experimental evaluations show that the proposed mask-wise protocol provides a more effective and reliable evaluation framework for OVS models compared to the previous pixel-wise approach on the perspective of the open world. Moreover, analysis of mismatched mask pairs reveals that a large number of ambiguous categories exist in commonly used OVS datasets.

Interestingly, we find that reducing these ambiguities during both training and inference enhances the capabilities of OVS models. These findings and the new evaluation protocol encourage further exploration of the open nature of OVS, as well as broader open-world challenges.

The proposed mask-wise evaluation protocol

Our method differs from traditional pixel-wise evaluation by retaining multiple reasonable class predictions based on a threshold, rather than selecting only the highest-probability class per pixel. We introduce three types of match relationships for evaluation:

In-Match Pair: The predicted class aligns exactly with the ground truth (GT) class, and their mask regions overlap.
Out-Match Pair: The predicted class does not match the ground truth class, but their mask overlap exceeds a predefined high threshold. This indicates a possible synonym or semantically similar prediction rather than an actual model error.
Error: The predicted class does not match the ground truth class, and the mask overlap is below the predefined threshold. This is considered an incorrect prediction by the model.

We further utilize out-matched pairs to construct an ambiguous vocabulary graph and perform community discovery analysis to better understand the model’s open-set prediction capabilities.

Visualization of the ambiguous vocabulary graph

A community extracted from the COCO-Stuff171 dataset (showing only 50 classes). Example images from the same community, where images from the same vocabulary community exhibit visually similar semantics (best viewed in colour).

Using community discovery methods, we can partition the ambiguous graph into communities as. Each community represents a cluster of classes that are often confused with each other. For example, in an object detection or segmentation dataset, we might observe that the categories “sofa”, “couch”, and “armchair” form a tightly connected community, indicating that these classes are frequently misclassified or confused by the model.

This insight suggests that the dataset may contain ambiguous annotations where these objects are not clearly distinguishable or where multiple terms are used interchangeably in different regions or contexts. By visualising the labels of the same community, we find that these labels are extremely similar visually and difficult to distinguish through subtle visual differences. From a human perspective, these labels are likely to be classified as the same thing, showing extremely high similarity, which may indicate that they share some core features or attributes.

Vocabulary co-occurring relationship

During the OVS inference stage, the segmentation performance for a particular object is dependent on other related vocabularies. For instance, the segmentation of chair is most effective when vocabularies like table and floor are simultaneously provided.

Re-benchmarking Results

Table 1: Quantitative results of our proposed mask-wise evaluation protocol. The symbol ★ indicates using a joint-dataset vocabulary set during testing. NULL denotes non-out-matched mask is predicted. Results of the conventional argmax pixel-wise approach are shown in the first four rows.

Method	Venue	PC59			ADE150			PC459			ADE847
		front↑	back↑	err↓	front↑	back↑	err↓	front↑	back↑	err↓	front↑	back↑	err↓
SAN	CVPR'23	57.70			32.10			15.70			12.40
CAT-Seg	CVPR'24	63.30			37.90			23.80			16.00
SED	CVPR'24	60.90			35.30			22.10			13.70
MAFT+	ECCV'24	59.40			36.10			21.60			15.10
SAN	CVPR'23	65.91	93.75	9.99	42.89	93.12	8.56	27.65	70.87	6.67	22.84	92.46	8.41
CAT-Seg	CVPR'24	68.46	94.24	NULL	45.74	94.61	5.53	30.95	68.96	3.86	26.39	93.66	5.20
SED	CVPR'24	66.29	94.21	6.43	44.90	93.50	5.20	31.41	70.72	4.93	26.99	92.61	5.07
MAFT+	ECCV'24	64.95	93.57	9.10	46.51	93.10	7.31	31.89	70.82	7.12	28.72	92.15	7.84

Poster

BibTeX

@inproceedings{huangrevisit,
        title={Revisit the open nature of open vocabulary segmentation},
        author={Huang, Qiming and Hu, Han and Jiao, Jianbo},
        booktitle={The Thirteenth International Conference on Learning Representations}}

📖 Revisit the Open Nature of Open Vocabulary Semantic Segmentation