DeCo-Adapter: Enhancing Zero-Shot Robustness via Decoupled Negative Semantic Suppression
Abstract:
Large-scale Vision-Language Models (VLMs) like Contrastive Language-Image Pre-training (CLIP) have demonstrated their impressive zero-shot capabilities. However, adapting them to downstream tasks remains challenging, especially under domain shifts where visual features become unreliable. Existing training-free methods, such as Tip-Adapter, rely heavily on visual similarity, which often fails in out-of-distribution (OOD) scenarios. To address this, Decoupled Correction Adapter (DeCo-Adapter), a robust adaptation framework that integrates a Decoupled Knowledge Stream into the visual baseline, is proposed. Specifically, a novel Negative Semantic Suppression mechanism is introduced, leveraging Large Language Models (LLMs) to generate and penalize distractor descriptions. This mechanism effectively corrects visual ambiguities without requiring any training. Extensive experiments on ImageNet-Sketch, ImageNet-V2, and ImageNet-A demonstrate that DeCo-Adapter consistently outperforms state-of-the-art methods. Notably, it achieves a top-1 accuracy of 54.11% on ImageNet-Sketch, surpassing the strong Tip-Adapter baseline by leveraging negative knowledge for error correction.
1. Introduction
The Vision-Language Models (VLMs), represented by Contrastive Language-Image Pre-training (CLIP) [1] and A Large-scale Image and Noisy-text Embedding (ALIGN) [2], have revolutionized computer vision by aligning images and texts into a unified embedding space through large-scale pre-training. This alignment enables remarkable zero-shot recognition capabilities, allowing the model to identify unseen categories by simply computing the similarity between images and their corresponding text prompts without additional training. Such a paradigm provides a natural advantage for open-vocabulary recognition tasks.
To further adapt VLMs to downstream tasks, training-free adaptation methods have emerged, with Tip-Adapter [3] being a prominent example. It constructs a key-value “Visual Cache” from few-shot training data to refine predictions via feature matching. However, a critical “Visual Dependency Trap” is identified in this mechanism: it implicitly assumes that test data follows the same distribution as the cache (InDomain). When facing out-of-distribution (OOD) scenarios [4], such as domain shifts [5] or adversarial attacks [6], visual matching often fails or introduces severe noise. As observed in preliminary experiments, when adapting from sketch data to real photos, the performance gain of pure visual adaptation drops to zero, highlighting its fragility.
To overcome this limitation, incorporating external semantic knowledge is proposed. Unlike visual features which are highly sensitive to domain shifts, the essential semantic concepts of objects remain relatively stable across domains (e.g., a “cat” is semantically consistent in both sketches and photos). Large Language Models (LLMs) [7] are leveraged to generate rich textual knowledge to assist recognition. More importantly, it is argued that effective recognition requires not only identifying “what an object is” but also explicitly ruling out “what it is not”. Therefore, a “Negative Semantic Suppression” mechanism is introduced to exclude confusing distractors. This plays a crucial role in correcting predictions when visual features are ambiguous.
To address these challenges, we propose the Decoupled Correction Adapter (DeCo-Adapter) framework, a robust, training-free adaptation architecture that integrates a Decoupled Knowledge Stream with the visual baseline. Furthermore, a Hybrid Sharpening Strategy is designed to ensure optimal alignment between visual and textual modalities.
In summary, the main contributions of this work are threefold:
$\bullet$ We propose DeCo-Adapter, a robust training-free framework that systematically alleviates the “Visual Dependency Trap” by seamlessly integrating decoupled semantic knowledge with visual caches.
$\bullet$ We introduce a novel Negative Semantic Suppression mechanism. Unlike existing knowledge-enhanced methods (e.g., Customized Prompts via Language models (CuPL)) that rely solely on entangled positive descriptions, our method explicitly penalizes visually similar but semantically distinct distractors, significantly refining the decision boundaries.
$\bullet$ Extensive experiments demonstrate that DeCo-Adapter consistently improves upon strong baselines. Notably, it exhibits superior cross-domain resilience on ImageNetV2 (+0.91%), proving its effectiveness in OOD scenarios where visual-only methods collapse.
2. Related Work
Recent years have witnessed the rise of VLMs trained on large-scale web data. These models utilize contrastive learning [8] to align multi-modal representations, extending to diverse architectures like Florence [9]. However, their performance heavily relies on the quality of prompts. To avoid manual design, Prompt Tuning methods like Context Optimization (CoOp) [10] replace hard prompts with learnable continuous vectors. Conditional Context Optimization (CoCoOp) [11] further improves this by conditioning prompts on visual features, while recent works explore deep multi-modal prompting strategies, such as Multimodal Prompt Learning (MaPLe) [12]. Despite their effectiveness, these methods re-quire gradient-based training, which is computationally expensive and prone to overfitting.
To overcome the efficiency bottleneck of prompt tuning, training-free adaptation methods have gained significant attention. A prominent example is Tip-Adapter [3], which constructs a cache model to dynamically refine predictions. Similarly, CLIP-Adapter [13], DenseCLIP [14], Parameter Free Attention for CLIP (CALIP) [15], and SuSX [16] propose various parameter-free attention or feature blending mechanisms. While extremely efficient, they rely fundamentally on visual similarity. This dependency becomes a critical vulnerability under domain shifts [17], where visual features become unreliable. The proposed DeCo-Adapter addresses this fragility by supplementing visual features with domain invariant semantic knowledge.
Another line of research seeks to enhance VLMs by incorporating rich external knowledge. CuPL [18] and Visual Classification via Description [19] leverage LLMs, such as Generative Pre-trained Transformer 3 (GPT-3), to generate detailed visual descriptions, creating customized prompts that capture fine-grained semantics. Other methods incorporate external knowledge bases like WordNet (e.g., K-Lite [20]) or hierarchical label sets (e.g., Zero-Shot Image Classification with Hierarchical Label Sets (CHiLS) [21], Radenovic et al. [22]) to enrich concept representations. While effective, they universally focus on integrating positive attributes into a single, coupled representation. Such an entangled representation often introduces noise and struggles to discriminate fine-grained interclass ambiguities. To overcome this limitation, the decoupled approach presented in this work separates semantic knowledge into distinct structural attributes and explicitly introduces a negative suppression mechanism, proving that subtracting incorrect semantics is critical for robust adaptation. Such robustness and adaptability have also been extensively explored in recent journal studies on test-time generalization [23] and video-language understanding [24], further highlighting the importance of refined prompt optimization.”
3. Methodology
The proposed DeCo-Adapter aims to enhance the robustness of zero-shot recognition by integrating decoupled semantic knowledge. As illustrated in Figure 1, the method constructs three parallel information streams: (1) the zero-shot CLIP stream, (2) the visual cache stream, and (3) the decoupled knowledge stream. Specifically, the decoupled knowledge stream introduces a negative semantic suppression (NSS) mechanism that leverages negative knowledge from LLMs to explicitly penalize distracting categories. This process is conducted without any non-linear sharpening, thereby maintaining a broad suppression field to correct visual ambiguities. Finally, the prediction is obtained by adaptively fusing these streams without any parameter optimization.

A key-value visual cache is constructed utilizing a few-shot training set. Let $F_{\text {test }}$ denote the visual features of test images extracted by the vision encoder. The visual adaptation logits $L_{\text {vis }}$ are computed via a sharp attention mechanism:
where, $K_{cache} \in \mathbb{R}^{C \times d}$ and $V_{cache} \in \mathbb{R}^{C \times C}$ represent the cached visual keys and one-hot labels for $C$ categories, respectively. $F_{test} \in \mathbb{R}^{1 \times d}$ denotes the $d$-dimensional visual feature of the test image. The parameter $\beta_{vis}$ serves as a sharpening factor. In the experiments, a relaxed sharpening strategy ($\beta_{vis} = 1.0$) is employed for the sketch domain to accommodate visual variations.
To mitigate the ambiguity of visual features in OOD scenarios, a decoupled knowledge injection mechanism is introduced. Unlike previous methods that utilize holistic sentence descriptions, semantic knowledge is decoupled into distinct attributes:
\[ \mathit{Shape}\ (W_{shape}),\ \mathit{Texture}\ (W_{text}),\ \text{and}\ \mathit{Negative}\ (W_{neg}). \]
Positive Enhancement with Sharpening:
For positive attributes (Shape and Texture), a non-linear sharpening function is applied to filter out noisy semantic matches. The positive knowledge score $L_{n o s}$ is calculated as:
A high sharpening factor $\beta_{knw}$ = 5.5 is set to ensure that only highly confident semantic matches contribute to the prediction, preventing noise accumulation.
Negative Semantic Suppression:
A core contribution of this work is the explicit introduction of negative knowledge. For negative descriptions (e.g., “it is not a dog”), a linear penalty mechanism without sharpening is employed:
Keeping $L_{neg}$ linear ensures a broad suppression field, allowing the model to penalize any potential distractors even if the visual similarity is relatively low.
The final classification logits, represented by the vector $L_{final}$, are derived by fusing the three streams:
where, $\alpha_{vis}, \alpha_{pos}$, and $\gamma_{neg}$ are hyperparameters balancing the contribution of visual adaptation, positive enhancement, and negative suppression, respectively.
4. Experiments
DeCo-Adapter is evaluated on three challenging datasets to verify its robustness: ImageNet-Sketch (strong domain shift), ImageNet-A (adversarial examples), and ImageNet-V2 (cross-domain generalization). The model utilizes CLIP-RN50 as the backbone, and the visual cache is con-structed using 16-shot samples from ImageNet-Sketch to simulate a realistic OOD adaptation scenario.
To ensure reproducibility, the experimental configurations are detailed. The pre-trained ResNet-50 version of CLIP is employed as both the vision and text encoder. For the visual cache construction, K = 16 images per class are randomly sampled from the ImageNet-Sketch dataset. To enhance the robustness of the visual keys, a 10-view augmentation strategy (e.g., random cropping and flipping) is applied during feature extraction, and the augmented features are averaged to represent each few-shot sample.
For the Decoupled Knowledge Stream, a LLM (GPT-4) is queried to construct the semantic knowledge base. To guarantee concise and structured representations, the prompt strictly restricts the output to a JSON format containing three distinct keys: “Shape” (keywords for geometric outlines or body parts), “Texture” (keywords for surface appearance, material, or colors), and exactly ONE “Negative” category (a visually similar but semantically distinct object that is easily confused with the target). For instance, given the category “clownfish”, the generated shape is “oval body, fins”, the texture is “shiny scales, orange and white stripes”, and the designated negative distractor is “goldfish”. This rigorous prompt constraint ensures that the suppression mechanism precisely targets the most highly confusing distractor without introducing irrelevant noise.
To find the optimal fusion weights without training, a coarse-to-fine grid search is conducted. The search spaces are defined as follows:
$\alpha_{vis} \in$ {0.0, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0} for the visual stream,
$\alpha_{pos} \in$ {0.0, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.8, 1.0} for the positive knowledge,
$\gamma_{neg} \in$ {0.0, 0.001, 0.005, 0.01, 0.05, 0.1} for the negative penalty.
The search process is extremely efficient as all visual and textual features are pre-calculated and cached in memory.
DeCo-Adapter is compared with the original zero-shot CLIP and the state-of-the-art Tip-Adapter-ZS across three benchmarks representing distinct adaptation scenarios. The overall performance is summarized in Table 1 and the robustness comparison is visualized in Figure 2.
Method | I-Sketch | I-V2 | I-A |
|---|---|---|---|
CLIP (zero-shot) | 35.50% | 53.25% | 10.50% |
Tip-Adapter-ZS | 53.95% | 53.25% | 10.59% |
DeCo-Adapter (Ours) | 54.11% | 54.16% | 10.61% |
Improvement | +0.16% | +0.91% | +0.02% |

In-Domain Superiority (ImageNet-Sketch):
On ImageNet-Sketch, which shares the same domain as the visual cache, Tip-Adapter-ZS establishes a highly saturated baseline of 53.95%, improving upon zero-shot CLIP by over 18%. Despite encountering this performance bottleneck, DeCo-Adapter further pushes the limit to 54.11% (+0.16%). While the numerical gain appears marginal, it carries profound theoretical significance: in scenarios where, visual features have already exhausted almost all available structural cues, pure visual adaptation reaches an asymptote. The decoupled negative suppression proves capable of rectifying hard, ambiguous boundary cases (e.g., highly confusing fine-grained categories) that visual similarity alone cannot resolve, thereby enhancing the absolute robustness of the decision boundaries.
Cross-Domain Resilience (ImageNet-V2):
The most significant advantage of the proposed method is observed on ImageNet-V2. Here, the model is tested on real photos using a cache constructed from sketches, creating a severe domain shift. As a result, the pure visual stream fails completely, yielding zero gain over CLIP (53.25%). In stark contrast, DeCo-Adapter achieves a remarkable improvement of +0.91% (reaching 54.16%), demonstrating superior generalization capability. This proves that while visual features are sensitive to domain changes, semantic knowledge (e.g., “a cat has ears”) is domain-invariant. DeCo-Adapter effectively leverages this property to serve as a robust stabilizer when visual adaptation collapses.
Adversarial Robustness (ImageNet-A):
On the challenging ImageNet-A dataset, which contains naturally adversarial examples explicitly designed to fool visual classifiers, baseline adapters show extremely limited effectiveness (+0.09%). DeCo-Adapter outperforms them with a gain of +0.02% (10.61%). Consistent with the in-domain analysis, this indicates that explicit negative sup-pression helps the model resist severe, targeted visual perturbations. By anchoring predictions in high-level semantic reasoning rather than vulnerable visual textures, the method demonstrates enhanced stability under adversarial noise.
To dissect the specific contributions of the proposed components, a detailed ablation study is conducted on ImageNet-Sketch. The quantitative ablation results are shown in Table 2 and the corresponding trends are illustrated in Figure 3. As observed, adding positive knowledge alone (+Pos) yields negligible gains, while introducing negative suppression (+Neg) significantly boosts performance (+0.16%). The full model achieves the best accuracy, validating the synergy of the decoupled streams.
Variant | Accuracy (%) | Gain |
|---|---|---|
Tip-Adapter (Visual Only) | 53.95 | - |
+ Positive Only | 53.97 | +0.02 |
+ Negative Only | 54.11 | +0.16 |
DeCo-Adapter (Full) | 54.11 | +0.16 |

The Limited Role of Positive Knowledge:
When only positive knowledge is added (Visual + Positive), the performance gain is marginal (+0.02%) or even negligible. This indicates that in a strong visual baseline, positive semantic descriptions (e.g., shape and texture at-tributes) are largely redundant with the visual features al-ready captured. The model already “knows” what the object looks like, so reaffirming it provides little extra value.
The Critical Role of Negative Suppression:
However, when negative knowledge is introduced (Visual + Negative), a significant performance boost of +0.16% is observed. This finding is pivotal. It suggests that the primary bottleneck of current visual adapters is not the lack of positive features, but the inability to reject confusing distractors. By explicitly penalizing features that match negative descriptions (e.g., “this is not a tiger”), DeCo-Adapter effectively prunes the decision space.
Synergy in the Full Model:
Finally, the full model combines both streams to achieve the highest accuracy (54.11%). This confirms that while negative suppression is the primary driver of performance, positive enhancement still plays a supportive role, and the decoupled architecture successfully integrates them without conflict.
Superiority of Decoupling:
The benefit of decoupling positive knowledge is further investigated. In the cross-domain setting (ImageNet-V2), the decoupled positive stream is compared against a simulated CuPL baseline (coupled) in Table 3. The decoupled approach achieves 54.16%, significantly outperforming the coupled baseline (53.68%) by 0.48%. This indicates that independently modeling shape and texture captures more transferrable semantics than holistic descriptions.
Method | Accuracy (%) | Gain |
|---|---|---|
Tip-Adapter (Visual Only) | 53.25 | +0.00 |
CuPL (Simulated, Coupled) | 53.68 | +0.43 |
DeCo (Decoupled Pos) | 54.16 | +0.91 |
Impact of Domain-Specific Knowledge Quality:
It is investigated whether the quality and domain-relevance of the external knowledge impact the adaptation performance. In Table 4, the performance of DeCo-Adapter on ImageNet Sketch using two different knowledge bases is compared: a generic knowledge base (originally generated for standard ImageNet photos) and a domain-specific knowledge base (explicitly prompting the LLM to describe features typical of sketch drawings).
Knowledge Source | Accuracy (%) |
|---|---|
Tip-Adapter (No Knowledge) | 53.95 |
DeCo (Generic Photo Knowledge) | 53.96 |
DeCo (Sketch-Specific Knowledge) | 54.11 |
The results show that while the generic knowledge base already provides a solid performance (53.96%), switching to the domain-specific knowledge base further pushes the accuracy to 54.11%. This indicates that while semantic concepts are largely domain-invariant, tailoring the textual descriptions to match the modality of the target domain (e.g., emphasizing outlines over colors for sketches) can lead to more precise feature alignment, highlighting the importance of thoughtful prompt engineering in multi-modal fusion.
Hyperparameter Sensitivity Analysis:
To validate the stability of the proposed mechanism, a sensitivity analysis on the crucial negative penalty weight $\gamma_{neg}$ is conducted on ImageNet-Sketch. While keeping other hyper-parameters fixed, we vary $\gamma_{neg}$ across a predefined spectrum. As shown in Table 5, the performance remains consistently superior to the visual-only baseline 53.95% across a broad range of $\gamma_{neg}$. The accuracy peaks at $\gamma_{neg}$ = 0.05 (54.11%) and only exhibits a slight degradation when the penalty becomes excessively aggressive (e.g., $\gamma_{neg} \geq$ 0.1). This demonstrates that the negative semantic suppression is not overly sensitive to hyperparameter tuning and provides stable robustness improvements.
$\boldsymbol{\gamma}_{\boldsymbol{neg}}$ | Accuracy (%) |
|---|---|
0.001 | 53.95 |
0.005 | 53.98 |
0.010 | 53.97 |
0.050 | 54.11 |
0.100 | 53.99 |
To intuitively understand why DeCo-Adapter succeeds where pure visual or holistic text adapters fail, the underlying mechanisms of the two core designs are discussed.
The Power of Negative Suppression (Boundary Shaping):
In domains with severe information dropout, such as sketches where color and fine-grained textures are missing, structural contours become highly ambiguous. Pure visual adapters often fall into geometric traps, assigning high confidences to visually similar but semantically distinct distractors. In such scenarios, injecting additional holistic positive descriptions provides negligible help, as visual features are already saturated with structural cues. Instead, Negative Semantic Suppression acts as a boundary-shaping regularizer. By explicitly evaluating the affinity between the ambiguous image features and negative distractors, DeCo-Adapter dynamically penalizes incorrect semantic clusters in the embedding space. This subtractive reasoning proves to be a highly efficient error correction mechanism, confirming that pushing the prediction away from known distractors is critical for fine-grained OOD recognition.
The Superiority of Positive Decoupling (Attribute Resilience):
Conversely, when evaluating cross-domain generalization, the visual stream collapses due to severe domain shift. Semantic knowledge becomes the sole reliable driver. Coupled methods employ holistic descriptions that entangle various object attributes (e.g., shape, texture, background). If a target image presents an atypical texture but a standard shape, a holistic text prompt might yield a low overall similarity score, leading to misclassification.
By decoupling the positive knowledge into independent “Shape” and “Texture” streams, these attributes are evaluated separately before late fusion. This structural unbinding ensures that as long as one essential attribute strongly aligns with the target, the model maintains high confidence. This decoupling strategy prevents the catastrophic score degradation seen in coupled prompts, making it significantly more robust against domain variations.
A pivotal advantage of the proposed DeCo-Adapter is its extreme computational efficiency. As a purely trainingfree framework, it requires no gradient backpropagation or parameter updates. More importantly, similar to the construction of the visual cache, all textual embeddings for the decoupled semantic knowledge ($W_{shape}, W_{text}$, and $W_{neg}$) are precalculated offline using the frozen text encoder and cached in memory prior to evaluation. During the inference phase, integrating the decoupled knowledge stream only introduces negligible overhead specifically, a few low-dimensional vector inner products and scalar additions (as formulated in Eq. (4)). Consequently, the multi-stream fusion achieves an $\mathcal{O}(1)$ additional time complexity with respect to the feature dimension. In practice, DeCo-Adapter maintains identical inference latency to the baseline Tip-Adapter while significantly bolstering semantic robustness.
5. Conclusions
In this paper, DeCo-Adapter, a robust training-free adaptation framework for VLMs, is proposed. By introducing a Decoupled Knowledge Stream with explicit Negative Semantic Suppression, the limitations of visual based methods in cross-domain and adversarial scenarios are addressed. Extensive experiments on ImageNet-Sketch, ImageNet-V2, and ImageNet-A demonstrate that the proposed method consistently outperforms state-of-the-art baselines. Specifically, it is revealed that while positive knowledge may be redundant in strong visual baselines, negative suppression serves as a critical error-correction mechanism. This work is expected to inspire future research into subtractive reasoning for multi-modal adaptation.
Conceptualization, Y.H.C. and P.J.Z.; methodology, Y.H.C.; software, Y.H.C.; validation, Y.H.C. and P.J.Z.; formal analysis, Y.H.C.; investigation, Y.H.C.; resources, P.J.Z.; data curation, Y.H.C.; writing—original draft preparation, Y.H.C.; writing—review and editing, Y.H.C. and P.J.Z.; visualization, Y.H.C.; supervision, P.J.Z.; project administration, P.J.Z.; funding acquisition, P.J.Z. All authors have read and agreed to the published version of the manuscript.
The data used to support the research findings are available from the corresponding author upon request.
The authors declare no conflict of interest.
