Skip to content

Table of Contents

cs.CL [Back]

[1] Self-Calibrating Language Models via Test-Time Discriminative Distillation

Mohamed Rissal Hedna,Jan Strich,Martin Semmann,Chris Biemann

Main category: cs.CL

TL;DR: 本文提出SECL方法,利用大语言模型内部更准的判别信号(如'Is this answer correct?'下的'True'概率)作为无标签自监督信号,在测试时进行轻量训练,显著提升模型校准性能,且无需标注数据或人工监督。

Details Motivation: 大语言模型普遍存在系统性过度自信问题,而现有校准方法依赖标注数据、易受分布偏移影响或推理开销大;同时发现模型内部判别信号(P(True))比其显式置信度更可靠,且有理论依据支持。 Method: 提出SECL(Self-Calibrating Language Models),一种基于测试时训练(TTT)的无监督校准框架:利用模型对自身答案正确性的判别概率(如'Is this answer correct?'中'True'的token概率)作为自监督信号,在输入分布变化时动态微调模型,仅需处理6–26%的问题流,成本低于所蒸馏的基线模型。 Result: 在四个小型语言模型、三个模型家族及四个不同领域上,SECL将期望校准误差(ECE)降低56–78%,优于其自身监督信号,并媲美或超越近期推理时校准方法;七组消融实验验证各模块(信号质量、门控策略、权重累积等)均关键且鲁棒。 Conclusion: SECL是首个将测试时训练应用于模型校准的方法,实现了高效、无监督、低开销且分布自适应的校准,为提升LLM可靠性提供了新范式。 Abstract: Large language models (LLMs) are systematically overconfident: they routinely express high certainty on questions they often answer incorrectly. Existing calibration methods either require labeled validation data, degrade under distribution shifts, or incur substantial inference costs. Recent work has shown that LLMs already contain a better-calibrated signal than the one they verbalize: the token probability of "True" when the model is asked "Is this answer correct?" ($P(\text{True})$) consistently outperforms their stated confidence, a gap that is theoretically grounded as generative error is lower-bounded by roughly twice the corresponding discriminative error. We introduce $\textbf{SECL}$ ($\textbf{SE}$lf-$\textbf{C}$alibrating $\textbf{L}$anguage Models), a test-time training (TTT) pipeline that exploits this gap as label-free self-supervision, requiring no labeled data or human supervision. SECL adapts only when the input distribution shifts, training on just 6--26% of the question stream at lower cost than the baseline it distills from. Across four small language models from three model families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56--78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods. SECL is the first method to apply TTT to calibration; seven ablations covering signal quality, gating strategy, weight accumulation, loss design, domain ordering, hyperparameter sensitivity, and layer selection confirm that each component is crucial and robust across configurations. Code: https://anonymous.4open.science/r/secl-emnlp26-submission-C890

[2] Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

Dang H. Dang,Jelena Mitrovi,Michael Granitzer

Main category: cs.CL

TL;DR: 本文探讨了利用大规模未标注网络数据和基于大语言模型(LLM)的合成标注来提升多语种仇恨言论检测性能,发现持续预训练和LLM集成生成的合成标签对小模型和低资源语言效果显著。

Details Motivation: 解决多语种仇恨言论检测中高质量标注数据稀缺、尤其在低资源语言中严重不足的问题,探索无监督数据与大模型合成标注的潜力。 Method: 1)在OpenWebSearch.eu爬取的四种语言未标注文本上对BERT进行持续掩码语言建模预训练;2)使用四个开源LLM通过三种集成策略(均值平均、多数投票、LightGBM元学习器)生成合成标注,并微调下游模型。 Result: 持续预训练带来平均+3%宏F1提升(低资源场景增益更大);LightGBM集成生成的合成标签微调使Llama3.2-1B提升+11%池化F1,Qwen2.5-14B仅+0.6%。 Conclusion: 网络规模无标注数据与LLM集成合成标注的组合,对小模型和低资源语言最具价值,而大模型从中获益有限。 Abstract: We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.

[3] HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Edward Ajayi,Prasenjit Mitra

Main category: cs.CL

TL;DR: 本文提出了一种基于认知理论的幽默生成框架(Cognitive Synergy Framework),利用多种认知角色(如荒诞者、愤世嫉俗者)通过混合思维(MoT)方式生成高质量幽默数据,并以此微调7B参数模型;实验表明,认知驱动的数据构建比对齐算法或模型规模对幽默生成性能影响更大。

Details Motivation: 大型语言模型(LLMs)的标准训练目标(预测最可能的下一个词)与幽默所需的意外性和不协调性存在本质冲突,导致其难以有效生成幽默。 Method: 提出认知协同框架,结合心理学幽默理论,设计六种认知角色,采用混合思维(MoT)方法生成多样化幽默数据;构建高质量数据集后,用其微调7B学生模型;对比Direct Preference Optimization(DPO)与新提出的Offline Group Relative Policy Optimization(O-GRPO)两种对齐方法。 Result: 所提7B模型显著优于更大的指令微调基线模型,性能媲美当前最优闭源模型;验证了认知驱动的数据构建在幽默生成中比对齐算法和模型规模更关键。 Conclusion: 幽默生成的关键在于理论驱动的高质量数据构建,而非单纯依赖大模型规模或先进对齐算法;认知多样性是提升LLM幽默能力的有效路径。 Abstract: Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective - predicting the most likely next word - inherently conflicts with the surprise and incongruity needed for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a theoretically grounded methodology for generating high-quality humor data inspired by psychological theories of humor. Utilizing a Mixture-of-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework creates a theoretically grounded dataset, which we use to fine-tune a 7B-parameter student model. We compare Direct Preference Optimization (DPO) and a novel Offline Group Relative Policy Optimization (O-GRPO); our 7B model significantly outperforms larger instruction-tuned baselines and achieves performance competitive with state-of-the-art proprietary models. We find that cognitive-driven data curation is far more critical than alignment algorithms or model scale for humor generation. Code and data will be available upon publication.

[4] Generating High Quality Synthetic Data for Dutch Medical Conversations

Cecilia Kuan,Aditya Kamlesh Parikh,Henk van den Heuvel

Main category: cs.CL

TL;DR: 本文提出了一种利用荷兰语微调大语言模型生成合成荷兰语医疗对话的流程,以缓解临床NLP领域因隐私限制导致的真实数据稀缺问题;定量与定性评估显示生成对话在词汇多样性上表现良好但自然度不足,提示需结合领域知识和精细提示设计来提升质量。

Details Motivation: 临床对话数据对提升临床NLP模型至关重要,但受隐私与伦理限制难以获取,导致领域专用数据集稀缺。 Method: 基于荷兰语微调的大语言模型,结合真实医疗对话作为语言与结构参考,构建合成荷兰语医疗对话生成流程,并通过定量指标(如词汇多样性、轮换规律性)和定性评估(母语者与医务人员评审)进行验证。 Result: 定量分析显示词汇多样性高但轮换过于规则,呈现脚本化倾向;定性评估得分略低于平均水平,专家指出领域特异性和自然表达存在不足;定量与定性结果相关性弱,说明单一数值指标难以全面衡量语言质量。 Conclusion: 生成高质量合成荷兰语医疗对话是可行的,但必须融合医学领域知识并优化提示工程,以在自然性与结构合理性之间取得平衡;该方法为伦理合规地扩充荷兰语临床NLP资源提供了可行路径。 Abstract: Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores, with raters noting issues in domain specificity and natural expression. The limited correlation between quantitative and qualitative results highlights that numerical metrics alone cannot fully capture linguistic quality. Our findings demonstrate that generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure in conversation. This work provides a foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.

[5] GIANTS: Generative Insight Anticipation from Scientific Literature

Joy He-Yueya,Anikait Singh,Ge Gao,Michael Y. Li,Sherry Yang,Chelsea Finn,Emma Brunskill,Noah D. Goodman

Main category: cs.CL

TL;DR: 本文提出'洞察预测'任务,即模型根据上游论文预测下游论文的核心洞见,并构建了包含17k样本的跨领域基准GiantsBench;基于此,作者训练出RL优化的小型开源模型GIANTS-4B,在洞察预测质量上显著超越大型闭源模型,并获人类与第三方模型SciJudge-30B的双重验证。

Details Motivation: 科学突破常源于对既有文献的创造性综合,但现有语言模型在面向文献的定向合成能力方面尚缺乏系统研究。 Method: 提出‘洞察预测’生成任务,构建多领域基准GiantsBench;采用LM裁判打分评估生成洞见与真实洞见的相似性,并以此作为奖励信号,通过强化学习训练开源小模型GIANTS-4B。 Result: GIANTS-4B在相似度得分上相对gemini-3-pro提升34%,泛化至未见领域;人类评估认为其生成洞见概念更清晰;SciJudge-30B在68%的成对比较中更偏好GIANTS-4B生成的洞见。 Conclusion: 定向文献合成能力可被建模与优化,小型RL微调模型能在科学洞见生成任务中超越大型通用模型,为自动化科学发现提供新路径。 Abstract: Scientific breakthroughs often emerge from synthesizing prior ideas into novel contributions. While language models (LMs) show promise in scientific discovery, their ability to perform this targeted, literature-grounded synthesis remains underexplored. We introduce insight anticipation, a generation task in which a model predicts a downstream paper's core insight from its foundational parent papers. To evaluate this capability, we develop GiantsBench, a benchmark of 17k examples across eight scientific domains, where each example consists of a set of parent papers paired with the core insight of a downstream paper. We evaluate models using an LM judge that scores similarity between generated and ground-truth insights, and show that these similarity scores correlate with expert human ratings. Finally, we present GIANTS-4B, an LM trained via reinforcement learning (RL) to optimize insight anticipation using these similarity scores as a proxy reward. Despite its smaller open-source architecture, GIANTS-4B outperforms proprietary baselines and generalizes to unseen domains, achieving a 34% relative improvement in similarity score over gemini-3-pro. Human evaluations further show that GIANTS-4B produces insights that are more conceptually clear than those of the base model. In addition, SciJudge-30B, a third-party model trained to compare research abstracts by likely citation impact, predicts that insights generated by GIANTS-4B are more likely to lead to higher citations, preferring them over the base model in 68% of pairwise comparisons. We release our code, benchmark, and model to support future research in automated scientific discovery.

[6] Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

Rrubaa Panchendrarajan,Arkaitz Zubiaga

Main category: cs.CL

TL;DR: 本文提出Claim2Vec,首个专为多语言事实核查声明聚类优化的语义嵌入模型,通过对比学习微调多语言编码器,在多个数据集和聚类算法上显著提升聚类性能,并验证了跨语言知识迁移效果。

Details Motivation: 重复声明给多语言自动事实核查系统带来重大挑战,而基于声明聚类以实现统一事实核查的任务尚未被充分研究。 Method: 提出Claim2Vec模型,采用对比学习对多语言编码器进行微调,使用相似的多语言声明对构建训练目标,以优化声明在嵌入空间中的语义表示。 Result: 在三个数据集、14种多语言嵌入模型和7种聚类算法上的实验表明,Claim2Vec显著提升聚类性能,尤其改善簇标签对齐与嵌入空间几何结构;多语言簇分析证实微调带来跨语言知识迁移。 Conclusion: Claim2Vec是首个面向多语言事实核查声明聚类的专用嵌入模型,有效支持跨语言声明归一化与高效事实核查。 Abstract: Recurrent claims present a major challenge for automated fact-checking systems designed to combat misinformation, especially in multilingual settings. While tasks such as claim matching and fact-checked claim retrieval aim to address this problem by linking claim pairs, the broader challenge of effectively representing groups of similar claims that can be resolved with the same fact-check via claim clustering remains relatively underexplored. To address this gap, we introduce Claim2Vec, the first multilingual embedding model optimized to represent fact-check claims as vectors in an improved semantic embedding space. We fine-tune a multilingual encoder using contrastive learning with similar multilingual claim pairs. Experiments on the claim clustering task using three datasets, 14 multilingual embedding models, and 7 clustering algorithms demonstrate that Claim2Vec significantly improves clustering performance. Specifically, it enhances both cluster label alignment and the geometric structure of the embedding space across different cluster configurations. Our multilingual analysis shows that clusters containing multiple languages benefit from fine-tuning, demonstrating cross-lingual knowledge transfer.

[7] Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

Peiqi Sui,Yutong Zhu,Tianyi Cheng,Peter West,Richard Jean So,Hoyt Long,Ari Holtzman

Main category: cs.CL

TL;DR: 本文提出了一种新的叙事张力评估指标100-Endings,并设计了增强张力的故事生成流程,在保持EQ-Bench基准性能的同时,显著提升故事质量。

Details Motivation: 现有评估方法(如EQ-Bench)无法识别LLM生成故事在叙事张力上的缺陷,甚至错误地将AI故事排在《纽约客》短篇之上。 Method: 提出100-Endings指标——逐句预测故事结局100次并统计与真实结局的不匹配率及曲线拐点频率;构建基于叙事结构约束(模板分析、创意构思、叙事支架)的生成流程。 Result: 100-Endings能正确区分《纽约客》故事与LLM生成故事;新生成流程显著提升叙事张力,且不损害EQ-Bench得分。 Conclusion: 叙事张力是衡量故事质量的关键维度,基于 narratological 原则的结构化生成与细粒度评估可有效提升LLM创作水平。 Abstract: LLMs have so far failed both to generate consistently compelling stories and to recognize this failure--on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. Our pipeline significantly increases narrative tension as measured by the 100-Endings metric, while maintaining performance on the EQ-Bench leaderboard.

[8] Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis

Xinkai Zou,Yiming Huang,Zhuohang Wu,Jian Sha,Nan Huang,Longfei Yun,Jingbo Shang,Letian Peng

Main category: cs.CL

TL;DR: 本文提出了组织群体行为模拟(Organized Group Behavior Simulation)任务,构建了GROVE基准数据集,并设计了一种结构化分析框架与适配器机制,以提升对群体决策的建模、解释性与跨群体知识迁移能力。

Details Motivation: 模拟有组织群体(如企业)的决策行为对理解现实世界动态和支撑市场预测等应用至关重要,但缺乏统一的任务定义、基准和可解释建模方法。 Method: 提出结构化分析框架,将群体决策建模为可解释、自适应、可追溯的行为模型;设计时间感知适配器和群体感知迁移机制;引入基于历史事件的可追溯证据节点;构建GROVE基准(44个实体、8052条真实情境-决策对,覆盖9个领域)。 Result: 所提方法在GROVE上显著优于摘要式和检索式基线;时间感知适配器有效捕捉个体群体的时间行为漂移;结构化跨群体相似性支持小样本组织的知识迁移。 Conclusion: 组织群体行为模拟是一个可行且富有潜力的研究方向;结构化建模与适配机制能兼顾性能、可解释性与泛化性,为群体智能与社会模拟提供新范式。 Abstract: Simulating how organized groups (e.g., corporations) make decisions (e.g., responding to a competitor's move) is essential for understanding real-world dynamics and could benefit relevant applications (e.g., market prediction). In this paper, we formalize this problem as a concrete research platform for group behavior understanding, providing: (1) a task definition with benchmark and evaluation criteria, (2) a structured analytical framework with a corresponding algorithm, and (3) detailed temporal and cross-group analysis. Specifically, we propose Organized Group Behavior Simulation, a task that models organized groups as collective entities from a practical perspective: given a group facing a particular situation (e.g., AI Boom), predict the decision it would take. To support this task, we present GROVE (GRoup Organizational BehaVior Evaluation), a benchmark covering 44 entities with 8,052 real-world context-decision pairs collected from Wikipedia and TechCrunch across 9 domains, with an end-to-end evaluation protocol assessing consistency, initiative, scope, magnitude, and horizon. Beyond straightforward prompting pipelines, we propose a structured analytical framework that converts collective decision-making events into an interpretable, adaptive, and traceable behavioral model, achieving stronger performance than summarization- and retrieval-based baselines. It further introduces an adapter mechanism for time-aware evolution and group-aware transfer, and traceable evidence nodes grounding each decision rule in originating historical events. Our analysis reveals temporal behavioral drift within individual groups, which the time-aware adapter effectively captures for stronger prediction, and structured cross-group similarity that enables knowledge transfer for data-scarce organizations.

[9] Should We be Pedantic About Reasoning Errors in Machine Translation?

Calvin Bao,Marine Carpuat

Main category: cs.CL

TL;DR: 本文研究机器翻译中的推理错误,提出了一种自动化标注协议来识别三类推理错误,并通过多种干预手段(如hedging、removal等)修正推理链;实验表明强干预能提高错误解决率,但对翻译质量提升有限,且推理错误的识别精度因语言而异,整体显示当前MT系统推理忠实度较低。

Details Motivation: 探究机器翻译(MT)中推理错误的发生频率与类型,并评估修正这些错误是否能提升翻译质量。 Method: 设计自动化标注协议识别三类推理错误(源句错位、模型假设错位、推理链错位),并在多语言对上测试;采用弱到强五类干预(hedging、removal、re-reasoning after removal、hindsight、oracle)修正推理链并评估效果。 Result: 强干预显著提升推理错误解决率,但翻译质量改善不一致;推理错误识别精度在乌尔都语中高,在西班牙语中低;去除推理错误并未显著修复原始翻译错误。 Conclusion: 机器翻译中的推理错误可被识别但难以通过修正推理链有效纠正,表明当前MT系统存在推理不忠实问题。 Abstract: Across multiple language pairings (English $\to$ \{Spanish, French, German, Mandarin, Japanese, Urdu, Cantonese\}), we find reasoning errors in translation. To quantify how often these reasoning errors occur, we leverage an automated annotation protocol for reasoning evaluation wherein the goal is to detect if a reasoning step is any of three error categories: (1) source sentence-misaligned, (2) model hypothesis-misaligned, or (3) reasoning trace-misaligned. We probe the reasoning model with perturbed traces correcting for these identified reasoning errors using an array of weak-to-strong interventions: hedging, removal, re-reasoning after removal, hindsight, and oracle interventions. Experimenting with interventions on the reasoning traces suggests that small corrections to the reasoning have little impact on translation quality, but stronger interventions yield the highest resolution rates, despite translation quality gains being mixed. We find ultimately that reasoning errors in MT can be identified with high precision in Urdu but lower precision in Spanish, but that removing these reasoning errors does not resolve the initial errors significantly, suggesting limited reasoning faithfulness for machine translation.

[10] Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning

Samuel Jaeger,Calvin Ibeneye,Aya Vera-Jimenez,Dhrubajyoti Ghosh

Main category: cs.CL

TL;DR: 本研究分析了人类撰写与AI生成的虚假新闻在语言、结构和情感上的差异,并利用多种机器学习模型及集成方法进行分类识别,结果表明可基于文本风格和结构特征可靠区分二者。

Details Motivation: 大型语言模型的快速普及导致AI生成的虚假新闻与传统人工撰写的虚假信息并存,亟需厘清二者差异并探索可靠的区分方法。 Method: 构建基于句子结构、词汇多样性、标点模式、可读性指标及六种情感维度(恐惧、愤怒、喜悦、悲伤、信任、期待)的文档级特征表示;采用逻辑回归、随机森林、支持向量机、XGBoost和神经网络等模型,并设计集成框架融合预测结果;以准确率和AUC评估性能。 Result: 所有模型均展现出强而稳定的分类性能;可读性特征最具判别力;AI生成文本表现出更统一的风格模式;集成学习相较单模型带来小幅但一致的性能提升。 Conclusion: 文本的风格与结构性质为区分AI生成与人工撰写的虚假新闻提供了稳健基础。 Abstract: The rapid adoption of large language models has introduced a new class of AI-generated fake news that coexists with traditional human-written misinformation, raising important questions about how these two forms of deceptive content differ and how reliably they can be distinguished. This study examines linguistic, structural, and emotional differences between human-written and AI-generated fake news and evaluates machine learning and ensemble-based methods for distinguishing these content types. A document-level feature representation is constructed using sentence structure, lexical diversity, punctuation patterns, readability indices, and emotion-based features capturing affective dimensions such as fear, anger, joy, sadness, trust, and anticipation. Multiple classification models, including logistic regression, random forest, support vector machines, extreme gradient boosting, and a neural network, are applied alongside an ensemble framework that aggregates predictions across models. Model performance is assessed using accuracy and area under the receiver operating characteristic curve. The results show strong and consistent classification performance, with readability-based features emerging as the most informative predictors and AI-generated text exhibiting more uniform stylistic patterns. Ensemble learning provides modest but consistent improvements over individual models. These findings indicate that stylistic and structural properties of text provide a robust basis for distinguishing AI-generated misinformation from human-written fake news.

[11] Weird Generalization is Weirdly Brittle

Miriam Wanner,Hannah Collison,William Jurayj,Benjamin Van Durme,Mark Dredze,William Walden

Main category: cs.CL

TL;DR: 本文通过扩展复现研究,验证了“怪异泛化”现象(即模型在窄域数据微调后产生跨域的意外不良行为)确实存在但极其脆弱,仅在特定模型和数据集上出现,并可通过简单训练时或提示干预有效缓解。

Details Motivation: 怪异泛化被先前工作视为关键安全风险,但其普遍性、鲁棒性和可缓解性尚不明确,需系统验证。 Method: 开展扩展复现研究,涵盖更广模型与数据集;评估多种训练时与提示干预策略(包括上下文引导型与通用型)对怪异泛化的抑制效果。 Result: 确认怪异泛化在特定条件下存在且具危险性,但高度依赖模型与数据;其表现极为脆弱,简单干预(尤其提供预期行为上下文的提示)即可有效消除;甚至非针对性的通用干预也具显著缓解效果。 Conclusion: 怪异泛化虽真实存在,但并非稳健威胁;其可控性强,可通过低成本、易部署的提示工程等方法有效管理,从而重新评估其实际安全风险等级。 Abstract: Weird generalization is a phenomenon in which models fine-tuned on data from a narrow domain (e.g. insecure code) develop surprising traits that manifest even outside that domain (e.g. broad misalignment)-a phenomenon that prior work has highlighted as a critical safety concern. Here, we present an extended replication study of key weird generalization results across an expanded suite of models and datasets. We confirm that surprising (and dangerous) traits can emerge under certain circumstances, but we find that weird generalization is exceptionally brittle: it emerges only for specific models on specific datasets, and it vanishes under simple training-time, prompt-based interventions. We find that the most effective interventions provide prompt context that makes the generalized behavior the expected behavior. However, we show that even very generic interventions that do not anticipate specific generalized traits can still be effective in mitigating weird generalization's effects. Our findings thus help clarify the nature of the safety threat that weird generalization poses and point toward an easily implemented set of solutions.

[12] CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Mengfan Li,Xuanhua Shi,Yang Deng

Main category: cs.CL

TL;DR: 本文提出CoSToM框架,通过因果追踪定位模型中ToM相关特征层,并实施针对性激活引导,提升大语言模型在社会推理任务中的泛化能力与对话质量。

Details Motivation: 现有大语言模型虽在标准ToM基准上表现良好,但在复杂任务场景下泛化能力差,过度依赖提示工程,其内部认知与外在行为存在关键错位,引发对其是否具备内在心智理论能力的质疑。 Method: 采用因果追踪技术定位模型内部ToM语义编码的关键层,并在此基础上设计轻量级的定向激活引导对齐框架(CoSToM)。 Result: 实验表明CoSToM显著提升了模型的人类水平社会推理能力及下游对话质量。 Conclusion: 通过主动干预而非仅解释模型机制,CoSToM验证了可通过对关键内部表征进行因果导向调控来增强LLMs的内在ToM能力及其行为一致性。 Abstract: Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers' characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.

[13] Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension

Fumitaka Iwaki,Miho Fuyama,Hayato Saigo,Tatsuji Takahashi

Main category: cs.CL

TL;DR: 本文提出了一种基于Fuyama等人提出的不定自然变换理论(TINT)的隐喻理解计算模型,并通过算法简化与实验验证,使其更贴近原理论;改进后的算法在数据拟合、系统性和新颖性三个指标上均优于现有方法。

Details Motivation: 使隐喻理解的计算模型更贴近Fuyama等人提出的不定自然变换理论(TINT),并提升其解释力与性能。 Method: 开发并简化了基于TINT理论的隐喻理解计算模型算法,并通过数据拟合和模拟进行验证;采用三个指标评估:实验数据拟合度、隐喻理解结果的系统性、以及理解的新颖性(即源域与目标域联想结构的对应程度)。 Result: 改进后的算法在数据拟合、系统性和新颖性三项评估指标上均优于现有算法。 Conclusion: 简化并忠实于TINT理论的计算实现是可行且有效的,能显著提升隐喻理解模型的综合性能。 Abstract: In this study, we developed a computational implementation for a model of metaphor comprehension based on the theory of indeterminate natural transformation (TINT) proposed by Fuyama et al. We simplified the algorithms implementing the model to be closer to the original theory and verified it through data fitting and simulations. The outputs of the algorithms are evaluated with three measures: data-fitting with experimental data, the systematicity of the metaphor comprehension result, and the novelty of the comprehension (i.e. the correspondence of the associative structure of the source and target of the metaphor). The improved algorithm outperformed the existing ones in all the three measures.

[14] Linguistic Accommodation Between Neurodivergent Communities on Reddit:A Communication Accommodation Theory Analysis of ADHD and Autism Groups

Saad Mankarious,Nour Zein,Iyad Ait Hou,Aya Zirikly

Main category: cs.CL

TL;DR: 本研究基于交流适应理论,分析ADHD和自闭症群体在Reddit上跨社区互动时的语言调整行为,发现存在趋同式语言适应现象,并初步揭示情境性受众适应与长期身份建构可能涉及不同机制。

Details Motivation: 现有社交媒体心理健康研究多聚焦个体层面的疾病检测与诊断,本文转向群体间互动视角,探究神经多样性群体(ADHD与自闭症)在线跨社区交流中的语言适应行为。 Method: 基于交流适应理论(CAT),使用LIWC词典刻画两个Reddit社区各自的语言特征,并分析用户跨社区发帖时的语言变化;引入话题无关的摘要变量(Authentic、Clout)检验非话题性解释;辅以围绕公开诊断披露时间点的探索性纵向分析。 Result: 两个群体具有显著不同的基线语言特征;跨社区发帖时出现相反方向的语言特征调整,符合趋同适应;话题无关变量也发生同步变化,削弱纯话题解释;诊断披露引发的语言变化微弱且方向有时相反,提示其机制不同于跨社区适应。 Conclusion: 神经多样性群体在在线互动中表现出有意识的、情境驱动的语言适应,这种跨群体趋同行为反映了动态的受众导向沟通策略,而非单纯的身份表达或话题切换;该发现对社区治理和临床理解具有启示意义。 Abstract: Social media research on mental health has focused predominantly on detecting and diagnosing conditions at the individual level. In this work, we shift attention to \emph{intergroup} behavior, examining how two prominent neurodivergent communities, ADHD and autism, adjust their language when engaging with each other on Reddit. Grounded in Communication Accommodation Theory (CAT), we first establish that each community maintains a distinct linguistic profile as measured by Language Inquiry and Word Count Lexicon (LIWC). We then show that these profiles shift in opposite directions when users cross community boundaries: features that are elevated in one group's home community decrease when its members post in the other group's space, and vice versa, consistent with convergent accommodation. The involvement of topic-independent summary variables (Authentic, Clout) in these shifts provides partial evidence against a purely topical explanation. Finally, in an exploratory longitudinal analysis around the moment of public diagnosis disclosure, we find that its effects on linguistic style are small and, in some cases, directionally opposite to cross-community accommodation, providing initial evidence that situational audience adaptation and longer-term identity processes may involve different mechanisms. Our findings contribute to understanding intergroup communication dynamics among neurodivergent populations online and carry implications for community moderation and clinical perspectives on these conditions.

[15] ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

Chi-Yuan Hsiao,Ke-Han Lu,Yu-Kuan Fu,Guan-Ting Lin,Hsiao-Tsung Hung,Hung-yi Lee

Main category: cs.CL

TL;DR: ASPIRin是一种专为端到端全双工语音语言模型设计的强化学习框架,通过解耦‘何时说’与‘说什么’,提升交互性并避免语义退化和重复生成。

Details Motivation: 标准基于原始token的强化学习在优化全双工SLM的时间动态时会损害语义质量,导致生成崩溃和严重重复。 Method: 提出ASPIRin框架:1)动作空间投影(Action Space Projection),将文本词表映射为二元状态(说话/静音);2)组相对策略优化(GRPO)配合基于规则的奖励,平衡用户打断与响应延迟。 Result: ASPIRin显著提升了轮转、回声反馈(backchanneling)和停顿处理等交互能力;相比标准GRPO,n-gram重复率降低超50%,有效消除退化性重复。 Conclusion: 将说话时机与内容生成解耦,可在保障语义连贯性的同时大幅增强实时交互性能,为全双工语音对话系统提供了新范式。 Abstract: End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.

[16] Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty

Chao Xue,Yao Wang,Mengqiao Liu,Di Liang,Xingsheng Han,Peiyang Liu,Xianjie Wu,Chenyao Lu,Lei Jiang,Yu Lu,Haibo Shi,Shuang Liang,Minlong Peng,Flora D. Salim

Main category: cs.CL

TL;DR: 本文提出E-GRM框架,利用模型内部不确定性自适应触发思维链(CoT)推理,并引入轻量判别式评分器提升奖励建模精度,在降低推理开销的同时提高答案准确率。

Details Motivation: 现有生成式奖励模型(GRM) indiscriminately应用思维链(CoT)提示,导致简单任务计算浪费;且依赖投票机制评估CoT输出,缺乏细粒度和精确性。 Method: E-GRM基于模型内部不确定性:利用并行生成的收敛行为估计不确定性,选择性触发CoT;并设计轻量判别式评分器,采用回归–排序混合目标进行训练以实现细粒度推理路径评估。 Result: 在多个推理基准上实验表明,E-GRM显著降低推理成本,同时持续提升答案准确率。 Conclusion: 模型内部不确定性是一种有效且通用的信号,可用于高效、推理感知的奖励建模。 Abstract: Recent advancements in the Generative Reward Model (GRM) have demonstrated its potential to enhance the reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting. Despite these gains, existing implementations of GRM suffer from two critical limitations. First, CoT prompting is applied indiscriminately to all inputs regardless of their inherent complexity. This introduces unnecessary computational costs for tasks amenable to fast, direct inference. Second, existing approaches primarily rely on voting-based mechanisms to evaluate CoT outputs, which often lack granularity and precision in assessing reasoning quality. In this paper, we propose E-GRM, an efficient generative reward modeling framework grounded in model-internal uncertainty. E-GRM leverages the convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed, without relying on handcrafted features or task-dependent signals. To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression--ranking objective to provide fine-grained evaluation of reasoning paths. Experiments on multiple reasoning benchmarks show that E-GRM substantially reduces inference cost while consistently improving answer accuracy, demonstrating that model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling.

[17] Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

Chao Xue,Yao Wang,Mengqiao Liu,Di Liang,Xingsheng Han,Peiyang Liu,Xianjie Wu,Chenyao Lu,Lei Jiang,Yu Lu,Haibo Shi,Shuang Liang,Minlong Peng,Flora D. Salim

Main category: cs.CL

TL;DR: 本文首次系统研究了大语言模型监督微调中的“不完全学习现象”(ILP),即模型在收敛后仍无法正确复现部分训练样本的问题;识别出五类成因,并提出诊断先行的框架与针对性缓解策略。

Details Motivation: 观察到监督微调(SFT)后模型仍无法复现部分训练数据,这一被忽视的“不完全学习现象”(ILP)缺乏系统性研究,影响对微调效果的真实评估。 Method: 通过跨模型、跨领域、跨数据集的实证分析形式化ILP;设计受控实验识别五类成因;构建基于训练/推理信号的诊断框架,并验证因果干预式缓解策略。 Result: 证实ILP在Qwen、LLaMA、OLMo2等模型中广泛存在且成因异质;聚合指标提升可能掩盖局部未学习子集;所提诊断框架可有效归因并指导缓解。 Conclusion: ILP是SFT中普遍而关键的问题,需从细粒度诊断出发理解‘学不会什么’及‘为何学不会’,推动更鲁棒、可解释的微调范式。 Abstract: Supervised Fine-Tuning (SFT) is the standard approach for adapting large language models (LLMs) to downstream tasks. However, we observe a persistent failure mode: even after convergence, models often fail to correctly reproduce a subset of their own supervised training data. We refer to this behavior as the Incomplete Learning Phenomenon(ILP). This paper presents the first systematic study of ILP in LLM fine-tuning. We formalize ILP as post-training failure to internalize supervised instances and demonstrate its prevalence across multiple model families, domains, and datasets. Through controlled analyses, we identify five recurrent sources of incomplete learning: (1) missing prerequisite knowledge in the pre-trained model, (2) conflicts between SFT supervision and pre-training knowledge, (3) internal inconsistencies within SFT data, (4) left-side forgetting during sequential fine-tuning, and (5) insufficient optimization for rare or complex patterns. We introduce a diagnostic-first framework that maps unlearned samples to these causes using observable training and inference signals, and study several targeted mitigation strategies as causal interventions. Experiments on Qwen, LLaMA, and OLMo2 show that incomplete learning is widespread and heterogeneous, and that improvements in aggregate metrics can mask persistent unlearned subsets. The findings highlight the need for fine-grained diagnosis of what supervised fine-tuning fails to learn, and why.

[18] SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

Han Liu,Haotian Gao,Xiaotong Zhang,Changya Li,Feng Zhang,Wei Wang,Fenglong Ma,Hong Yu

Main category: cs.CL

TL;DR: 本文提出了一种简单高效的LLM后训练量化方法SEPTQ,通过静态全局重要性评分和掩码引导的列式量化,在低比特设置下显著提升性能。

Details Motivation: 现有后训练量化(PTQ)方法在低比特量化时性能下降明显,且计算复杂;而量化感知训练(QAT)成本过高,不适用于大语言模型。 Method: SEPTQ首先为权重矩阵每个元素计算重要性得分,并以静态全局方式确定量化位置;然后利用重要性掩码,逐列量化并更新权重,最终获得量化权重矩阵。 Result: 在多个数据集和不同规模(百万至十亿参数)模型上的实验表明,SEPTQ在各类比特精度(尤其低比特)下均显著优于现有强基线方法。 Conclusion: SEPTQ是一种兼顾有效性与效率的轻量级后训练量化范式,简化了PTQ流程为仅两步,为资源受限设备部署LLM提供了实用新方案。 Abstract: Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited devices while preserving generative quality, encompasses two primary methods: quantization aware training (QAT) and post-training quantization (PTQ). QAT involves additional retraining or fine-tuning, thus inevitably resulting in high training cost and making it unsuitable for LLMs. Consequently, PTQ has become the research hotspot in recent quantization methods. However, existing PTQ methods usually rely on various complex computation procedures and suffer from considerable performance degradation under low-bit quantization settings. To alleviate the above issues, we propose a simple and effective post-training quantization paradigm for LLMs, named SEPTQ. Specifically, SEPTQ first calculates the importance score for each element in the weight matrix and determines the quantization locations in a static global manner. Then it utilizes the mask matrix which represents the important locations to quantize and update the associated weights column-by-column until the appropriate quantized weight matrix is obtained. Compared with previous methods, SEPTQ simplifies the post-training quantization procedure into only two steps, and considers the effectiveness and efficiency simultaneously. Experimental results on various datasets across a suite of models ranging from millions to billions in different quantization bit-levels demonstrate that SEPTQ significantly outperforms other strong baselines, especially in low-bit quantization scenarios.

[19] Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry

Jiang Li,Tian Lan,Shanshan Wang,Dongxing Zhang,Dianqing Lin,Guanglai Gao,Derek F. Wong,Xiangdong Su

Main category: cs.CL

TL;DR: 本文提出了ChangAn基准,用于检测大语言模型生成的古典中文诗歌,并系统评估了12种AI检测器在该任务上的表现,揭示了现有中文文本检测器在该任务上的局限性。

Details Motivation: AI生成的古典中文诗歌在创造性真实性和伦理方面引发问题,而现有AI文本检测方法尚未针对古典中文诗歌这一特殊文体展开研究。 Method: 构建了包含30664首古典中文诗歌的ChangAn基准(其中10276首为人类创作,20388首由4种主流大语言模型生成),并基于该基准对12种AI检测器在不同文本粒度和生成策略下的性能进行了系统评估。 Result: 现有中文文本检测器在古典中文诗歌检测任务中表现不佳,无法作为可靠工具;ChangAn基准有效验证了该任务的挑战性与研究必要性。 Conclusion: 古典中文诗歌因其格律、意象体系和句法灵活性等独特语言特征,给AI生成文本检测带来显著挑战;ChangAn基准为该领域提供了重要基础资源和评估标准。 Abstract: The rapid development of large language models (LLMs) has extended text generation tasks into the literary domain. However, AI-generated literary creations has raised increasingly prominent issues of creative authenticity and ethics in literary world, making the detection of LLM-generated literary texts essential and urgent. While previous works have made significant progress in detecting AI-generated text, it has yet to address classical Chinese poetry. Due to the unique linguistic features of classical Chinese poetry, such as strict metrical regularity, a shared system of poetic imagery, and flexible syntax, distinguishing whether a poem is authored by AI presents a substantial challenge. To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs. Based on ChangAn, we conducted a systematic evaluation of 12 AI detectors, investigating their performance variations across different text granularities and generation strategies. Our findings highlight the limitations of current Chinese text detectors, which fail to serve as reliable tools for detecting LLM-generated classical Chinese poetry. These results validate the effectiveness and necessity of our proposed ChangAn benchmark. Our dataset and code are available at https://github.com/VelikayaScarlet/ChangAn.

[20] CircuitSynth: Reliable Synthetic Data Generation

Zehua Cheng,Wei Dai,Jiahao Sun,Thomas Lukasiewicz

Main category: cs.CL

TL;DR: CircuitSynth 是一种新型神经符号框架,通过将教师大语言模型(LLM)的推理能力蒸馏为概率句子决策图(PSDD),在结构化生成中兼顾逻辑正确性与分布覆盖性,显著提升合成数据的有效性与多样性。

Details Motivation: 现有LLM在结构化生成中存在幻觉、逻辑不一致和模式坍塌问题,且主流方法(如提示工程或检索增强生成)难以同时保证语言表达力与形式有效性/覆盖率。 Method: 提出CircuitSynth框架:1)用Probabilistic Sentential Decision Diagram(PSDD)蒸馏教师LLM的语义推理能力,构建可处理的语义先验以强制满足硬逻辑约束;2)引入凸优化机制以严格满足软分布目标。 Result: 在多个基准测试中,CircuitSynth在复杂逻辑谜题上实现100% Schema Validity(基线仅12.4%),并在罕见组合覆盖率上显著超越现有最优方法。 Conclusion: CircuitSynth成功实现了神经与符号方法的协同,为高保真、结构化合成数据生成提供了兼具形式保证与统计灵活性的新范式。 Abstract: The generation of high-fidelity synthetic data is a cornerstone of modern machine learning, yet Large Language Models (LLMs) frequently suffer from hallucinations, logical inconsistencies, and mode collapse when tasked with structured generation. Existing approaches, such as prompting or retrieval-augmented generation, lack the mechanisms to balance linguistic expressivity with formal guarantees regarding validity and coverage. To address this, we propose CircuitSynth, a novel neuro-symbolic framework that decouples semantic reasoning from surface realization. By distilling the reasoning capabilities of a Teacher LLM into a Probabilistic Sentential Decision Diagram (PSDD), CircuitSynth creates a tractable semantic prior that structurally enforces hard logical constraints. Furthermore, we introduce a convex optimization mechanism to rigorously satisfy soft distributional goals. Empirical evaluations across diverse benchmarks demonstrate that CircuitSynth achieves 100% Schema Validity even in complex logic puzzles where unconstrained baselines fail (12.4%) while significantly outperforming state-of-the-art methods in rare-combination coverage.

[21] Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations

Bernard Muller,Antonio Armando Ortiz Barrañón,LaVonne Roberts

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、基于冻结HuBERT表示中音系特征子空间退化程度来量化构音障碍严重程度的新方法,仅利用健康人语音估计特征方向,通过多语言、多病因数据验证了其有效性与鲁棒性。

Details Motivation: 传统构音障碍严重度评估依赖专业临床人员或需大量标注病理性语音的监督模型,难以跨语言和临床场景扩展。 Method: 利用预训练HuBERT提取语音表征,结合蒙特利尔强制对齐器(MFA)获取音素级嵌入;基于健康对照语音估计6类辅音及4类元音音系对比方向(如鼻化度、浊音性等),计算各说话人的d-prime得分,构建12维音系特征谱。全程无需任何构音障碍语音训练数据。 Result: 在10个语料库、5种语言、3种病因共890名受试者上验证:5个辅音d-prime特征均与临床严重度显著负相关(rho = -0.47 ~ -0.56,p < 2e-4);效果在单语料库内可复现、经FDR校正仍显著,并对语料剔除与对齐质量控制稳健;鼻化度d-prime在6/7分级语料库中呈单调下降;全部12维特征均可显著区分健康对照与重度构音障碍者(p < 0.001)。 Conclusion: 该无监督方法摆脱了对病理性语音标注的依赖,具备跨语言通用性(支持现有MFA声学模型的29种语言),为大规模、多中心构音障碍评估提供了可行新范式。 Abstract: Dysarthric speech severity assessment typically requires trained clinicians or supervised models built from labelled pathological speech, limiting scalability across languages and clinical settings. We present a training-free method that quantifies dysarthria severity by measuring degradation in phonological feature subspaces within frozen HuBERT representations. No supervised severity model is trained; feature directions are estimated from healthy control speech using a pretrained forced aligner. For each speaker, we extract phone-level embeddings via Montreal Forced Aligner, compute d-prime scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, and four vowel features) derived exclusively from healthy controls, and construct a 12-dimensional phonological profile.Evaluating 890 speakers across 10 corpora, 5 languages (English, Spanish, Dutch, Mandarin, French), and 3 primary aetiologies (Parkinson's disease, cerebral palsy, ALS), we find that all five consonant d-prime features correlate significantly with clinical severity (random-effects meta-analysis rho = -0.50 to -0.56, p < 2e-4; pooled Spearman rho = -0.47 to -0.55 with bootstrap 95% CIs not crossing zero). The effect replicates within individual corpora, survives FDR correction, and remains robust to leave-one-corpus-out removal and alignment quality controls. Nasality d-prime decreases monotonically from control to severe in 6 of 7 severity-graded corpora. Mann-Whitney U tests confirm that all 12 features distinguish controls from severely dysarthric speakers (p < 0.001).The method requires no dysarthric training data and applies to any language with an existing MFA acoustic model (currently 29 languages). We release the full pipeline and phone feature configurations for six languages.

[22] Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities

Zhichen Liu,Yongyuan Li,Yang Xu

Main category: cs.CL

TL;DR: 本文提出在大语言模型(LLM)输入中于句子边界插入分隔符,以增强其句子级结构感知能力,从而提升推理性能;该方法在多种任务上取得显著提升,并在模型内部表征中验证了句子意识的增强。

Details Motivation: 现有基于哑元标记(dummy token)的方法忽略了自然语言固有的句子级结构,而LLM的语言能力正是通过接触句子结构化的文本习得的,因此需引入句子边界信息以弥补这一关键缺陷。 Method: 在LLM输入中于句子边界插入特定分隔符,将哑元标记与句子结构结合;分别在上下文学习(in-context learning)和监督微调(supervised fine-tuning)两种范式下进行实验,覆盖7B至600B参数规模的模型(含Deepseek-V3)。 Result: 在GSM8k和DROP等任务上分别取得最高7.7%和12.5%的性能提升;微调后的模型内部表征显示出可验证的句子意识增强。 Conclusion: 在句子边界插入分隔符是一种简单而有效提升LLM能力的技术,为认知启发式的LLM增强范式提供了新方向。 Abstract: Researchers have explored different ways to improve large language models (LLMs)' capabilities via dummy token insertion in contexts. However, existing works focus solely on the dummy tokens themselves, but fail to leverage the inherent sentence-level structure of natural language. This is a critical oversight, as LLMs acquire linguistic capabilities through exposure to human-generated texts, which are inherently structured at the sentence level. Motivated by this gap, we propose an approach that inserts delimiters at sentence boundaries in LLM inputs, which not only integrates dummy tokens into the context, but also facilitates LLMs with sentence-by-sentence processing behavior during reasoning. Two concrete methods: (1). In-context learning and (2). Supervised fine-tuning are experimented using 7B models to 600B Deepseek-V3. Our results demonstrate consistent improvements across various tasks, with notable gains of up to 7.7\% on GSM8k and 12.5\% on DROP. Furthermore, the fine-tuned LLMs can incorporate sentence awareness evidenced by their internal representations. Our work establishes a simple yet effective technique for enhancing LLM's capabilities, offering promising directions for cognitive-inspired LLM enhancement paradigm.

[23] Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text

Paul Jackson,Ruizhe Li,Elspeth Edelstein

Main category: cs.CL

TL;DR: 本研究通过探针方法分析Gemma-3-4b-it模型在生成英文学术论文引言时是否隐含国籍相关表征,发现其隐藏层(尤其第18层)显著编码英国与中国学术人格差异,但这种差异未体现在最终输出的表层文本中。

Details Motivation: 大型语言模型日益被用作学术英语教学与写作工具,但尚不清楚其生成的学术文本是否隐含文化/国籍差异性表征。 Method: 使用2×3实验设计(英国/中国 persona × 6种条件),基于45个提示模板生成270篇引言;对全部35层隐藏状态训练逻辑回归探针,并设置多种控制(打乱标签基线、表层文本分类器、跨模型族测试、句级基线);对高信号位置token用Stanza进行结构、词汇和立场特征标注。 Result: 国籍探针在第18层达到0.968交叉验证准确率与完美留出集分类;国籍编码呈非单调层间分布:结构特征在中上层最强,词汇特征更早达峰;高信号位置显示英国模式倾向后置修饰、模糊限制语、强化语、被动语态及评价性/过程性词汇,中国模式倾向前置修饰、名词谓语及社会文化/国际化词汇;但全句表层文本无显著国籍差异。 Conclusion: LLM在内部表征中编码了细粒度的国籍相关学术风格差异,但该差异未映射到输出文本,提示表征与生成存在解耦;该发现拓展了探针技术在社会语言学属性上的应用,并对学术英语教学具有实践启示。 Abstract: Large language models are increasingly used as writing tools and pedagogical resources in English for Academic Purposes, but it remains unclear whether they encode culturally differentiated representations when generating academic text. This study tests whether Gemma-3-4b-it encodes nationality-discriminative information in hidden states when generating research article introductions conditioned by British and Chinese academic personas. A corpus of 270 texts was generated from 45 prompt templates crossed with six persona conditions in a 2 x 3 design. Logistic regression probes were trained on hidden-state activations across all 35 layers, with shuffled-label baselines, a surface-text skyline classifier, cross-family tests, and sentence-level baselines used as controls. Probe-selected token positions were annotated for structural, lexical, and stance features using the Stanza NLP pipeline. The nationality probe reached 0.968 cross-validated accuracy at Layer 18, with perfect held-out classification. Nationality encoding followed a non-monotonic trajectory across layers, with structural effects strongest in the middle to upper network and lexical-domain effects peaking earlier. At high-signal token positions, British-associated patterns showed more postmodification, hedging, boosting, passive voice, and evaluative or process-oriented vocabulary, while Chinese-associated patterns showed more premodification, nominal predicates, and sociocultural or internationalisation vocabulary. However, sentence-level analysis found no significant nationality differences in the full generated surface text. The findings extend probing methodology to a sociolinguistic attribute and have practical implications for EAP and language pedagogy.

[24] ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification

Zhensheng Wang,ZhanTeng Lin,Wenmian Yang,Kun Zhou,Yiquan Zhang,Weijia Jia

Main category: cs.CL

TL;DR: 本文提出了ODUTQA-MDC任务及首个全面基准,用于解决大语言模型在开放域表格问答中对模糊或不确定表达理解不足的问题,并设计了多智能体框架MAIC-TQA以实现歧义检测、对话澄清与答案优化。

Details Motivation: 大语言模型在表格问答中难以处理开放域中存在模糊或不确定表达的查询。 Method: 构建了包含209张表格和25105个问答对的大规模ODUTQA数据集、细粒度标注方案及动态澄清交互界面;提出多智能体框架MAIC-TQA,支持歧义检测、对话式澄清与答案精炼。 Result: 实验验证了所提基准和MAIC-TQA框架的有效性,确立其为推进会话式、歧义感知表格问答研究的关键资源。 Conclusion: ODUTQA-MDC任务与MAIC-TQA框架为开放域表格问答中处理不确定性与模糊性提供了系统性解决方案,推动了该方向的发展。 Abstract: The advancement of large language models (LLMs) has enhanced tabular question answering (Tabular QA), yet they struggle with open-domain queries exhibiting underspecified or uncertain expressions. To address this, we introduce the ODUTQA-MDC task and the first comprehensive benchmark to tackle it. This benchmark includes: (1) a large-scale ODUTQA dataset with 209 tables and 25,105 QA pairs; (2) a fine-grained labeling scheme for detailed evaluation; and (3) a dynamic clarification interface that simulates user feedback for interactive assessment. We also propose MAIC-TQA, a multi-agent framework that excels at detecting ambiguities, clarifying them through dialogue, and refining answers. Experiments validate our benchmark and framework, establishing them as a key resource for advancing conversational, underspecification-aware Tabular QA research.

[25] FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness

Xiaoning Dong,Chengyan Wu,Yajie Wen,Yu Chen,Yun Xue,Jing Zhang,Wei Xu,Bolei Ma

Main category: cs.CL

TL;DR: 本文提出FAITH框架,通过结合自然语言不确定性信号与外部知识,在后训练阶段提升大语言模型的事实准确性与诚实性。

Details Motivation: 现有方法在问答提示中引入数值型不确定性分数,但缺乏语义丰富性,难以让大语言模型真正理解其可信度与诚实性状态,导致事实对齐效果不足。 Method: 提出FAITH框架:1)基于LLM输出计算置信度与语义熵,映射为描述知识掌握(可信度)与回答行为(诚实性)的自然语言知识状态象限;2)设计兼顾正确性与不确定性信号的奖励函数,用PPO算法进行强化微调;3)引入检索增强模块,提升内部知识与外部知识的一致性。 Result: 在四个知识密集型基准上实验表明,FAITH显著提升了大语言模型的事实准确性和真实性。 Conclusion: 将自然语言形式的不确定性建模与外部知识检索相结合,可更有效地实现大语言模型的事实对齐,增强其可靠性。 Abstract: Large Language Models (LLMs) can generate factually inaccurate content even if they have corresponding knowledge, which critically undermines their reliability. Existing approaches attempt to mitigate this by incorporating uncertainty in QA prompt during training, but these numerical scores lack the semantic richness for LLM to properly understand its internal states of trustworthiness and honestness, leading to insufficient factuality alignment. We introduce FAITH (Factuality Alignment through Integrating Trustworthiness and Honestness), a post-training framework for factuality alignment that integrates natural-language uncertainty signals with external knowledge. Specifically, we augment training datasets by computing confidence scores and semantic entropy from LLM outputs and mapping them into a knowledge state quadrant that describes the model's internal knowledge possession (trustworthiness) and answering behaviors (honestness) in natural language. Based on this enhanced data, we design a reward function that considers both correctness and uncertainty signals, and fine-tune the LLM using the Proximal Policy Optimization (PPO) algorithm. To further mitigate weakly grounded responses, we design a retrieval-augmented module that retrieves relevant external passages, improving the consistency between internal and external knowledge representations. Extensive experiments on four knowledge-intensive benchmarks demonstrate that FAITH enhances the factual accuracy and truthfulness of LLMs.

[26] Relational Probing: LM-to-Graph Adaptation for Financial Prediction

Yingjie Niu,Changhong Jin,Rian Dolphin,Ruihai Dong

Main category: cs.CL

TL;DR: 本文提出关系探测(Relational Probing)方法,用关系头替代语言模型标准输出头,直接从隐藏状态构建关系图,并与下游股票趋势预测任务联合训练,在保持图结构严格性的同时提升性能。

Details Motivation: 现有基于提示的语言模型关系抽取方法存在自回归解码开销大、图构建与下游优化脱节的问题;同时缺乏对小语言模型(SLM)的可复现操作定义。 Method: 设计关系头替代语言模型原有输出头,直接从隐藏层诱导出结构化关系图;将该关系头与下游股票趋势预测模型端到端联合训练;明确定义SLM为可在单块24GB GPU上全参数微调的模型(限定batch size与序列长度)。 Result: 在Qwen3系列小语言模型(0.6B/1.7B/4B)上验证,Relational Probing相较共现基线持续提升性能,且推理成本具有竞争力。 Conclusion: Relational Probing实现了语义表征学习与结构化图生成的统一,使语言模型输出可适配下游任务格式,提升了金融实体关系建模的效率与有效性。 Abstract: Language models can be used to identify relationships between financial entities in text. However, while structured output mechanisms exist, prompting-based pipelines still incur autoregressive decoding costs and decouple graph construction from downstream optimization. We propose \emph{Relational Probing}, which replaces the standard language-model head with a relation head that induces a relational graph directly from language-model hidden states and is trained jointly with the downstream task model for stock-trend prediction. This approach both learns semantic representations and preserves the strict structure of the induced relational graph. It enables language-model outputs to go beyond text, allowing them to be reshaped into task-specific formats for downstream models. To enhance reproducibility, we provide an operational definition of small language models (SLMs): models that can be fine-tuned end-to-end on a single 24GB GPU under specified batch-size and sequence-length settings. Experiments use Qwen3 backbones (0.6B/1.7B/4B) as upstream SLMs and compare against a co-occurrence baseline. Relational Probing yields consistent performance improvements at competitive inference cost.

[27] CodeComp: Structural KV Cache Compression for Agentic Coding

Qiujiang Chen,Jing Xiong,Chenyang Zhao,Sidi Yang,Ngai Wong

Main category: cs.CL

TL;DR: 本文提出CodeComp,一种无需训练的KV缓存压缩框架,结合静态程序分析(通过Joern提取代码属性图先验)来提升大语言模型在长代码库上的推理效率,尤其在故障定位与补丁生成等代理式编程任务中,在内存受限下显著优于仅依赖注意力信号的压缩方法。

Details Motivation: 现有KV缓存压缩方法仅依赖注意力信号评估token重要性,会系统性丢弃对代码理解至关重要的结构化token(如调用点、分支条件、赋值语句),导致在内存受限的代理式代码任务中性能下降。 Method: 提出CodeComp框架,不依赖训练,将静态程序分析(基于Joern构建代码属性图)所得的结构先验信息融入LLM推理过程,指导KV缓存压缩,保留关键结构化token。 Result: 在故障定位与代码生成基准上,CodeComp在同等内存预算下持续超越纯注意力驱动的压缩基线;在激进压缩下恢复大部分全上下文准确率;补丁生成质量媲美未压缩的全上下文推理,并可无缝集成至SGLang代理编码流水线。 Conclusion: 引入静态程序分析先验可有效弥补注意力机制在代码结构感知上的不足,CodeComp为内存受限的代理式编程提供了高效、即插即用的KV缓存压缩方案。 Abstract: Agentic code tasks such as fault localization and patch generation require processing long codebases under tight memory constraints, where the Key-Value (KV) cache becomes the primary inference bottleneck. Existing compression methods rely exclusively on attention signals to estimate token importance, systematically discarding structurally critical tokens such as call sites, branch conditions, and assignments that are essential for code understanding. We present CodeComp, a training-free KV cache compression framework that incorporates static program analysis into LLM inference via Code Property Graph priors extracted by Joern. Across bug localization and code generation benchmarks, CodeComp consistently outperforms attention-only compression baselines under equal memory budgets, recovering the majority of full-context accuracy under aggressive KV cache compression, while matching the patch generation quality of uncompressed full-context inference and integrating seamlessly into SGLang-based agentic coding pipelines without model modification.

[28] Comparative Analysis of Large Language Models in Healthcare

Subin Santhosh,Farwa Abbas,Hussain Ahmad,Claudia Szabo

Main category: cs.CL

TL;DR: 本研究对多个大语言模型(如ChatGPT、LLaMA、Grok、Gemini和ChatDoctor)在医疗任务(如病历摘要和医学问答)上的表现进行了标准化评估,发现领域专用模型(如ChatDoctor)在语义准确性和上下文可靠性上更优,而通用模型(如Grok、LLaMA)在结构化问答任务中定量准确率更高;结果强调需依据任务特性选择模型,并结合人工监督与伦理规范审慎部署。

Details Motivation: 当前大语言模型在医疗领域应用日益广泛,但其在高风险临床环境中的准确性、可靠性与患者安全性问题亟待解决,且缺乏标准化的医学场景基准评测。 Method: 在MedMCQA、PubMedQA和Asclepius等公开数据集上,对ChatGPT、LLaMA、Grok、Gemini和ChatDoctor等多个模型开展患者病历摘要与医学问答等核心任务评估,综合采用语言学指标与任务特异性指标进行性能分析。 Result: 领域专用模型(如ChatDoctor)在语境可靠性和医学语义一致性方面表现更优;通用模型(如Grok、LLaMA)在结构化问答任务中定量准确率更高;两类模型优势互补,效果高度依赖具体任务类型。 Conclusion: 大语言模型可有效辅助医务人员并提升临床决策质量,但其安全有效落地必须遵循伦理准则、确保语境准确性,并在关键环节保留人工监督;任务导向的评估与审慎集成至关重要。 Abstract: Background: Large Language Models (LLMs) are transforming artificial intelligence applications in healthcare due to their ability to understand, generate, and summarize complex medical text. They offer valuable support to clinicians, researchers, and patients, yet their deployment in high-stakes clinical environments raises critical concerns regarding accuracy, reliability, and patient safety. Despite substantial attention in recent years, standardized benchmarking of LLMs for medical applications has been limited. Objective: This study addresses the need for a standardized comparative evaluation of LLMs in medical settings. Method: We evaluate multiple models, including ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor, on core medical tasks such as patient note summarization and medical question answering, using the open-access datasets, MedMCQA, PubMedQA, and Asclepius, and assess performance through a combination of linguistic and task-specific metrics. Results: The results indicate that domain-specific models, such as ChatDoctor, excel in contextual reliability, producing medically accurate and semantically aligned text, whereas general-purpose models like Grok and LLaMA perform better in structured question-answering tasks, demonstrating higher quantitative accuracy. This highlights the complementary strengths of domain-specific and general-purpose LLMs depending on the medical task. Conclusion: Our findings suggest that LLMs can meaningfully support medical professionals and enhance clinical decision-making; however, their safe and effective deployment requires adherence to ethical standards, contextual accuracy, and human oversight in relevant cases. These results underscore the importance of task-specific evaluation and cautious integration of LLMs into healthcare workflows.

[29] Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation

Mohamed Ehab,Ali Hamdi

Main category: cs.CL

TL;DR: 本文提出了一种自适应多专家推理(AMR)框架,通过动态调整推理策略应对数学问题难度差异,结合难度预测、多专家生成、神经验证与聚类聚合,在GSM8K上达到75.28%准确率,优于多数使用合成数据训练的7B模型。

Details Motivation: 大型语言模型在数学推理任务中表现不一致,尤其在不同难度问题上性能波动大,亟需一种能根据问题复杂度自适应调整推理过程的方法。 Method: 提出AMR框架:1)敏捷路由系统预测问题难度与不确定性;2)可重构采样机制调控生成广度;3)三个专用专家生成候选答案;4)多轮修正与终稿化;5)神经验证器评估正确性;6)基于聚类的共识与质量加权聚合。 Result: 在GSM8K数据集上取得75.28%准确率,仅使用原始训练数据,优于多数依赖合成数据训练的7B模型。 Conclusion: 基于难度路由与不确定性驱动聚合的策略可显著提升数学推理模型的鲁棒性与效率,无需额外合成数据即可实现先进性能。 Abstract: Large language models (LLMs) demonstrate strong performance in math reasoning benchmarks, but their performance varies inconsistently across problems with varying levels of difficulty. This paper describes Adaptive Multi-Expert Reasoning (AMR), a framework that focuses on problem complexity by reasoning with dynamically adapted strategies. An agile routing system that focuses on problem text predicts problems' difficulty and uncertainty and guides a reconfigurable sampling mechanism to manage the breadth of generation. Three specialized experts create candidate responses, which are modified during multiple correction and finalization phases. A neural verifier assesses the correctness of responses, while a clustering-based aggregation technique identifies the final candidate answer based on a combination of consensus and answer quality. When evaluated on the GSM8K dataset, AMR achieved 75.28% accuracy while only using the original training data. This result outperformed the majority of comparable 7B models that were trained on synthetic data. This showcases that models using difficulty-based routing and uncertainty-driven aggregation are efficient and effective in improving math reasoning models' robustness.

[30] A Structured Clustering Approach for Inducing Media Narratives

Rohan Das,Advait Deshmukh,Alexandria Leto,Zohar Naaman,I-Ta Lee,Maria Leonor Pacheco

Main category: cs.CL

TL;DR: 本文提出了一种联合建模事件与角色的结构化聚类框架,用于自动归纳可解释、符合框架理论且可扩展的叙事模式,克服了现有方法在细粒度分析和可扩展性上的不足。

Details Motivation: 现有计算方法难以捕捉传播学强调的精细叙事结构,或过于粗粒度,或依赖领域特定分类法而缺乏可扩展性。 Method: 通过结构化聚类联合建模事件和角色,自动归纳叙事模式(narrative schemas)。 Result: 生成了可解释、符合框架理论、无需大量人工标注即可扩展至大规模语料的叙事模式。 Conclusion: 该框架有效弥合了计算叙事分析与传播学叙事理论之间的鸿沟,提升了媒体叙事建模的可解释性与可扩展性。 Abstract: Media narratives wield tremendous power in shaping public opinion, yet computational approaches struggle to capture the nuanced storytelling structures that communication theory emphasizes as central to how meaning is constructed. Existing approaches either miss subtle narrative patterns through coarse-grained analysis or require domain-specific taxonomies that limit scalability. To bridge this gap, we present a framework for inducing rich narrative schemas by jointly modeling events and characters via structured clustering. Our approach produces explainable narrative schemas that align with established framing theory while scaling to large corpora without exhaustive manual annotation.

[31] BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

Saukun Thika You,Nguyen Anh Khoa Tran,Wesley K. Marizane,Hanshu Rao,Qiunan Zhang,Xiaolei Huang

Main category: cs.CL

TL;DR: 本文提出BLUEmed,一种结合混合检索增强生成(RAG)与多智能体辩论的框架,用于检测临床文本中的术语替换错误;通过子查询分解、多源证据检索、双专家独立分析、结构化反论证与跨源裁决,以及级联安全层,显著提升检测性能。

Details Motivation: 临床笔记中术语替换错误(语义合理但临床含义不同)难以被自动化方法准确识别,现有单智能体或纯辩论方法效果有限。 Method: 提出BLUEmed多智能体辩论框架:将临床笔记分解为子查询;融合稠密、稀疏和在线检索进行源分区证据获取;分配两个具备不同知识库的领域专家智能体独立分析;在意见分歧时启动结构化反论证与跨源裁决;最后通过级联安全层过滤常见误报。 Result: 在术语替换检测基准上,few-shot设置下BLUEmed达到最高准确率69.13%、ROC-AUC 74.45%、PR-AUC 72.44%,优于单智能体RAG及纯辩论基线;验证了检索增强与结构化辩论的互补性,且模型需具备良好指令遵循与临床语言理解能力。 Conclusion: BLUEmed通过融合多源检索、多视角专家辩论与安全校验机制,有效提升了临床术语错误检测的准确性与鲁棒性,为医疗NLP中的可信推理提供了新范式。 Abstract: Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

[32] NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

Cong Ming,Ruixin Shi,Yifan Hu

Main category: cs.CL

TL;DR: 本文提出NameBERT模型,利用大语言模型(LLM)增强低资源国家姓名数据,构建大规模姓名-国籍数据集,在保持高效推理的同时显著提升国籍预测准确率,尤其改善尾部国家性能。

Details Motivation: 现有基于姓名的国籍分类器受限于小规模或来源单一的标注数据,导致覆盖不足、对代表性不足国家性能差;而直接使用大语言模型虽零样本效果好,但计算开销和延迟高,难以实时大规模部署。 Method: 从开放学术图谱(OAG)构建大规模姓名-国籍数据集,并设计框架将大语言模型用作数据集增强器(而非推理引擎),通过LLM生成低资源国家的姓名进行数据扩充,训练轻量级NameBERT模型。 Result: 在真实及合成尾部测试集上,数据增强显著提升尾部国家预测性能;NameBERT在域内与跨域任务中均显著优于现有最优基线,且推理效率远高于LLM。 Conclusion: 将LLM用于数据增强而非直接推理,是兼顾性能与效率的有效范式,NameBERT为姓名国籍推断提供了更鲁棒、可扩展的解决方案。 Abstract: Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

[33] LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset

Aizihaierjiang Yusufu,Jiang Liu,Kamran Aziz,Abidan Ainiwaer,Bobo Li,Fei Li,Donghong Ji,Aizierguli Yusufu

Main category: cs.CL

TL;DR: 本文构建了首个面向低资源语言(乌兹别克语和维吾尔语)的细粒度情感四元组(目标-方面-观点-情感)数据集LASQ,并提出融合句法知识(POS与依存关系)的网格标注模型SKEM,有效缓解黏着语的词汇稀疏问题,在LASQ上显著优于基线。

Details Motivation: 现有ABSA研究和基准主要集中在高资源语言,低资源语言的细粒度情感分析严重缺乏数据与方法支持。 Method: 构建LASQ数据集(含乌兹别克语、维吾尔语);提出基于网格标注的模型,通过自研的句法知识嵌入模块(SKEM)融入词性(POS)与依存句法信息。 Result: 在LASQ数据集上,所提模型在四元组抽取任务中持续超越强基线模型。 Conclusion: LASQ填补了低资源语言ABSA的数据空白;SKEM证明句法知识对缓解黏着语词汇稀疏问题具有实际价值,为低资源ABSA提供了新思路与可用工具。 Abstract: In recent years, aspect-based sentiment analysis (ABSA) has made rapid progress and shown strong practical value. However, existing research and benchmarks are largely concentrated on high-resource languages, leaving fine-grained sentiment extraction in low-resource languages under-explored. To address this gap, we constructed the first Low-resource languages Aspect-based Sentiment Quadruple dataset, named LASQ, which includes two low-resource languages: Uzbek and Uyghur. Secondly, it includes a fine-grained target-aspect-opinion-sentiment quadruple extraction task. To facilitate future research, we designed a grid-tagging model that integrates syntactic knowledge. This model incorporates part-of-speech (POS) and dependency knowledge into the model through our designed Syntax Knowledge Embedding Module (SKEM), thereby alleviating the lexical sparsity problem caused by agglutinative languages. Experiments on LASQ demonstrate consistent gains over competitive baselines, validating both the dataset's utility and the effectiveness of the proposed modeling approach.

[34] Turing or Cantor: That is the Question

Eugene Eberbach

Main category: cs.CL

TL;DR: 本文探讨了图灵成就与康托尔集合论的渊源,提出基于输入数据概率分布的不可判定性度量,并扩展图灵机模型至超图灵计算;定义了三类新的不可判定问题复杂度类(U/D/H-complete),并证明U-complete类中不存在类似P≠NP的问题(即该类中相应问题被否定回答)。

Details Motivation: 揭示图灵工作与康托尔集合论的深层联系,弥补对不可判定问题缺乏量化度量和系统分类的空白,并拓展图灵计算模型的理论边界。 Method: 理论分析与概念构建:追溯数学基础史;引入基于输入数据分布的不可判定性测度;推广Oracle机与无限逻辑至超图灵模型;类比NP完全性定义三类新的不可判定问题完全性;通过归约与对角化等方法确立其性质。 Result: 确立康托尔对图灵工作的奠基作用;提出不可判定性概率度量;构建超图灵计算模型族;首次明确定义U-complete、D-complete和H-complete三类不可判定问题复杂度类;证明U-complete类中对应于P vs NP的核心问题答案是否定的。 Conclusion: 图灵计算理论根植于康托尔的集合论;不可判定性可量化且具层次结构;超图灵模型拓展了可计算性边界;新定义的三类完全性为不可判定问题提供了系统分类框架,且U-complete类中不存在未解的‘P vs NP式’问题。 Abstract: Alan Turing is considered as a founder of current computer science together with Kurt Godel, Alonzo Church and John von Neumann. In this paper multiple new research results are presented. It is demonstrated that there would not be Alan Turing's achievements without earlier seminal contributions by Georg Cantor in the set theory and foundations of mathematics. It is proposed to introduce the measure of undecidability of problems unsolvable by Turing machines based on probability distribution of its input data, i.e., to provide the degree of unsolvabilty based on the number of undecidable instances of input data versus decidable ones. It is proposed as well to extend the Turing's work on infinite logics and Oracle machines to a whole class of super-Turing models of computation. Next, the three new complexity classes for TM undecidable problems have been defined: U-complete (Universal complete), D-complete (Diagonalization complete) and H-complete (Hypercomputation complete) classes. The above has never been defined explicitly before by other scientists, and has been inspired by Cook/Levin NP-complete class for intractable problems. Finally, an equivalent to famous P is not equal to NP unanswered question for NP-complete class, has been answered negatively for U-complete class of complexity for undecidable problems.

[35] CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning

Cheng-Yen Li,Xuanjun Chen,Claire Lin,Wei-Yu Chen,Wenhua Nie,Hung-Yi Lee,Jyh-Shing Roger Jang

Main category: cs.CL

TL;DR: CodaRAG是一种受互补学习系统启发的检索增强生成框架,通过知识整合、关联导航和干扰消除三阶段流程,提升LLM在知识密集型任务中的检索召回率与生成准确性。

Details Motivation: 现有RAG方法将证据视为孤立单元,无法重建连接分散信息的逻辑链,导致LLM在知识密集型任务中易产生幻觉和碎片化推理。 Method: 提出CodaRAG框架,包含三阶段:(1)知识整合——将碎片化抽取统一为稳定记忆基底;(2)关联导航——沿语义、上下文化与功能多维路径遍历图结构以恢复分散证据链;(3)干扰消除——剔除过度关联噪声,确保高精度推理上下文。 Result: 在GraphRAG-Bench上,CodaRAG实现检索召回率绝对提升7–10%,生成准确率提升3–11%。 Conclusion: CodaRAG能系统性增强关联式证据检索能力,显著提升事实性、推理性与创造性任务的表现鲁棒性。 Abstract: Large Language Models (LLMs) struggle with knowledge-intensive tasks due to hallucinations and fragmented reasoning over dispersed information. While Retrieval-Augmented Generation (RAG) grounds generation in external sources, existing methods often treat evidence as isolated units, failing to reconstruct the logical chains that connect these dots. Inspired by Complementary Learning Systems (CLS), we propose CodaRAG, a framework that evolves retrieval from passive lookup into active associative discovery. CodaRAG operates via a three-stage pipeline: (1) Knowledge Consolidation to unify fragmented extractions into a stable memory substrate; (2) Associative Navigation to traverse the graph via multi-dimensional pathways-semantic, contextualized, and functional-explicitly recovering dispersed evidence chains; and (3) Interference Elimination to prune hyper-associative noise, ensuring a coherent, high-precision reasoning context. On GraphRAG-Bench, CodaRAG achieves absolute gains of 7-10% in retrieval recall and 3-11% in generation accuracy. These results demonstrate CodaRAG's superior ability to systematically robustify associative evidence retrieval for factual, reasoning, and creative tasks.

[36] Instruction Data Selection via Answer Divergence

Bo Li,Mingda Wang,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 本文提出了一种名为Answer Divergence-Guided Selection (ADG) 的指令数据选择方法,通过分析多样本响应在嵌入空间中的几何结构(分散程度和形状各向异性)来评估指令质量,从而提升指令微调效果。

Details Motivation: 指令微调依赖于大规模高质量的指令-响应语料,而现有数据选择方法未能充分利用响应的多样性信息。 Method: ADG对每条指令生成多个高温采样响应,将其映射到嵌入空间,并计算一个联合编码响应分散幅度与形状各向异性的输出分歧得分;高分指令对应响应既分散又呈多模态,而非单一方向上的近似复述。 Result: 在两个骨干模型和三个公开指令池上,仅使用10K条ADG筛选样本进行微调,在涵盖推理、知识和编程的六个基准测试中持续优于强基线选择器;消融分析验证了分散幅度与形状各向异性均不可或缺。 Conclusion: 答案分歧(answer divergence)是一种有效且实用的指令数据选择信号,ADG方法能显著提升小规模高质量数据下的指令微调性能。 Abstract: Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an output divergence score that jointly encodes dispersion magnitude and shape anisotropy. High scores correspond to instructions whose answers are both far apart and multi-modal, rather than clustered paraphrases along a single direction. Across two backbones and three public instruction pools, fine-tuning on only 10K ADG-selected examples consistently outperforms strong selectors on six benchmarks spanning reasoning, knowledge, and coding. Analyses further show that both dispersion magnitude and shape anisotropy are necessary, supporting answer divergence as a practical signal for instruction data selection. Code and appendix are included in the supplementary materials.

[37] NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning

Yanyi Su,Hongshuai Wang,Zhifeng Gao,Jun Cheng

Main category: cs.CL

TL;DR: 本文提出NOSE框架,通过正交约束对齐分子结构、受体序列和自然语言描述三种模态,解决现有嗅觉表征方法缺乏生物学基础和语义可解释性的问题,并引入弱正样本策略缓解嗅觉语言稀疏性,实现了SOTA性能和优异的零样本泛化能力。

Details Motivation: 现有嗅觉表征方法仅建模嗅觉通路的孤立片段,导致学习到的嵌入缺乏生物学基础和语义可解释性。 Method: 提出NOSE(Neural Olfactory-Semantic Embedding)框架,通过正交约束对齐分子结构、受体序列和自然语言描述三模态;引入弱正样本策略校准语义相似性,防止特征空间中相似气味被错误排斥。 Result: NOSE在多项实验中达到SOTA性能,并展现出优异的零样本泛化能力,验证了其表征空间与人类嗅觉直觉的高度一致性。 Conclusion: NOSE成功整合嗅觉通路多模态信息,提升了表征的生物学合理性和语义可解释性,为嗅觉计算建模提供了新范式。 Abstract: Olfaction lies at the intersection of chemical structure, neural encoding, and linguistic perception, yet existing representation methods fail to fully capture this pathway. Current approaches typically model only isolated segments of the olfactory pathway, overlooking the complete chain from molecule to receptors to linguistic descriptions. Such fragmentation yields learned embeddings that lack both biological grounding and semantic interpretability. We propose NOSE (Neural Olfactory-Semantic Embedding), a representation learning framework that aligns three modalities along the olfactory pathway: molecular structure, receptor sequence, and natural language description. Rather than simply fusing these signals, we decouple their contributions via orthogonal constraints, preserving the unique encoded information of each modality. To address the sparsity of olfactory language, we introduce a weak positive sample strategy to calibrate semantic similarity, preventing erroneous repulsion of similar odors in the feature space. Extensive experiments demonstrate that NOSE achieves state-of-the-art (SOTA) performance and excellent zero-shot generalization, confirming the strong alignment between its representation space and human olfactory intuition.

[38] EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning

Hengyu Zhang,Xuyun Zhang,Pengxiang Zhan,Linhao Luo,Hang Lv,Yanchao Tan,Shirui Pan,Carl Yang

Main category: cs.CL

TL;DR: EviCare是一种结合深度模型引导与大语言模型(LLM)的上下文推理框架,用于电子健康记录(EHR)中的诊断预测,尤其提升对新型疾病(novel diagnosis)的识别能力。

Details Motivation: 现有LLM方法在EHR诊断预测中易过拟合历史诊断,难以识别临床重要但未见于训练数据的新疾病,亟需提升泛化性与早期干预能力。 Method: 提出EviCare框架:(1)用深度模型进行候选诊断筛选;(2)对集合型EHR进行证据优先级排序;(3)构建关系型证据以支持新型诊断预测;最终将三类信号融合为自适应上下文提示,引导LLM推理。 Result: 在MIMIC-III和MIMIC-IV数据集上,EviCare在精度与准确率上平均超越纯LLM和纯深度模型基线20.65%;在新型诊断预测任务中提升达30.97%。 Conclusion: EviCare通过深度融合模型与LLM的协同机制,显著提升EHR诊断预测的准确性、鲁棒性与可解释性,尤其适用于罕见或新兴疾病的早期识别。 Abstract: Recent advances in large language models (LLMs) have enabled promising progress in diagnosis prediction from electronic health records (EHRs). However, existing LLM-based approaches tend to overfit to historically observed diagnoses, often overlooking novel yet clinically important conditions that are critical for early intervention. To address this, we propose EviCare, an in-context reasoning framework that integrates deep model guidance into LLM-based diagnosis prediction. Rather than prompting LLMs directly with raw EHR inputs, EviCare performs (1) deep model inference for candidate selection, (2) evidential prioritization for set-based EHRs, and (3) relational evidence construction for novel diagnosis prediction. These signals are then composed into an adaptive in-context prompt to guide LLM reasoning in an accurate and interpretable manner. Extensive experiments on two real-world EHR benchmarks (MIMIC-III and MIMIC-IV) demonstrate that EviCare achieves significant performance gains, which consistently outperforms both LLM-only and deep model-only baselines by an average of 20.65\% across precision and accuracy metrics. The improvements are particularly notable in challenging novel diagnosis prediction, yielding average improvements of 30.97\%.

[39] Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification

Qingyang Li

Main category: cs.CL

TL;DR: 本文提出了一种结合动态自适应多头注意力与有监督对比学习的BERT改进框架,用于提升电影评论情感分类性能,在IMDB上达到94.67%准确率。

Details Motivation: 传统模型(如标准BERT和RNN)难以捕捉长距离语义依赖及解析长文本中模糊的情感表达。 Method: 在BERT编码器中引入动态自适应多头注意力模块(利用全局上下文池化向量调节各注意力头权重)和有监督对比学习分支(增强嵌入空间中类内紧凑性与类间分离度)。 Result: 在IMDB数据集上准确率达94.67%,较强基线提升1.5–2.5个百分点;模型轻量、高效、可扩展。 Conclusion: 该混合框架有效提升了长文本情感分类性能,兼顾建模能力与实用性。 Abstract: The exponential growth of user-generated movie reviews on digital platforms has made accurate text sentiment classification a cornerstone task in natural language processing. Traditional models, including standard BERT and recurrent architectures, frequently struggle to capture long-distance semantic dependencies and resolve ambiguous emotional expressions in lengthy review texts. This paper proposes a novel hybrid framework that seamlessly integrates dynamic adaptive multi-head attention with supervised contrastive learning into a BERT-based Transformer encoder. The dynamic adaptive attention module employs a global context pooling vector to dynamically regulate the contribution of each attention head, thereby focusing on critical sentiment-bearing tokens while suppressing noise. Simultaneously, the supervised contrastive learning branch enforces tighter intra-class compactness and larger inter-class separation in the embedding space. Extensive experiments on the IMDB dataset demonstrate that the proposed model achieves competitive performance with an accuracy of 94.67\%, outperforming strong baselines by 1.5--2.5 percentage points. The framework is lightweight, efficient, and readily extensible to other text classification tasks.

Mingfei Lu,Yi Zhang,Mengjia Wu,Yue Feng

Main category: cs.CL

TL;DR: 本文提出JurisCQAD数据集和JurisMA多智能体框架,以解决法律咨询问答中数据稀缺、任务复杂和上下文依赖强等挑战,显著提升中文法律问答性能。

Details Motivation: 法律咨询问答(Legal CQA)面临高质量训练数据稀缺、任务构成复杂、上下文依赖强等独特挑战。 Method: 构建大规模中文法律咨询数据集JurisCQAD(4.3万+真实查询,含专家标注正负回答),设计法律要素图进行结构化任务分解,并提出模块化多智能体框架JurisMA,支持动态路由、法条 grounding 和风格优化。 Result: 在精炼版LawBench上评估,该系统在多个词法与语义指标上显著优于通用及法律领域大模型。 Conclusion: 可解释的任务分解与模块化智能体协作能有效提升法律咨询问答的上下文感知推理能力与实际性能。 Abstract: Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.

[41] Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

Maiya Goloburda,Roman Vashurin,Fedor Chernogorsky,Nurkhan Laiyk,Daniil Orel,Preslav Nakov,Maxim Panov

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)中不确定性来源的多样性对现有不确定性量化(UQ)方法性能的影响,提出一个能显式标注不确定性来源的新数据集,并发现当前UQ方法在面对知识缺失以外的不确定性源时表现显著下降,强调需发展能区分不确定性来源的UQ方法。

Details Motivation: 现有UQ方法大多仅输出单一置信度分数,但自然语言任务中的不确定性来自多个不同源头(如模型知识缺口、输出变异性、输入歧义),其影响各异,亟需按来源分析UQ方法的有效性。 Method: 构建一个显式标注不确定性来源的新数据集,结合控制实验设计,系统评估各类UQ方法在不同不确定性来源下的表现。 Result: 实验表明,多数UQ方法在仅由模型知识局限引发不确定性时效果良好,但在其他不确定性来源(如输入歧义或输出变异性)存在时性能明显下降甚至产生误导性结果。 Conclusion: 当前UQ方法缺乏对不确定性来源的区分能力,未来工作应发展能显式建模和响应不同不确定性来源的新型UQ方法,以提升LLM在真实场景中的可靠性与安全性。 Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

[42] Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

Xinyi Huang,Mingzhe Lu,Haoyu Dong

Main category: cs.CL

TL;DR: 本文提出SGKR(结构导向的知识检索)框架,通过函数调用依赖图组织领域知识,提升大语言模型在多步数据分析任务中的知识检索效果。

Details Motivation: 现有基于词法或嵌入相似性的检索方法难以匹配多步推理所需的、以可执行代码和计算依赖结构为根基的关键知识。 Method: SGKR构建由函数调用依赖关系诱导的知识图谱;对问题提取语义化的输入/输出标签,识别连接它们的依赖路径,并生成任务相关的子图;将子图对应的知识与函数实现组装为结构化上下文供LLM生成代码。 Result: 在多步数据分析基准测试中,SGKR显著优于无检索及基于相似性的检索基线,提升了LLM和编码智能体的解题正确率。 Conclusion: 以计算结构(而非文本相似性)为依据进行知识检索,能更有效地支撑LLM完成需多步推理的领域分析任务。 Abstract: Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.

[43] ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

Suyoung Bae,CheolWon Na,Jaehoon Lee,Yumin Lee,YunSeok Choi,Jee-Hyong Lee

Main category: cs.CL

TL;DR: 本文提出ReFEree,一种无需参考的细粒度方法,用于评估现实世界代码摘要的事实一致性,通过定义特定于代码摘要的事实不一致标准并在片段级别结合依赖信息进行评估,最终聚合为细粒度评分。

Details Motivation: 现有评估方法主要针对孤立代码片段的简短摘要,难以对多句功能和真实代码摘要中常见的依赖上下文进行细粒度评估。 Method: 提出ReFEree方法,定义代码摘要特有的事实不一致标准,在片段级别利用这些标准和依赖信息进行评估,并将结果聚合为细粒度评分;构建含人工标注事实一致性标签的代码摘要基准。 Result: ReFEree在13个基线方法中与人类判断的相关性最高,比先前最先进方法提升15-18%。 Conclusion: ReFEree是一种有效、可靠的无需参考的细粒度代码摘要事实一致性评估方法,显著优于现有方法。 Abstract: As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.

[44] Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models

Zhengnan Guo,Fei Tan

Main category: cs.CL

TL;DR: 本文首次系统比较了扩散大语言模型(dLLMs)与自回归模型(AR)在幻觉问题上的差异,发现dLLMs幻觉更严重,并揭示其特有的失败模式(如提前终止、去噪不全、上下文侵入)及推理计算动态特性。

Details Motivation: 尽管扩散大语言模型(dLLMs)展现出与自回归模型相当的性能,但其在事实性尤其是幻觉方面的表现尚缺乏系统研究。 Method: 开展受控对比实验,控制架构、规模和预训练权重,在相同条件下评估dLLMs与AR模型的幻觉模式;分析推理计算开销下的生成动态;识别并归类dLLMs特有的失败模式。 Result: dLLMs比同配置AR模型幻觉更严重;其推理过程呈现不同计算动态:准自回归生成易早饱和,非顺序解码支持持续优化;存在三种独特失败模式:提前终止、去噪不全、上下文侵入。 Conclusion: dLLMs虽在通用任务上逼近AR模型,但其独特的幻觉机制严重威胁模型可靠性,需针对性解决。 Abstract: While Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive paradigm comparable to autoregressive (AR) models, their faithfulness, specifically regarding hallucination, remains largely underexplored. To bridge this gap, we present the first controlled comparative study to evaluate hallucination patterns in dLLMs. Our results demonstrate that current dLLMs exhibit a higher propensity for hallucination than AR counterparts controlled for architecture, scale, and pre-training weights. Furthermore, an analysis of inference-time compute reveals divergent dynamics: while quasi-autoregressive generation suffers from early saturation, non-sequential decoding unlocks potential for continuous refinement. Finally, we identify distinct failure modes unique to the diffusion process, including premature termination, incomplete denoising, and context intrusion. Our findings underscore that although dLLMs have narrowed the performance gap on general tasks, their distinct hallucination mechanisms pose a critical challenge to model reliability. Our code is available at https://github.com/ZeroLoss-Lab/Lost-in-Diffusion

[45] LLMs Should Incorporate Explicit Mechanisms for Human Empathy

Xiaoxing You,Qiang Huang,Jun Yu

Main category: cs.CL

TL;DR: 本文提出大型语言模型(LLMs)应显式融入人类共情机制,指出当前LLM在高风险人本场景中常因情感衰减、共情粒度错配、冲突回避和语言疏离等机制而系统性地扭曲人类视角,并建议将共情意识的目标、基准与训练信号纳入LLM开发核心。

Details Motivation: 随着LLM在高风险人本场景中广泛应用,其成功不仅依赖正确性或流利度,更需忠实保留人类视角;但现有模型在对齐良好、政策合规的情况下仍普遍削弱情感、误判语境重点、僵化关系立场,导致意义失真。 Method: 将共情形式化为可观测的行为属性(即建模并回应人类视角,同时保持意图、情感与语境),识别出四种共情失败机制,并沿认知、文化、关系三个维度组织分析,辅以实证研究验证。 Result: 发现强基准性能可能掩盖系统性共情失真;共情失败是当前训练与对齐范式的结构性后果。 Conclusion: 应将共情意识的目标设定、评估基准和训练信号作为LLM开发的一等公民(first-class components)。 Abstract: This paper argues that Large Language Models (LLMs) should incorporate explicit mechanisms for human empathy. As LLMs become increasingly deployed in high-stakes human-centered settings, their success depends not only on correctness or fluency but on faithful preservation of human perspectives. Yet, current LLMs systematically fail at this requirement: even when well-aligned and policy-compliant, they often attenuate affect, misrepresent contextual salience, and rigidify relational stance in ways that distort meaning. We formalize empathy as an observable behavioral property: the capacity to model and respond to human perspectives while preserving intention, affect, and context. Under this framing, we identify four recurring mechanisms of empathic failure in contemporary LLMs--sentiment attenuation, empathic granularity mismatch, conflict avoidance, and linguistic distancing--arising as structural consequences of prevailing training and alignment practices. We further organize these failures along three dimensions: cognitive, cultural, and relational empathy, to explain their manifestation across tasks. Empirical analyses show that strong benchmark performance can mask systematic empathic distortions, motivating empathy-aware objectives, benchmarks, and training signals as first-class components of LLM development.

[46] Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

Jiyeon Kim,Sungik Choi,Yongrae Jo,Moontae Lee,Minjoon Seo

Main category: cs.CL

TL;DR: 本文研究了扩散式语言模型(dLLMs)中非自回归解码的内在缺陷,发现其存在邻近偏差导致的空间错误传播问题,并提出一种轻量级规划与温度退火策略来改善推理与规划任务性能。

Details Motivation: 尽管扩散式语言模型(dLLMs)具备并行生成和双向建模潜力,但其在推理与规划任务中实现稳定、高质量的完全非自回归解码仍具挑战,尤其缺乏对解码动态过程的系统理解。 Method: 通过沿时间轴系统分析dLLMs的非自回归解码动态,识别出基于置信度解码中的邻近偏差问题;进而设计轻量级早期token选择引导机制,结合简易规划器与句末温度退火策略。 Result: 在多种推理与规划任务上显著优于现有启发式基线方法,且未引入明显计算开销。 Conclusion: 非自回归dLLMs的性能瓶颈源于解码初期的空间局部性偏差,而有针对性的早期引导可有效缓解该问题,为高效、鲁棒的非自回归语言生成提供新思路。 Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

[47] Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

Arnon Turetzky,Avihu Dekel,Hagai Aronowitz,Ron Hoory,Yossi Adi

Main category: cs.CL

TL;DR: 本文提出了Context-Aware Stress TTS (CAST)基准,用于评估TTS系统在语境下生成恰当词重音的能力,并发现当前TTS系统在仅依赖文本语境推断重音方面表现不佳,而纯文本语言模型却能较好完成该任务。

Details Motivation: 现代TTS系统虽能生成富有表现力的语音,但尚不清楚其能否仅从话语语境中推断出符合语义需要的词重音;同一句子因重音位置不同可表达修正、对比或澄清等不同含义,因此需评估TTS对语境敏感重音的建模能力。 Method: 构建了CAST基准,包含成对的对比性语境样本(相同句子+不同语境→不同应重音词),并设计了评估框架与合成语料库,对前沿TTS系统和纯文本语言模型进行对比评测。 Result: 实验发现:纯文本语言模型能可靠地从语境中恢复目标重音,但主流TTS系统在语音实现上频繁失败,存在显著性能差距。 Conclusion: 当前TTS系统缺乏对语境驱动的词级重音的充分建模能力,CAST基准为推动上下文感知语音合成研究提供了标准化工具和数据支持。 Abstract: Spoken meaning often depends not only on what is said, but also on which word is emphasized. The same sentence can convey correction, contrast, or clarification depending on where emphasis falls. Although modern text-to-speech (TTS) systems generate expressive speech, it remains unclear whether they infer contextually appropriate stress from discourse alone. To address this gap, we present Context-Aware Stress TTS (CAST), a benchmark for evaluating context-conditioned word-level stress in TTS. Items are defined as contrastive context pairs: identical sentences paired with distinct contexts requiring different stressed words. We evaluate state-of-the-art systems and find a consistent gap: text-only language models reliably recover the intended stress from context, yet TTS systems frequently fail to realize it in speech. We release the benchmark, evaluation framework, construction pipeline and a synthetic corpus to support future work on context-aware speech synthesis.

[48] Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Weihua Zheng,Chang Liu,Zhengyuan Liu,Xin Huang,Kui Wu,Muhammad Huzaifah Md Shahrin,Aiti Aw,Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: 本文提出了一种在预训练阶段引入跨语言映射任务的方法,以增强多语言大语言模型(LLMs)的跨语言对齐能力,同时保持单语流畅性;该方法无需大量平行语料,且通过语言对齐系数量化跨语言一致性,在机器翻译、跨语言理解与问答任务上显著超越基线。

Details Motivation: 多语言大语言模型在跨语言任务中表现不佳,主要受限于高低资源语言间的数据不平衡及预训练中的单语偏差。现有方法依赖大量平行数据或存在训练不稳定问题。 Method: 在预训练阶段引入双向跨语言映射任务,使模型在嵌入空间中对齐不同语言;并提出语言对齐系数(Language Alignment Coefficient)来稳健评估跨语言一致性。 Result: 在机器翻译(MT)、跨语言自然语言理解(CLNLU)和跨语言问答(CLQA)任务上,相比强基线模型,分别提升最高11.9 BLEU、6.72 BERTScore-Precision 和超5% 准确率。 Conclusion: 将跨语言目标直接融入预训练过程,可有效提升多语言大语言模型的跨语言能力,且不损害单语性能,为构建更均衡的多语言模型提供了新思路。 Abstract: Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.

[49] Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

Yang Cui,Jingyuan Sun,Yizheng Sun,Yifan Wang,Yunhao Zhang,Jixing Li,Shaonan Wang,Hongpeng Zhou,John Hale,Chengqing Zong,Goran Nenadic

Main category: cs.CL

TL;DR: 本文通过在多语言大语言模型(LLMs)中进行计算损伤实验,探究大脑多语言处理机制,发现存在一个共享的核心处理结构,同时嵌入了语言特异性模块;该共享核心对跨语言脑活动预测至关重要,而语言特异性模块则选择性影响对应母语的脑预测性能。

Details Motivation: 探究大脑如何在不同语言间支持语言处理,检验多语言人工智能力是否与人脑机制一致,并弥补神经影像学无法因果区分共享 vs. 语言特异性加工的局限。 Method: 使用6个开源多语言大语言模型,构建‘计算损伤’:零化跨语言重要参数(共享损伤)或仅对某一语言特别重要的参数(语言特异性损伤);将模型表征与三语(英、中、法)自然故事fMRI数据(112名被试)进行脑-模型对齐分析。 Result: 共享核心损伤使全脑编码相关性下降60.32%;语言特异性损伤不破坏嵌入空间的跨语言分离性,但显著削弱对应母语的fMRI预测性能。 Conclusion: 人脑多语言处理更符合‘共享骨干+嵌入式特化’架构;该方法为多语言脑-模型对齐提供了首个因果性计算框架。 Abstract: How the brain supports language across different languages is a basic question in neuroscience and a useful test for multilingual artificial intelligence. Neuroimaging has identified language-responsive brain regions across languages, but it cannot by itself show whether the underlying processing is shared or language-specific. Here we use six multilingual large language models (LLMs) as controllable systems and create targeted ``computational lesions'' by zeroing small parameter sets that are important across languages or especially important for one language. We then compare intact and lesioned models in predicting functional magnetic resonance imaging (fMRI) responses during 100 minutes of naturalistic story listening in native English, Chinese and French (112 participants). Lesioning a compact shared core reduces whole-brain encoding correlation by 60.32% relative to intact models, whereas language-specific lesions preserve cross-language separation in embedding space but selectively weaken brain predictivity for the matched native language. These results support a shared backbone with embedded specializations and provide a causal framework for studying multilingual brain-model alignment.

[50] ProUIE: A Macro-to-Micro Progressive Learning Method for LLM-based Universal Information Extraction

Wenda Liu,Zhigang Song,Shuai Nie,Guangyao Liu,Lisung Chen,Binyu Yang,Yaran Chen,Peng Zhou,Hongzhen Wang,Yuchen Liu,Wenyue Hu,Jiaming Xu,Runyu Shi,Ying Huang

Main category: cs.CL

TL;DR: 本文提出ProUIE方法,通过宏观到微观的渐进式学习,在不引入额外信息的前提下提升大语言模型在通用信息抽取(UIE)任务上的性能。

Details Motivation: 现有基于大语言模型的通用信息抽取方法常依赖额外信息,导致训练复杂且增益有限。 Method: ProUIE包含三个阶段:(i)宏观层完整建模(CM),按内在难度顺序联合学习NER、RE和EE;(ii)中观层简化对齐(SA),在采样数据上使用简化格式以规整结构化输出;(iii)微观层深度探索(DE),结合GRPO与结构单元级细粒度奖励(SFR)引导优化。 Result: 在36个公开数据集上实验表明,ProUIE在NER和RE任务上平均优于强指令微调基线,且使用更小主干模型;在大规模生产导向的信息抽取中也展现出显著提升。 Conclusion: ProUIE是一种无需外部信息、高效提升UIE性能的渐进式学习框架,兼顾模型轻量化与实际部署效果。 Abstract: LLM-based universal information extraction (UIE) methods often rely on additional information beyond the original training data, which increases training complexity yet often yields limited gains. To address this, we propose ProUIE, a Macro-to-Micro progressive learning approach that improves UIE without introducing any external information. ProUIE consists of three stages: (i) macro-level Complete Modeling (CM), which learns NER, RE, and EE along their intrinsic difficulty order on the full training data to build a unified extraction foundation, (ii) meso-level Streamlined Alignment (SA), which operates on sampled data with simplified target formats, streamlining and regularizing structured outputs to make them more concise and controllable, and (iii) micro-level Deep Exploration (DE), which applies GRPO with stepwise fine-grained rewards (SFR) over structural units to guide exploration and improve performance. Experiments on 36 public datasets show that ProUIE consistently improves unified extraction, outperforming strong instruction-tuned baselines on average for NER and RE while using a smaller backbone, and it further demonstrates clear gains in large-scale production-oriented information extraction.

[51] Efficient Process Reward Modeling via Contrastive Mutual Information

Nakyung Lee,Sangwoo Hong,Jungwoo Lee

Main category: cs.CL

TL;DR: 本文提出了一种名为对比点互信息(CPMI)的自动奖励标注方法,用于链式推理(CoT)步骤验证,显著降低计算成本并提升准确性。

Details Motivation: 现有链式推理(CoT)步骤验证依赖人工标注或高计算开销的蒙特卡洛(MC)估计,成本高、效率低。 Method: 提出对比点互信息(CPMI),利用模型内部概率衡量推理步骤相对于难负样本对正确答案互信息的提升,作为步骤贡献度的代理奖励信号。 Result: CPMI将数据集构建时间减少84%,token生成减少98%,且在过程评估和数学推理基准上准确率高于MC估计。 Conclusion: CPMI是一种高效、低成本、高精度的自动步骤级奖励标注方法,为PRM等验证模型提供了实用可行的监督信号来源。 Abstract: Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model's internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step's contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.

[52] HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

Senol Gulgonul

Main category: cs.CL

TL;DR: HeceTokenizer 是一种基于音节的土耳其语分词器,利用土耳其语确定性的六种音系模式构建约8000个唯一音节类型的封闭、无未登录词(OOV)词汇表;配合BERT-tiny编码器和细粒度块检索策略,在TQuAD检索基准上达到50.3% Recall@5,优于使用大200倍模型的形态学基线方法。

Details Motivation: 利用土耳其语音节结构的高度规律性,设计轻量、无OOV问题的分词器,以提升资源受限场景下的信息检索性能。 Method: 提出音节级分词器HeceTokenizer,构建封闭式音节词表;从零训练参数量仅1.5M的BERT-tiny模型,采用掩码语言建模目标;结合细粒度块检索策略进行端到端检索评估。 Result: 在TQuAD检索任务上取得50.3% Recall@5,显著高于形态学驱动基线的46.92%,且所用模型参数仅为后者的1/200。 Conclusion: 土耳其语的音系规律性可作为高效、低资源检索任务的强归纳偏置,音节级建模优于传统形态学方法。 Abstract: HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

[53] Learning and Enforcing Context-Sensitive Control for LLMs

Mohammad Albinhassan,Pranava Madhyastha,Mark Law,Alessandra Russo

Main category: cs.CL

TL;DR: 本文提出了一种自动学习上下文敏感约束的框架,使小规模大语言模型(1B参数)也能实现完美约束遵循,无需人工指定规则。

Details Motivation: 现有基于上下文无关文法(CFG)的方法难以保证生成有效性,而人工指定上下文敏感约束又需要专业领域知识,构成显著障碍。 Method: 采用两阶段框架:第一阶段为语法探索,通过LLM交互收集多样化输出以学习约束;第二阶段为约束应用,在生成过程中强制执行所学规则。 Result: 实验表明,该方法使1B参数的小型LLM能实现100%约束遵循,性能超越更大参数模型及当前最优推理模型。 Conclusion: 这是首个将上下文敏感语法学习与LLM生成相结合的工作,在消除人工干预的同时保障了生成的有效性。 Abstract: Controlling the output of Large Language Models (LLMs) through context-sensitive constraints has emerged as a promising approach to overcome the limitations of Context-Free Grammars (CFGs) in guaranteeing generation validity. However, such constraints typically require manual specification -- a significant barrier demanding specialized expertise. We introduce a framework that automatically learns context-sensitive constraints from LLM interactions through a two-phase process: syntactic exploration to gather diverse outputs for constraint learning, followed by constraint exploitation to enforce these learned rules during generation. Experiments demonstrate that our method enables even small LLMs (1B parameters) to learn and generate with perfect constraint adherence, outperforming larger counterparts and state-of-the-art reasoning models. This work represents the first integration of context-sensitive grammar learning with LLM generation, eliminating manual specification while maintaining generation validity.

[54] QFS-Composer: Query-focused summarization pipeline for less resourced languages

Vuk Đuranović,Marko Robnik Šikonja

Main category: cs.CL

TL;DR: 本文提出QFS-Composer框架,通过查询分解、问题生成、问答与抽象摘要提升低资源语言(如斯洛文尼亚语)的查询聚焦式摘要的事实一致性与用户意图对齐。

Details Motivation: 解决低资源语言中查询聚焦式摘要(QFS)面临的标注数据和评估工具匮乏问题。 Method: 提出QFS-Composer框架,整合查询分解、问题生成(QG)、问答(QA)和抽象摘要;构建斯洛文尼亚语QA/QG模型,并适配无参考摘要评估方法。 Result: 在斯洛文尼亚语上实证表明,QA引导的摘要流程相比基线LLM显著提升了摘要的一致性与相关性。 Conclusion: 建立了可扩展的方法论,推动低资源语言中查询聚焦式摘要的发展。 Abstract: Large language models (LLMs) demonstrate strong performance in text summarization, yet their effectiveness drops significantly across languages with restricted training resources. This work addresses the challenge of query-focused summarization (QFS) in less-resourced languages, where labeled datasets and evaluation tools are limited. We present a novel QFS framework, QFS-Composer, that integrates query decomposition, question generation (QG), question answering (QA), and abstractive summarization to improve the factual alignment of a summary with user intent. We test our approach on the Slovenian language. To enable high-quality supervision and evaluation, we develop the Slovenian QA and QG models based on a Slovene LLM and adapt evaluation approaches for reference-free summary evaluation. Empirical evaluation shows that the QA-guided summarization pipeline yields improved consistency and relevance over baseline LLMs. Our work establishes an extensible methodology for advancing QFS in less-resourced languages.

[55] Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

Jakub Binkowski,Kamil Adamczewski,Tomasz Kajdanowicz

Main category: cs.CL

TL;DR: 本文提出SinkProbe方法,通过分析注意力机制中的'注意力汇点'(attention sinks)来检测大语言模型生成中的幻觉现象,并在多个数据集和模型上达到SOTA效果。

Details Motivation: 大型语言模型常出现幻觉现象,即输出看似流畅自信但事实错误或缺乏上下文支持;现有基于注意力图的检测方法机制不明,亟需理论支撑。 Method: 提出SinkProbe方法,基于注意力汇点(即累积大量注意力权重的token)现象,发现其与值向量范数密切相关,并从数学上证明已有方法隐式依赖于注意力汇点。 Result: SinkProbe在主流数据集和大语言模型上实现了最先进的幻觉检测性能。 Conclusion: 注意力汇点是理解与检测幻觉的关键机制,SinkProbe为幻觉检测提供了坚实的理论基础与实用新方法。 Abstract: Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks - tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.

[56] Expect the Unexpected? Testing the Surprisal of Salient Entities

Jessica Lin,Amir Zeldes

Main category: cs.CL

TL;DR: 本文研究了话语参与者(实体)的显著性如何影响信息密度分布,发现全局显著实体具有更高意外度(surprisal),但能降低周围内容的意外度,从而提升整体可预测性;该效应因语篇体裁而异,对UID假说的竞争压力框架进行了细化。

Details Motivation: 以往关于均匀信息密度(UID)假说的研究忽略了话语中参与者的相对显著性,本文旨在填补这一空白。 Method: 基于7万个人工标注的实体提及(涵盖16种英语语篇体裁),结合新颖的最小对提示法(minimal-pair prompting),分析实体显著性与surprisal的关系,并控制位置、长度和嵌套等混杂因素。 Result: 全局显著实体比非显著实体具有显著更高的surprisal;同时,当作为提示时,显著实体能系统性降低周围内容的surprisal,提升文档级可预测性;该效应在主题连贯文本中最强,在会话语境中最弱。 Conclusion: 全局实体显著性是塑造话语信息分布的关键机制,为UID假说的竞争压力框架提供了新的理论细化。 Abstract: Previous work examining the Uniform Information Density (UID) hypothesis has shown that while information as measured by surprisal metrics is distributed more or less evenly across documents overall, local discrepancies can arise due to functional pressures corresponding to syntactic and discourse structural constraints. However, work thus far has largely disregarded the relative salience of discourse participants. We fill this gap by studying how overall salience of entities in discourse relates to surprisal using 70K manually annotated mentions across 16 genres of English and a novel minimal-pair prompting method. Our results show that globally salient entities exhibit significantly higher surprisal than non-salient ones, even controlling for position, length, and nesting confounds. Moreover, salient entities systematically reduce surprisal for surrounding content when used as prompts, enhancing document-level predictability. This effect varies by genre, appearing strongest in topic-coherent texts and weakest in conversational contexts. Our findings refine the UID competing pressures framework by identifying global entity salience as a mechanism shaping information distribution in discourse.

[57] Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

Arya Shah,Deepali Mishra,Chaklam Silpasuwanchai

Main category: cs.CL

TL;DR: 本文研究了大语言模型在角色扮演中采用不同人格特质(特别是宜人性)时,其奉承行为(sycophancy)的变化规律,发现宜人性越高,奉承倾向越强,并在13个开源模型中验证了这一现象。

Details Motivation: 尽管已知大语言模型的角色扮演能力可能引发奉承问题,但尚不清楚具体人格特质(如宜人性)如何影响奉承行为,本文旨在填补这一空白。 Method: 构建包含275个基于NEO-IPIP量表评估宜人性的 persona 的基准,对13个参数规模从0.6B到20B的开源语言模型,在4950个涵盖33类话题的奉承诱导提示上进行系统测试与统计分析。 Result: 9/13模型显示 persona 宜人性与奉承率呈显著正相关,最高Pearson相关系数达r=0.87,Cohen's d效应量高达2.33。 Conclusion: 宜人性是预测角色扮演中奉承行为的可靠指标,该发现对角色型AI部署及考虑人格因素的对齐策略具有重要启示。 Abstract: Large language models increasingly serve as conversational agents that adopt personas and role-play characters at user request. This capability, while valuable, raises concerns about sycophancy: the tendency to provide responses that validate users rather than prioritize factual accuracy. While prior work has established that sycophancy poses risks to AI safety and alignment, the relationship between specific personality traits of adopted personas and the degree of sycophantic behavior remains unexplored. We present a systematic investigation of how persona agreeableness influences sycophancy across 13 small, open-weight language models ranging from 0.6B to 20B parameters. We develop a benchmark comprising 275 personas evaluated on NEO-IPIP agreeableness subscales and expose each persona to 4,950 sycophancy-eliciting prompts spanning 33 topic categories. Our analysis reveals that 9 of 13 models exhibit statistically significant positive correlations between persona agreeableness and sycophancy rates, with Pearson correlations reaching $r = 0.87$ and effect sizes as large as Cohen's $d = 2.33$. These findings demonstrate that agreeableness functions as a reliable predictor of persona-induced sycophancy, with direct implications for the deployment of role-playing AI systems and the development of alignment strategies that account for personality-mediated deceptive behaviors.

[58] Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS

Shijia Xu,Zhou Wu,Xiaolong Jia,Yu Wang,Kai Liu,April Xiaowen Dong

Main category: cs.CL

TL;DR: 本文提出Self-Correcting RAG框架,将检索与生成建模为约束优化与路径规划问题,通过多维多重选择背包问题(MMKP)优化上下文选择,并引入NLI引导的MCTS机制提升推理准确性、降低幻觉。

Details Motivation: 现有RAG方法在处理复杂推理任务时存在上下文利用率低和频繁幻觉两大挑战。 Method: 1)将上下文选择建模为多维多重选择背包问题(MMKP),在严格token预算下最大化信息密度并去除冗余;2)在生成端引入自然语言推理(NLI)引导的蒙特卡洛树搜索(MCTS),动态探索并验证推理路径的忠实性。 Result: 在六个多跳问答与事实核查数据集上,该方法显著提升复杂查询的推理准确率,并有效抑制幻觉,优于强基线模型。 Conclusion: Self-Correcting RAG通过联合优化检索与生成过程,实现了更高效、更可靠的知识增强推理。 Abstract: Retrieval-augmented generation (RAG) substantially extends the knowledge boundary of large language models. However, it still faces two major challenges when handling complex reasoning tasks: low context utilization and frequent hallucinations. To address these issues, we propose Self-Correcting RAG, a unified framework that reformulates retrieval and generation as constrained optimization and path planning. On the input side, we move beyond traditional greedy retrieval and, for the first time, formalize context selection as a multi-dimensional multiple-choice knapsack problem (MMKP), thereby maximizing information density and removing redundancy under a strict token budget. On the output side, we introduce a natural language inference (NLI)-guided Monte Carlo Tree Search (MCTS) mechanism, which leverages test-time compute to dynamically explore reasoning trajectories and validate the faithfulness of generated answers. Experiments on six multi-hop question answering and fact-checking datasets demonstrate that our method significantly improves reasoning accuracy on complex queries while effectively reducing hallucinations, outperforming strong existing baselines.Our code is available at https://github.com/xjiacs/Self-Correcting-RAG .

[59] BlasBench: An Open Benchmark for Irish Speech Recognition

Jyoutir Raj,John Conway

Main category: cs.CL

TL;DR: 本文提出了BlasBench,一个开源的爱尔兰语ASR评估框架,具备爱尔兰语感知的文本归一化功能,并在Common Voice和FLEURS两个数据集上对12种ASR系统进行了基准测试,揭示了模型在不同数据集间的泛化差距。

Details Motivation: 缺乏面向爱尔兰语的开放基准测试,无法在统一的爱尔兰语感知评估协议下比较终端用户ASR系统性能。 Method: 构建BlasBench评估框架,集成支持fadas、lenition和eclipsis的爱尔兰语文本归一化;在Common Voice ga-IE和FLEURS ga-IE上对12个ASR系统(涵盖4类架构)进行基准测试。 Result: 所有Whisper变体WER均超100%;最佳开源模型omniASR LLM 7B在Common Voice和FLEURS上WER分别为30.65%和39.09%;在Common Voice上微调的模型在FLEURS上性能下降33–43 WER点。 Conclusion: 单一数据集评估会掩盖模型泛化能力缺陷,BlasBench可有效暴露此类问题,推动更鲁棒的爱尔兰语ASR系统发展。 Abstract: No open Irish-specific benchmark compares end-user ASR systems under a shared Irish-aware evaluation protocol. To solve this, we release BlasBench, an open evaluation harness with Irish-aware text normalisation that preserves fadas, lenition, and eclipsis. We benchmark 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER. The best open model (omniASR LLM 7B) achieves 30.65% WER on Common Voice and 39.09% on FLEURS. We noticed models fine-tuned on Common Voice lose 33-43 WER points on FLEURS, revealing a generalisation gap that is invisible to single-dataset evaluation.

[60] RCBSF: A Multi-Agent Framework for Automated Contract Revision via Stackelberg Game

Shijia Xu,Yu Wang,Xiaolong Jia,Zhou Wu,Kai Liu,April Xiaowen Dong

Main category: cs.CL

TL;DR: 本文提出了一种风险约束的双层Stackelberg框架(RCBSF),用于提升大语言模型在自动合同修订任务中的安全性与可控性,通过构建全局指导代理与局部约束修订/验证代理的层级博弈结构,在理论收敛性保障下实现了更高的风险解决率与令牌效率。

Details Motivation: 现有大语言模型在法律AI尤其是自动合同修订中存在幻觉风险高、行为约束不足的问题。 Method: 提出风险约束双层Stackelberg框架(RCBSF),将合同修订建模为非合作Stackelberg博弈,包含全局规范代理(GPA)作为领导者,约束修订代理(CRA)和本地验证代理(LVA)作为跟随者,并提供理论收敛性保证。 Result: 在统一基准上验证,RCBSF平均风险解决率(RRR)达84.21%,优于迭代基线方法,且提升令牌效率。 Conclusion: RCBSF通过引入结构化风险约束与层级代理协同机制,显著提升了LLM在高风险法律文本修订任务中的可靠性与实用性。 Abstract: Despite the widespread adoption of Large Language Models (LLMs) in Legal AI, their utility for automated contract revision remains impeded by hallucinated safety and a lack of rigorous behavioral constraints. To address these limitations, we propose the Risk-Constrained Bilevel Stackelberg Framework (RCBSF), which formulates revision as a non-cooperative Stackelberg game. RCBSF establishes a hierarchical Leader Follower structure where a Global Prescriptive Agent (GPA) imposes risk budgets upon a follower system constituted by a Constrained Revision Agent (CRA) and a Local Verification Agent (LVA) to iteratively optimize output. We provide theoretical guarantees that this bilevel formulation converges to an equilibrium yielding strictly superior utility over unguided configurations. Empirical validation on a unified benchmark demonstrates that RCBSF achieves state-of-the-art performance, surpassing iterative baselines with an average Risk Resolution Rate (RRR) of 84.21\% while enhancing token efficiency. Our code is available at https://github.com/xjiacs/RCBSF .

[61] Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Fangda Ye,Zhifei Xie,Yuxin Hu,Yihang Yin,Shurui Huang,Shikai Dong,Jianzhu Bao,Shuicheng Yan

Main category: cs.CL

TL;DR: 本文提出Deep-Reporter,一种面向多模态长文本生成的智能体框架,支持图文联合检索、渐进式合成与上下文管理,并构建了高质量训练数据集和评测基准M2LongBench。

Details Motivation: 现有智能体搜索框架局限于文本,忽视真实专家报告中普遍存在的多模态证据,亟需支持多模态长文本生成的新方法。 Method: 提出Deep-Reporter框架,包含三部分:(i) 多模态智能体搜索与过滤;(ii) 清单引导的渐进式图文合成;(iii) 循环上下文管理;并构建8K高质量智能体轨迹数据及M2LongBench评测基准。 Result: 实验表明多模态长文本生成在图文选择与融合方面极具挑战性,但有效后训练可显著提升性能。 Conclusion: Deep-Reporter为多模态长文本生成提供了统一、可扩展的智能体范式,推动事实性与多模态协同生成能力的发展。 Abstract: Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

[62] How You Ask Matters! Adaptive RAG Robustness to Query Variations

Yunah Jang,Megha Sundriyal,Kyomin Jung,Meeyoung Cha

Main category: cs.CL

TL;DR: 本文提出了首个大规模基准测试,用于评估自适应检索增强生成(Adaptive RAG)在语义相同但表面形式多样的查询下的鲁棒性,发现其在查询微小变化下表现脆弱,且模型规模增大并未提升鲁棒性。

Details Motivation: 现实世界中的查询虽语义相同但表面形式多样,其对自适应RAG的影响尚未被充分研究。 Method: 构建了一个结合人工撰写与模型生成的、大规模且语义一致但表面多样的查询变体基准,并从答案质量、计算开销和检索决策三个维度系统评估自适应RAG的鲁棒性。 Result: 发现了显著的鲁棒性差距:查询表面形式的微小变化会显著改变检索行为与准确性;更大模型虽性能更好,但鲁棒性未同步提升。 Conclusion: 自适应RAG方法对保持语义不变的查询变化高度敏感,暴露出关键鲁棒性挑战。 Abstract: Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice. However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored. We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites. Our benchmark facilitates a systematic evaluation of Adaptive RAG robustness by examining its key components across three dimensions: answer quality, computational cost, and retrieval decisions. We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Although larger models show better performance, robustness does not improve accordingly. These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.

[63] Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models

Mehmet Can Şakiroğlu,H. Altay Güvenir,Kamer Kaya

Main category: cs.CL

TL;DR: 本文提出了一种结合知识图谱(KG)与大语言模型(LLM)的新型多选题(MCQ)生成方法,并引入九种难度信号融合的数据驱动难度评估机制,显著提升了题目质量与难度可解释性。

Details Motivation: 自动化MCQ生成系统在教育自适应场景中面临题目难度精准估计的挑战,现有方法缺乏结构化知识支撑与可解释的难度建模。 Method: 利用LLM从文档构建知识图谱;基于KG节点与三元组/五元组生成题干;从KG中选取干扰项;计算九种难度信号并融合为统一难度分。 Result: 实验表明所生成MCQ质量高,难度估计具有可解释性且与人类判断高度一致。 Conclusion: 本方法通过融合KG、LLM与数据驱动难度建模,有效提升了自动化MCQ生成的实用性与可信度。 Abstract: Generating multiple-choice questions (MCQs) with difficulty estimation remains challenging in automated MCQ-generation systems used in adaptive, AI-assisted education. This study proposes a novel methodology for generating MCQs with difficulty estimation from the input documents by utilizing knowledge graphs (KGs) and large language models (LLMs). Our approach uses an LLM to construct a KG from input documents, from which MCQs are then systematically generated. Each MCQ is generated by selecting a node from the KG as the key, sampling a related triple or quintuple -- optionally augmented with an extra triple -- and prompting an LLM to generate a corresponding stem from these graph components. Distractors are then selected from the KG. For each MCQ, nine difficulty signals are computed and combined into a unified difficulty score using a data-driven approach. Experimental results demonstrate that our method generates high-quality MCQs whose difficulty estimation is interpretable and aligns with human perceptions. Our approach improves automated MCQ generation by integrating structured knowledge representations with LLMs and a data-driven difficulty estimation model.

[64] Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction

Beicheng Bei,Hannah Hyesun Chun,Chen Guo,Arwa Saghiri

Main category: cs.CL

TL;DR: 本研究探讨了BERT嵌入是否编码了虚构叙事语义的四个维度(时间、空间、因果性、人物),通过LLM辅助构建标注数据集并使用线性探针验证,发现BERT确实编码了有意义的叙事信息,但各维度并非离散可分。

Details Motivation: 叙事理解需要多维语义结构,而现有预训练语言模型(如BERT)是否隐式编码了虚构叙事的关键语义维度(时间、空间、因果性、人物)尚不清楚。 Method: 利用大语言模型(LLM)加速构建token级叙事维度标注数据集(含时间、空间、因果性、人物及'其他'五类);采用线性探针对BERT嵌入进行分类,并与方差匹配的随机嵌入对照;辅以混淆矩阵和聚类分析(ARI评估)。 Result: 线性探针在BERT嵌入上达94%准确率,显著优于随机基线(47%);宏观平均召回率为0.83,其中因果性(0.75)和空间(0.66)等稀有类别表现中等;存在'边界泄漏'现象(稀有类别易被误判为'其他');无监督聚类与真实类别几乎无关(ARI=0.081)。 Conclusion: BERT嵌入确实编码了叙事语义信息,但这些维度并非以清晰、离散的簇形式存在,而是以更复杂、交织的方式表征;后续需引入POS基线、扩展数据集及开展层间探针分析。 Abstract: Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics -- time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus "others." A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals "Boundary Leakage," where rare dimensions are systematically misclassified as "others." Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.

[65] When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities

Sarmistha Das,Shreyas Guha,Suvrayan Bandyopadhyay,Salisa Phosit,Kitsuchart Pasupa,Sriparna Saha

Main category: cs.CL

TL;DR: 本文提出Mediom多语言多模态习语语料库及HIDE习语解释框架,旨在提升大模型对文化隐喻性习语(如‘葡萄是酸的’)的理解能力,弥补当前模型在习语推理上的系统性缺陷。

Details Motivation: 现有语言模型过度依赖表层词汇和语义线索,难以理解与隐喻和文化深度绑定的习语,存在系统性盲区。 Method: 构建包含3533个印地语、孟加拉语和泰语习语的Mediom多语言多模态语料库(含标准解释、跨语言翻译和图文对齐),并设计基于提示反馈检索与诊断线索的HIDE习语解释框架。 Result: 在Mediom上评测发现大语言模型与视觉语言模型在习语隐喻理解上存在系统性失败;HIDE框架有效提升了模型的习语解释准确性与推理可解释性。 Conclusion: Mediom与HIDE共同为下一代AI系统提供了文化扎根、多模态融合且嵌入推理提示的习语理解基准与方法论。 Abstract: Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit{\foreignlanguage{bengali}{\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995}} (angur fol tok, ``grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present ``Mediom,'' a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text--image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,'' a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.

[66] TInR: Exploring Tool-Internalized Reasoning in Large Language Models

Qiancheng Xu,Yongqi Li,Fan Liu,Hongru Wang,Min Yang,Wenjie Li

Main category: cs.CL

TL;DR: 本文提出了一种名为TInR-U的工具内化推理框架,通过三阶段训练流程(双向知识对齐、监督微调预热、强化学习)将工具知识内化到大语言模型中,从而提升推理与工具使用的统一性和效率。

Details Motivation: 现有基于外部工具文档的工具集成推理(TIR)方法存在工具掌握困难、工具规模受限和推理效率低等问题,因此需要探索将工具知识内化到模型中的新范式(TInR)。 Method: 提出TInR-U框架,采用三阶段训练:1)基于双向知识对齐的工具内化;2)高质量推理标注的监督微调预热;3)面向TInR的强化学习奖励机制。 Result: 在域内和域外设置下全面评估表明,TInR-U在性能和效率上均优于现有方法。 Conclusion: 工具内化推理(TInR)是提升LLM推理能力的有效新路径,TInR-U框架成功实现了工具知识内化与推理协调,具备良好泛化性与实用性。 Abstract: Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

[67] Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

Chirag Shinde

Main category: cs.CL

TL;DR: 本文提出两种互补的Transformer注意力块改进方法:1)在层归一化和Q/K/V投影之间插入非线性预投影MLP,以位置无关方式构建更丰富的特征;2)添加内容跳过连接,使预投影特征绕过注意力机制,保留原始内容信息。该方法在Pythia模型上显著提升LAMBADA准确率并降低困惑度,且不增加KV缓存开销。

Details Motivation: 现有Transformer中,位置编码与内容特征耦合过早,可能损害纯内容建模能力;同时,深层注意力可能过度依赖位置信息而削弱内容表征。 Method: 1)在LayerNorm后、Q/K/V投影前引入非线性MLP;2)将该MLP输出通过可学习权重的跳过连接直接加到注意力输出之后;3)保持原有架构其余部分不变,不修改位置编码或KV缓存机制。 Result: +40.6% LAMBADA准确率提升和-39%困惑度下降(Pythia-160M);跳过连接权重分析显示深层比浅层更依赖内容直连路径;无额外KV缓存开销。 Conclusion: 位置无关的内容预处理与可控的内容跳过连接能有效增强Transformer对纯语义信息的建模能力,尤其在深层模块中收益显著,是一种高效、即插即用的注意力增强方案。 Abstract: We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection's features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.

[68] Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

Krzysztof Ociepa,Łukasz Flis,Remigiusz Kinas,Krzysztof Wróbel,Adrian Gwoździej

Main category: cs.CL

TL;DR: 本文介绍了Bielik v3 PL系列语言模型(7B和11B参数)的开发,重点在于为波兰语定制分词器以提升性能,并采用FOCUS初始化、多阶段预训练及多种后训练对齐方法。

Details Motivation: 通用大模型使用的通用分词器难以捕捉波兰语等特定语言的形态学细节,导致分词效率低、推理成本高、有效上下文受限。 Method: 采用波兰语专用词汇表替代Mistral通用分词器;使用FOCUS方法进行嵌入初始化;实施多阶段预训练;并通过监督微调、直接偏好优化(DPO)和基于可验证奖励的组相对策略优化(GRPO)进行后训练对齐。 Result: 成功构建了针对波兰语优化的Bielik v3 7B/11B模型,提升了分词效率、降低了推理开销、扩展了有效上下文窗口,并增强了波兰语理解与生成能力。 Conclusion: 语言专用分词器与定制化训练流程是提升特定语言大模型性能的关键路径,Bielik v3为区域性语言LLM优化提供了可复用的技术范式。 Abstract: The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

[69] OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Xiaomeng Hu,Yinger Zhang,Fei Huang,Jianhong Tu,Yang Su,Lianghao Deng,Yuxuan Liu,Yantao Liu,Dayiheng Liu,Tsung-Yi Ho

Main category: cs.CL

TL;DR: 本文提出了OccuBench,一个覆盖100个真实职业任务场景的基准测试,利用语言世界模型(LWMs)模拟专业环境,并通过多智能体合成流程生成可解、难度可控、文档支撑的评估实例;评估维度包括跨行业任务完成能力与环境鲁棒性(注入显式/隐式/混合故障);实验发现:无单一模型在所有行业占优;隐式故障最难检测;更大、更新、更高推理努力的模型表现更优;强代理不等于强模拟器,模拟器质量对评估可靠性至关重要。

Details Motivation: 现有AI代理基准仅覆盖少数有公开环境的职业领域,无法系统评估其在数百种真实职业场景中的专业能力。 Method: 提出OccuBench基准,基于语言世界模型(LWMs)构建100个职业任务场景(10大行业、65个专业领域);采用多智能体合成流程自动生成具备可解性、难度校准和文档多样性保障的评估实例;设计双维度评估:任务完成度与环境鲁棒性(含显式错误、隐式数据退化、混合故障三类注入)。 Result: 在15个前沿模型(8个家族)上验证发现:(1) 各模型职业能力谱系不同,无全行业主导者;(2) 隐式故障(如数据截断、字段缺失)最难应对,因缺乏明显错误信号;(3) 模型规模、代际更新与推理努力程度正向影响性能(如GPT-5.2提升27.5分);(4) 代理能力强 ≠ 模拟器质量高,后者对LWM评估可靠性至关重要。 Conclusion: OccuBench是首个支持跨行业、多维度、故障鲁棒性评估的专业职业任务基准,揭示了AI代理在真实工作场景中的能力边界与关键瓶颈,强调了高质量环境模拟对可信评估的基础作用。 Abstract: AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

[70] AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis

Qinjiang Niu,Lu Yan

Main category: cs.CL

TL;DR: 本文提出了一种面向不良结局通路(AOP)的检索增强生成(RAG)框架AOP-Smart,利用AOP-Wiki的XML数据提升大语言模型在AOP问答任务中的可靠性与准确性,显著缓解幻觉问题。

Details Motivation: 大语言模型在AOP相关任务中存在幻觉问题,导致生成内容缺乏事实依据,影响其在毒理学研究和风险评估中的可靠性。 Method: 构建AOP-Smart框架,基于AOP-Wiki官方XML数据,针对关键事件(KE)、关键事件关系(KER)及具体AOP信息进行知识检索,并结合大语言模型(Gemini、DeepSeek、ChatGPT)实现检索增强生成。 Result: 在20个AOP问答任务上测试显示,引入RAG后三模型准确率分别从15.0%、35.0%、20.0%提升至95.0%、100.0%、95.0%。 Conclusion: AOP-Smart能显著缓解大语言模型在AOP知识任务中的幻觉问题,大幅提升答案的准确性与一致性,为毒理学领域可信AI应用提供了有效方案。 Abstract: Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0\%, 35.0\%, and 20.0\%, respectively; after using RAG, their accuracies increased to 95.0\%, 100.0\%, and 95.0\%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.

[71] HTAA: Enhancing LLM Planning via Hybrid Toolset Agentization & Adaptation

Chengrui Huang,Junshuo Zhang,Zhiyuan Ma,Xikun Wang,Ximeng Wang,Menghua Jiang,Gang Zeng,Zhaobing Han,Shen Gao,Shuo Shang

Main category: cs.CL

TL;DR: 本文提出了一种名为HTAA的分层工具使用框架,通过将常用工具封装为专用代理工具并设计非对称规划器自适应训练方法,显著提升了大语言模型在多工具场景下的可扩展性、准确性和效率。

Details Motivation: 现有扁平化工具调用架构存在效率低、错误累积等问题,难以支持大语言模型可靠地扩展使用数百种工具。 Method: 提出Hybrid Toolset Agentization & Adaptation(HTAA)框架,包括:1)工具集代理化(将高频共用工具封装为专用代理工具以压缩动作空间);2)非对称规划器自适应(基于轨迹的后向重构与前向精炼训练策略),实现高层规划器与代理工具协同优化。 Result: 在真实业务数据集InfoVerify及多个公开基准上,HTAA显著提升任务成功率、缩短工具调用链长度、降低上下文开销;生产部署中大幅减少人工验证工作量和运营成本。 Conclusion: HTAA是一种高效、可扩展且实用的大语言模型多工具协同框架,为复杂现实场景中的智能体系统提供了可行的技术路径。 Abstract: Enabling large language models to scale and reliably use hundreds of tools is critical for real-world applications, yet challenging due to the inefficiency and error accumulation inherent in flat tool-calling architectures. To address this, we propose Hybrid Toolset Agentization & Adaptation (HTAA), a hierarchical framework for scalable tool-use planning. We propose a novel toolset agentization paradigm, which encapsulates frequently co-used tools into specialized agent tools, thereby reducing the planner's action space and mitigating redundancy. To ensure effective coordination, we design Asymmetric Planner Adaptation, a trajectory-based training paradigm that aligns the high-level planner with agent tools via backward reconstruction and forward refinement. To validate the performance of HTAA, we conduct experiments on a real-world internal dataset, InfoVerify, based on the POI validation workflow of China's largest online large-scale ride-hailing platform, featuring long-horizon executable tool trajectories. Experiments on InfoVerify and widely-used benchmarks show that HTAA consistently achieves higher task success rates, requires short tool calling trajectories, and significantly reduces context overhead compared to strong baselines. Furthermore, in a production deployment, HTAA substantially reduces manual validation effort and operational cost, demonstrating its practical efficacy.

[72] Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

Zihao Cheng,Zeming Liu,Yingyu Shan,Xinyi Wang,Xiangrong Zhu,Yunpu Ma,Hongru Wang,Yuhang Guo,Wei Lin,Yunhong Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为Mem²Evolve的新型共进化能力扩展与经验蒸馏范式,通过整合经验记忆与资产记忆,实现LLM智能体在动态创建工具/专家代理的同时,利用历史经验进行指导,从而提升能力增长的效率与稳定性。

Details Motivation: 现有LLM智能体自演化框架将经验积累与资产(工具/专家代理)动态创建视为孤立过程,忽视了二者内在依赖关系:经验演化受限于静态工具集,而资产演化缺乏经验引导,导致能力增长有限且演化不稳定。 Method: 提出共进化范式Capability Expansion and Experience Distillation,并构建Mem²Evolve框架,包含Experience Memory(存储和提炼任务执行经验)与Asset Memory(管理并生成新工具/专家代理),二者相互反馈、协同演化。 Result: 在6类任务、8个基准上实验表明,Mem²Evolve相较标准LLM提升18.53%,相较仅经验演化的代理提升11.80%,相较仅资产演化的代理提升6.46%;代码已开源。 Conclusion: Mem²Evolve实现了更高效、更稳定的LLM智能体自演化,验证了经验与资产共进化的必要性与有效性。 Abstract: While large language model--powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \textbf{Mem$^{\textbf{2}}$Evolve}, which integrates two core components: \textbf{Experience Memory} and \textbf{Asset Memory}. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53\% over standard LLMs, 11.80\% over agents evolving solely through experience, and 6.46\% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: https://buaa-irip-llm.github.io/Mem2Evolve.

[73] YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents

Victor De Lima,Grace Hui Yang

Main category: cs.CL

TL;DR: 本文提出信息获取代理(IEAs),旨在通过对话从用户处主动获取信息以支持机构或任务目标,并构建了2600万token的YIELD数据集,形式化建模为POMDP,设计新评估指标,实验证明基于YIELD训练可提升大模型在信息获取任务中的对齐性。

Details Motivation: 现有对话代理多为用户驱动,难以适配需主动获取信息的现实场景(如学术面试、司法程序、新闻调查);亟需面向机构目标的信息获取型代理。 Method: 提出信息获取代理(IEA)范式;构建YIELD数据集(2281段人-人对话,26M token);将信息获取建模为有限视界POMDP;设计专用评估指标;在多个基础大语言模型上开展微调与实验。 Result: 在多个基础大语言模型上验证了YIELD训练能显著提升其与真实信息获取行为的一致性,结果经人工评估证实。 Conclusion: IEA是对话AI的重要新方向;YIELD数据集、POMDP建模框架和评估指标为系统性研究奠定了基础;代码、数据、模型均已开源。 Abstract: Most conversational agents (CAs) are designed to satisfy user needs through user-driven interactions. However, many real-world settings, such as academic interviewing, judicial proceedings, and journalistic investigations, involve broader institutional decision-making processes and require agents that can elicit information from users. In this paper, we introduce Information Elicitation Agents (IEAs) in which the agent's goal is to elicit information from users to support the agent's institutional or task-oriented objectives. To enable systematic research on this setting, we present YIELD, a 26M-token dataset of 2,281 ethically sourced, human-to-human dialogues. Moreover, we formalize information elicitation as a finite-horizon POMDP and propose novel metrics tailored to IEAs. Pilot experiments on multiple foundation LLMs show that training on YIELD improves their alignment with real elicitation behavior and findings are corroborated by human evaluation. We release YIELD under CC BY 4.0. The dataset, project code, evaluation tools, and fine-tuned model adapters are available at: https://github.com/infosenselab/yield.

[74] When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

Muxin Liu,Delip Rao,Grace Kim,Chris Callison-Burch

Main category: cs.CL

TL;DR: 本文揭示了现有科学主张验证基准存在缺陷,无法区分真正的严谨验证与依赖显著约束的捷径推理;为此,作者构建了组合式不可行主张来暴露模型的捷径行为,并证明当前模型普遍存在结构性推理瓶颈。

Details Motivation: 现有验证基准无法区分模型是否真正执行闭世界假设下的完整约束检验,还是仅依赖最显著约束进行捷径推理,导致对模型验证能力的误判。 Method: 构建组合式不可行主张(即显著约束成立但非显著约束被证伪),在多模型、多模态上测试其接受率;并通过上下文干预分析模型验证阈值与ROC曲线关系。 Result: 所有饱和现有基准的模型均显著高估组合式不可行主张的可接受性;模型间差异主要体现为验证阈值不同,而非推理能力差异;组合式推理瓶颈是当前验证行为的结构性限制。 Conclusion: 当前科学主张验证模型普遍依赖捷径推理,其性能上限受制于结构性的组合推理瓶颈,仅靠提示工程等策略调整无法根本解决。 Abstract: Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA's rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.

[75] When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

Zhengzhe Yang

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLMs)能否生成有助于强化学习交易代理的连续数值特征,提出了一种将冻结LLM作为无状态特征提取器的模块化流程,并通过自动提示优化提升信息系数(IC),但发现特征有效性不等于策略鲁棒性,尤其在宏观冲击导致分布偏移时表现下降。

Details Motivation: 探索大语言模型生成的数值特征是否能真正提升强化学习交易策略的性能,而非仅在NLP指标上表现良好。 Method: 构建一个模块化流水线:冻结LLM作为无状态特征提取器,将新闻与财报等非结构化文本映射为固定维向量;设计基于信息系数(IC)的自动化提示优化循环,将提示视为离散超参数进行调优。 Result: 优化后的提示可提取出具有预测性(IC > 0.15)的特征,但在宏观冲击引发分布偏移时,LLM特征反而引入噪声,导致增强代理逊于纯价格基线;在平稳测试环境下性能恢复,但宏观状态变量仍是策略提升最稳健的驱动因素。 Conclusion: 特征层面的有效性(如高IC)不能保证策略层面的鲁棒性,尤其在分布偏移场景下,凸显了从表征学习到策略泛化之间的关键鸿沟。 Abstract: Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.

[76] Uncertainty-Aware Web-Conditioned Scientific Fact-Checking

Ashwin Vinod,Katrin Erk

Main category: cs.CL

TL;DR: 本文提出了一种面向科学事实核查的原子级断言分解与不确定性门控验证框架,通过嵌入对齐、证据驱动校验和选择性权威网络检索,在保持低延迟与高可解释性的同时提升验证准确性。

Details Motivation: 现有科学事实核查系统在处理生物医学、材料科学等专业领域的复合技术主张时,易产生幻觉或推理不一致,尤其在受限于证据片段、数据源及成本/延迟约束下表现不佳。 Method: 构建以原子谓词-论元分解为核心的流水线:首先将主张分解为原子事实;利用嵌入对齐至局部证据片段;由轻量级证据接地校验器进行二值或三值(Supported/Refuted/NEI)判定;仅当置信度不足时,才触发面向权威源的领域受限网络搜索;冲突时主动 abstain(输出NEI)而非覆盖原始上下文。 Result: 在多个基准上超越最强基线;不确定性门控机制使网络检索仅被少数原子事实触发,显著降低平均开销;支持可追溯推理、可控延迟与保守决策,适用于高风险单文档场景。 Conclusion: 原子粒度分解与不确定性门控协同,提升了科学事实核查的可解释性、上下文敏感性与部署实用性,为高可靠性专业验证任务提供了新范式。 Abstract: Scientific fact-checking is vital for assessing claims in specialized domains such as biomedicine and materials science, yet existing systems often hallucinate or apply inconsistent reasoning, especially when verifying technical, compositional claims against an evidence snippet under source and cost/latency constraints. We present a pipeline centered on atomic predicate-argument decomposition and calibrated, uncertainty-gated corroboration: atomic facts are aligned to local snippets via embeddings, verified by a compact evidence-grounded checker, and only facts with uncertain support trigger domain-restricted web search over authoritative sources. The system supports both binary and tri-valued classification where it predicts labels from Supported, Refuted, NEI for three-way tasks. We evaluate under two regimes, Context-Only (no web) and Context+Web (uncertainty-gated web corroboration); when retrieved evidence conflicts with the provided context, we abstain with NEI rather than overriding the context. On multiple benchmarks, our framework surpasses the strongest benchmarks. In our experiments, web corroboration was invoked for only a minority of atomic facts on average, indicating that external evidence is consulted selectively under calibrated uncertainty rather than routinely. Overall, coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification, making the approach well-suited to high-stakes, single-document settings that demand traceable rationales, predictable cost/latency, and conservative.

[77] A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities

Jiaqi Chen,Ming Wang,Tingna Xie,Shi Feng,Yongkang Liu

Main category: cs.CL

TL;DR: 本文研究了在大型语言模型(LLMs)中注入人格特质(如大五人格)对其认知能力的影响,发现人格诱导不仅改变交互风格,还会稳定、可复现地影响任务表现;效果具有任务依赖性,且与人类中人格-认知关系方向一致率达73.68%;基于此提出无需训练的轻量级动态人格路由(DPR)方法,性能优于最优静态人格。

Details Motivation: 现有工作多关注人格注入对LLM交互风格的影响,但其对底层认知能力的影响尚属空白,亟需系统探究人格特质与认知表现之间的关联机制。 Method: 采用基于神经元的人格特质诱导(NPTI)框架,在LLM中注入大五人格特质,并在六个认知基准上评估性能;分析任务依赖性、效应大小与人格维度的关系;验证与人类人格-认知关系的一致性;进而设计动态人格路由(DPR)策略。 Result: 人格诱导引发稳定、可复现的认知表现变化;Openness和Extraversion影响最显著;任务依赖性强(如提升指令遵循但损害复杂推理);LLM中人格效应与人类方向一致率达73.68%;DPR策略无需额外训练即超越最优静态人格。 Conclusion: 人格不仅是表层风格调节器,更是可调控的认知杠杆;人格-认知映射具有跨主体(人/模型)规律性,支持构建更智能、自适应的个性化LLM系统。 Abstract: Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.

[78] Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

Jihoon Jeong

Main category: cs.CL

TL;DR: 本文研究了12个小语言模型在21种情绪向量表征上的几何一致性,发现5个成熟架构具有高度相似的情绪表征结构,而Gemma-3 1B等不成熟模型则表现出显著差异;同时指出先前研究中关于理解vs生成模式的结论需细分为四个层次效应。

Details Motivation: 探究小语言模型是否在情绪语义表征上存在跨架构的普遍性,并厘清先前研究中混淆的方法学效应(如理解vs生成、精度影响等)。 Method: 在统一的理解模式(comprehension-mode)流水线下,以fp16精度提取12个小型语言模型(6种架构×base/instruct)的21维情绪向量;使用表征相似性分析(RSA)比较其原始余弦RDMs;对RDM进行Spearman相关性分析,并考察残差流各向异性、RLHF影响及多层效应分解。 Result: 5个成熟架构(Qwen 2.5、SmolLM2、Llama 3.2、Mistral 7B v0.3、Llama 3.1)的情绪RDM高度一致(rho=0.74–0.92);行为特征差异(如MTI Compliance)不反映在底层情绪表征中;Gemma-3 1B因不成熟而呈现极端各向异性且被RLHF显著重构;先前归因为‘理解vs生成’的效应实为四层混杂效应。 Conclusion: 小语言模型的情绪语义几何结构具有跨架构的强普遍性,前提是模型已达到一定成熟度;行为差异源于更高层处理而非基础情绪表征;方法学上需对表征差异进行多层解耦,避免单一相关系数误导解释。 Abstract: We extract 21-emotion vector sets from twelve small language models (six architectures x base/instruct, 1B-8B parameters) under a unified comprehension-mode pipeline at fp16 precision, and compare the resulting geometries via representational similarity analysis on raw cosine RDMs. The five mature architectures (Qwen 2.5 1.5B, SmolLM2 1.7B, Llama 3.2 3B, Mistral 7B v0.3, Llama 3.1 8B) share nearly identical 21-emotion geometry, with pairwise RDM Spearman correlations of 0.74-0.92. This universality persists across diametrically opposed behavioral profiles: Qwen 2.5 and Llama 3.2 occupy opposite poles of MTI Compliance facets yet produce nearly identical emotion RDMs (rho = 0.81), so behavioral facet differences arise above the shared emotion representation. Gemma-3 1B base, the one immature case in our dataset, exhibits extreme residual-stream anisotropy (0.997) and is restructured by RLHF across all geometric descriptors, whereas the five already-mature families show within-family base x instruct RDM correlations of rho >= 0.92 (Mistral 7B v0.3 at rho = 0.985), suggesting RLHF restructures only representations that are not yet organized. Methodologically, we show that what prior work has read as a single comprehension-vs-generation method effect in fact decomposes into four distinct layers -- a coarse method-dependent dissociation, robust sub-parameter sensitivity within generation, a true precision (fp16 vs INT8) effect, and a conflated cross-experiment bias that distorts in opposite directions for different models -- so that a single rho between two prior emotion-vector studies is not a safe basis for interpretation without the layered decomposition.

[79] ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

Haq Nawaz Malik,Nahfid Nissar

Main category: cs.CL

TL;DR: 本文介绍了KS-PRET-5M,目前最大的公开克什米尔语预训练数据集,包含509万词、2769万字符和29.5万唯一词型,数据来源包括数字化档案/文学材料和网络文本,并经过严格清洗与分词处理,以支持克什米尔语语言模型预训练等研究。

Details Motivation: 克什米尔语缺乏大规模、高质量、公开可用的预训练语料库,限制了其自然语言处理研究与发展。 Method: 从两类来源(InPage格式数字化档案文献 + Unicode原生网络文本)收集原始数据;使用11阶段清洗流程大幅降低非克什米尔文字(如天城文)污染;采用google/muril-base-cased进行经验性子词分词。 Result: 构建出KS-PRET-5M数据集:5.09M词、27.6M字符、295.4K唯一词型、平均克什米尔文字占比0.9965、全集仅146字符为天城文污染、约12.13M子词单元;按CC BY 4.0协议开源。 Conclusion: KS-PRET-5M填补了克什米尔语NLP基础资源空白,为预训练语言模型、分词器训练及计算语言学研究提供了可靠、可扩展的数据支撑。 Abstract: We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.

[80] Efficient Training for Cross-lingual Speech Language Models

Yan Zhou,Qingkai Fang,Yun Hong,Yang Feng

Main category: cs.CL

TL;DR: 本文提出了一种名为CSLM的跨语言语音大模型训练方法,通过离散语音token和创新的对齐策略,在有限数据下实现语音与文本、多语言间的高效对齐,提升生成质量与低延迟,并具备良好的语言可扩展性。

Details Motivation: 现有大语言模型主要面向文本模态,而构建高效端到端语音大模型面临数据稀缺和多语言扩展困难的问题,亟需一种低资源、高扩展性的跨语言语音-文本联合建模方法。 Method: 提出基于离散语音token的跨语言语音语言模型(CSLM),采用持续预训练实现跨模态与跨语言对齐,并结合语音-文本交错链式模态生成的指令微调,提升细粒度模态对齐能力。 Result: CSLM在跨模态任务、单语对话任务和跨语言对话任务上均表现出强跨模态对齐能力和通用任务性能,且无需海量语音数据,具有良好的语言可扩展性。 Conclusion: CSLM是一种高效、可扩展的跨语言语音大模型训练框架,解决了语音LLM在数据受限和多语言支持方面的关键挑战,为自然人机语音交互提供了新路径。 Abstract: Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM's strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)

[81] BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

Atharva Gupta,Dhruv Kumar,Yash Sinha

Main category: cs.CL

TL;DR: 本文提出了一种结合结构化监督微调与直接偏好优化(DPO)的两阶段方法,用于多语言社交媒体文本中的政治极化检测,在SemEval-2026 POLAR任务中显著提升召回率与宏F1分数,且无需额外人工标注。

Details Motivation: 准确计算检测在线极化面临隐含修辞、框架效应及高人力标注成本等挑战;近期研究表明上下文提示可使大语言模型成为强极化检测器,但需进一步提升鲁棒性与召回能力。 Method: 采用两阶段方法:首先基于LoRA对Qwen 2.5-7B-Instruct进行结构化监督微调,使用可解释的填槽模板(目标、主张类型、表现清单、理由);随后利用自动生成的偏好对进行DPO优化,以降低假阴性率。 Result: 在SemEval 2026 POLAR英文开发集上,DPO使召回率从0.5085提升至0.7797,宏F1提升约5个百分点,同时未引入额外人工标注成本。 Conclusion: DPO作为后训练优化手段能有效弥补监督微调在极化检测中的召回缺陷,验证了偏好学习在低资源极化识别任务中的有效性与实用性。 Abstract: The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting political polarization in social media text that combines structured supervised fine-tuning with Direct Preference Optimization (DPO) refinement. We fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Experiments on the SemEval 2026 POLAR shared task dataset show that preference-based refinement improves both accuracy and decreases false negatives without extra annotation. On the English development set, DPO increases recall from 0.5085 to 0.7797 and improves macro-F1 by ~5 points.

[82] DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning

Feiyang Li,Yile Wang

Main category: cs.CL

TL;DR: 本文提出DeCoVec,一种无需训练、非侵入式的任务向量方法,直接在解码空间中构建任务向量,利用上下文学习捕获任务本质,并通过注入解码过程提升生成质量。

Details Motivation: 现有任务向量方法通常需微调或侵入式操作内部状态,限制了灵活性与可扩展性。 Method: DeCoVec基于上下文学习,在解码空间中通过少量样本与零样本提示的输出logit分布差异构造任务向量,并将其注入解码过程以引导生成。 Result: 在七个LLM(0.5B–9B)及TruthfulQA、Math-500、AQUA-RAT等任务上,DeCoVec持续超越标准少样本基线,平均准确率最高提升+5.50;同时有效抑制生成退化与逻辑错误,对示例顺序鲁棒,且不增加输入token开销。 Conclusion: DeCoVec提供了一种无需训练、非侵入、无需权重更新或辅助模型的大语言模型控制新范式。 Abstract: Task vectors, representing directions in model or activation spaces that encode task-specific behaviors, have emerged as a promising tool for steering large language models (LLMs). However, existing approaches typically require fine-tuning or invasive manipulation of internal states, limiting their flexibility and scalability. We propose \textsc{DeCoVec} (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors directly in the \textit{decoding space} by leveraging in-context learning (ICL). Specifically, \textsc{DeCoVec} captures the task essence as the difference between the output logit distributions of few-shot and zero-shot prompts, then steers generation by injecting this vector into the decoding process. Experiments across seven LLMs (0.5B--9B) on TruthfulQA, Math-500, and AQUA-RAT show that \textsc{DeCoVec} consistently outperforms standard few-shot baselines, with gains up to +5.50 average accuracy. Further analysis demonstrates that \textsc{DeCoVec} effectively suppresses generation degeneration and logical flaws while exhibiting strong robustness to demonstration ordering, all without incurring additional input token costs. Our method offers a training-free and non-invasive solution for LLM steering without requiring weight updates or auxiliary models.

[83] How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

Minh-Vuong Nguyen,Fatemeh Shiri,Zhuang Li,Karin Verspoor

Main category: cs.CL

TL;DR: 本文提出ClinicNumRobBench基准,用于全面评估大语言模型在临床文本中数值推理能力(包括值检索、算术计算、关系比较与聚合),强调其在不同临床记录格式下的鲁棒性;实验表明当前LLM在关系比较和聚合任务上表现薄弱,且微调可能损害数值能力,格式变化显著影响性能。

Details Motivation: 现有临床数值推理评估覆盖操作类型有限(主要为算术)、缺乏对临床笔记格式多样性的鲁棒性检验,难以支撑LLM安全临床部署。 Method: 构建包含1624个样本的ClinicNumRobBench基准,涵盖四类临床数值能力;采用MIMIC-IV纵向生命体征数据,以三种语义等价但格式不同的表示(含真实笔记风格)进行压力测试;使用42种问题模板生成查询;在14个LLM上开展评测,并分析微调影响与格式敏感性。 Result: 多数模型值检索准确率超85%,但关系比较与聚合任务部分模型低于15%;医学微调可使数值能力下降超30%;笔记风格变化导致性能明显下降,显示LLM对格式敏感。 Conclusion: ClinicNumRobBench为临床可靠数值推理提供了严格评测平台,揭示了当前LLM在复杂临床数值理解尤其是格式鲁棒性和高阶推理上的关键短板。 Abstract: Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.

[84] SHARE: Social-Humanities AI for Research and Education

João Gonçalves,Sonia de Jager,Petr Knoth,David Pride,Nick Jelicic

Main category: cs.CL

TL;DR: 本文介绍了专为社会科学与人文科学(SSH)设计的SHARE因果语言模型家族及MIRROR用户界面,其在SSH文本建模上接近大规模通用模型,且MIRROR界面通过不生成文本的方式支持批判性审阅,兼顾AI能力与SSH学术规范。

Details Motivation: 现有通用大语言模型虽强大,但未针对社会科学与人文科学(SSH)的文本特性与学术规范进行专门预训练和交互设计,难以满足SSH领域对因果推理、文本批判性分析与学术伦理的特殊需求。 Method: 开发了专为SSH领域定制的因果语言模型SHARE系列,并构建SSH Cloze基准测试评估其性能;同时设计了非生成式用户界面MIRROR,支持用户对SSH文本输入进行交互式审阅与批判性反馈。 Result: SHARE模型在SSH Cloze基准上表现接近参数量大100倍的通用模型Phi-4;MIRROR界面成功实现零文本生成的交互范式,验证了在不违背SSH原则前提下利用大模型能力的可行性。 Conclusion: SHARE与MIRROR共同构成一种面向SSH领域的负责任AI范式:以领域原生建模能力为基础,以非侵入式人机协作界面为载体,推动生成式AI在人文社科中的适配性与伦理性落地。 Abstract: This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal language models fully pretrained by and for the social sciences and humanities (SSH). Their performance in modelling SSH texts is close to that of general purpose models (Phi-4) which use 100 times more tokens, as shown by our custom SSH Cloze benchmark. The MIRROR user interface is designed for reviewing text inputs from the SSH disciplines while preserving critical engagement. By prototyping a generative AI interface that does not generate any text, we propose a way to harness the capabilities of the SHARE models without compromising the integrity of SSH principles and norms.

[85] Evaluating Memory Capability in Continuous Lifelog Scenario

Jianjie Zheng,Zhichen Liu,Zhanyu Shen,Jingxiang Qu,Guanhua Chen,Yile Wang,Yang Xu,Yang Liu,Sijie Cheng

Main category: cs.CL

TL;DR: 本文提出LifeDialBench,一个面向可穿戴设备持续记录环境对话的记忆系统新基准,包含EgoMem和LifeMem两个子集,并引入在线评估协议以保证时间因果性;实验发现当前复杂记忆系统不如简单RAG基线,凸显高保真上下文保留的重要性。

Details Motivation: 现有基准聚焦于在线一对一聊天或人机交互,忽视真实生活记录场景的独特需求,且缺乏公开的lifelogging音频数据集。 Method: 提出分层合成框架构建LifeDialBench基准,包含基于真实第一人称视频的EgoMem和基于虚拟社区模拟的LifeMem;并设计遵循时间因果性的Online Evaluation协议。 Result: 实验发现当前先进记忆系统在该基准上未能超越简单RAG基线,表明过度设计结构和有损压缩对lifelog场景有害。 Conclusion: lifelog场景下,高保真上下文保留比复杂记忆机制更重要,需重新思考记忆系统的设计范式。 Abstract: Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.

[86] MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis

Zixiong Yu,Jun Rao,Guhan Chen,Songtao Tian,Bohan Li,Jiansheng Wei,Min Zhang,Xiaojun Meng

Main category: cs.CL

TL;DR: 本文提出了一种分层合成框架,将数学推理数据合成建模为约束图上的无监督优化问题,并引入Legislator-Executor范式,以提升合成数据的逻辑复杂性与多样性;实验表明,仅用1K合成样本微调的模型在多个数学基准上超越同规模主流数据集。

Details Motivation: 当前数学推理数据合成方法依赖种子数据变异或简单提示工程,易出现模式坍缩且逻辑复杂度有限,难以在无人类先验下生成高质量数据。 Method: 提出分层合成框架:首先在约束图上进行无监督优化以生成结构化生成蓝图(Legislator),再由Executor将蓝图实例化为多样化的自然语言场景,实现逻辑结构设计与语言表达的解耦。 Result: 在Qwen、Llama、Mistral、Gemma共10个模型上验证,仅用1K合成样本微调即在8个数学基准上超越LIMO和s1K等同规模数据集,展现出更强的分布外泛化能力。 Conclusion: 该框架通过解耦结构生成与语义实例化,有效提升了合成数据的逻辑深度与多样性,为无需人工先验的高质量数学推理数据合成提供了新范式。 Abstract: Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.

[87] TRACE: An Experiential Framework for Coherent Multi-hop Knowledge Graph Question Answering

Yingxu Wang,Jiaxin Huang,Mengzhu Wang,Nan Yin

Main category: cs.CL

TL;DR: 本文提出TRACE框架,通过结合LLM驱动的上下文推理与探索先验整合,提升多跳知识图谱问答中的推理连贯性与鲁棒性。

Details Motivation: 现有方法独立处理每步推理,未能有效利用先前探索经验,导致推理碎片化和重复探索。 Method: TRACE将推理路径动态转化为自然语言叙述以保持语义连续性,并从历史探索中抽象出可复用的经验先验;引入双反馈重排序机制,融合上下文叙述与探索先验指导关系选择。 Result: 在多个KGQA基准上,TRACE持续优于当前最优基线方法。 Conclusion: TRACE通过经验驱动的轨迹感知推理,显著提升了多跳KGQA的连贯性、鲁棒性与效率。 Abstract: Multi-hop Knowledge Graph Question Answering (KGQA) requires coherent reasoning across relational paths, yet existing methods often treat each reasoning step independently and fail to effectively leverage experience from prior explorations, leading to fragmented reasoning and redundant exploration. To address these challenges, we propose Trajectoryaware Reasoning with Adaptive Context and Exploration priors (TRACE), an experiential framework that unifies LLM-driven contextual reasoning with exploration prior integration to enhance the coherence and robustness of multihop KGQA. Specifically, TRACE dynamically translates evolving reasoning paths into natural language narratives to maintain semantic continuity, while abstracting prior exploration trajectories into reusable experiential priors that capture recurring exploration patterns. A dualfeedback re-ranking mechanism further integrates contextual narratives with exploration priors to guide relation selection during reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate that TRACE consistently outperforms state-of-the-art baselines.

[88] CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team,Shibo Hao,Zhining Zhang,Zhiqi Liang,Tianyang Liu,Yuheng Zha,Qiyue Gao,Jixuan Chen,Zilong Wang,Zhoujun Cheng,Haoxiang Zhang,Junli Wang,Hexi Jin,Boyuan Zheng,Kun Zhou,Yu Wang,Feng Yao,Licheng Liu,Yijiang Li,Zhifei Li,Zhengtao Han,Pracha Promthaw,Tommaso Cerruti,Xiaohan Fu,Ziqiao Ma,Jingbo Shang,Lianhui Qin,Julian McAuley,Eric P. Xing,Zhengzhong Liu,Rupesh Kumar Srivastava,Zhiting Hu

Main category: cs.CL

TL;DR: 本文提出了CocoaBench,一个面向统一数字智能体的基准测试,强调需结合视觉、搜索与编程能力的长周期任务;同时发布了轻量级通用智能体框架CocoaAgent,用于跨模型骨干的可控对比。实验表明当前智能体在此基准上表现不佳(最高仅45.1%成功率),揭示了推理规划、工具调用与视觉定位等关键短板。

Details Motivation: 现有评估多孤立测试智能体各项能力,缺乏对多能力协同(如视觉+搜索+编码)的统一、长周期任务评估,难以反映真实复杂场景下的综合性能。 Method: 构建基于人工设计、长周期、多模态(视觉/搜索/编码)协同任务的CocoaBench基准;任务仅由自然语言指令和自动评估函数定义,确保跨架构可复现;配套发布轻量级共享智能体框架CocoaAgent,支持不同大模型后端的公平比较。 Result: 当前最优系统在CocoaBench上成功率为45.1%,显著低于实用门槛;分析指出三大瓶颈:推理与规划能力不足、工具使用与执行鲁棒性差、视觉信息 grounding 不准确。 Conclusion: 统一数字智能体亟需面向多能力协同的评估范式;CocoaBench与CocoaAgent为该方向提供了可扩展、可复现的基准与基础设施,指明了未来提升的关键路径。 Abstract: LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

[89] Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

Tianzhe Zhao,Jiaoyan Chen,Shuxiu Zhang,Haiping Zhu,Qika Lin,Jun Liu

Main category: cs.CL

TL;DR: 本文提出ConflictQA基准来评估大语言模型(LLM)在面对文本与知识图谱(KG)等异构外部知识源之间冲突时的推理能力,并发现现有LLM易受提示方式影响、偏向单一证据源;为此,作者进一步提出两阶段解释式推理框架XoT以提升跨源冲突下的推理忠实性。

Details Motivation: 现有研究主要关注外部知识与模型参数化知识之间的冲突,而忽略了不同外部知识源(如文本与知识图谱)之间的冲突;同时,现代RAG系统日益融合多源异构知识,亟需评估和提升模型在跨源冲突下的推理忠实性。 Method: 构建ConflictQA基准,系统性构造文本证据与KG证据间的冲突实例;设计XoT框架——一种两阶段解释式推理方法,第一阶段生成对各证据源的解释,第二阶段基于解释进行整合推理。 Result: 实验表明主流LLM在ConflictQA上表现较差,易受提示扰动,且倾向于仅依赖某类证据(文本或KG),导致错误回答;XoT显著提升各LLM在ConflictQA上的准确率与推理忠实性。 Conclusion: 跨源知识冲突是RAG中尚未被充分重视的关键挑战;XoT通过显式解释机制有效缓解该问题,为构建更鲁棒、可信的多源RAG系统提供了新思路。 Abstract: Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.

[90] HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning

Yangfan Wang,Tianyang Sun,Chen Tang,Jie Liu,Wei Cai,Jingchi Jiang

Main category: cs.CL

TL;DR: 本文提出HiEdit,一种分层强化学习框架,用于终身模型编辑,通过动态选择与知识最相关的层进行精确、局部更新,显著提升编辑性能并减少副作用。

Details Motivation: 现有终身模型编辑方法对所有编辑实例使用静态、密集的参数扰动,忽略了不同知识存储在模型不同层中的事实,导致适应性差和灾难性遗忘。 Method: 提出HiEdit框架,采用分层强化学习自适应识别每个编辑实例最相关的知识层,结合稀疏性内在奖励实现动态、实例感知的层选择和局部更新。 Result: 在多个大语言模型上的实验表明,HiEdit相较RLEdit平均提升8.48%性能,且每次编辑仅扰动一半的层。 Conclusion: HiEdit通过层特异性编辑提升了终身模型编辑的精度与效率,缓解了灾难性遗忘问题,验证了知识在模型中分层存储的假设。 Abstract: Lifelong model editing (LME) aims to sequentially rectify outdated or inaccurate knowledge in deployed LLMs while minimizing side effects on unrelated inputs. However, existing approaches typically apply parameter perturbations to a static and dense set of LLM layers for all editing instances. This practice is counter-intuitive, as we hypothesize that different pieces of knowledge are stored in distinct layers of the model. Neglecting this layer-wise specificity can impede adaptability in integrating new knowledge and result in catastrophic forgetting for both general and previously edited knowledge. To address this, we propose HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. By enabling dynamic, instance-aware layer selection and incorporating an intrinsic reward for sparsity, HiEdit achieves precise, localized updates. Experiments on various LLMs show that HiEdit boosts the performance of the competitive RLEdit by an average of 8.48% with perturbing only half of the layers per edit. Our code is available at: https://github.com/yangfanww/hiedit.

[91] RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Dominic P. Fischer,Zachary Hopton,Jannis Vamvas

Main category: cs.CL

TL;DR: 本文提出了RUMLEM,一个覆盖罗曼什语五种主要方言及标准语Rumantsch Grischun的词形还原工具,基于社区驱动的形态数据库,覆盖率达77-84%,并支持方言识别(准确率95%)和罗曼什语与非罗曼什语分类。

Details Motivation: 罗曼什语存在多种方言且资源稀缺,亟需一个覆盖全面、准确可靠的词形还原工具以支持NLP应用及方言识别任务。 Method: 基于为每种罗曼什语方言构建的综合性社区驱动形态数据库开发RUMLEM词形还原器,并利用其输出实现方言识别和语言分类。 Result: RUMLEM在典型罗曼什语文本中覆盖77–84%的词汇;在3万段文本上的方言识别准确率达95%;初步验证了其在罗曼什语/非罗曼什语二分类任务中的可行性。 Conclusion: RUMLEM不仅是一个高效实用的罗曼什语词形还原工具,还拓展出方言识别与语言分类的新应用,凸显了高质量形态资源对低资源语言NLP的重要性。 Abstract: Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

[92] Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Guoxin Yu,Chulun Zhou,Lemao Liu,Qi Wang,Mo Yu,Jialong Tang,Baosong Yang,Xiang Ao,Wao Lam,Yue Yu

Main category: cs.CL

TL;DR: 本文提出WIMPE框架,通过加权上下文绑定的评分点对长文本生成任务进行细粒度评估,包含WPA和PCP两个指标,显著提升与人工标注的相关性。

Details Motivation: 现有评估方法难以判断模型回答是否真正基于给定上下文,且无法捕捉参考答案中不同方面的重要性差异。 Method: 提出加权重要性多点评估(WIMPE)框架,将参考答案分解为加权的上下文绑定评分点,并设计加权逐点对齐(WPA)和逐点冲突惩罚(PCP)两个互补指标。 Result: 在10个生成任务上的实验表明,WIMPE与人工标注的相关性更高。 Conclusion: WIMPE能更准确、细粒度地评估长文本生成质量,尤其在上下文一致性与关键点重要性建模方面具有优势。 Abstract: Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

[93] Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate

Zhixiang Lu,Jionglong Su

Main category: cs.CL

TL;DR: 本文提出Dialectic-Med多智能体框架,通过支持者、反对者(含视觉证伪模块)与调解者之间的对抗性辩证推理,缓解医疗多模态大模型中的确认偏误与幻觉问题,提升诊断可信度与解释忠实性。

Details Motivation: 医疗多模态大语言模型(MLLMs)存在严重确认偏误和视觉细节幻觉问题,现有思维链(CoT)方法缺乏内在纠错机制,易导致错误传播。 Method: 提出Dialectic-Med三智能体框架:支持者生成诊断假设;反对者利用新型视觉证伪模块主动检索矛盾视觉证据进行挑战;调解者基于加权共识图解决冲突;整个过程显式建模证伪认知,确保推理锚定于经验证的视觉区域。 Result: 在MIMIC-CXR-VQA、VQA-RAD和PathVQA数据集上达到SOTA性能,显著提升解释忠实性,有效抑制幻觉,优于单智能体基线。 Conclusion: Dialectic-Med通过引入对抗性辩证机制,为医疗MLLMs提供了更可靠、可验证、更少幻觉的诊断推理范式,确立了新的可信推理标准。 Abstract: Multimodal Large Language Models (MLLMs) in healthcare suffer from severe confirmation bias, often hallucinating visual details to support initial, potentially erroneous diagnostic hypotheses. Existing Chain-of-Thought (CoT) approaches lack intrinsic correction mechanisms, rendering them vulnerable to error propagation. To bridge this gap, we propose Dialectic-Med, a multi-agent framework that enforces diagnostic rigor through adversarial dialectics. Unlike static consensus models, Dialectic-Med orchestrates a dynamic interplay between three role-specialized agents: a proponent that formulates diagnostic hypotheses; an opponent equipped with a novel visual falsification module that actively retrieves contradictory visual evidence to challenge the Proponent; and a mediator that resolves conflicts via a weighted consensus graph. By explicitly modeling the cognitive process of falsification, our framework guarantees that diagnostic reasoning is tightly grounded in verified visual regions. Empirical evaluations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA demonstrate that Dialectic-Med not only achieves state-of-the-art performance but also fundamentally enhances the trustworthiness of the reasoning process. Beyond accuracy, our approach significantly enhances explanation faithfulness and decisively mitigates hallucinations, establishing a new standard over single-agent baselines.

[94] Transactional Attention: Semantic Sponsorship for KV-Cache Retention

Abhinaba Basu

Main category: cs.CL

TL;DR: 本文提出Transactional Attention (TA)机制,通过结构锚点模式保护关键凭证令牌不被压缩丢弃,显著提升KV缓存压缩下的敏感信息检索准确率。

Details Motivation: 现有KV缓存压缩方法在极低保留率(如K=16)下完全无法检索出关键凭证(如API密钥、密码),因其依赖注意力分数等统计信号,而凭证令牌常为‘休眠令牌’——注意力极低但生成时至关重要。 Method: 提出Transactional Attention(TA):利用结构化锚点模式(如'key:'、'password:')作为‘赞助者’,将相邻的关键值令牌标记为不可驱逐;还设计了无需注意力计算的轻量变体TA-Fast。 Result: 在K=16时,TA实现100%凭证检索准确率,而六个基线方法均为0%;在200次函数调用测试中保持100%准确;TA-Fast降低52%内存开销,兼容SDPA/FlashAttention,增加延迟<1%。 Conclusion: TA解决了KV压缩中休眠令牌丢失的根本问题,是一种正交、高效且低开销的增强机制,对安全敏感的长上下文应用具有重要价值。 Abstract: At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., "key:", "password:") protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead by 52% and is compatible with SDPA and FlashAttention. TA is orthogonal to existing compression methods and adds less than 1% latency overhead.

[95] Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

Lester James V. Miranda,Ivan Vulić,Anna Korhonen

Main category: cs.CL

TL;DR: 本文系统研究了多语言场景下如何选择有效的教师模型来生成监督微调(SFT)数据,提出Polyglot Score评估指标,发现模型规模并非决定性因素,而提示多样性、长度和响应流利度等数据质量特征更能预测学生模型性能,并给出实用教学配对建议。

Details Motivation: 现有工作常凭经验选择最大可用语言模型作为教师来生成多语言SFT数据,但大模型在非英语语言上存在能力缺陷,导致合成数据质量差、学生模型下游表现不佳。 Method: 通过内在数据质量指标与外在学生模型性能联合构建Polyglot Score;在6种类型学差异大的语言上评估10个LM,生成超140万SFT样本,训练240个学生模型;分析教师模型规模与数据质量特征(如提示多样性、长度、响应流利度)对学生成绩的影响。 Result: Gemma 3 27B和Aya Expanse 32B在不同学生模型家族中均表现稳定高效;模型规模无法显著预测教师有效性;提示多样性、长度和响应流利度可解释93.3%的内在数据质量方差,并能有效预测学生性能;匹配师生模型家族及复用/翻译已有提示可提升低资源语言效果。 Conclusion: 多语言SFT数据合成应以数据质量为导向而非单纯依赖模型规模;本文提出的评估框架与实践建议有助于推动以数据为中心的多语言大模型研究。 Abstract: Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

[96] Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

Rui Song,Lida Shi,Ruihua Qi,Yingji Li,Hao Xu

Main category: cs.CL

TL;DR: 本文构建了一个包含11项任务、超13万样本的古汉字字形演化分析基准,并提出一种基于字形驱动的微调框架GEVO,显著提升了多模态大模型在字形识别与演化推理等核心任务上的性能。

Details Motivation: 现有MLLMs在古汉字字形演化分析(如字形对比、演化推理)方面能力有限,缺乏系统性评估基准与针对性建模方法。 Method: 构建涵盖11项任务、13万+样本的古汉字演化分析基准;提出字形驱动的微调框架GEVO,显式建模字形演变的一致性。 Result: GEVO在2B规模模型上实现所有任务的稳定且全面性能提升;现有MLLMs在字形级比较上能力有限,核心任务表现受限。 Conclusion: 古汉字演化分析需专门设计的基准与微调方法;GEVO验证了字形感知建模对提升MLLMs历史文本理解能力的有效性,为文化计算提供新范式。 Abstract: In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\footnote{https://github.com/songruiecho/GEVO}.

[97] Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations

Yilong Liu,Xixun Lin,Pengfei Cao,Ge Zhang,Fang Fang,Yanan Cao

Main category: cs.CL

TL;DR: 本文发现并命名了大型语言模型(LLM)在工具调用中存在一种被忽视的结构性对齐偏差(structural alignment bias),即模型倾向于调用结构上可匹配但语义上不相关的工具;为此构建新评测集SABEval,并提出对比注意力归因方法揭示其内在机制,最终设计重平衡策略有效缓解该偏差。

Details Motivation: LLMs在实际工具使用中常面对无关工具,理想行为是拒绝调用,但现有模型普遍存在一种将结构匹配误判为语义相关而错误调用的机制缺陷。 Method: 提出结构性对齐偏差概念;构建解耦结构对齐与语义相关性的新评测数据集SABEval;设计对比注意力归因(Contrastive Attention Attribution)方法分析模型内部决策路径;基于发现提出重平衡策略。 Result: 实证表明结构性对齐偏差导致严重工具调用错误,且现有评测未覆盖;对比注意力归因揭示语义检查与结构匹配两条竞争路径;所提重平衡策略显著缓解偏差,且不损害通用工具使用能力。 Conclusion: 结构性对齐偏差是LLM工具调用中的关键机制问题,需在评估与训练中显式建模;本文提出的分析框架与缓解策略为提升LLM可靠工具使用提供了新方向。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities in utilizing external tools. In practice, however, LLMs are often exposed to tools that are irrelevant to the user's query, in which case the desired behavior is to refrain from invocations. In this work, we identify a widespread yet overlooked mechanistic flaw in tool refusal, which we term structural alignment bias: Even when a tool fails to serve the user's goal, LLMs still tend to invoke it whenever query attributes can be validly assigned to tool parameters. To systematically study this bias, we introduce SABEval, a new dataset that decouples structural alignment from semantic relevance. Our analysis shows that structural alignment bias induces severe tool-invocation errors in LLMs, yet remains largely unaccounted for in existing evaluations. To investigate the internal mechanisms underlying this bias, we propose Contrastive Attention Attribution, which reveals two competing pathways for semantic checking and structural matching. The relative strength of these pathways drives LLMs' tool invocation decisions. Based on these findings, we further introduce a rebalancing strategy that effectively mitigates structural alignment bias, as demonstrated by extensive experiments, without degrading general tool-use capabilities.

[98] Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

Bo Li,Mingda Wang,Gexiang Fang,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 本文提出GRIP框架,将检索控制嵌入生成过程,通过自触发信息规划实现端到端检索-生成协同,在多个问答基准上超越强RAG基线且参数更少。

Details Motivation: 传统RAG将检索视为外部干预,导致检索与生成脱节;本文旨在实现检索与生成的端到端统一协调,消除额外控制器或分类器。 Method: 提出Retrieval as Generation范式和GRIP框架,利用控制token在自回归解码中动态决定检索时机、查询改写和终止;引入Self-Triggered Information Planning机制,并构建覆盖多种查询类型的结构化训练数据进行监督。 Result: 在五个QA基准上,GRIP显著优于强RAG基线,性能媲美GPT-4o,但参数量大幅减少。 Conclusion: 将检索建模为生成过程的一部分是可行且高效的,GRIP证明了端到端联合建模检索与推理可提升性能并降低模型复杂度。 Abstract: We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-guided \textbf{R}etrieval with \textbf{I}nformation \textbf{P}lanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \textit{Self-Triggered Information Planning}, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.

[99] Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

Kuang Wang,Lai Wei,Qibing Bai,Ping Lin,Wenkai Fang,Feng Jiang,Zhongjie Jiang,Jun Huang,Yannan Wang,Haizhou Li

Main category: cs.CL

TL;DR: 本文提出SA-SLM模型,通过意图感知桥接和实现感知对齐,弥合语义理解与声学表达之间的差距,在表达力上接近GPT-4o-Audio。

Details Motivation: Speech Language Models (SLMs)虽具强语义理解能力,但生成语音缺乏表现力,存在语义理解与声学实现之间的鸿沟,源于意图传递失败和实现无感知训练。 Method: 提出SA-SLM:1)Intent-Aware Bridging,利用变分信息瓶颈(VIB)将内部语义转化为时序平滑的表达意图;2)Realization-Aware Alignment,让模型自评并依据评分标准对齐声学实现与意图。仅用800小时数据训练。 Result: 3B参数SA-SLM在EchoMind基准上整体表现力超越所有开源基线,仅比GPT-4o-Audio低0.08分。 Conclusion: SA-SLM通过增强模型对自身意图与发声过程的自知能力,有效缩小语义理解与声学表达间的差距,为高表现力语音生成提供新范式。 Abstract: Speech Language Models (SLMs) exhibit strong semantic understanding, yet their generated speech often sounds flat and fails to convey expressive intent, undermining user engagement. We term this mismatch the semantic understanding-acoustic realization gap. We attribute this gap to two key deficiencies: (1) intent transmission failure, where SLMs fail to provide the stable utterance-level intent needed for expressive delivery; and (2) realization-unaware training, where no feedback signal verifies whether acoustic outputs faithfully reflect intended expression. To address these issues, we propose SA-SLM (Self-Aware Speech Language Model), built on the principle that the model should be aware of what it thinks during generation and how it speaks during training. SA-SLM addresses this gap through two core contributions: (1) Intent-Aware Bridging, which uses a Variational Information Bottleneck (VIB) objective to translate the model's internal semantics into temporally smooth expressive intent, making speech generation aware of what the model intends to express; and (2) Realization-Aware Alignment, which repurposes the model as its own critic to verify and align acoustic realization with intended expressive intent via rubric-based feedback. Trained on only 800 hours of expressive speech data, our 3B parameter SA-SLM surpasses all open-source baselines and comes within 0.08 points of GPT-4o-Audio in overall expressiveness on the EchoMind benchmark.

[100] METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

Haofu Yang,Jiaji Liu,Chen Huang,Faguo Wu,Wenqiang Lei,See-Kiong Ng

Main category: cs.CL

TL;DR: METRO is a novel method that uses large language models to automatically learn strategy actions and planning logic from raw dialogue transcripts for non-collaborative dialogue agents, outperforming existing methods by 9%-10% on benchmarks.

Details Motivation: Traditional development of non-collaborative dialogue agents relies on manual, unscalable codification of expert strategies, necessitating a more scalable and cost-effective approach. Method: METRO leverages large language models to induce strategy actions and planning logic from raw transcripts, formalizing expert knowledge into a hierarchical Strategy Forest that captures both short-term responses (nodes) and long-term strategic foresight (branches). Result: METRO achieves an average performance improvement of 9%-10% over existing methods on two benchmarks, demonstrates strategic behavioral diversity and foresight, and shows robust cross-task transferability. Conclusion: METRO provides a scalable, cost-effective way to build non-collaborative dialogue agents by autonomously learning strategies from data, offering new insights for future agent development. Abstract: Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short-term responses (nodes) and long-term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%-10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross-task transferability. This offers new insights into building non-collaborative agents in a cost-effective and scalable way. Our code is available at https://github.com/Humphrey-0125/METRO.

[101] Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books

Argyrios Papoudakis,Mirella Lapata,Frank Keller

Main category: cs.CL

TL;DR: 本文提出了一种将推理与生成解耦的训练框架,用于从长篇叙事文本中生成准确的人物描述,通过结构化问答式推理轨迹提升生成结果的忠实性、信息量和文本依据性。

Details Motivation: 现有推理增强型大语言模型在人物描述生成任务上表现反常(禁用内置推理反而更好),表明需重新设计推理与生成的协同机制。 Method: 构建双模型框架:一个推理模型生成结构化的问答式推理轨迹(QA reasoning trace),一个生成模型基于该轨迹生成最终人物描述;可适配长上下文模型或分块方法。 Result: 在BookWorm和CroSS两个数据集上,该方法在忠实性(faithfulness)、信息量(informativeness)和依据性(grounding)上均优于强长上下文基线模型。 Conclusion: 显式解耦并结构化推理过程(尤其是通过QA形式)比隐式、端到端推理更适用于复杂人物建模任务,为叙事理解提供了新范式。 Abstract: Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.

[102] METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Pengfeng Li,Chen Huang,Chaoqun Hao,Hongyao Chen,Xiao-Yong Wei,Wenqiang Lei,See-Kiong Ng

Main category: cs.CL

TL;DR: 本文提出METER基准,首次在统一上下文设置下系统评估大语言模型(LLM)在因果阶梯三层级上的上下文因果推理能力,并通过机制分析揭示其两大失败模式。

Details Motivation: 现有基准在评估LLM上下文因果推理时存在上下文不一致、覆盖因果层级不全的问题,亟需系统性、层次化的统一评测框架。 Method: 提出METER基准,在统一上下文下覆盖因果阶梯三层(观察、干预、反事实);结合错误模式分析与内部信息流追踪进行深度机制诊断。 Result: 发现LLM因果推理能力随因果层级上升显著下降;识别出两大失败模式:易受无关但正确的低层级信息干扰,以及高层级任务中上下文忠实度下降。 Conclusion: METER为理解LLM因果推理机制提供了新视角和关键基础,推动因果推理评测与建模的后续研究。 Abstract: Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

[103] Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Jiashu Yao,Heyan Huang,Chuwei Luo,Daiqing Wu,Zeming Liu,Yuhang Guo,Yangyang Kang

Main category: cs.CL

TL;DR: 本文提出Policy Split方法,将策略分为正常模式和高熵模式,通过高熵提示实现多样化探索,同时保持准确性。

Details Motivation: 为了在大语言模型的强化学习中鼓励多样化探索而不牺牲准确性。 Method: 将策略分为共享参数的正常模式和高熵模式,并施加针对不同目标的协同双模态熵正则化:正常模式优化任务正确性,高熵模式偏好探索。 Result: 在各种模型规模的一般和创造性任务中,该方法持续优于现有的基于熵引导的强化学习基线。 Conclusion: Policy Split促进了双模态探索,高熵模式产生与正常模式不同的行为模式,提供独特的学习信号。 Abstract: To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

[104] Triviality Corrected Endogenous Reward

Xinda Wang,Zhengxu Hou,Yangshijie Zhang,Bingren Yan,Jialin Liu,Chenzhuo Zhao,Zhibo Yang,Bin-Bin Yang,Feng Xiao

Main category: cs.CL

TL;DR: 本文提出TCER方法,通过相对信息增益和概率依赖校正机制解决开放文本生成中基于置信度的内在奖励导致的平凡性偏差问题,在多个写作基准和模型架构上实现无监督性能提升,并可迁移到数学推理任务。

Details Motivation: 强化学习在开放文本生成中受限于缺乏可验证奖励,现有判别模型依赖标注数据或闭源大模型;受数学推理中基于置信度的无监督强化学习启发,探索其在写作任务中的适用性。 Method: 提出TCER(Triviality Corrected Endogenous Reward),以专家策略与通用参考策略间的相对信息增益为基础奖励,并引入概率依赖的校正机制来抑制平凡性偏差。 Result: TCER在多个写作基准和不同模型架构上均取得一致提升,且无需外部监督;同时成功迁移到数学推理任务,验证了方法的跨任务通用性。 Conclusion: TCER有效缓解了置信度奖励引发的平凡性偏差,为开放文本生成提供了可泛化、无监督的内在奖励新范式。 Abstract: Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.

[105] NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

Wenqing Wu,Yi Zhao,Yuzhuo Wang,Siyou Li,Juexi Shao,Yunfei Long,Chengzhi Zhang

Main category: cs.CL

TL;DR: 本文提出了首个用于评估大语言模型(LLM)在科研新颖性评价能力的大规模基准NovBench,并构建了四维评估框架;实验表明现有LLM对科学新颖性理解有限,微调模型常存在指令遵循缺陷,亟需针对性优化策略。

Details Motivation: 学术出版中 novelty 是核心要求,但投稿量激增给人工评审带来压力;现有LLM虽能生成审稿意见,却缺乏专门评估其新颖性判断能力的基准。 Method: 构建NovBench基准:包含1684篇NLP顶会论文-评审对,提取引言中的新颖性陈述及专家撰写的新颖性评价;提出四维评估框架(相关性、正确性、覆盖度、清晰度);在通用与专用LLM上开展多种提示策略的实验。 Result: 当前LLM对科研新颖性理解有限;微调模型常出现指令遵循不足问题;不同模型和提示策略下表现差异显著。 Conclusion: 需设计联合提升新颖性理解能力与指令遵循能力的针对性微调策略,以更好支持人类同行评审。 Abstract: Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

[106] Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory

Weixian Waylon Li,Jiaxin Zhang,Xianan Jim Yang,Tiejun Ma,Yiwen Guo

Main category: cs.CL

TL;DR: RoMem是一种新型时序知识图谱模块,通过语义速度门与连续相位旋转实现几何阴影机制,在不删除旧事实的前提下动态管理时序记忆,显著提升时序推理与记忆稳定性。

Details Motivation: 现有方法难以区分持久性事实(如'出生地')与演化性事实(如'总统'),导致知识遗忘、覆盖或高计算开销。 Method: 提出RoMem模块,包含预训练的语义速度门(将关系文本嵌入映射为波动性得分)和连续相位旋转机制,利用复数向量空间中的几何阴影实现事实的时序排序与共存。 Result: 在ICEWS05-15上MRR达72.6(SOTA);在MultiTQ、LoCoMo、DMR-MSC、FinTMMBench等多任务中全面超越基线,实现2–3倍MRR/准确率提升、零退化静态记忆与零样本金融领域泛化。 Conclusion: RoMem为结构化记忆系统提供了轻量、可插拔、无需频繁LLM调用的时序建模方案,有效统一持久性与演化性知识管理。 Abstract: Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation's text embedding to a volatility score, learning from data that evolving relations (e.g., "president of") should rotate fast while persistent ones (e.g., "born in") should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).

[107] Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Liujie Zhang,Benzhe Ning,Rui Yang,Xiaoyan Yu,Jiaxing Li,Lumeng Wu,Jia Liu,Minghao Li,Weihang Chen,Weiqi Hu,Lei Zhang

Main category: cs.CL

TL;DR: 本文提出Relax,一个开源的强化学习训练引擎,专为多模态和智能体工作流设计,通过全栈原生多模态架构、故障隔离服务和异步数据总线解决异构数据流、规模化鲁棒性和时效性-吞吐量权衡三大挑战,并在多个模型和任务上实现显著加速与稳定收敛。

Details Motivation: 随着大模型扩展至多模态输入和多轮智能体工作流,现有RL训练系统面临异构数据流、大规模运行鲁棒性差、以及时效性与吞吐量难以兼顾三大相互关联的挑战。 Method: 提出Relax RL训练引擎,包含三层协同设计架构:1)全栈原生多模态架构,从预处理、并行计算到推理生成均原生支持多模态;2)各RL角色作为独立、故障隔离的服务,支持灵活扩缩容与升级;3)通过TransferQueue数据总线实现服务级解耦与异步训练,仅用单一‘staleness’参数平滑调控策略新鲜度。 Result: Relax在Qwen3-4B上较veRL实现1.20×端到端加速;全异步模式下,在Qwen3-4B和Qwen3-Omni-30B上分别达1.76×和2.00×加速,且奖励收敛一致;对MoE模型支持R3机制仅引入1.9%开销(veRL达32%退化);并在Qwen3-Omni上稳定完成跨图文音及长达2000步视频的多模态RL训练。 Conclusion: Relax通过系统级协同设计有效解决了多模态智能体RL训练的关键瓶颈,兼具高性能、高鲁棒性与强泛化能力,为下一代多模态RL基础设施提供了开源可行方案。 Abstract: Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.

[108] Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

Artem Gadzhiev,Andrew Kislov

Main category: cs.CL

TL;DR: Synthius-Mem 是一种受大脑启发的结构化人格记忆系统,通过将对话分解为六个认知领域并进行结构化事实提取与检索,在 LoCoMo 基准上实现 94.37% 准确率和 99.55% 的对抗鲁棒性,显著优于现有方法且降低约 5 倍 token 消耗。

Details Motivation: 现有 LLM 代理记忆方法(如滑动窗口、摘要、RAG、扁平化事实抽取)虽节省 token,但导致信息丢失、语义漂移或对用户的幻觉;且所有已发表系统均将对话视为原始对话段落的检索问题,缺乏对抗鲁棒性评估。 Method: 提出 Synthius-Mem:构建结构化人格记忆,包含六域(生平、经历、偏好、社交圈、工作、心理特征)的人格提取流水线,按域去重整合,并通过 CategoryRAG 进行结构化事实检索。 Result: 在 LoCoMo 基准(10 场对话、1813 个问题)上达 94.37% 准确率(超越 MemMachine 和人类 F1 87.9),核心事实准确率达 98.64%,对抗鲁棒性达 99.55%,token 消耗降低约 5 倍。 Conclusion: Synthius-Mem 是首个同时超越人类表现且报告对抗鲁棒性的 persona 记忆系统,证明结构化、领域分解的记忆建模可从根本上缓解幻觉并提升长期记忆可靠性。 Abstract: Providing AI agents with reliable long-term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents -- sliding windows, summarization, embedding-based RAG, and flat fact extraction -- each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius-Mem, a brain-inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius-Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius-Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius-Mem reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy. Synthius-Mem achieves state-of-the-art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human-level performance and reports adversarial robustness.

[109] Phonological distances for linguistic typology and the origin of Indo-European languages

Marius Mavridis,Juan De Gregorio,Raul Toral,David Sanchez

Main category: cs.CL

TL;DR: 本文通过信息论框架,利用二阶马尔可夫链建模音素序列,发现短程音素依赖能反映大尺度语言亲缘关系,并据此构建67种现代语言的音系距离矩阵,成功复现主要语系、接触趋同现象及地理距离相关性,支持印欧语系的草原假说。

Details Motivation: 探索短程音素依赖是否能编码大规模语言亲缘关系,为定量类型学与演化语言学提供新方法。 Method: 采用信息论框架,将音素序列建模为二阶马尔可夫链,结合音素发音特征定义距离度量,基于多语平行语料库计算67种语言的音系距离矩阵。 Result: 音系距离矩阵成功复现主要语言家族,揭示接触诱导的趋同现象,并与地理距离呈现显著相关性;据此推断印欧语系起源地与草原假说一致。 Conclusion: 短程音素统计依赖足以捕捉语言系统层级结构和演化信号,为语言分类、接触分析与起源推断提供了高效、无监督的量化工具。 Abstract: We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.

[110] MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

Chen Hu,Yintao Tai,Antonio Vergari,Frank Keller,Alessandro Suglia

Main category: cs.CL

TL;DR: 本文提出了MIXAR,首个基于像素的多语言生成式语言模型,支持八种不同文字系统,在多语言判别与生成任务上显著优于先前的像素模型及分词器模型,并展现出对未见语言和输入扰动的强鲁棒性。

Details Motivation: 像素级语言模型虽有望规避分词问题,但语言间感知多样性严重制约其多语言泛化能力。 Method: 提出MIXAR模型,首个在八种不同文字系统上训练的生成式像素级语言模型,并进行多语言判别与生成任务评估,同时测试其对未见语言及正字法攻击等扰动的鲁棒性。 Result: MIXAR在多语言任务上显著优于先前像素模型和分词器模型;具备对未见语言的泛化能力;扩展至0.5B参数后,在LAMBADA等生成任务及抗扰动能力上进一步提升。 Conclusion: MIXAR验证了像素级建模在多语言场景下的可行性与优势,为摆脱分词依赖、提升跨语言鲁棒性提供了新路径。 Abstract: Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.

[111] Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

Solomon Messing

Main category: cs.CL

TL;DR: 本文分析了大语言模型(LLM)评估中被忽视的管道不确定性来源,将其分解为可随数据增加而减小的方差与由研究者设计选择引起的敏感性,并提出优化评估流程以降低总误差的方法;实验表明,优化后的评估流程在多个任务上显著优于多数朴素流程,且能更准确地估计置信区间。

Details Motivation: LLM评估结果影响模型部署、安全标准制定和研究结论发表,但其分数存在未被量化的隐藏不确定性(如提示重写、裁判模型切换、温度变化等导致排名反转),标准置信区间低估该方差,且易被开发者利用进行虚假优化。 Method: 将LLM评估管道的不确定性分解为数据依赖性方差与设计选择敏感性两部分,建模各环节(提示、裁判模型、采样参数等)贡献,通过小样本方差估计构建覆盖更佳的置信区间,并投影最优资源分配策略以最小化总误差。 Result: 在意识形态标注、安全分类、MMLU基准和人工验证的宣传审计任务中,投影优化的评估流程优于73%的朴素流程;在MMLU上,同等成本下估计误差减半;小样本方差估计即可使置信区间接近标称覆盖率,并给出降低测量误差的具体建议。 Conclusion: LLM评估需显式建模并控制管道级不确定性;仅靠增加数据无法消除设计选择引入的可操纵偏差;基准构建者应依据不确定性分解结果规避易被‘刷分’的设计选项,提升评估鲁棒性与可信度。 Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The unmeasured variance also creates an exploitable surface: model developers can optimize against measurement noise rather than genuine capability. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and projects the most efficient path to reducing total error. For benchmark builders, the same decomposition identifies which design choices contribute exploitable surface for gaming and prescribes designs that minimize it. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, projection-optimized pipelines outperform 73\% of possible naive pipelines against a human baseline. On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost. A small-sample variance estimation exercise is sufficient to derive confidence intervals that approach nominal coverage when the model includes the relevant pipeline facets, and to generate recommendations for reducing measurement error and improving benchmark robustness.

[112] A Triadic Suffix Tokenization Scheme for Numerical Reasoning

Olga Chetverina

Main category: cs.CL

TL;DR: 本文提出Triadic Suffix Tokenization(TST),一种针对数字的确定性子词分词方法,通过将数字按三位分组并附加显式数量级标记,解决传统分词破坏数字结构导致LLM算术与科学推理错误的问题。

Details Motivation: 标准子词分词方法对数字分段不一致,导致大语言模型丢失数字的位置和小数结构,进而引发算术和科学推理错误。 Method: 提出Triadic Suffix Tokenization(TST):将整数部分按三位分组(triad),并为每组附加固定后缀标记对应数量级(如千、百万、十亿等);小数部分采用平行的复制标记系统(如十分之一、千分之一等)。提供两种实现:词汇表扩展(最多+10,000 token,覆盖10^{-15}至10^{18})和动态后缀标记(用少量特殊token表示量级)。 Result: TST保持数字精确性,使数量级关系在token层面显式可读;具备线性可扩展性,支持任意精度与范围;架构无关,可作为即插即用预处理步骤。 Conclusion: TST为数字表示提供了结构化、确定性、可扩展的分词方案,有望提升LLM在数值推理任务中的稳定性与准确性,实验验证留待后续工作。 Abstract: Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

[113] Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

Yuqing Yang,Tengxiao Liu,Wang Bill Zhu,Taiwei Shi,Linxin Song,Robin Jia

Main category: cs.CL

TL;DR: 本文提出BEHEMOTH基准和CluE方法,解决大模型助手在多样化任务中记忆提取不一致的问题,通过聚类优化提示词,在异构任务上实现显著性能提升。

Details Motivation: 随着大语言模型助手变得持续化和个性化,需从历史对话中提取并保留有用信息作为记忆;但不同任务所需记忆的信息类型差异很大,现有方法难以兼顾多样性。 Method: 提出异构记忆提取任务定义;构建覆盖个性化、问题求解与智能体任务的BEHEMOTH基准;设计基于聚类的自演化策略CluE,先按提取场景聚类样本,再独立分析各簇并融合跨簇洞察更新提取提示。 Result: 在BEHEMOTH基准上,CluE相较先前自演化框架获得+9.04%相对增益,且在各类异构任务上稳定优于基线。 Conclusion: 静态提示或面向同质任务设计的自演化方法不适用于异构记忆提取;CluE通过场景感知聚类与跨簇协同优化,有效提升了记忆提取的泛化能力与鲁棒性。 Abstract: As LLM-based assistants become persistent and personalized, they must extract and retain useful information from past conversations as memory. However, the types of information worth remembering vary considerably across tasks. We formalize the \textit{heterogeneous memory extraction} task and introduce \textbf{BEHEMOTH}, a benchmark that repurposes 18 existing datasets spanning personalization, problem-solving, and agentic tasks, using a downstream utility-driven metric for systematic evaluation. Our empirical analysis confirms that no single static extraction prompt dominates across all task categories, and that existing self-evolving prompt optimization frameworks, originally designed for homogeneous distributions, degrade when training tasks are heterogeneous. To address this, we propose \textbf{CluE}, a cluster-based self-evolving strategy that groups training examples into clusters by extraction scenarios, analyzes each cluster independently, and synthesizes cross-cluster insights to update the extraction prompt. Experiments on BEHEMOTH show that CluE generalizes effectively across heterogeneous tasks ($+$9.04\% relative gain), consistently outperforming prior self-evolving frameworks.

[114] Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

Jiashu Yao,Heyan Huang,Zeming Liu,Yuhang Guo

Main category: cs.CL

TL;DR: 本文提出了一种名为互信息自评估(MISE)的新强化学习范式,用于解决基于大语言模型(LLM)的智能体在强化学习中稀疏奖励的问题。MISE利用回溯式生成自评估作为密集奖励信号,并同时校准这些信号与环境反馈的一致性。理论证明其等价于最小化一个结合互信息和KL散度的目标函数;实验表明,MISE使7B参数开源LLM在无专家监督下达到接近GPT-4o的性能。

Details Motivation: 解决基于大语言模型的强化学习智能体面临的稀疏奖励挑战,提升其自主学习能力。 Method: 提出互信息自评估(MISE)范式,结合回溯式生成自评估作为密集内在奖励,并通过校准步骤使其与最优策略对齐;理论推导其等价于最小化互信息与KL散度组合的目标函数。 Result: MISE在多个实验中超越强基线,在无需专家监督的情况下,使约7B参数的开源LLM在验证集上性能媲美GPT-4o。 Conclusion: MISE为生成式自奖励范式提供了首个形式化理论基础,并验证了其在提升LLM自主强化学习能力上的有效性与实用性。 Abstract: To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.

[115] Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

Yuqian Wu,Wei Chen,Zhengjun Huang,Junle Chen,Qingxiang Liu,Kai Wang,Xiaofang Zhou,Yuxuan Liang

Main category: cs.CL

TL;DR: 本文提出了一种名为\method的极简对话记忆框架,通过Turn Isolation Retrieval(TIR)和Query-Driven Pruning(QDP)解决长对话中信号稀疏与冗余问题,显著提升性能与效率。

Details Motivation: 现有对话记忆系统受限于上下文稀释问题,作者发现根本瓶颈在于潜在知识流形中的‘信号稀疏效应’,具体表现为决定性证据稀疏和双层级冗余。 Method: 提出\method框架,包含两个核心组件:Turn Isolation Retrieval(TIR),采用最大激活策略替代全局聚合;Query-Driven Pruning(QDP),用于剔除冗余会话与对话填充内容,构建高密度证据集。 Result: 在多个基准测试中,\method在鲁棒性、性能、token效率和延迟方面均持续超越强基线方法。 Conclusion: 对话记忆可回归本质——仅依赖检索与生成;\method确立了新的极简主义基线,为该领域提供新范式。 Abstract: Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.

[116] CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

Xuefeng Wei,Zhixuan Wang,Xuan Zhou,Zhi Qu,Hongyao Li,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe

Main category: cs.CL

TL;DR: 本文提出了CARTBENCH,一个基于故宫博物院藏品的中文视觉-语言模型评测基准,包含四个子任务,旨在评估模型在艺术品鉴赏、推理与真伪辨别等高阶能力上的表现;实验表明当前主流VLMs在证据关联、风格断代、长文本赏析及真实性判别等方面仍存在显著不足。

Details Motivation: 现有视觉-语言模型评测多聚焦于短文本识别与问答,缺乏对中文艺术品深层次理解(如专家式赏析、风格断代、真伪鉴别)的系统性评估,亟需一个博物馆 grounded 的高难度基准。 Method: 构建了CARTBENCH基准:基于Wikidata中带图的故宫博物院文物数据,对齐权威图录网页,覆盖五类艺术形式与多个朝代;设计四个子任务——CURATORQA(证据支撑的识别与推理)、CATALOGCAPTION(四段式专家体描述生成)、REINTERPRET(可辩护的再诠释)、CONNOISSEURPAIRS(高相似度下的真伪判别)。 Result: 在9个主流VLM上测试发现:CURATORQA整体准确率高但证据链接与风格断代能力骤降;CATALOGCAPTION长文本生成远逊于专家参考;CONNOISSEURPAIRS真伪判别接近随机水平。 Conclusion: 当前VLM在艺术品鉴赏所需的高阶、可解释、诊断性推理能力上仍严重欠缺,CARTBENCH为推动该方向研究提供了坚实基准。 Abstract: We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

[117] RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

Riccardo Rosati,Edoardo Colucci,Massimiliano Bolognini,Adriano Mancini,Paolo Sernani

Main category: cs.CL

TL;DR: 本文提出RPA-Check框架,用于多阶段自动评估大语言模型驱动的角色扮演代理(RPAs)在复杂约束环境下的表现,涵盖角色一致性、逻辑连贯性与叙事稳定性,并在法律模拟场景中验证其有效性。

Details Motivation: 现有NLP指标难以衡量角色扮演代理在角色遵循、逻辑一致性和长期叙事稳定性等方面的细微表现,亟需更客观、可复现的专用评估方法。 Method: 提出四阶段自动化评估框架RPA-Check:(1)维度定义——确立高层行为准则;(2)增强——转化为细粒度布尔检查项;(3)语义过滤——保障指标客观性、无冗余与代理隔离;(4)LLM-as-a-Judge——结合思维链进行保真度评分。 Result: 在LLM Court法律训练游戏中验证,五类法律场景实验表明该框架能揭示模型规模、推理深度与运行稳定性间的权衡关系;发现8–9B量级小模型在指令调优充分时,程序一致性优于更大模型,后者易受用户对齐偏差或奉承倾向影响。 Conclusion: RPA-Check为生成式智能体在专业领域(如司法训练)的评估提供了标准化、可复现的量化工具,推动角色扮演代理评估范式的演进。 Abstract: The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.

[118] Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation

Joe Stacey,Hadas Orgad,Kentaro Inui,Benjamin Heinzerling,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 本文系统研究了监督式不确定性探针在不同模型、任务和分布外(OOD)设置下的鲁棒性,发现当前方法在长文本生成中鲁棒性差;中间层表征和跨响应token聚合比末层单token特征更鲁棒;并提出一种简单混合回退策略以提升鲁棒性。

Details Motivation: 现有基于隐藏状态的不确定性估计和幻觉检测方法缺乏对分布偏移下鲁棒性的系统评估,亟需厘清哪些探针设计能在OOD场景下提供可靠不确定性估计。 Method: 训练超过2000个监督式不确定性探针,系统变化表示层(layer)、特征类型(feature type)和token聚合策略(token aggregation),并在多种模型、任务及OOD设置下评估其鲁棒性。 Result: 发现当前探针在长文本生成中鲁棒性差;中间层表征比最终层更泛化;跨响应token聚合始终比单token特征更鲁棒;这些差异在ID下不明显,但在OOD下显著;提出并验证了一种简单混合回退策略可提升鲁棒性。 Conclusion: 探针鲁棒性主要取决于输入特征而非架构;应重视OOD评估;中间层+跨token聚合是更稳健的设计选择;改进评估是构建更鲁棒探针的前提。 Abstract: Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating across response tokens is consistently more robust than relying on single-token features. These differences are often largely invisible in-distribution but become more important under distribution shift. Informed by our evaluation, we explore a simple hybrid back-off strategy for improving robustness, arguing that better evaluation is a prerequisite for building more robust probes.

[119] Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Hanqi Xiao,Vaidehi Patil,Zaid Khan,Hyunji Lee,Elias Stengel-Eskin,Mohit Bansal

Main category: cs.CL

TL;DR: 本文提出了一种面向隐私保护的新型心智理论(ToM)挑战——ToM for Steering Beliefs(ToM-SB),要求防御模型作为‘双面间谍’引导攻击者形成错误信念;前沿大模型在该任务上表现不佳,而通过融合心智建模与欺骗奖励的强化学习训练出的AI双面间谍显著提升了二者能力,并展现出跨攻击者类型与分布外场景的泛化性。

Details Motivation: 随着大语言模型广泛用于对话系统,其对对话伙伴意图与状态进行推理(即具备心智理论能力)对安全交互(尤其面对潜在对抗性用户)至关重要;现有模型缺乏在隐私敏感场景中主动引导对手信念的能力评估与训练机制。 Method: 提出ToM-SB新任务,构建双角色博弈框架(防御者为双面间谍、攻击者具部分先验知识);采用强化学习训练AI双面间谍,分别及联合优化‘欺骗成功’与‘心智建模’两类奖励;在四类攻击者、六种防御方法及分布内/外设置下系统评测。 Result: 前沿模型(如Gemini3-Pro、GPT-5.4)即使经ToM提示仍难以在高难度场景中欺骗攻击者;仅优化欺骗或仅优化ToM均能双向提升另一项能力;联合奖励策略的AI双面间谍在硬场景中显著优于最强基线模型,并在OOD攻击者上展现良好泛化性。 Conclusion: ToM-SB任务揭示了信念建模是实现有效对抗性隐私保护的关键能力;ToM与欺骗能力存在内在协同关系;基于双目标强化学习的AI双面间谍范式可扩展、可升级,为构建具备安全意识的智能体提供了新路径。 Abstract: As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

[120] Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer

Utsav Paneru

Main category: cs.CL

TL;DR: 本文提出了一种将AI生成文本重写为人类风格文本的方法,构建了包含25,140对样本的平行语料库,识别出11个可测量的文体特征,并微调了BART-base、BART-large和Mistral-7B-Instruct模型;结果表明BART-large在保持参考文本相似性方面最优,且参数量远少于Mistral-7B,同时指出当前风格迁移评估中‘风格偏移准确性’存在盲点。

Details Motivation: 现有研究多关注AI文本检测,而系统性地将AI生成文本重写为人类风格文本的研究较少,亟需方法学与评估标准的创新。 Method: 构建AI输入与人类参考文本的平行语料库(25,140对),提取11个可量化文体标记,采用QLoRA技术微调BART-base、BART-large及Mistral-7B-Instruct模型。 Result: BART-large在BERTScore F1(0.924)、ROUGE-L(0.566)和chrF++(55.92)上表现最优,参数量仅为Mistral-7B的1/17;Mistral-7B虽风格偏移得分更高,但存在过拟合倾向。 Conclusion: BART-large是更高效、更准确的人类风格重写模型;当前风格迁移评估过度依赖整体偏移分数,忽视了偏移准确性这一关键维度,需引入新评估指标。 Abstract: AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity -- BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 -- with 17x fewer parameters than Mistral-7B. We show that Mistral-7B's higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.

Jieying Xue,Phuong Minh Nguyen,Ha Thanh Nguyen,May Myo Zin,Ken Satoh

Main category: cs.CL

TL;DR: 本文提出Legal2LogicICL框架,结合检索增强生成与法律结构感知的少样本学习,提升逻辑型法律推理系统在数据稀缺下的泛化能力,无需额外训练即可稳定生成准确逻辑规则。

Details Motivation: 现有基于逻辑的法律推理系统受限于高质量标注数据稀缺,难以泛化;同时法律文本中长实体易引发检索偏差,掩盖关键推理模式。 Method: 提出Legal2LogicICL:一种兼顾语义表征与法律文本结构的少样本检索框架,通过缓解实体诱导的检索偏差,构建信息丰富且鲁棒的上下文示例;并构建新数据集Legal2Proleg用于评估法律语义解析。 Result: 在开源与私有大语言模型上实验表明,该方法显著提升自然语言法律案例到逻辑表示的转换准确性、稳定性与泛化性。 Conclusion: Legal2LogicICL实现了无需微调的高效、可解释、可靠的法律逻辑推理,为数据稀缺场景下的法律AI提供了新范式。 Abstract: This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.

[122] Evaluating Cooperation in LLM Social Groups through Elected Leadership

Ryan Faulkner,Anushka Deshpande,David Guzman Piedrahita,Joel Z. Leibo,Zhijing Jin

Main category: cs.CL

TL;DR: 本文探讨了在多智能体系统中引入领导力和选举机制是否能提升合作效果与社会福利,通过基于大语言模型(LLM)的模拟实验发现,选举产生的领导角色显著提升了社会福利(+55.4%)和系统存续时间(+128.6%),并结合社交图谱与情感分析揭示了领导者的社会影响力与话语特征。

Details Motivation: 现有基于大语言模型的多智能体合作研究缺乏对人类社会普遍存在的领导与选举机制的建模,限制了其在公共资源治理等复杂社会困境中的适用性。 Method: 构建开源框架,模拟由选举产生的领导者及其议程驱动的多智能体协作;在受控治理条件下开展LLM实证研究;构建智能体社交图并计算中心性指标;对领导者发言进行情感与合作倾向分析。 Result: 选举领导机制使社会福利提升55.4%,系统生存时间延长128.6%;领导者在社交图中呈现高中心性;其话语表现出更强的合作性与积极情绪倾向。 Conclusion: 引入结构化领导与选举机制可显著增强多智能体系统的协作效能与可持续性,为未来研究社会性多智能体系统提供了新范式与实证基础。 Abstract: Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.

[123] Discourse Diversity in Multi-Turn Empathic Dialogue

Hongli Zhan,Emma S. Gueorguieva,Javier Hernandez,Jina Suh,Desmond C. Ong,Junyi Jessy Li

Main category: cs.CL

TL;DR: 本文发现大语言模型(LLMs)在多轮共情对话中存在显著的“话语策略重复”问题,远超人类支持者;为此提出首个面向多轮话语策略多样性的强化学习训练框架MINT,在提升共情质量的同时显著降低策略重复率。

Details Motivation: 现有研究关注LLM单轮共情表现优异,但忽视其在多轮对话中话语策略(discourse moves)高度重复的问题,而共情支持的有效性依赖于策略的动态变化。 Method: 通过实证分析揭示LLM在多轮情感支持对话中跨轮策略复用率远高于人类;进而提出MINT框架——一种结合共情质量奖励与跨轮策略新颖性信号的强化学习方法。 Result: MINT在1.7B和4B模型上使整体共情得分提升25.3%,4B模型跨轮话语策略重复率降低26.3%,全面优于仅优化质量或词元多样性等基线方法。 Conclusion: 当前LLM缺乏的不是共情能力本身,而是根据对话进展灵活切换话语策略的能力;提升多轮话语策略多样性是增强真实共情对话效果的关键路径。 Abstract: Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.

[124] LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen,Chumeng Liang,Hangke Sui,Ruihan Guo,Chaoran Cheng,Jiaxuan You,Ge Liu

Main category: cs.CL

TL;DR: 本文提出了LangFlow,首个在语言建模中媲美离散扩散模型的连续扩散语言模型(DLM),通过连接嵌入空间DLM与流匹配、提出新型ODE-based NLL界、基于Gumbel分布的可学习噪声调度策略及自条件训练协议,显著提升性能。

Details Motivation: 连续扩散模型在图像等领域表现优异,但在语言建模中仍落后于离散模型,亟需弥合这一差距。 Method: 将嵌入空间DLM与流匹配通过Bregman散度关联;提出基于ODE的NLL下界用于评估;设计基于Gumbel分布的信息均匀噪声调度;引入自条件训练提升似然与采样质量。 Result: LangFlow在LM1B和OpenWebText上分别达到PPL=30.0和24.6,性能匹敌同规模顶尖离散DLM,并在多个零样本迁移任务中超越自回归基线。 Conclusion: 连续扩散是语言建模中具有竞争力且前景广阔的新范式。 Abstract: Continuous diffusion models have achieved strong performance across domains such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion. Our approach connects embedding-space DLMs to Flow Matching via Bregman divergence and introduces three key innovations: (1) a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) an information-uniform principle for noise scheduling, motivating a learnable scheduler based on a Gumbel distribution; and (3) an improved training protocol incorporating self-conditioning, which enhances both likelihood and sample quality.LangFlow achieves strong performance across benchmarks, reaching a perplexity (PPL) of 30.0 on LM1B and 24.6 on OpenWebText. It matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks. LangFlow provides clear evidence that continuous diffusion is a competitive and promising paradigm for language modeling. https://github.com/nealchen2003/LangFlow

[125] HistLens: Mapping Idea Change across Concepts and Corpora

Yi Jing,Weiyun Qiu,Yihang Peng,Zhifang Sui

Main category: cs.CL

TL;DR: 本文提出HistLens框架,用于多概念、多语料库的概念历史分析,通过可解释的特征分解和时间动态追踪,实现跨语料、跨概念的概念演化建模,尤其支持隐式概念计算。

Details Motivation: 现有计算方法局限于单概念或单语料,且仅依赖表层词汇证据,难以捕捉隐式表达的概念,缺乏可比性和解释粒度。 Method: 提出基于自编码器(SAE)的HistLens统一框架,将概念表示分解为可解释特征,建模其在时间和语料间的激活动态,并映射到共享坐标系中。 Result: 在长期新闻语料库上的实验表明,HistLens能支持跨概念、跨语料的概念演化模式计算,并有效识别隐式概念。 Conclusion: HistLens弥合了概念建模与人文社科解释需求之间的鸿沟,拓展了社会科学研究中历时文本分析的视角与方法。 Abstract: Language change both reflects and shapes social processes, and the semantic evolution of foundational concepts provides a measurable trace of historical and social transformation. Despite recent advances in diachronic semantics and discourse analysis, existing computational approaches often (i) concentrate on a single concept or a single corpus, making findings difficult to compare across heterogeneous sources, and (ii) remain confined to surface lexical evidence, offering insufficient computational and interpretive granularity when concepts are expressed implicitly. We propose HistLens, a unified, SAE-based framework for multi-concept, multi-corpus conceptual-history analysis. The framework decomposes concept representations into interpretable features and tracks their activation dynamics over time and across sources, yielding comparable conceptual trajectories within a shared coordinate system. Experiments on long-span press corpora show that HistLens supports cross-concept, cross-corpus computation of patterns of idea evolution and enables implicit concept computation. By bridging conceptual modeling with interpretive needs, HistLens broadens the analytical perspectives and methodological repertoire available to social science and the humanities for diachronic text analysis.

[126] Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee,Howard Yen,Xi Ye,Danqi Chen

Main category: cs.CL

TL;DR: 本文提出AggAgent,一种用于长周期智能体任务(如智能搜索和深度研究)的并行测试时扩展聚合方法,通过将多条轨迹视为环境并使用轻量级工具进行动态检查与合成,在多个基准上显著优于现有聚合方法,且开销极小。

Details Motivation: 现有并行测试时扩展在链式推理中有效,但在长周期、多轮次、工具增强、开放性输出的智能体任务中面临挑战:仅聚合最终答案会丢失轨迹中的丰富信息,而拼接全部轨迹又超出模型上下文窗口限制。 Method: 提出AggAgent——一个将并行生成的多条智能体轨迹视为环境的聚合智能体;为其配备轻量级工具,支持按需检查候选解、跨轨迹搜索与信息合成。 Result: 在六个基准和三个模型家族(GLM-4.7、Qwen3.5、MiniMax-M2.5)上,AggAgent平均绝对提升达5.3%,在两项深度研究任务上最高提升10.3%;聚合开销被限制为单次智能体推理成本。 Conclusion: AggAgent验证了智能体式聚合是实现高效、低成本并行测试时扩展的有效新范式。 Abstract: We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

[127] General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Junlin Liu,Shengnan An,Shuang Zhou,Dan Ma,Shixiong Luo,Ying Xie,Yuan Zhang,Wenling Yuan,Yifan Zhou,Xiaoyu Li,Ziwen Wang,Xuezhi Cao,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出General365基准,用于评估大语言模型(LLMs)在脱离专业领域知识下的通用推理能力;实验表明当前顶尖LLM在此基准上准确率仅62.8%,远低于其在数学/物理等专业基准上的表现,揭示其推理能力仍高度依赖领域特异性知识。

Details Motivation: 现有LLM在数学、物理等专业领域表现出色,但其在更广泛、通用场景下的推理能力(即‘通用推理’)尚未被充分研究;该能力需应对复杂约束、嵌套逻辑与语义干扰,且不依赖专家知识。 Method: 构建General365基准:限定背景知识为K-12水平以剥离专业依赖,包含365个种子题及1095个变体题,覆盖8类通用推理任务。 Result: 在26个主流LLM上的评测显示,最佳模型准确率仅为62.8%,显著低于其在数学/物理基准上的接近满分表现。 Conclusion: 当前LLM的推理能力具有强领域依赖性,通用推理能力仍严重不足;General365可推动LLM向鲁棒、通用的真实世界推理能力发展。 Abstract: Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

[128] C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

Chenxi Qing,Junxi Wu,Zheng Liu,Yixiang Qiu,Hongyao Yu,Bin Chen,Hao Wu,Shu-Tao Xia

Main category: cs.CL

TL;DR: 本文提出C-ReD,一个全面的中文真实提示AI生成文本检测基准,旨在解决现有中文检测基准在模型多样性、领域覆盖和提示真实性方面的不足。

Details Motivation: 现有中文AI生成文本检测基准存在模型多样性有限和数据同质化的问题,难以支持可靠的跨模型和跨数据集泛化。 Method: 构建了一个名为C-ReD的中文真实提示AI生成检测基准,包含多样化的模型输出与真实场景提示,并设计实验验证其在域内检测、跨模型泛化及跨数据集迁移上的能力。 Result: C-ReD在域内检测上表现可靠,并展现出对未见大语言模型和外部中文数据集的强泛化能力,显著弥补了模型多样性、领域覆盖与提示真实性三方面短板。 Conclusion: C-ReD为中文AI生成文本检测提供了更全面、更具现实意义的评估基准,推动该领域向更高鲁棒性与实用性发展。 Abstract: Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

[129] CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation

WonJin Yoon,Kangyu Zhu,Ian Bulovic,Autumn Sehy,Yanjun Gao,Dmitriy Dligach,Majid Afshar,Timothy A. Miller

Main category: cs.CL

TL;DR: 本文提出CLSGen框架,旨在解决大语言模型(LLM)在二分类任务中难以同时提供可靠概率估计与可解释性解释生成的问题。通过新架构、训练方法和数据构建策略,在不牺牲解释能力的前提下提升概率校准与分类性能。

Details Motivation: 现有LLM虽能生成解释,但缺乏可靠的定量概率输出;而传统判别式微调又会导致灾难性遗忘和语言退化,损害解释能力。 Method: 提出CLSGen框架,包含新型模型架构、专用训练方法及数据构造策略,专为二分类任务设计,兼顾概率估计与解释生成。 Result: 在多个基准数据集上,CLSGen微调模型在AUROC和F1-score上优于基线;生成的解释与预测标签高度一致且可读性强。 Conclusion: CLSGen成功实现了概率可靠性与语言解释能力的协同优化,为LLM在需可信决策的实际场景中部署提供了可行路径。 Abstract: With the recent progress of Large Language Models (LLMs), there is a growing interest in applying these models to solve complex and challenging problems. Modern LLMs, capable of processing long contexts and generating verbalized explanations, offer significant potential in addressing real-world applications. However, a critical hurdle in deploying LLMs for practical decision-making is their inability to provide reliable, quantitative probabilities. While task-specific fine-tuning of LLMs using traditional discriminative objectives (similar to encoder-only models) can yield probability estimates, this often leads to catastrophic forgetting and linguistic collapse. Consequently, the model loses its ability to generate explanations, severely undermining its interpretability and usability. To address this challenge, we propose CLSGen, a novel LLM fine-tuning framework designed for binary classification tasks. The CLSGen framework encompasses a new model architecture, training methodology, and data construction strategy to enable robust probability estimation without sacrificing the model's inherent explanation-generation capabilities. Experimental results across multiple benchmark datasets demonstrate that models fine-tuned with CLSGen outperform existing baselines in classification metrics (AUROC and F1-score). Regarding explanation, the results showed strong alignment between predicted labels and generated justifications, as well as high readability.

[130] Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?

Yuto Harada,Hiro Taiyo Hamada

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)中‘大五人格’(Big Five)心理构念的内部表征形成、定位及其与行为输出的关系;通过探针分析和神经元干预,发现人格信息在早期层即可解码且贯穿全层,概念选择性神经元多集中于中层,干预可有效调控隐空间表征,但对最终标签生成的控制较弱且存在跨特质干扰,揭示了表征控制与行为控制之间的差距。

Details Motivation: 尽管大语言模型能模仿和预测人格,但其内部如何表征和组织人格构念(如大五人格)尚不清楚,尤其缺乏对表征位置、形成机制及其与行为输出因果关系的系统分析。 Method: 采用探针(probing)方法定位大五人格信息在模型各层的出现时机与强度;识别对各人格维度响应的选择性神经元;通过激活增强/抑制这些神经元进行因果干预,并评估其对隐空间表征(probe readout)和最终标签生成分布的影响。 Result: 大五信息在早期层即快速可解码并持续至末层;概念选择性神经元主要集中于中层,跨人格维度重叠少;干预显著改变probe读出(成功率>0.8),但对生成标签的调控效果较弱、不稳定,且常引发跨特质干扰。 Conclusion: LLM内部存在结构化、可定位、可干预的人格表征,但表征层面的可控性不等同于行为(如标签生成)层面的可控性,揭示了当前模型中表征控制与行为控制之间存在本质差距。 Abstract: Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user's personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended directions. We find that Big Five information becomes rapidly decodable in early layers and remains detectable through the final layers, while concept-selective neurons are most prevalent in mid layers and exhibit limited overlap across domains. Interventions on these neurons consistently shift probe readouts toward targeted concepts, with targeted success rates exceeding 0.8 for some concepts, indicating that the model's internal separation of Big Five personality traits can be causally steered. At the label-generation level, the same interventions often bias generated label distributions in the intended directions, but the effects are weaker, more concept-dependent, and often accompanied by cross-trait spillover, indicating that comparable control over generated labels is difficult even with interventions on a large fraction of concept-selective neurons. Overall, our findings reveal a gap between representational control and behavioral control in LLMs.

[131] Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus

Lena S. Oberkircher,Jesujoba O. Alabi,Dietrich Klakow,Jürgen Trouvain

Main category: cs.CL

TL;DR: This paper introduces Saar-Voice, a six-hour speech corpus for the Saarbrücken German dialect, aiming to address the underrepresentation of dialects in NLP and speech technologies.

Details Motivation: Dialects are culturally significant and widely used but underrepresented in linguistic resources and computational models, leading to performance disparities. The paper aims to bridge this gap by building a dedicated dialect corpus. Method: The authors collected text from digitized books and local sources, selected a subset for recording by nine speakers, and performed analyses on both text and speech. They addressed orthographic and speaker variation and explored grapheme-to-phoneme (G2P) conversion. Result: A six-hour aligned textual and audio corpus for the Saarbrücken dialect, supporting dialect-aware TTS research, especially in low-resource, zero-shot, and few-shot settings. Conclusion: Saar-Voice serves as a foundational resource for advancing dialect-aware speech technologies and highlights methodological considerations for future dialect corpus development. Abstract: Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.

cs.CV [Back]

[132] 3D Multi-View Stylization with Pose-Free Correspondences Matching for Robust 3D Geometry Preservation

Shirsha Bose

Main category: cs.CV

TL;DR: 本文提出了一种面向多视角3D场景的艺术风格迁移方法,在不依赖相机位姿或显式3D表示的前提下,通过联合优化外观迁移与几何一致性,提升了风格化结果在下游3D任务(如SLAM、深度估计、重建)中的可用性。

Details Motivation: 现有图像/视频风格迁移方法难以直接用于多视角3D场景,因独立逐视图 stylization 会破坏跨视角对应关系,导致纹理漂移、边缘扭曲和着色不一致,进而损害SLAM、深度预测与多视角重建等几何感知任务。 Method: 提出一种前馈式风格迁移网络,结合测试时每场景优化;采用AdaIN-inspired VGG-19特征统计匹配实现风格迁移;引入基于SuperPoint/SuperGlue的跨视角描述符一致性损失以保持结构稳定;加入MiDaS/DPT深度保持损失与全局颜色对齐缓解域偏移;采用分阶段权重调度逐步引入几何与深度约束。 Result: 在Tanks and Temples与Mip-NeRF 360数据集上验证:CHD与DSD指标显示良好风格保真度与结构保留;DROID-SLAM轨迹与对称Chamfer距离表明3D一致性显著优于MuVieCAST基线;消融实验证明对应关系与深度正则化有效降低结构畸变并提升SLAM稳定性与重建质量。 Conclusion: 所提方法实现了多视角风格迁移与3D任务可用性的兼顾,无需训练时相机姿态或3D先验,为几何感知视觉任务中的可控风格化提供了新范式。 Abstract: Artistic style transfer is well studied for images and videos, but extending it to multi-view 3D scenes remains difficult because stylization can disrupt correspondences needed by geometry-aware pipelines. Independent per-view stylization often causes texture drift, warped edges, and inconsistent shading, degrading SLAM, depth prediction, and multi-view reconstruction. This thesis addresses multi-view stylization that remains usable for downstream 3D tasks without assuming camera poses or an explicit 3D representation during training. We introduce a feed-forward stylization network trained with per-scene test-time optimization under a composite objective coupling appearance transfer with geometry preservation. Stylization is driven by an AdaIN-inspired loss from a frozen VGG-19 encoder, matching channel-wise moments to a style image. To stabilize structure across viewpoints, we propose a correspondence-based consistency loss using SuperPoint and SuperGlue, constraining descriptors from a stylized anchor view to remain consistent with matched descriptors from the original multi-view set. We also impose a depth-preservation loss using MiDaS/DPT and use global color alignment to reduce depth-model domain shift. A staged weight schedule introduces geometry and depth constraints. We evaluate on Tanks and Temples and Mip-NeRF 360 using image and reconstruction metrics. Style adherence and structure retention are measured by Color Histogram Distance (CHD) and Structure Distance (DSD). For 3D consistency, we use monocular DROID-SLAM trajectories and symmetric Chamfer distance on back-projected point clouds. Across ablations, correspondence and depth regularization reduce structural distortion and improve SLAM stability and reconstructed geometry; on scenes with MuVieCAST baselines, our method yields stronger trajectory and point-cloud consistency while maintaining competitive stylization.

[133] PA-SFM: Tracker-free differentiable acoustic radiation for freehand 3D photoacoustic imaging

Shuang Li,Jian Gao,Chulhong Kim,Seongwook Choi,Qian Chen,Yibing Wang,Shuang Wu,Yu Zhang,Tingting Huang,Yucheng Zhou,Boxin Yao,Yao Yao,Changhui Li

Main category: cs.CV

TL;DR: 本文提出PA-SFM,一种无需外部定位传感器的3D手持式光声断层成像框架,通过可微分声辐射建模,仅用单模态光声数据实现传感器位姿估计与高保真三维重建。

Details Motivation: 传统3D手持光声断层成像依赖笨重昂贵的外部定位传感器来校正运动伪影,限制了其临床灵活性和可及性。 Method: PA-SFM将声波方程嵌入可微分编程流程,利用GPU加速的声辐射核,联合优化光声源分布与传感器阵列位姿;引入粗到精优化策略,结合几何一致性检验和刚体约束以提升自由手部扫描鲁棒性。 Result: 在数值仿真和大鼠活体实验中验证,PA-SFM达到亚毫米级定位精度,重建出媲美金标准的高分辨率3D血管结构。 Conclusion: PA-SFM为临床自由手式光声成像提供了低成本、纯软件定义的解决方案,源码已开源。 Abstract: Three-dimensional (3D) handheld photoacoustic tomography typically relies on bulky and expensive external positioning sensors to correct motion artifacts, which severely limits its clinical flexibility and accessibility. To address this challenge, we present PA-SFM, a tracker-free framework that leverages exclusively single-modality photoacoustic data for both sensor pose recovery and high-fidelity 3D reconstruction via differentiable acoustic radiation modeling. Unlike traditional structure-from-motion (SFM) methods based on visual features, PA-SFM integrates the acoustic wave equation into a differentiable programming pipeline. By leveraging a high-performance, GPU-accelerated acoustic radiation kernel, the framework simultaneously optimizes the 3D photoacoustic source distribution and the sensor array pose via gradient descent. To ensure robust convergence in freehand scenarios, we introduce a coarse-to-fine optimization strategy that incorporates geometric consistency checks and rigid-body constraints to eliminate motion outliers. We validated the proposed method through both numerical simulations and in-vivo rat experiments. The results demonstrate that PA-SFM achieves sub-millimeter positioning accuracy and restores high-resolution 3D vascular structures comparable to ground-truth benchmarks, offering a low-cost, software-defined solution for clinical freehand photoacoustic imaging. The source code is publicly available at \href{https://github.com/JaegerCQ/PA-SFM}{https://github.com/JaegerCQ/PA-SFM}.

[134] TRACE: Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock

Taminul Islam,Abdellah Lakhssassi,Toqi Tahamid Sarker,Mohamed Embaby,Khaled R Ahmed,Amer AbuGhazaleh

Main category: cs.CV

TL;DR: 本文提出TRACE框架,首次实现从MWIR热视频中联合完成CO2羽流逐帧分割与片段级排放通量分类,通过热气体感知注意力编码器、注意力时序融合模块和四阶段渐进训练策略,在多个指标上显著超越现有方法,支持非接触式、连续、个体级牛只呼出CO2监测。

Details Motivation: 量化自由放牧牛只呼出的CO2是反映瘤胃代谢状态的直接指标,也是农场尺度碳核算的前提,但现有系统无法在不物理约束或接触动物的前提下实现连续、空间分辨的测量。 Method: 提出TRACE统一框架,包含:1)热气体感知注意力(TGAA)编码器,利用像素级气体强度作为空间监督信号引导自注意力聚焦高排放区域;2)基于注意力的时序融合(ATF)模块,通过结构化跨帧注意力建模呼吸周期动态以实现序列级通量分类;3)四阶段渐进训练方案,协同优化分割与分类任务并避免梯度干扰。 Result: 在CO2 Farm Thermal Gas数据集上,TRACE达到0.998 mIoU,并在所有分割与分类指标上均优于15个SOTA模型,包括参数量数倍于它的专用气体分割模型;消融实验证明各组件均不可或缺。 Conclusion: TRACE为大规模商业农场中利用 overhead 热成像相机实现无创、连续、个体级CO2监测提供了切实可行的技术路径。 Abstract: Quantifying exhaled CO2 from free-roaming cattle is both a direct indicator of rumen metabolic state and a prerequisite for farm-scale carbon accounting, yet no existing system can deliver continuous, spatially resolved measurements without physical confinement or contact. We present TRACE (Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock), the first unified framework to jointly address per-frame CO2 plume segmentation and clip-level emission flux classification from mid-wave infrared (MWIR) thermal video. TRACE contributes three domain-specific advances: a Thermal Gas-Aware Attention (TGAA) encoder that incorporates per-pixel gas intensity as a spatial supervisory signal to direct self-attention toward high-emission regions at each encoder stage; an Attention-based Temporal Fusion (ATF) module that captures breath-cycle dynamics through structured cross-frame attention for sequence-level flux classification; and a four-stage progressive training curriculum that couples both objectives while preventing gradient interference. Benchmarked against fifteen state-of-the-art models on the CO2 Farm Thermal Gas Dataset, TRACE achieves an mIoU of 0.998 and the best result on every segmentation and classification metric simultaneously, outperforming domain-specific gas segmenters with several times more parameters and surpassing all baselines in flux classification. Ablation studies confirm that each component is individually essential: gas-conditioned attention alone determines precise plume boundary localization, and temporal reasoning is indispensable for flux-level discrimination. TRACE establishes a practical path toward non-invasive, continuous, per-animal CO2 monitoring from overhead thermal cameras at commercial scale. Codes are available at https://github.com/taminulislam/trace.

[135] FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

Xinyuan An,Tao Luo,Gengyun Peng,Yaobing Wang,Kui Ren,Dongxia Wang

Main category: cs.CV

TL;DR: 本文提出FlowHijack,首个针对基于流匹配的视觉-语言-动作(VLA)模型向量场动力学的后门攻击框架,通过τ条件注入策略和动力学模仿正则化,实现高成功率、高隐蔽性且行为不可区分的攻击。

Details Motivation: 现有针对自回归离散化VLA模型的后门攻击无法直接适用于基于流匹配的连续动作生成机制,其向量场动力学存在未被探索的安全漏洞。 Method: 提出FlowHijack框架,包含τ-条件注入策略(操控动作生成初始阶段)和动力学模仿正则化(确保恶意动作在运动学上与正常动作相似)。 Result: 实验表明FlowHijack在隐蔽、上下文感知触发器下实现高攻击成功率,同时保持良性任务性能,且生成的恶意动作在行为上与正常动作难以区分。 Conclusion: 揭示了连续具身模型内部生成动力学的重大安全风险,强调亟需发展针对模型内部生成机制的防御方法。 Abstract: Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like $π_0$ showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism - the vector field dynamics - presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics. We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel $τ$-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model's internal generative dynamics.

[136] Prints in the Magnetic Dust: Robust Similarity Search in Legacy Media Images Using Checksum Count Vectors

Maciej Grzeszczuk,Kinga Skorupska,Grzegorz M. Wójcik

Main category: cs.CV

TL;DR: 本文提出了一种基于校验和计数向量(Checksum Count Vectors)的特征表示方法,用于自动化检测历史磁带数据中的重复项与变体,提升早期家用计算机时代数字文物的修复、去重与语义整合效率。

Details Motivation: 数字化磁带数据只是保存早期家用计算机文物的第一步,后续解码、验证、修复、测试与文档化工作繁重;若能有效自动化部分流程,志愿者可更专注于提供历史与上下文知识。 Method: 提出基于校验和计数向量(Checksum Count Vectors)的特征表示方法,并在4902个解码后的磁带镜像数据集上评估其对重复副本与变体的检测能力。 Result: 在最多丢失75%记录的损坏录音中,该方法检测变体准确率达58%,识别替代副本准确率达97%。 Conclusion: 该方法是迈向全自动历史数字文物修复、去重与语义整合的重要一步,支持序列匹配、自动修复与知识发现。 Abstract: Digitizing magnetic media containing computer data is only the first step towards the preservation of early home computing era artifacts. The audio tape images must be decoded, verified, repaired if necessary, tested, and documented. If parts of this process could be effectively automated, volunteers could focus on contributing contextual and historical knowledge rather than struggling with technical tools. We therefore propose a feature representation based on Checksum Count Vectors and evaluate its applicability to detecting duplicates and variants of recordings within a large data store. The approach was tested on a collection of decoded tape images (n=4902), achieving 58\% accuracy in detecting variants and 97% accuracy in identifying alternative copies, for damaged recordings with up to 75% of records missing. These results represent an important step towards fully automated pipelines for restoration, de-duplication, and semantic integration of historical digital artifacts through sequence matching, automatic repair and knowledge discovery.

[137] A Modular Zero-Shot Pipeline for Accident Detection, Localization, and Classification in Traffic Surveillance Video

Amey Thakur,Sarvesh Talele

Main category: cs.CV

TL;DR: 本文提出了一种用于ACCIDENT @ CVPR 2026挑战赛的零样本交通事故检测流水线,无需真实标注数据,通过时间定位、空间定位和碰撞类型分类三个模块实现端到端预测。

Details Motivation: 应对CVPR 2026 ACCIDENT挑战——在无真实世界标注数据条件下,预测监控视频中交通事故发生的时间、地点及类型。 Method: 将任务拆分为三个独立模块:1)基于z-score归一化帧差信号的峰值检测进行时间定位;2)利用Farneback算法计算累积稠密光流幅值图的加权质心进行空间定位;3)用CLIP模型提取关键帧图像嵌入与多提示文本嵌入的余弦相似度进行碰撞类型分类;全程不涉及领域微调,仅使用预训练权重。 Result: 实现了完全零样本的交通事故时空类型联合预测,在挑战设定下有效运行,代码以Kaggle Notebook形式开源。 Conclusion: 证明了纯零样本、免微调的多模态(视觉+语言)预训练模型组合,可有效解决现实场景中缺乏标注数据的细粒度事件理解任务。 Abstract: We describe a zero-shot pipeline developed for the ACCIDENT @ CVPR 2026 challenge. The challenge requires predicting when, where, and what type of traffic accident occurs in surveillance video, without labeled real-world training data. Our method separates the problem into three independent modules. The first module localizes the collision in time by running peak detection on z-score normalized frame-difference signals. The second module finds the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps using the Farneback algorithm. The third module classifies collision type by measuring cosine similarity between CLIP image embeddings of frames near the detected peak and text embeddings built from multi-prompt natural language descriptions of each collision category. No domain-specific fine-tuning is involved; the pipeline processes each video using only pre-trained model weights. Our implementation is publicly available as a Kaggle notebook.

[138] Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Yunkai Zhang,Linda Li,Yingxin Cui,Xiyuan Ruan,Zeyu Zheng,Kezhen Chen,Yi Zhang,Diji Yang

Main category: cs.CV

TL;DR: 本文提出Grid2Matrix (G2M)基准,用于检测视觉-语言模型(VLMs)在精确读取图像细节上的能力缺陷;实验发现VLMs在零样本任务中对小规模颜色网格即出现急剧性能崩溃,并揭示出一种称为‘数字失认症(Digital Agnosia)’的现象——即视觉特征中仍保留的信息未能被语言输出正确表达。

Details Motivation: 现有视觉-语言模型评估常不需详尽图像读取,易掩盖其对细粒度视觉细节建模的缺陷,因此需要一个可控、低语义干扰的基准来精准定位视觉信息丢失环节。 Method: 构建Grid2Matrix(G2M)基准:输入为颜色网格与颜色-数字映射表,要求模型输出对应数值矩阵;通过调节网格尺寸和颜色数控制视觉复杂度;系统评测主流VLMs的端到端表现及视觉编码器中间特征保真度,并分析错误模式与视觉分块对齐关系。 Result: VLMs在零样本G2M任务中出现早期急剧性能崩溃(如4×4网格即失败),远早于预期;其视觉编码器保留了远多于最终语言输出的网格信息;错误呈现强结构性,与ViT等模型的视觉patch边界高度相关;模型缩放与多模态对齐策略无法彻底缓解该问题。 Conclusion: VLMs存在‘数字失认症’——即从可恢复的视觉表征到实际语言生成之间存在系统性信息衰减;G2M可作为诊断细粒度视觉理解瓶颈的有效工具,对表格、图表、表单、GUI等依赖像素级准确性的应用具重要启示。 Abstract: Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap \textit{Digital Agnosia}. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.

[139] Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps

Jianwei Zhang,Sihan Cao,Chaoning Zhang,Ziming Hong,Jiaxin Huang,Pengcheng Zheng,Caiyan Qin,Wei Dong,Yang Yang,Tongliang Liu

Main category: cs.CV

TL;DR: 本文提出GaussLock,一种针对3D生成模型(特别是基于高斯表示的模型)的轻量级参数空间免疫框架,通过授权蒸馏与属性感知陷阱损失协同优化,在保障授权任务性能的同时,有效抵御未经授权的微调攻击。

Details Motivation: 公开预训练权重使3D生成模型易受微调攻击,导致知识产权泄露;而3D模型因显式高斯表示,其结构参数直接暴露于梯度优化,需专用防御方法。 Method: 提出GaussLock框架,结合授权蒸馏与针对位置、尺度、旋转、不透明度和颜色的属性感知陷阱损失,系统性破坏空间分布、几何形状、旋转轴对齐和原语可见性,以瓦解结构完整性。 Result: 在大规模高斯模型上的实验表明,GaussLock显著降低未授权重建质量(LPIPS升高、PSNR降低),同时保持授权微调性能。 Conclusion: GaussLock是首个专为3D生成模型设计的抗微调攻击防御方法,兼具有效性与轻量性,为3D生成模型知识产权保护提供了新范式。 Abstract: Recent large-scale generative models enable high-quality 3D synthesis. However, the public accessibility of pre-trained weights introduces a critical vulnerability. Adversaries can fine-tune these models to steal specialized knowledge acquired during pre-training, leading to intellectual property infringement. Unlike defenses for 2D images and language models, 3D generators require specialized protection due to their explicit Gaussian representations, which expose fundamental structural parameters directly to gradient-based optimization. We propose GaussLock, the first approach designed to defend 3D generative models against fine-tuning attacks. GaussLock is a lightweight parameter-space immunization framework that integrates authorized distillation with attribute-aware trap losses targeting position, scale, rotation, opacity, and color. Specifically, these traps systematically collapse spatial distributions, distort geometric shapes, align rotational axes, and suppress primitive visibility to fundamentally destroy structural integrity. By jointly optimizing these dual objectives, the distillation process preserves fidelity on authorized tasks while the embedded traps actively disrupt unauthorized reconstructions. Experiments on large-scale Gaussian models demonstrate that GaussLock effectively neutralizes unauthorized fine-tuning attacks. It substantially degrades the quality of unauthorized reconstructions, evidenced by significantly higher LPIPS and lower PSNR, while effectively maintaining performance on authorized fine-tuning.

[140] Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

Abolfazl Mohammadi-Seif,Ricardo Baeza-Yates

Main category: cs.CV

TL;DR: 本文揭示了实例密度(如人脸数量)是影响机器学习模型性能的内在数据复杂性维度,通过在WIDER FACE和Open Images数据集上的严格控制实验,发现模型性能随密度单调下降,且低密度训练模型难以泛化到高密度场景,表现出系统性漏检偏差。

Details Motivation: 传统机器学习研究聚焦于模型创新,但实际性能常受限于数据本身的复杂性;本文旨在分离并量化实例密度(以人脸数衡量)作为数据复杂性的核心驱动因素,超越经验性观察(如‘拥挤场景更难’),在控制类别不平衡的前提下精确评估密度单独造成的影响。 Method: 在WIDER FACE和Open Images数据集上开展受控实验,将每张图像的人脸数严格限定为1至18,并采用完全均衡采样;评估分类、回归与检测三类任务中模型性能随密度变化的趋势;进一步分析低密度训练模型在高密度测试下的泛化能力及误差模式。 Result: 模型性能随人脸数量增加而单调下降,该现象跨任务范式一致;即使模型充分接触全部密度范围,仍无法消除密度效应;低密度训练模型在高密度场景下出现系统性少计数偏差,错误率最高达4.6倍,表明密度引发域偏移。 Conclusion: 实例密度是一种可量化的内在数据难度维度;应将其纳入课程学习设计与分密度评估体系,以提升模型鲁棒性与泛化能力。 Abstract: Machine learning progress has historically prioritized model-centric innovations, yet achievable performance is frequently capped by the intrinsic complexity of the data itself. In this work, we isolate and quantify the impact of instance density (measured by face count) as a primary driver of data complexity. Rather than simply observing that ``crowded scenes are harder,'' we rigorously control for class imbalance to measure the precise degradation caused by density alone. Controlled experiments on the WIDER FACE and Open Images datasets, restricted to exactly 1 to 18 faces per image with perfectly balanced sampling, reveal that model performance degrades monotonically with increasing face count. This trend holds across classification, regression, and detection paradigms, even when models are fully exposed to the entire density range. Furthermore, we demonstrate that models trained on low-density regimes fail to generalize to higher densities, exhibiting a systematic under-counting bias, with error rates increasing by up to 4.6x, which suggests density acts as a domain shift. These findings establish instance density as an intrinsic, quantifiable dimension of data hardness and motivate specific interventions in curriculum learning and density-stratified evaluation.

[141] Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification

Antonio Rueda-Toicen,Abigail Allen Martin,Daniil Morozov,Matin Mahmood,Alexandra Schild,Shahabeddin Dayani,Davide Panza,Gerard de Melo

Main category: cs.CV

TL;DR: 本文提出了一种针对野生动物(美洲豹)重识别的诊断框架,旨在检测模型是否真正依赖于个体特有的毛皮图案,而非背景或轮廓等误导性线索;该框架包含背景/前景泄漏控制和左右侧对称性诊断,并在新构建的Pantanal美洲豹数据集上验证了多种缓解方法的有效性。

Details Motivation: 现有美洲豹重识别方法在标准检索指标上表现良好,但可能错误依赖背景或轮廓等非本质特征,而非真正具有判别性的毛皮图案,因此需要可解释、可量化的诊断工具。 Method: 提出双轴诊断框架:1)基于图像修复的背景/前景泄漏比(context ratio);2)基于跨侧检索与镜像自相似性的laterality诊断;构建带像素级分割掩码和身份均衡协议的Pantanal美洲豹基准;在统一评估下测试ArcFace微调、反对称正则化和Lorentz双曲嵌入三类缓解方法。 Result: 诊断框架成功揭示不同方法对背景/前景线索的依赖程度及左右侧判别能力差异;例如,某些模型虽检索准确率高,却严重依赖背景,而引入反向对称正则化可显著提升对毛皮图案的依赖。 Conclusion: 仅靠检索指标不足以评估野生动物重识别模型的可靠性;需结合可解释诊断工具检验其实际依据的视觉证据;所提框架为模型可信性评估提供了新范式。 Abstract: Jaguar re-identification (re-ID) from citizen-science imagery can look strong on standard retrieval metrics while still relying on the wrong evidence, such as background context or silhouette shape, instead of the coat pattern that defines identity. We introduce a diagnostic framework for wildlife re-ID with two axes: a leakage-controlled context ratio, background/foreground, computed from inpainted background-only versus foreground-only images, and a laterality diagnostic based on cross-flank retrieval and mirror self-similarity. To make these diagnostics measurable, we curate a Pantanal jaguar benchmark with per-pixel segmentation masks and an identity-balanced evaluation protocol. We then use representative mitigation families, ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings, as case studies under the same evaluation lens. The goal is not only to ask which model ranks best, but also what visual evidence it uses to do so.

[142] CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement

Dikshant Kukreja,Kshitij Sah,Karan Goyal,Mukesh Mohania,Vikram Goyal

Main category: cs.CV

TL;DR: 本文提出CAGE方法,结合LLM生成结构正确图表代码与扩散模型(ControlNet)进行视觉增强,在保证标签准确性的前提下提升教育图表的美观性;并发布EduDiagram-2K数据集,推动教育多媒体生成研究。

Details Motivation: 现有教育图表生成方法存在准确性与美观性不可兼得的问题:开源扩散模型破坏文本标签,代码生成法视觉平淡,闭源API不可靠且成本高。 Method: 提出CAGE(Code-Anchored Generative Enhancement)框架:先由LLM生成可执行代码生成结构正确的图表,再用ControlNet条件控制扩散模型对其视觉美化,同时保持标签保真度;构建EduDiagram-2K配对数据集支持该流程。 Result: 在400个K-12图表提示上量化评估三类方法,验证CAGE在标签 fidelity 和视觉质量上的综合优势;提供初步实证结果与面向多媒体社区的研究路线图。 Conclusion: CAGE有效缓解教育图表生成中准确性和美观性的权衡困境,为可信赖、可扩展的教育内容生成提供了新范式。 Abstract: Educational diagrams -- labeled illustrations of biological processes, chemical structures, physical systems, and mathematical concepts -- are essential cognitive tools in K-12 instruction. Yet no existing method can generate them both accurately and engagingly. Open-source diffusion models produce visually rich images but catastrophically garble text labels. Code-based generation via LLMs guarantees label correctness but yields visually flat outputs. Closed-source APIs partially bridge this gap but remain unreliable and prohibitively expensive at educational scale. We quantify this accuracy-aesthetics dilemma across all three paradigms on 400 K-12 diagram prompts, measuring both label fidelity and visual quality through complementary automated and human evaluation protocols. To resolve it, we propose CAGE (Code-Anchored Generative Enhancement): an LLM synthesizes executable code producing a structurally correct diagram, then a diffusion model conditioned on the programmatic output via ControlNet refines it into a visually polished graphic while preserving label fidelity. We also introduce EduDiagram-2K, a collection of 2,000 paired programmatic-stylized diagrams enabling this pipeline, and present proof-of-concept results and a research agenda for the multimedia community.

[143] TaFall: Balance-Informed Fall Detection via Passive Thermal Sensing

Chengxiao Li,Xie Zhang,Wei Zhu,Yan Jiang,Chenshu Wu

Main category: cs.CV

TL;DR: TaFall是一种基于低成本热阵列传感的隐私保护跌倒检测系统,通过建模平衡退化过程实现高精度检测,兼顾隐私与可靠性。

Details Motivation: 现有基于射频的隐私保护跌倒检测方法依赖粗粒度运动线索,导致真实场景中可靠性不足;而老年人跌倒多发生在需兼顾有效性与隐私的私密室内环境。 Method: 提出TaFall系统:(i)外观-运动融合模型用于从低分辨率热图鲁棒重建姿态;(ii)物理驱动的平衡感知学习;(iii)姿态桥接预训练提升鲁棒性;核心是将跌倒建模为姿态驱动的生物力学平衡退化过程。 Result: 在包含35名参与者、3000+次跌倒实例的数据集上达到98.26%检测率和0.65%误报率;27天四户家庭实测误报率低至0.00126%;浴室试点验证其对湿气与热干扰的鲁棒性。 Conclusion: TaFall是一种在日常居住环境中兼具高可靠性与强隐私保护能力的实用化跌倒检测方案。 Abstract: Falls are a major cause of injury and mortality among older adults, yet most incidents occur in private indoor environments where monitoring must balance effectiveness with privacy. Existing privacy-preserving fall detection approaches, particularly those based on radio frequency sensing, often rely on coarse motion cues, which limits reliability in real-world deployments. We introduce TaFall, a balance-informed fall detection system based on low-cost, privacy-preserving thermal array sensing. The key insight is that TaFall models a fall as a process of balance degradation and detects falls by estimating pose-driven biomechanical balance dynamics. To enable this capability from low-resolution thermal array maps, we propose (i) an appearance-motion fusion model for robust pose reconstruction, (ii) physically grounded balance-aware learning, and (iii) pose-bridged pretraining to improve robustness. TaFall achieves a detection rate of 98.26% with a false alarm rate of 0.65% on our dataset with over 3,000 fall instances from 35 participants across diverse indoor environments. In 27 day deployments across four homes, TaFall attains an ultra-low false alarm rate of 0.00126% and a pilot bathroom study confirms robustness under moisture and thermal interference. Together, these results establish TaFall as a reliable and privacy-preserving approach to fall detection in everyday living environments.

[144] EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation

Negar Fathi

Main category: cs.CV

TL;DR: 本文提出EDFNet,一种用于无人机薄障碍物感知的RGB-深度-边缘早期融合分割框架,在DDOS数据集上验证了其在边界敏感和召回率导向指标上的优势,但超细障碍物分割仍是挑战。

Details Motivation: 现有分割方法难以有效检测无人机导航中常见的细小障碍物(如电线、树枝),因其像素占比少、视觉对比度弱且存在严重的类别不平衡问题,且未充分利用多模态互补信息。 Method: 提出EDFNet,一种模块化早期融合分割框架,融合RGB、深度和边缘三种模态信息;在DDOS数据集上系统评估16种模态-骨干网络组合(U-Net/DeepLabV3,预训练/非预训练)。 Result: 预训练的RGBDE-U-Net取得最佳性能:薄结构评估分数0.244、平均IoU 0.219、边界IoU 0.234,运行速度19.62 FPS;但所有模型对最稀有的超细障碍物类别性能仍很低。 Conclusion: 早期RGB-深度-边缘融合是一种实用、模块化的薄障碍物分割基线方法,为无人机安全导航提供了新思路,但超细障碍物的鲁棒分割仍是开放问题。 Abstract: Autonomous Unmanned Aerial Vehicles (UAVs) must reliably detect thin obstacles such as wires, poles, and branches to navigate safely in real-world environments. These structures remain difficult to perceive because they occupy few pixels, often exhibit weak visual contrast, and are strongly affected by class imbalance. Existing segmentation methods primarily target coarser obstacles and do not fully exploit the complementary multimodal cues needed for thin-structure perception. We present EDFNet, a modular early-fusion segmentation framework that integrates RGB, depth, and edge information for thin-obstacle perception in cluttered aerial scenes. We evaluate EDFNet on the Drone Depth and Obstacle Segmentation (DDOS) dataset across sixteen modality-backbone configurations using U-Net and DeepLabV3 in pretrained and non-pretrained settings. The results show that early RGB-Depth-Edge fusion provides a competitive and well-balanced baseline, with the most consistent gains appearing in boundary-sensitive and recall-oriented metrics. The pretrained RGBDE U-Net achieves the best overall performance, with the highest Thin-Structure Evaluation Score (0.244), mean IoU (0.219), and boundary IoU (0.234), while maintaining competitive runtime performance (19.62 FPS) on our evaluation hardware. However, performance on the rarest ultra-thin categories remains low across all models, indicating that reliable ultra-thin segmentation is still an open challenge. Overall, these findings position early RGB-Depth-Edge fusion as a practical and modular baseline for thin-obstacle segmentation in UAV navigation.

[145] Assessing Privacy Preservation and Utility in Online Vision-Language Models

Karmesh Siddharam Chaudhari,Youxiang Zhu,Amy Feng,Xiaohui Liang,Honggang Zhang

Main category: cs.CV

TL;DR: 本文探讨了在线视觉语言模型(OVLMs)处理图像时导致的个人身份信息(PII)泄露风险,分析了显性和隐性PII暴露机制,并提出了兼顾隐私保护与图像效用的防护方法。

Details Motivation: 随着OVLMs广泛应用,用户上传图像可能无意中泄露PII,尤其图像中隐含的关系线索可能导致敏感信息间接暴露,亟需系统性隐私风险分析与防护。 Method: 分析图像中PII的显性与隐性暴露机制,提出在VLM应用中保护隐私同时保留图像效用的方法。 Result: 所提方法经评估验证有效,揭示了在线图像处理中隐私保护与功能效用之间的精细平衡。 Conclusion: 图像中的上下文关系是PII泄露的关键途径,必须在OVLM部署中同步设计隐私保护机制以防止直接或间接PII暴露。 Abstract: The increasing use of Online Vision Language Models (OVLMs) for processing images has introduced significant privacy risks, as individuals frequently upload images for various utilities, unaware of the potential for privacy violations. Images contain relationships that relate to Personally Identifiable Information (PII), where even seemingly harmless details can indirectly reveal sensitive information through surrounding clues. This paper explores the critical issue of PII disclosure in images uploaded to OVLMs and its implications for user privacy. We investigate how the extraction of contextual relationships from images can lead to direct (explicit) or indirect (implicit) exposure of PII, significantly compromising personal privacy. Furthermore, we propose methods to protect privacy while preserving the intended utility of the images in Vision Language Model (VLM)-based applications. Our evaluation demonstrates the efficacy of these techniques, highlighting the delicate balance between maintaining utility and protecting privacy in online image processing environments. Index Terms-Personally Identifiable Information (PII), Privacy, Utility, privacy concerns, sensitive information

[146] I Can't Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification

Daniel Nobrega Medeiros

Main category: cs.CV

TL;DR: 本文通过系统实证研究发现,测试时增强(TTA)在多数医学影像分类任务中反而显著降低准确率,主要原因是增强图像与训练数据间的分布偏移及批归一化统计不匹配;仅在特定模型-数据组合(如ResNet-18用于皮肤镜图像)中略有提升。

Details Motivation: 广泛认为测试时增强(TTA)能提升医学影像分类准确率,且已在实际系统和竞赛中普遍使用,但缺乏系统性验证,本文旨在检验该假设的普适性。 Method: 在三个MedMNIST v2基准数据集和四种参数量跨越三个数量级的模型上,系统评估标准TTA流程的效果;进行消融实验分析增强类型、是否包含原始图像、批归一化影响等机制。 Result: TTA在绝大多数设置下一致降低准确率,最严重下降达31.6个百分点;仅ResNet-18在皮肤镜图像上提升1.6%;强度类增强比几何变换更稳健;加入原始图像仅部分缓解性能下降。 Conclusion: TTA不应作为默认后处理手段,而需针对具体模型和数据集进行实证验证;其负面效应源于增强引入的分布偏移与BN统计不匹配。 Abstract: Test-time augmentation (TTA)--aggregating predictions over multiple augmented copies of a test input--is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Our principal finding is that TTA with standard augmentation pipelines consistently degrades accuracy relative to single-pass inference, with drops as severe as 31.6 percentage points for ResNet-18 on pathology images. This degradation affects all architectures, including convolutional models, and worsens with more augmented views. The sole exception is ResNet-18 on dermatology images, which gains a modest +1.6%. We identify the distribution shift between augmented and training-time inputs--amplified by batch normalization statistics mismatch--as the primary mechanism. Our ablation studies show that augmentation strategy matters critically: intensity-only augmentations preserve more performance than geometric transforms, and including the original unaugmented image partially mitigates but does not eliminate the accuracy drop. These findings serve as a cautionary note for practitioners: TTA should not be applied as a default post-hoc improvement but must be validated on the specific model-dataset combination.

[147] Attention-Guided Flow-Matching for Sparse 3D Geological Generation

Zhixiang Lu,Mengqi Han,Peixin Guo,Tianming Bai,Jionglong Su,Fei Fang,Sifan Song

Main category: cs.CV

TL;DR: 本文提出3D-GeoFlow,一种面向稀疏多模态地质建模的注意力引导连续流匹配框架,通过连续向量场回归与3D注意力门机制,有效解决高分辨率3D地质建模中因数据稀疏导致的拓扑不连续和表示坍塌问题,并在大规模合成数据集上显著优于传统方法和扩散模型基线。

Details Motivation: 传统地质建模方法难以处理极端稀疏的1D钻孔和2D地表数据,无法捕捉非线性拓扑间断;现有扩散模型在稀疏分类网格条件下易发生表示坍塌。 Method: 提出3D-GeoFlow:将离散分类生成重定义为免模拟的连续向量场回归(MSE优化),构建稳定确定性最优传输路径;引入3D注意力门机制,在体素潜空间中动态传播局部钻孔特征,保障宏观结构一致性。 Result: 在包含2200个程序生成3D地质案例的大规模多模态数据集上验证;OOD评估显示其显著优于启发式插值和标准扩散基线。 Conclusion: 3D-GeoFlow实现了稀疏地质建模的范式转变,为高分辨率、拓扑鲁棒的3D地质建模提供了新方法。 Abstract: Constructing high-resolution 3D geological models from sparse 1D borehole and 2D surface data is a highly ill-posed inverse problem. Traditional heuristic and implicit modeling methods fundamentally fail to capture non-linear topological discontinuities under extreme sparsity, often yielding unrealistic artifacts. Furthermore, while deep generative architectures like Diffusion Models have revolutionized continuous domains, they suffer from severe representation collapse when conditioned on sparse categorical grids. To bridge this gap, we propose 3D-GeoFlow, the first Attention-Guided Continuous Flow Matching framework tailored for sparse multimodal geological modeling. By reformulating discrete categorical generation as a simulation-free, continuous vector field regression optimized via Mean Squared Error, our model establishes stable, deterministic optimal transport paths. Crucially, we integrate 3D Attention Gates to dynamically propagate localized borehole features across the volumetric latent space, ensuring macroscopic structural coherence. To validate our framework, we curated a large-scale multimodal dataset comprising 2,200 procedurally generated 3D geological cases. Extensive out-of-distribution (OOD) evaluations demonstrate that 3D-GeoFlow achieves a paradigm shift, significantly outperforming heuristic interpolations and standard diffusion baselines.

[148] PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation

Melanie Neubauer,Elmar Rueckert,Christian Rauch

Main category: cs.CV

TL;DR: 本文提出了一种名为PASTA的弱监督目标分割与异常检测方法,利用自监督ViT特征空间中的分布分析和SAM 3的文本引导零样本分割,在工业与农业场景中实现了高效、高精度的像素级分割。

Details Motivation: 现有感知系统依赖大量标注数据,难以满足工业与农业应用(如废钢回收、除草)对实时性、像素级分割精度和鲁棒准确率的严苛要求。 Method: 提出弱监督管道PASTA,通过对比观测场景与名义参考,在自监督ViT特征空间中进行分布分析,并结合Segment Anything Model 3的语义文本提示实现零样本目标与异常分割。 Result: 在自建废钢回收与植物数据集上,训练时间减少75.8%;目标分割IoU最高达88.3%,异常分割IoU最高达63.5%,优于领域专用基线且具备跨领域泛化能力。 Conclusion: PASTA是一种无需密集标注、兼顾效率与精度的弱监督分割框架,适用于未见异常检测的非结构化真实场景。 Abstract: Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called 'Patch Aggregation for Segmentation of Targets and Anomalies' (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.

[149] Identity-Aware U-Net: Fine-grained Cell Segmentation via Identity-Aware Representation Learning

Rui Xiao

Main category: cs.CV

TL;DR: 本文提出了一种身份感知的U-Net(IAU-Net),通过联合建模空间定位与实例判别,结合三元组度量学习,提升高度相似形状物体的精细分割精度。

Details Motivation: 解决密集预测中形态高度相似、边界模糊、实例重叠及视觉差异弱的物体精确分割难题。 Method: 在U-Net架构基础上增设辅助嵌入分支学习判别性身份表征,并引入基于三元组的度量学习,拉近同类目标嵌入、推开形态相似的难负样本。 Result: 在细胞分割等基准上表现优异,尤其在轮廓相似、布局密集和边界模糊等挑战性场景下效果显著。 Conclusion: IAU-Net实现了从类别级到实例级精细区分的跨越,增强了对视觉相似物体的鲁棒判别能力。 Abstract: Precise segmentation of objects with highly similar shapes remains a challenging problem in dense prediction, especially in scenarios with ambiguous boundaries, overlapping instances, and weak inter-instance visual differences. While conventional segmentation models are effective at localizing object regions, they often lack the discriminative capacity required to reliably distinguish a target object from morphologically similar distractors. In this work, we study fine-grained object segmentation from an identity-aware perspective and propose Identity-Aware U-Net (IAU-Net), a unified framework that jointly models spatial localization and instance discrimination. Built upon a U-Net-style encoder-decoder architecture, our method augments the segmentation backbone with an auxiliary embedding branch that learns discriminative identity representations from high-level features, while the main branch predicts pixel-accurate masks. To enhance robustness in distinguishing objects with near-identical contours or textures, we further incorporate triplet-based metric learning, which pulls target-consistent embeddings together and separates them from hard negatives with similar morphology. This design enables the model to move beyond category-level segmentation and acquire a stronger capability for precise discrimination among visually similar objects. Experiments on benchmarks including cell segmentation demonstrate promising results, particularly in challenging cases involving similar contours, dense layouts, and ambiguous boundaries.

[150] Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank

Xiangyong Chen,Xiaochuan Lin,Haoran Liu,Xuan Li,Yichen Su,Xiangwei Guo

Main category: cs.CV

TL;DR: 本文提出MG-IQA框架,通过多粒度推理扩展RL2R方法,实现图像整体质量与细粒度属性(如清晰度、色彩保真度等)的联合评估,并引入属性感知提示、多维Thurstone奖励模型和跨域对齐机制,在多个基准上超越现有方法。

Details Motivation: 现有基于强化学习的图像质量评估方法仅预测整体质量分,忽视人类感知质量的多维性(如清晰度、色彩保真度、噪声水平、构图美学等)。 Method: 提出MG-IQA多粒度IQ框架,包含:(1) 属性感知提示策略;(2) 多维Thurstone奖励模型用于组相对策略优化;(3) 跨域对齐机制以稳定合成失真、真实失真和AI生成图像数据集的联合训练。 Result: 在八个IQA基准上,MG-IQA在整体质量预测(平均SRCC提升2.1%)和属性级评估上均优于SOTA方法,并能生成可解释、符合人类认知的质量描述。 Conclusion: MG-IQA成功实现了图像质量的整体与细粒度联合评估,提升了模型的可解释性与人类对齐性,为多维度视觉质量理解提供了新范式。 Abstract: Recent advances in reasoning-induced image quality assessment (IQA) have demonstrated the power of reinforcement learning to rank (RL2R) for training vision-language models (VLMs) to assess perceptual quality. However, existing approaches operate at a single granularity, predicting only an overall quality score, while overlooking the multi-dimensional nature of human quality perception, which encompasses attributes such as sharpness, color fidelity, noise level, and compositional aesthetics. In this paper, we propose MG-IQA (Multi-Granularity IQA), a multi-granularity reasoning framework that extends RL2R to jointly assess overall image quality and fine-grained quality attributes within a single inference pass. Our approach introduces three key innovations: (1) an attribute-aware prompting strategy that elicits structured multi-attribute reasoning from VLMs; (2) a multi-dimensional Thurstone reward model that computes attribute-specific fidelity rewards for group relative policy optimization; and (3) a cross-domain alignment mechanism that enables stable joint training across synthetic distortion, authentic distortion, and AI-generated image datasets without perceptual scale re-alignment. Extensive experiments on eight IQA benchmarks demonstrate that MG-IQA consistently outperforms state-of-the-art methods in both overall quality prediction (average SRCC improvement of 2.1\%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.

[151] The Deployment Gap in AI Media Detection: Platform-Aware and Visually Constrained Adversarial Evaluation

Aishwarya Budhkar,Trishita Dhara,Siddhesh Sheth

Main category: cs.CV

TL;DR: 本文提出了一种面向平台的对抗性评估框架,用于检测AI生成媒体,在考虑实际部署中的图像变换(如压缩、缩放、截图失真)和视觉合理限制(如meme式条带扰动)下,揭示了现有检测器在真实场景中鲁棒性严重不足,并发现其校准性能显著崩溃。

Details Motivation: 现有AI媒体检测器在干净实验室环境下表现优异,但在实际部署中因图像被缩放、压缩、重编码和视觉修改而面临巨大鲁棒性落差,亟需更贴近现实的评估方式。 Method: 构建平台感知的对抗评估框架,建模典型部署变换,并将对抗扰动约束在局部、视觉合理的meme式条带内;同时探索通用扰动存在性及对模型校准性的影响。 Result: 在平台感知攻击下,AUC≈0.99的检测器性能大幅下降,误判率升高;存在满足局部约束的通用扰动;检测器出现显著校准崩溃(高置信度错误预测)。 Conclusion: 仅依赖干净数据集的鲁棒性评估会严重高估实际部署可靠性;应将平台感知评估纳入未来AI媒体安全基准,并开源该框架以推动标准化鲁棒性评测。 Abstract: Recent AI media detectors report near-perfect performance under clean laboratory evaluation, yet their robustness under realistic deployment conditions remains underexplored. In practice, AI-generated images are resized, compressed, re-encoded, and visually modified before being shared on online platforms. We argue that this creates a deployment gap between laboratory robustness and real-world reliability. In this work, we introduce a platform-aware adversarial evaluation framework for AI media detection that explicitly models deployment transforms (e.g., resizing, compression, screenshot-style distortions) and constrains perturbations to visually plausible meme-style bands rather than full-image noise. Under this threat model, detectors achieving AUC $\approx$ 0{.}99 in clean settings experience substantial degradation. Per-image platform-aware attacks reduce AUC to significantly lower levels and achieve high fake-to-real misclassification rates, despite strict visual constraints. We further demonstrate that universal perturbations exist even under localized band constraints, revealing shared vulnerability directions across inputs. Beyond accuracy degradation, we observe pronounced calibration collapse under attack, where detectors become confidently incorrect. Our findings highlight that robustness measured under clean conditions substantially overestimates deployment robustness. We advocate for platform-aware evaluation as a necessary component of future AI media security benchmarks and release our evaluation framework to facilitate standardized robustness assessment.

[152] Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks

Wang Zixian

Main category: cs.CV

TL;DR: 本文提出正交二次补(OQC)方法,通过构建低秩二次辅助分支并将其正交投影到主表示的补空间中,以增强视觉Transformer的性能,同时避免信息冗余。实验表明OQC及其变体在多个数据集上显著提升准确率,并改善了速度-精度权衡和表征几何性质。

Details Motivation: 现有双线性前馈替代方案虽能提升视觉Transformer的准确性,但常混淆更强的二阶交互与相对于主分支的冗余信息,本文旨在设计一种仅补充主表示未捕获信息的辅助二次特征机制。 Method: 提出正交二次补(OQC),构建低秩二次辅助分支,并显式将其投影到主分支的正交补空间;进一步设计其低秩高效实现(OQC-LR)及静态/动态门控扩展(OQC-static/dynamic)。 Result: 在参数匹配的Deep-ViT与CIFAR-100协议下,全量OQC将AFBO基线从64.25±0.22提升至65.59±0.22;OQC-LR达65.52±0.25且速度-精度更优;TinyImageNet上OQC-dynamic达51.88±0.32,超越基线1.43点并优于所有无门控变体;机制分析显示辅助与主表示正交性高、表征几何与类间分离度提升。 Conclusion: OQC系列方法(含无门控与门控变体)在不同数据集上具有一致泛化能力,验证了正交补充二次信息的有效性与实用性。 Abstract: Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets.

[153] Robust Fair Disease Diagnosis in CT Images

Justin Li,Daniel Ding,Asmita Yuki Pritha,Aryana Hou,Xin Wang,Shu Hu

Main category: cs.CV

TL;DR: 本文提出一种双层级目标函数,结合logit调整交叉熵损失和条件风险值(CVaR)聚合,以同时解决医学影像诊断中类别不平衡与群体表征不足的复合问题,在公平疾病诊断基准上显著提升模型性能与公平性。

Details Motivation: 现有深度学习模型在处理临床数据时,常因类别不平衡与群体表征不足并存而产生复合失效模式,传统重平衡或公平性修正方法难以单独解决。 Method: 提出双层级损失函数:样本层面采用logit-adjusted交叉熵损失校正类别频率偏差;群体层面采用Conditional Value at Risk(CVaR)聚合,动态聚焦高损失的敏感群体(如特定性别组)。模型基于Kinetics-400预训练的3D ResNet-18,在胸部CT数据上进行四分类(腺癌、鳞癌、COVID-19、正常),并利用患者性别标签进行公平性优化。 Result: 在Fair Disease Diagnosis基准上,模型实现性别平均宏F1为0.8403,公平性差距(fairness gap)仅为0.0239,相较基线提升13.3%的分数并减少78%的群体差异;消融实验验证两个组件缺一不可。 Conclusion: 仅关注类别平衡或仅关注群体公平均不足以应对临床数据中的复合偏差;双层级优化框架能协同缓解二者,为医疗AI的鲁棒性与公平性提供新范式。 Abstract: Automated diagnosis from chest CT has improved considerably with deep learning, but models trained on skewed datasets tend to perform unevenly across patient demographics. However, the situation is worse than simple demographic bias. In clinical data, class imbalance and group underrepresentation often coincide, creating compound failure modes that neither standard rebalancing nor fairness corrections can fix alone. We introduce a two-level objective that targets both axes of this problem. Logit-adjusted cross-entropy loss operates at the sample level, shifting decision margins by class frequency with provable consistency guarantees. Conditional Value at Risk aggregation operates at the group level, directing optimization pressure toward whichever demographic group currently has the higher loss. We evaluate on the Fair Disease Diagnosis benchmark using a 3D ResNet-18 pretrained on Kinetics-400, classifying CT volumes into Adenocarcinoma, Squamous Cell Carcinoma, COVID-19, and Normal groups with patient sex annotations. The training set illustrates the compound problem concretely: squamous cell carcinoma has 84 samples total, 5 of them female. The combined loss reaches a gender-averaged macro F1 of 0.8403 with a fairness gap of 0.0239, a 13.3% improvement in score and 78% reduction in demographic disparity over the baseline. Ablations show that each component alone falls short. The code is publicly available at https://github.com/Purdue-M2/Fair-Disease-Diagnosis.

[154] Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality

Kai Qian,Weijie Shi,Jiaqi Wang,Mengze Li,Hao Chen,Yue Cui,Hanghui Guo,Ziyi Liu,Jia Zhu,Jiajie Xu

Main category: cs.CV

TL;DR: 本文提出了一种面向缺失模态鲁棒性的多模态假新闻检测方法,通过在多模态大语言模型中实现头级别的模态专业化,并结合单模态知识保留策略,提升单模态验证能力与模型鲁棒性。

Details Motivation: 现实新闻传播中常出现图像缺失、截图损坏等导致模态缺失的问题,而现有MFND方法难以在缺失模态下保持各模态的强验证能力,尤其受限于低贡献模态学习不足和单模态标注稀缺。 Method: 提出头级别模态专业化机制:首先分析MLLM中注意力头对缺失模态下的性能影响,识别模态关键头;然后通过显式分配与下界注意力约束维持其模态特化;再结合单模态知识保留策略防止关键头偏离有限单模态监督所学知识。 Result: 实验表明该方法在模态缺失场景下显著提升鲁棒性,同时不损害完整多模态输入时的性能。 Conclusion: 头级别模态专业化与单模态知识保留协同提升了MFND模型在模态缺失下的可靠性,为少标注、弱监督下的多模态鲁棒学习提供了新思路。 Abstract: Multimodal fake news detection (MFND) aims to verify news credibility by jointly exploiting textual and visual evidence. However, real-world news dissemination frequently suffers from missing modality due to deleted images, corrupted screenshots, and similar issues. Thus, robust detection in this scenario requires preserving strong verification ability for each modality, which is challenging in MFND due to insufficient learning of the low-contribution modality and scarce unimodal annotations. To address this issue, we propose Head-wise Modality Specialization within Multimodal Large Language Models (MLLMs) for robust MFND under missing modality. Specifically, we first systematically study attention heads in MLLMs and their relationship with performance under missing modality, showing that modality-critical heads serve as key carriers of unimodal verification ability through their modality specialization. Based on this observation, to better preserve verification ability for the low-contribution modality, we introduce a head-wise specialization mechanism that explicitly allocates these heads to different modalities and preserves their specialization through lower-bound attention constraints. Furthermore, to better exploit scarce unimodal annotations, we propose a Unimodal Knowledge Retention strategy that prevents these heads from drifting away from the unimodal knowledge learned from limited supervision. Experiments show that our method improves robustness under missing modality while preserving performance with full multimodal input.

[155] LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

Shi-Yu Tian,Zhi Zhou,Kun-Yang Yu,Ming Yang,Yang Chen,Ziqiao Shang,Lan-Zhe Guo,Yu-Feng Li

Main category: cs.CV

TL;DR: 本文提出LAST框架,通过统一的工具增强空间推理方法(LAST-Box沙箱)和三阶段渐进训练策略,显著提升多模态大语言模型在复杂空间任务上的性能。

Details Motivation: 多模态大语言模型在解析复杂几何布局时易出现幻觉与不精确问题;纯数据驱动难以内化结构化几何先验和空间约束,而直接集成专用视觉模型又面临异构工具调用难、低层输出(如分割掩码、深度图)难以被高层推理有效利用的挑战。 Method: 提出LAST框架,包含可扩展的交互式沙箱LAST-Box,将异构工具调用抽象为原子指令和可复用空间技能,并生成LLM可直接理解的多模态提示(如标注图像与文本描述);设计三阶段渐进训练策略,引导模型从理解工具输出到自适应调用工具。 Result: 在四个数据集上实验表明,LAST-7B相较其基座模型性能提升约20%,且优于多个强闭源商业大模型。 Conclusion: LAST通过结构化工具集成与渐进训练,有效弥合了专用视觉模型与大语言模型在空间推理任务中的鸿沟,为工具增强型多模态推理提供了新范式。 Abstract: Spatial reasoning is a cornerstone capability for intelligent systems to perceive and interact with the physical world. However, multimodal large language models (MLLMs) frequently suffer from hallucinations and imprecision when parsing complex geometric layouts. As data-driven scaling struggles to internalize structured geometric priors and spatial constraints, integrating mature, specialized vision models presents a compelling alternative. Despite its promise, applying this paradigm to spatial reasoning is hindered by two key challenges: The difficulty of invoking heterogeneous, parameter-rich tools, as well as the challenge of understanding and effectively leveraging their diverse low-level outputs (e.g., segmentation masks, depth maps) in high-level reasoning. To address these challenges, we propose LAST, a unified framework for tool-augmented spatial reasoning. LAST features an extensible interactive sandbox, termed LAST-Box, which abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints (e.g., annotated images and textual descriptions) that can be directly consumed by LLMs. We further design a three-stage progressive training strategy that guides models from understanding tool outputs to proficient and adaptive tool invocation. Experiments on four datasets show that LAST-7B achieves around 20\% performance gains over its backbone and outperforms strong proprietary closed-source LLMs, substantially enhancing reasoning on complex spatial tasks.

[156] Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies

Carlos Garrido-Munoz,Aniello Panariello,Silvia Cascianelli,Angelo Porrello,Simone Calderara,Jorge Calvo-Zaragoza,Rita Cucchiara

Main category: cs.CV

TL;DR: 本文提出了一种无需目标语言真实手写数据的零样本合成到真实手写文本识别(HTR)迁移方法,通过学习源语言中模型参数从合成到真实数据的变化规律,并利用语言相似性加权迁移至目标语言,显著提升了跨语言泛化性能。

Details Motivation: 现有HTR模型在合成数据上训练后难以泛化到真实手写文本,且主流适配方法仍需目标域的真实样本;本文旨在解决完全零样本(无任何目标语言真实数据)下的合成到真实迁移问题。 Method: 学习源语言中模型参数从合成域到真实域的映射变化规律,并将该‘参数校正’迁移至目标语言;多源时依据语言相似性加权融合各源的校正信号。 Result: 在5种语言、6种架构上的实验表明,该方法持续优于仅用合成数据训练的基线,且即使目标语言与源语言无关,迁移校正仍带来提升。 Conclusion: 参数级的跨语言迁移校正是一种有效的零样本合成到真实HTR泛化策略,语言相似性可作为多源融合的合理依据,但非必需条件。 Abstract: Handwritten Text Recognition (HTR) models trained on synthetic handwriting often struggle to generalize to real text, and existing adaptation methods still require real samples from the target domain. In this work, we tackle the fully zero-shot synthetic-to-real generalization setting, where no real data from the target language is available. Our approach learns how model parameters change when moving from synthetic to real handwriting in one or more source languages and transfers this learned correction to new target languages. When using multiple sources, we rely on linguistic similarity to weigh their contrubition when combining them. Experiments across five languages and six architectures show consistent improvements over synthetic-only baselines and reveal that the transferred corrections benefit even languages unrelated to the sources.

[157] MuPPet: Multi-person 2D-to-3D Pose Lifting

Thomas Markhorst,Zhi-Yi Lin,Jouh Yeong Chew,Jan van Gemert,Xucong Zhang

Main category: cs.CV

TL;DR: 本文提出MuPPet框架,通过建模人与人之间的相关性,提升多人群体场景下的2D-to-3D姿态估计性能。

Details Motivation: 现有2D-to-3D姿态提升方法常忽略个体间关系或无法适应可变人数的群体,限制了其在社交互动理解中的效果。 Method: 提出MuPPet框架,包含Person Encoding(结构化个体表征)、Permutation Augmentation(增强训练多样性)和Dynamic Multi-Person Attention(自适应建模个体间关联)。 Result: 在群体交互数据集上显著优于现有单人及多人2D-to-3D姿态提升方法,并提升了遮挡场景下的鲁棒性。 Conclusion: 建模人与人之间的相关性对实现准确且具备社会感知能力的3D姿态估计至关重要。 Abstract: Multi-person social interactions are inherently built on coherence and relationships among all individuals within the group, making multi-person localization and body pose estimation essential to understanding these social dynamics. One promising approach is 2D-to-3D pose lifting which provides a 3D human pose consisting of rich spatial details by building on the significant advances in 2D pose estimation. However, the existing 2D-to-3D pose lifting methods often neglect inter-person relationships or cannot handle varying group sizes, limiting their effectiveness in multi-person settings. We propose MuPPet, a novel multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations. To leverage these inter-person dependencies, our approach introduces Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. Extensive experiments on group interaction datasets demonstrate MuPPet significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods, and improves robustness in occlusion scenarios. Our findings highlight the importance of modeling inter-person correlations, paving the way for accurate and socially-aware 3D pose estimation. Our code is available at: https://github.com/Thomas-Markhorst/MuPPet

[158] Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach

Hai La Quang,Hassan Ugail,Newton Howard,Cong Tran Tien,Nam Vu Hoai,Hung Nguyen Viet

Main category: cs.CV

TL;DR: 本文提出了一种基于动力系统理论分析深度视觉模型训练过程的新框架,通过层激活信号定义了整合性、亚稳态性和动态稳定性指数三个指标,揭示了模型内部表征演化的规律。

Details Motivation: 传统训练评估仅依赖损失和准确率,难以反映模型内部表征的动态演化过程;本文旨在从动力系统视角提供一种互补的、更深入的理解方式。 Method: 借鉴神经科学中的信号分析思想,基于训练过程中各层激活值,定义三个动力学指标:整合性分数(反映跨层长程协调)、亚稳态分数(刻画同步状态切换的灵活性)和综合动态稳定性指数,并在九种模型-数据组合上进行实证分析。 Result: 发现三个规律:1)整合性可稳定区分CIFAR-10与CIFAR-100难度差异;2)稳定性指数的波动变化可能早于准确率收敛;3)整合性与亚稳态的关系体现不同训练行为模式。 Conclusion: 该动力系统分析框架为超越损失和准确率、理解深度视觉模型训练机制提供了新颖且有前景的探索路径。 Abstract: Deep visual recognition models are usually trained and evaluated using metrics such as loss and accuracy. While these measures show whether a model is improving, they reveal very little about how its internal representations change during training. This paper introduces a complementary way to study that process by examining training through the lens of dynamical systems. Drawing on ideas from signal analysis originally used to study biological neural activity, we define three measures from layer activations collected across training epochs: an integration score that reflects long-range coordination across layers, a metastability score that captures how flexibly the network shifts between more and less synchronised states, and a combined dynamical stability index. We apply this framework to nine combinations of model architecture and dataset, including several ResNet variants, DenseNet-121, MobileNetV2, VGG-16, and a pretrained Vision Transformer on CIFAR-10 and CIFAR-100. The results suggest three main patterns. First, the integration measure consistently distinguishes the easier CIFAR-10 setting from the more difficult CIFAR-100 setting. Second, changes in the volatility of the stability index may provide an early sign of convergence before accuracy fully plateaus. Third, the relationship between integration and metastability appears to reflect different styles of training behaviour. Overall, this study offers an exploratory but promising new way to understand deep visual training beyond loss and accuracy.

[159] Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

Mirza Raquib,Asif Pervez Polok,Kedar Nath Biswas,Farida Siddiqi Prity,Saydul Akbar Murad,Nick Rahimi

Main category: cs.CV

TL;DR: 本文构建了一个新的平衡的孟加拉手写字符数据集(78类,每类约650样本),并提出一种融合EfficientNetB3、Vision Transformer和Conformer的交互感知混合深度学习架构,通过多头交叉注意力融合实现高效特征交互,在自建数据集和CHBCR基准上分别达到98.84%和96.49%准确率。

Details Motivation: 现有孟加拉手写字符识别面临书写风格多样、笔画不一致、类内差异小、类间分布不均、数据集规模有限等挑战。 Method: 构建涵盖基本字符、复合字符(Juktobarno)和数字的平衡新数据集(78类×650样本,覆盖多年龄段、社会经济背景及左右利手书写者);提出融合EfficientNetB3、Vision Transformer与Conformer的并行交互感知混合模型,并引入多头交叉注意力机制进行跨模块特征融合;采用Grad-CAM提升模型可解释性。 Result: 在自建数据集上准确率达98.84%,在外部CHBCR基准上达96.49%,展现出强泛化能力;Grad-CAM可视化验证了判别区域定位有效性;数据集与源码已开源。 Conclusion: 所提新数据集与交互感知混合架构显著提升了孟加拉手写字符识别性能与鲁棒性,为低资源语种OCR研究提供了高质量资源与有效方法范式。 Abstract: Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition.

[160] Data-Driven Automated Identification of Optimal Feature-Representative Images in Infrared Thermography Using Statistical and Morphological Metrics

Harutyun Yagdjian,Martin Gurka

Main category: cs.CV

TL;DR: 本文提出了一种无需先验空间信息的数据驱动方法,通过三个互补指标(HI、REA、TVE)自动筛选红外热成像序列中最能表征缺陷的图像,提升了无监督缺陷检测的鲁棒性与可靠性。

Details Motivation: 现有红外热成像后处理方法中缺陷可见性随时间/频率/系数变化大,且传统评价指标(如SNR、Tanimoto准则)依赖缺陷位置或无缺陷参考区域,难以用于自动化、无监督分析。 Method: 提出基于三个互补数据驱动指标的框架:1)混合物同质性指数(HI),衡量局部强度分布偏离全局参考分布的程度;2)代表性基本区域(REA),将三维代表性体积概念适配至二维图像的Minkowski泛函方法;3)几何-拓扑总变差能量(TVE)指数,同样基于二维Minkowski泛函,增强对局域异常的敏感性。 Result: 在含六个人工缺陷的CFRP板脉冲加热红外数据及一维N层热模型仿真上验证了该方法,实现了图像序列的鲁棒、无偏排序,为红外热成像中自动化缺陷导向图像选择提供了可靠依据。 Conclusion: 所提三指标融合框架克服了对先验知识的依赖,显著提升了红外热成像缺陷识别的自动化与无监督能力,具有实际工程应用价值。 Abstract: Infrared thermography (IRT) is a widely used non-destructive testing technique for detecting structural features such as subsurface defects. However, most IRT post-processing methods generate image sequences in which defect visibility varies strongly across time, frequency, or coefficient/index domains, making the identification of defect-representative images a critical challenge. Conventional evaluation metrics, such as the signal-to-noise ratio (SNR) or the Tanimoto criterion, often require prior knowledge of defect locations or defect-free reference regions, limiting their suitability for automated and unsupervised analysis. In this work, a data-driven methodology is proposed to identify images within IRT datasets that are most likely to contain and represent structural features, particularly anomalies and defects, without requiring prior spatial information. The approach is based on three complementary metrics: the Homogeneity Index of Mixture (HI), which quantifies statistical heterogeneity via deviations of local intensity distributions from a global reference distribution; a Representative Elementary Area (REA), derived from a Minkowski-functional adaptation of the Representative Elementary Volume concept to two-dimensional images; and a geometrical-topological Total Variation Energy (TVE) index, also based on two-dimensional Minkowski functionals, designed to improve sensitivity to localized anomalies. The framework is validated experimentally using pulse-heated IRT data from a carbon fiber-reinforced polymer (CFRP) plate containing six artificial defects at depths between 0.135 mm and 0.810 mm, and is further supported by one-dimensional N-layer thermal model simulations. The results demonstrate robust and unbiased ranking of image sequences and provide a reliable basis for automated defect-oriented image selection in IRT.

[161] LOLGORITHM: Funny Comment Generation Agent For Short Videos

Xuan Ouyang,Senan Wang,Bouzhou Wang,Siyuan Xiahou,Jinrong Zhou,Yuekang Li

Main category: cs.CV

TL;DR: 本文提出LOLGORITHM,一种模块化多智能体框架,用于生成符合平台文化与语言规范的短视频评论,支持六种可控风格,并在YouTube和抖音上实现了高人类偏好选择率。

Details Motivation: 现有方法(如视频摘要和弹幕生成)无法生成符合平台特定文化和语言规范的真实评论。 Method: 提出LOLGORITHM框架,包含视频内容摘要、视频分类、结合语义检索与热梗增强的评论生成三个核心模块,并构建了涵盖YouTube和抖音的双语数据集。 Result: 在自动评估与大规模人类偏好分析中,LOLGORITHM显著优于基线方法,在YouTube和抖音上的人类偏好选择率分别达80.46%和84.29%;消融实验表明性能提升源于框架结构而非大模型选择。 Conclusion: LOLGORITHM是一种鲁棒且可泛化的短视频评论生成方法,其模块化设计有效提升了评论的真实性与平台适配性。 Abstract: Short-form video platforms have become central to multimedia information dissemination, where comments play a critical role in driving engagement, propagation, and algorithmic feedback. However, existing approaches -- including video summarization and live-streaming danmaku generation -- fail to produce authentic comments that conform to platform-specific cultural and linguistic norms. In this paper, we present LOLGORITHM, a novel modular multi-agent framework for stylized short-form video comment generation. LOLGORITHM supports six controllable comment styles and comprises three core modules: video content summarization, video classification, and comment generation with semantic retrieval and hot meme augmentation. We further construct a bilingual dataset of 3,267 videos and 16,335 comments spanning five high-engagement categories across YouTube and Douyin. Evaluation combining automatic scoring and large-scale human preference analysis demonstrates that LOLGORITHM consistently outperforms baseline methods, achieving human preference selection rates of 80.46\% on YouTube and 84.29\% on Douyin across 107 respondents. Ablation studies confirm that these gains are attributable to the framework architecture rather than the choice of backbone LLM, underscoring the robustness and generalizability of our approach.

[162] Multi-Frequency Local Plasticity for Visual Representation Learning

Mehdi Fatan Serj,C. Alejandro Parraga,Xavier Otazu

Main category: cs.CV

TL;DR: 本文提出了一种不依赖端到端反向传播的模块化分层视觉识别框架,结合多频Gabor分解、局部Hebbian/Oja学习、Hopfield式联想记忆与自上而下迭代调制,在CIFAR-10上达到80.1%准确率,表明强结构先验可大幅弥补无全局梯度训练的性能损失。

Details Motivation: 探究结构化的架构先验能在多大程度上弥补视觉识别中缺乏端到端梯度表示学习的缺陷。 Method: 构建基于VisNet传统的模块化分层框架:(i) 固定多频Gabor分解为7路并行流;(ii) 各流内使用Hebbian/Oja竞争学习与抗Hebbian去相关;(iii) 引入受现代Hopfield检索启发的联想记忆模块;(iv) 利用局部预测与重建信号进行迭代自上而下调制;仅最终线性读出和自上而下投影矩阵用梯度下降优化。 Result: CIFAR-10上准确率达80.1%(±0.3%),显著高于纯Hebbian基线(71.0%),略低于同Gabor基底的全梯度模型(83.4%);CIFAR-100达54.8%;消融分析显示多频流、联想记忆与自上而下反馈贡献近似可加,且流×自上而下存在显著交互效应(p=0.02)。 Conclusion: 精心设计的架构先验能恢复大部分全局梯度训练的性能,但仍存在可测量的性能差距;该混合局部训练范式在受限数据集(CIFAR)上验证了其有效性。 Abstract: We study how far structured architectural bias can compensate for the absence of end-to-end gradient-based representation learning in visual recognition. Building on the VisNet tradition, we introduce a modular hierarchical framework combining: (i) fixed multi-frequency Gabor decomposition into F=7 parallel streams; (ii) within-stream competitive learning with Hebbian and Oja updates and anti-Hebbian decorrelation; (iii) an associative memory module inspired by modern Hopfield retrieval; and (iv) iterative top-down modulation using local prediction and reconstruction signals. Representational layers are trained without end-to-end backpropagation through the full hierarchy; only the final linear readout and top-down projection matrices are optimized by gradient descent. We therefore interpret the model as a hybrid system that is predominantly locally trained but includes a small number of gradient-trained parameters. On CIFAR-10, the full model reaches 80.1% +/- 0.3% top-1 accuracy, linear probe), compared with 71.0% for a Hebbian-only baseline and 83.4% for a gradient-trained model on the same fixed Gabor basis. On CIFAR-100, performance is 54.8%. Factorial analysis indicates that multi-frequency streams, associative memory, and top-down feedback contribute largely additively, with a significant Streams x TopDown interaction (p=0.02). These results suggest that carefully chosen architectural priors can recover a substantial fraction of the performance typically associated with global gradient training, while leaving a measurable residual gap. Experiments are limited to CIFAR-10/100.

[163] See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

Mohammad Anas Azeez,Ankan Deria,Zohaib Hasan Siddiqui,Adinath Madhavrao Dukre,Rafiq Ali,Sara Atito,Yutong Xie,Imran Razzak

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、与架构无关的解码策略DOP-OBC,通过均衡视觉注意力分配(抑制主导物体、增强稀有物体)来缓解多模态大语言模型中的物体幻觉问题,并在多个基准上验证了其有效性。

Details Motivation: MLLMs常因解码时注意力过度集中于视觉显著或高频内容而产生物体幻觉;作者指出注意力分配不均(尤其忽视稀有、小尺寸或上下文边缘物体)是根本原因,主张所有图像物体应享有均等表征机会。 Method: 提出DOP-OBC解码策略:1)主导物体惩罚(DOP),软性抑制对视觉主导区域的注意力过集中;2)离群增强系数(OBC),增强对稀有但高置信检测物体的注意力;二者以逐行logit调制方式注入因果注意力掩码,无需参数更新,保持自回归特性。 Result: 在CHAIR和POPE基准上显著降低物体幻觉;GPT-4o评估显示图像/视频描述在正确性、一致性、细节、上下文及时间维度上的质量提升。 Conclusion: 注意力分配的公平性不仅是设计原则,更是提升多模态生成忠实性的实用有效路径;DOP-OBC具备训练免费、架构无关、即插即用优势。 Abstract: Multimodal large language models (MLLMs) frequently hallucinate objects that are absent from the visual input, often because attention during decoding is disproportionately drawn to visually dominant or frequently occurring content. We observe that this inequity in attention allocation is a root cause of object hallucination: when rare, small, or contextually peripheral objects receive insufficient attention, the model fails to ground its generation in the full visual scene. We argue that every object in an image, regardless of its size, frequency or visual salience, deserves equal representational opportunity during decoding. To this end, we propose DOP-OBC, a training-free and architecture-agnostic decoding strategy built on the principle of equitable attention. Two complementary object-aware signals work in tandem: a Dominant Object Penalty (DOP) that softly suppresses attention over-concentration on visually dominant regions, and an Outlier Boost Coefficient (OBC) that amplifies attention toward rare yet confidently detected objects. These signals are injected as per-row logit modulations within the causal attention mask, requiring no weight updates and preserving autoregressive decoding properties. Extensive experiments across image and video MLLMs demonstrate consistent reductions in object hallucination on CHAIR and POPE benchmarks, alongside improvements in GPT-4o assessed captioning quality across correctness, consistency, detail, context and temporal dimensions. DOP-OBC establishes that fairness in attention allocation is not merely a design principle but a practical and effective path toward more faithful multimodal generation.

[164] MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

Suyang Xi,Songtao Hu,Yuxiang Lai,Wangyun Dan,Yaqi Liu,Shansong Wang,Xiaofeng Yang

Main category: cs.CV

TL;DR: 本文提出MedLVR框架,通过在自回归解码中引入显式的潜在视觉证据状态,实现对医学图像中关键视觉信息的迭代保留与精炼,从而提升医学视觉问答(VQA)的准确性与可靠性。

Details Motivation: 现有医学视觉语言模型(VLMs)推理过程以文本为中心,图像仅被静态编码一次,难以保留临床诊断所需的细微、局部视觉证据,限制了其在真实医疗场景中的可靠性。 Method: 提出MedLVR:在解码器中插入短的潜在推理段,复用隐藏状态作为连续潜在步,实现视觉证据的迭代保持与精炼;采用两阶段训练:ROI监督微调对齐潜在状态与临床相关图像区域,视觉-潜在策略优化(VLPO)基于结果级奖励联合优化潜在推理与答案生成。 Result: 在OmniMedVQA及五个外部医学VQA基准上,MedLVR持续超越近期推理基线,将Qwen2.5-VL-7B主干模型平均得分从48.3%提升至53.4%。 Conclusion: 潜在视觉推理是一种有效机制,可更好保留诊断相关视觉证据,显著提升医学VQA的准确性和临床可信度。 Abstract: Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

[165] Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Sangwon Baik,Gunhee Kim,Mingi Choi,Hanbyul Joo

Main category: cs.CV

TL;DR: 本文提出一种无需微调的推理时技术,通过多视角推理、物体中心坐标系可视化和单轴旋转预测,在闭环交互中提升视觉语言模型(VLM)对3D场景中目标物体文本引导6D姿态的预测能力,并验证其在机器人操作中的有效性。

Details Motivation: 视觉语言模型(VLMs)虽具强视觉推理能力,但在3D理解尤其是文本一致的6D姿态推断上表现不足。 Method: 设计闭环推理流程:观察RGB-D或3D网格场景→评估是否符合文本指令→提议目标物体姿态更新→应用更新→渲染新场景;引入三项推理时技术:多视角推理与支持视图选择、物体中心坐标系可视化、单轴旋转预测。 Result: 在无需额外微调或新模块下,显著超越现有方法,适用于闭源与开源VLM;结合简单机器人运动规划后,操纵成功率更高;消融实验证明三项技术均必要。 Conclusion: VLM可通过精心设计的推理时策略有效解决3D姿态推理任务,凸显‘推理即代理’范式的潜力,并为VLM驱动的具身智能提供实用路径。 Abstract: Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

[166] Biomarker-Based Pretraining for Chagas Disease Screening in Electrocardiograms

Elias Stenhede,Arian Ranjbar

Main category: cs.CV

TL;DR: 本文提出一种基于生物标志物的预训练方法,利用MIMIC-IV-ECG数据集预测血液生物标志物以预训练ECG特征提取器,再在巴西数据集上微调用于查加斯病检测,在2025年PhysioNet挑战赛中取得第5名。

Details Motivation: 现有查加斯病ECG筛查数据集标签稀缺且噪声大,限制了模型性能。 Method: 采用生物标志物(来自MIMIC-IV-ECG)百分位分箱回归进行ECG特征提取器的自监督式预训练,随后在巴西Chagas标注数据上微调,并构建5模型集成。 Result: 在PhysioNet 2025挑战赛隐藏测试集上获得0.269挑战分数,排名第五。 Conclusion: 基于临床可解释生物标志物的预训练策略可有效提升小规模、低质量医学数据下的疾病检测性能。 Abstract: Chagas disease screening via ECGs is limited by scarce and noisy labels in existing datasets. We propose a biomarker-based pretraining approach, where an ECG feature extractor is first trained to predict percentile-binned blood biomarkers from the MIMIC-IV-ECG dataset. The pretrained model is then fine-tuned on Brazilian datasets for Chagas detection. Our 5-model ensemble, developed by the Ahus AIM team, achieved a challenge score of 0.269 on the hidden test set, ranking 5th in Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025. Source code and the model are shared on GitHub: github.com/Ahus-AIM/physionet-challenge-2025

[167] RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation

Jieru Li,Matthew Chen,Micky C. Nnamdi,J. Ben Tamo,Benoit L. Marteau,May D. Wang

Main category: cs.CV

TL;DR: 本文提出RobustMedSAM,通过模块化融合MedSAM(医学先验)和RobustSAM(抗干扰能力),仅微调掩码解码器,在35个医学数据集上显著提升受损图像分割性能(Dice提升0.106)。

Details Motivation: 现有方法仅分别解决医学领域适配或图像退化鲁棒性问题,而SAM中医学先验和鲁棒性能力分别集中在图像编码器和掩码解码器中,需联合优化。 Method: 采用模块化检查点融合:图像编码器初始化自MedSAM,掩码解码器初始化自RobustSAM(同为ViT-B架构);仅在MedSegBench的35个医学数据集上微调掩码解码器,冻结其余部分;并探索SVD参数高效变体以有限调整编码器。 Result: 在分布内与分布外基准测试中,RobustMedSAM将退化图像的Dice分数从0.613提升至0.719(+0.106),显著优于SAM。 Conclusion: 结构化融合互补预训练模型是实现鲁棒医学图像分割的有效且实用方法。 Abstract: Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these capabilities are concentrated in complementary modules: the image encoder preserves medical priors, while the mask decoder governs corruption robustness. Motivated by this observation, we propose RobustMedSAM, which adopts module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture. We then fine-tune only the mask decoder on 35 medical datasets from MedSegBench, spanning six imaging modalities and 12 corruption types, while freezing the remaining components to preserve pretrained medical representations. We additionally investigate an SVD-based parameter-efficient variant for limited encoder adaptation. Experiments on both in-distribution and out-of-distribution benchmarks show that RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM, demonstrating that structured fusion of complementary pretrained models is an effective and practical approach for robust medical image segmentation.

[168] ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

Lukas Picek,Michal Čermák,Marek Hanzl,Vojtěch Čermák

Main category: cs.CV

TL;DR: 本文介绍了ACCIDENT,一个用于CCTV监控视频中交通事故检测的基准数据集,支持监督(IID/OOD)和零样本设置,并包含真实与合成视频片段及多任务评估指标。

Details Motivation: 现有交通事故检测数据集在真实性、多样性及评估全面性方面存在不足,难以覆盖数据丰富与稀缺场景,尤其缺乏对CCTV视频固有模糊性和不确定性建模的评测标准。 Method: 构建包含2027个真实与2211个合成事故视频片段的数据集,标注事故时间、空间位置与碰撞类型;定义三个核心任务(时序定位、空间定位、碰撞类型分类),并设计适配CCTV特性的定制化评估指标;提供启发式、运动感知和视觉-语言等多种基线方法。 Result: 实验表明ACCIDENT具有显著挑战性,现有方法在各项任务上性能有限,验证了该基准对推动鲁棒、实用的事故检测模型发展的价值。 Conclusion: ACCIDENT是一个面向现实CCTV场景、兼顾真实性与可控性的新型事故检测基准,填补了数据稀缺设定与细粒度多任务评估的空白,为后续研究提供了统一、可复现的评测平台。 Abstract: We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: https://accidentbench.github.io

[169] F3G-Avatar : Face Focused Full-body Gaussian Avatar

Willem Menu,Erkut Akdag,Pedro Quesado,Yasaman Kashefbahrami,Egor Bondarev

Main category: cs.CV

TL;DR: 本文提出F3G-Avatar,一种面向人脸的全身体高斯化身合成方法,通过双分支架构(身体+人脸形变)提升面部几何与表情细节建模能力,并结合MHR模板、LBS姿态控制与可微高斯光栅化实现高质量、可驱动的全身体渲染。

Details Motivation: 现有全身体高斯化身方法在全局重建质量上表现良好,但受限于面部表征能力,难以保留精细的面部几何与表情细节,尤其对高频姿态相关形变建模不足。 Method: 基于多视角RGB视频和回归的姿态/形状参数,以蒙皮动量人体模板(MHR)为起点,渲染前后位置图;采用双分支网络解码为3D高斯:身体分支建模姿态依赖非刚性形变,人脸聚焦分支细化头部几何与外观;融合高斯后使用线性混合蒙皮(LBS)驱动,并通过可微高斯光栅化渲染;训练中融合重建损失、感知损失及面向人脸的对抗损失。 Result: 在AvatarReX数据集上,正面人脸视图达到PSNR/SSIM/LPIPS为26.243/0.964/0.084;消融实验验证了MHR模板与人脸形变分支的关键作用。 Conclusion: F3G-Avatar提供了一种实用、高质量的可驱动全身体化身合成方案,在面部细节与整体动画性能间取得良好平衡。 Abstract: Existing full-body Gaussian avatar methods primarily optimize global reconstruction quality and often fail to preserve fine-grained facial geometry and expression details. This challenge arises from limited facial representational capacity that causes difficulties in modeling high-frequency pose-dependent deformations. To address this, we propose F3G-Avatar, a full-body, face-aware avatar synthesis method that reconstructs animatable human representations from multi-view RGB video and regressed pose/shape parameters. Starting from a clothed Momentum Human Rig (MHR) template, front/back positional maps are rendered and decoded into 3D Gaussians through a two-branch architecture: a body branch that captures pose-dependent non-rigid deformations and a face-focused deformation branch that refines head geometry and appearance. The predicted Gaussians are fused, posed with linear blend skinning (LBS), and rendered with differentiable Gaussian splatting. Training combines reconstruction and perceptual objectives with a face-specific adversarial loss to enhance realism in close-up views. Experiments demonstrate strong rendering quality, with face-view performance reaching PSNR/SSIM/LPIPS of 26.243/0.964/0.084 on the AvatarReX dataset. Ablations further highlight contributions of the MHR template and the face-focused deformation. F3G-Avatar provides a practical, high-quality pipeline for realistic, animatable full-body avatar synthesis.

[170] Vector Field Synthesis with Sparse Streamlines Using Diffusion Model

Nguyen K. Phan,Ricardo Morales,Sebastian D. Espriella,Guoning Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的新框架,用于从稀疏、连贯的流线输入中合成2D矢量场,同时保持物理合理性。

Details Motivation: 传统优化方法在灵活性和物理一致性方面存在不足,需要一种能兼顾几何与物理约束的新型矢量场合成方法。 Method: 采用带无分类器引导的条件去噪扩散概率模型,实现渐进式重建,同时保留几何与物理约束。 Result: 实验表明该方法能生成符合物理定律且忠实于稀疏输入的合理矢量场,在灵活性和物理一致性上优于传统优化方法。 Conclusion: 基于扩散的框架为矢量场合成提供了新范式,有效平衡了数据保真度与物理可解释性。 Abstract: We present a novel diffusion-based framework for synthesizing 2D vector fields from sparse, coherent inputs (i.e., streamlines) while maintaining physical plausibility. Our method employs a conditional denoising diffusion probabilistic model with classifier-free guidance, enabling progressive reconstruction that preserves both geometric and physical constraints. Experimental results demonstrate our method's ability to synthesize plausible vector fields that adhere to physical laws while maintaining fidelity to sparse input observations, outperforming traditional optimization-based approaches in terms of flexibility and physical consistency.

[171] Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

Oliver McLaughlin,Daniel Shubin,Carsten Eickhoff,Ritambhara Singh,William Rudman,Michal Golovanevsky

Main category: cs.CV

TL;DR: 本文评估了四种开源视觉-语言模型(VLMs)在四种医学影像任务中的表现,发现其临床推理能力有限,性能随任务难度增加而显著下降;医学领域微调未带来一致提升,且模型对提示词高度敏感;引入描述生成+文本模型诊断的新流程仅小幅提升性能,根本问题在于视觉表征与下游推理均薄弱。

Details Motivation: 探究视觉-语言模型经医学领域微调后是否真正提升了临床推理能力,而非仅依赖表面视觉线索。 Method: 对比四组开源VLM(如LLaVA vs. LLaVA-Med),在脑瘤、肺炎、皮肤癌和组织病理学四类递增难度的医学影像分类任务中评估性能;分析提示工程影响;提出描述生成+GPT-5.1诊断的两阶段流程;分析视觉编码器嵌入。 Result: 模型性能随任务难度上升而急剧下降(近随机水平);医学微调无一致优势;提示微小变化导致准确率和拒绝率大幅波动;新描述流程仅有限提升;视觉编码器嵌入分析揭示视觉表征与推理均存在缺陷。 Conclusion: 当前医学VLM性能脆弱、高度依赖提示设计,且领域微调不能可靠增强临床推理能力。 Abstract: Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.

[172] Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

Yang Deng,David Mould,Paul L. Rosin,Yu-Kun Lai

Main category: cs.CV

TL;DR: 本文提出了一种无需重新训练的扩散采样重构框架,通过动态空间引导和多路径剪枝,改善文本到图像生成中前景与背景的平衡与协同,提升全局场景一致性和构图控制能力。

Details Motivation: 现有文本到图像扩散模型存在前景偏差,忽视背景优化,导致全局场景不连贯、构图控制受限。 Method: 提出训练-free框架:1)动态空间引导——引入时步依赖的软门控机制,调节前景/背景注意力;2)多路径剪枝——基于注意力统计与语义对齐信号动态筛选潜在生成路径。并构建专用评测基准。 Result: 在多个扩散主干模型上验证了背景一致性与前景-背景构图对齐的持续提升。 Conclusion: 该方法有效缓解前景偏差问题,无需训练即可增强扩散模型的场景级合成能力与 compositional control。 Abstract: Existing text-to-image diffusion models, while excelling at subject synthesis, exhibit a persistent foreground bias that treats the background as a passive and under-optimized byproduct. This imbalance compromises global scene coherence and constrains compositional control. To address the limitation, we propose a training-free framework that restructures diffusion sampling to explicitly account for foreground-background interactions. Our approach consists of two key components. First, Dynamic Spatial Guidance introduces a soft, time step dependent gating mechanism that modulates foreground and background attention during the diffusion process, enabling spatially balanced generation. Second, Multi-Path Pruning performs multi-path latent exploration and dynamically filters candidate trajectories using both internal attention statistics and external semantic alignment signals, retaining trajectories that better satisfy object-background constraints. We further develop a benchmark specifically designed to evaluate object-background compositionality. Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment.

[173] Do vision models perceive illusory motion in static images like humans?

Isabella Elaine Rosario,Fan L. Cheng,Zitang Sun,Nikolaus Kriegeskorte

Main category: cs.CV

TL;DR: 本文研究了深度神经网络在静态运动错觉(如旋转蛇错觉)中的表现,发现多数光流模型无法复现人类感知的错觉运动,仅受人类启发的双通道模型在模拟扫视条件下表现出类似人类的旋转运动;结果揭示了当前模型与人类运动处理之间的重要差距,并为构建更符合人类感知的运动估计系统提供了启示。

Details Motivation: 理解人类运动处理机制对构建可靠、以人为中心的计算机视觉系统至关重要;现有深度神经网络虽在光流估计中性能优异,但鲁棒性和计算策略仍不及人类;运动错觉是探查人机视觉异同的有效工具。 Method: 评估多个代表性光流模型在旋转蛇错觉上的响应;在模拟扫视眼动条件下测试模型输出;采用消融分析探究亮度信号、颜色-特征信号及循环注意机制的作用。 Result: 大多数光流模型无法生成与人类感知一致的流动场;仅双通道模型在扫视模拟中产生预期的旋转运动;亮度和高阶颜色-特征信号共同贡献该行为,且循环注意机制对整合局部线索至关重要。 Conclusion: 当前光流模型与人类视觉运动处理存在显著差距;引入人类视觉机制(如双通道结构、循环注意)有助于提升模型与人类感知的一致性,推动人本AI发展。 Abstract: Understanding human motion processing is essential for building reliable, human-centered computer vision systems. Although deep neural networks (DNNs) achieve strong performance in optical flow estimation, they remain less robust than humans and rely on fundamentally different computational strategies. Visual motion illusions provide a powerful probe into these mechanisms, revealing how human and machine vision align or diverge. While recent DNN-based motion models can reproduce dynamic illusions such as reverse-phi, it remains unclear whether they can perceive illusory motion in static images, exemplified by the Rotating Snakes illusion. We evaluate several representative optical flow models on Rotating Snakes and show that most fail to generate flow fields consistent with human perception. Under simulated conditions mimicking saccadic eye movements, only the human-inspired Dual-Channel model exhibits the expected rotational motion, with the closest correspondence emerging during the saccade simulation. Ablation analyses further reveal that both luminance-based and higher-order color--feature--based motion signals contribute to this behavior and that a recurrent attention mechanism is critical for integrating local cues. Our results highlight a substantial gap between current optical-flow models and human visual motion processing, and offer insights for developing future motion-estimation systems with improved correspondence to human perception and human-centric AI.

[174] FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views

Chaoyi Zhou,Run Wang,Feng Luo,Mert D. Pesé,Zhiwen Fan,Yiqi Zhong,Siyu Huang

Main category: cs.CV

TL;DR: 本文提出FF3R框架,首次在无需任何标注(如相机位姿、深度图、语义标签)条件下,通过纯前馈方式统一实现几何重建与语义理解,利用渲染监督和两项创新机制解决全局语义不一致与局部结构不一致问题,在多个任务上取得领先性能。

Details Motivation: 现有方法将几何重建与语义理解割裂处理,导致流程冗余、误差累积;且严重依赖大量人工标注(如位姿、深度、语义标签),限制了实际部署与泛化能力。 Method: 提出完全无标注的前馈框架FF3R:1)仅用RGB图像及特征图的渲染监督进行训练;2)引入Token-wise Fusion Module,通过跨注意力融合几何与语义token;3)设计Semantic-Geometry Mutual Boosting机制,结合几何引导的特征形变(保障全局语义一致性)与语义感知体素化(提升局部结构一致性)。 Result: 在ScanNet和DL3DV-10K数据集上,FF3R在新视角合成、开放词汇语义分割、深度估计等任务中均显著优于现有方法,并展现出强野外泛化能力。 Conclusion: FF3R确立了一种可扩展的统一3D推理新范式,为需同时具备空间与语义理解能力的具身智能系统提供了关键技术支撑。 Abstract: Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token-wise Fusion Module that enriches geometry tokens with semantic context via cross-attention, and (ii) a Semantic-Geometry Mutual Boosting mechanism combining geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R's superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.

[175] PAS: Estimating the target accuracy before domain adaptation

Raphaella Diniz,Jackson de Faria,Martin Ester

Main category: cs.CV

TL;DR: 本文提出了一种名为PAS的新指标,用于在执行域自适应前评估源域数据集和预训练特征提取器对目标分类任务的可迁移性,从而指导最优模型与源域的选择,提升目标域准确率并降低计算开销。

Details Motivation: 域自适应中源域和预训练模型的选择困难,因目标域缺乏标注验证集且预训练模型数量庞大。 Method: 提出PAS评分,利用预训练模型的泛化能力,基于预训练特征嵌入评估源-目标兼容性,并构建选择框架。 Result: 在图像分类基准上,PAS与实际目标准确率强相关,能稳定选出最佳预训练模型和源域。 Conclusion: PAS是一种有效、高效且实用的域自适应前期评估工具,可显著提升适应性能并减少试错成本。 Abstract: The goal of domain adaptation is to make predictions for unlabeled samples from a target domain with the help of labeled samples from a different but related source domain. The performance of domain adaptation methods is highly influenced by the choice of source domain and pre-trained feature extractor. However, the selection of source data and pre-trained model is not trivial due to the absence of a labeled validation set for the target domain and the large number of available pre-trained models. In this work, we propose PAS, a novel score designed to estimate the transferability of a source domain set and a pre-trained feature extractor to a target classification task before actually performing domain adaptation. PAS leverages the generalization power of pre-trained models and assesses source-target compatibility based on the pre-trained feature embeddings. We integrate PAS into a framework that indicates the most relevant pre-trained model and source domain among multiple candidates, thus improving target accuracy while reducing the computational overhead. Extensive experiments on image classification benchmarks demonstrate that PAS correlates strongly with actual target accuracy and consistently guides the selection of the best-performing pre-trained model and source domain for adaptation.

[176] DINO_4D: Semantic-Aware 4D Reconstruction

Yiru Yang,Zhuojie Wu,Quentin Marguet,Nishant Kumar Singh,Max Schulthess

Main category: cs.CV

TL;DR: DINO_4D利用冻结的DINOv3特征作为结构先验,将语义信息引入动态场景的4D重建过程,有效抑制语义漂移,在保持线性时间复杂度的同时提升跟踪精度和重建完整性。

Details Motivation: 4D动态场景重建需连接低层几何感知与高层语义理解,现有方法存在语义漂移问题。 Method: 提出DINO_4D方法,引入冻结的DINOv3特征作为结构先验,增强重建过程中的语义感知能力。 Result: 在Point Odyssey和TUM-Dynamics基准上验证,保持O(T)时间复杂度,显著提升Tracking Accuracy(APD)和Reconstruction Completeness。 Conclusion: DINO_4D建立了兼具几何精度与语义理解的4D世界模型新范式。 Abstract: In the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes serve as the critical bridge connecting low-level geometric sensing with high-level semantic understanding. We present DINO\_4D, introducing frozen DINOv3 features as structural priors, injecting semantic awareness into the reconstruction process to effectively suppress semantic drift during dynamic tracking. Experiments on the Point Odyssey and TUM-Dynamics benchmarks demonstrate that our method maintains the linear time complexity $O(T)$ of its predecessors while significantly improving Tracking Accuracy (APD) and Reconstruction Completeness. DINO\_4D establishes a new paradigm for constructing 4D World Models that possess both geometric precision and semantic understanding.

[177] Topo-ADV: Generating Topology-Driven Imperceptible Adversarial Point Clouds

Gayathry Chandramana Krishnan Nampoothiry,Raghuram Venkatapuram,Anirban Ghosh,Ayan Dutta

Main category: cs.CV

TL;DR: 本文提出了一种基于拓扑结构的3D点云对抗攻击方法Topo-ADV,利用持续同调作为可微优化目标,通过扰动点云的拓扑特征实现高成功率且几何上难以察觉的攻击。

Details Motivation: 现有3D对抗攻击主要操纵几何属性,假设保持全局形状保真度即可保留语义内容;本文挑战该假设,指出点云的同调结构是一个未被探索的脆弱面。 Method: 提出端到端可微框架Topo-ADV,将持久同调嵌入为可微拓扑表征,联合优化拓扑差异损失、误分类目标和几何不可察觉性约束。 Result: 在ModelNet40、ShapeNet Part和ScanObjectNN等数据集上,对PointNet和DGCNN模型实现最高100%攻击成功率,且扰动在几何上难以察觉,多项感知指标优于SOTA方法。 Conclusion: 拓扑结构是3D点云深度学习中一个关键且易受攻击的语义维度,引入拓扑感知的对抗鲁棒性研究具有重要意义。 Abstract: Deep neural networks for 3D point cloud understanding have achieved remarkable success in object classification and recognition, yet recent work shows that these models remain highly vulnerable to adversarial perturbations. Existing 3D attacks predominantly manipulate geometric properties such as point locations, curvature, or surface structure, implicitly assuming that preserving global shape fidelity preserves semantic content. In this work, we challenge this assumption and introduce the first topology-driven adversarial attack for point cloud deep learning. Our key insight is that the homological structure of a 3D object constitutes a previously unexplored vulnerability surface. We propose Topo-ADV, an end-to-end differentiable framework that incorporates persistent homology as an explicit optimization objective, enabling gradient-based manipulation of topological features during adversarial example generation. By embedding persistence diagrams through differentiable topological representations, our method jointly optimizes (i) a topology divergence loss that alters persistence, (ii) a misclassification objective, and (iii) geometric imperceptibility constraints that preserve visual plausibility. Experiments demonstrate that subtle topology-driven perturbations consistently achieve up to 100% attack success rates on benchmark datasets such as ModelNet40, ShapeNet Part, and ScanObjectNN using PointNet and DGCNN classifiers, while remaining geometrically indistinguishable from the original point clouds, beating state-of-the-art methods on various perceptibility metrics.

[178] Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

Gautham Vinod,Bruce Coburn,Siddeshwar Raghavan,Fengqing Zhu

Main category: cs.CV

TL;DR: 本文提出了一种融合立体视觉隐式3D线索与自然语言显式先验知识的物体体积估计新方法,在多个公开数据集上显著优于纯视觉基线。

Details Motivation: 现有方法依赖复杂3D重建流程或难以应对单视图图像固有的歧义性,亟需更鲁棒、上下文感知的体积估计方案。 Method: 从立体图像对和描述性文本(含物体类别与近似体积)中提取深层特征,通过简单有效的投影层融合为统一多模态表征,用于体积回归。 Result: 在多个公开数据集上,该文本引导方法显著优于纯视觉基线;验证了即使简单文本先验也能有效指导体积估计任务。 Conclusion: 融合文本先验与立体视觉可提升体积估计精度与鲁棒性,为构建上下文感知的视觉测量系统提供了新路径。 Abstract: Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object's class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: https://gitlab.com/viper-purdue/stereo-typical-estimator.

[179] PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting

Anh Thuan Tran,Jana Kosecka

Main category: cs.CV

TL;DR: 本文提出PointSplat,一种基于3D几何驱动的剪枝与优化框架,用于压缩3D高斯泼溅(3DGS)模型,在不依赖2D图像或逐场景微调的前提下,实现高效、高质量的新视角合成。

Details Motivation: 传统3DGS需大量高斯元导致内存与存储开销大;现有剪枝与微调方法依赖2D图像和逐场景优化,泛化性与效率受限。 Method: 提出PointSplat:(1) 纯3D属性驱动的高斯排序与剪枝策略,摆脱对2D图像的依赖;(2) 双分支编码器,分别处理并重加权几何与外观特征,缓解特征不平衡。 Result: 在ScanNet++和Replica数据集上,PointSplat在不同稀疏度下均达到有竞争力的渲染质量,并显著提升效率,且无需逐场景优化。 Conclusion: PointSplat统一了高斯剪枝与Transformer优化两条技术路线,通过纯几何驱动机制提升了3DGS的通用性、效率与可扩展性。 Abstract: 3D Gaussian Splatting (3DGS) has recently unlocked real-time, high-fidelity novel view synthesis by representing scenes using explicit 3D primitives. However, traditional methods often require millions of Gaussians to capture complex scenes, leading to significant memory and storage demands. Recent approaches have addressed this issue through pruning and per-scene fine-tuning of Gaussian parameters, thereby reducing the model size while maintaining visual quality. These strategies typically rely on 2D images to compute important scores followed by scene-specific optimization. In this work, we introduce PointSplat, 3D geometry-driven prune-and-refine framework that bridges previously disjoint directions of gaussian pruning and transformer refinement. Our method includes two key components: (1) an efficient geometry-driven strategy that ranks Gaussians based solely on their 3D attributes, removing reliance on 2D images during pruning stage, and (2) a dual-branch encoder that separates, re-weights geometric and appearance to avoid feature imbalance. Extensive experiments on ScanNet++ and Replica across varying sparsity levels demonstrate that PointSplat consistently achieves competitive rendering quality and superior efficiency without additional per-scene optimization.

[180] From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

Yu Wu,Guangzeng Han,Ibra Niang Niang,Francia Ravelombola,Maiara Oliveira,Jason Davis,Dong Chen,Feng Lin,Xiaolei Huang

Main category: cs.CV

TL;DR: 本文提出PlantXpert——一个面向大豆和棉花表型分析的、基于证据的多模态推理基准,用于评估和推动视觉-语言模型在植物科学中的农学适应与推理能力。

Details Motivation: 植物科学对基础模型提出特殊挑战,需领域知识、细粒度视觉理解及复杂农学生物推理;现有方法难以满足高通量、全面表型分析需求。 Method: 构建PlantXpert基准:包含385张数字图像、3000+样本,覆盖病害、虫害、杂草与产量四大农学领域;评估11种SOTA视觉-语言模型,考察其视觉专长、定量推理与多步农学推理能力。 Result: 任务特化微调显著提升准确率(如Qwen3-VL-4B/30B达78%);但模型规模增大收益递减,跨作物泛化不均衡,定量与生物机制推理仍困难。 Conclusion: PlantXpert为评估证据驱动的农学推理提供了结构化、可复现的基准,有助于推动面向植物科学的多模态模型发展。 Abstract: To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.

[181] Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

Lars Lundqvist,Earl Ranario,Hamid Kamangir,Heesup Yun,Christine Diepenbrock,Brian N. Bailey,J. Mason Earles

Main category: cs.CV

TL;DR: 本文提出了一种系统化的提示优化框架,用于提升视觉基础模型(VFMs)在复杂农业场景(如豇豆花与豆荚检测)中的零样本检测性能,发现模型特异性提示结构可显著提升mAP,并能从合成数据迁移到真实田间场景。

Details Motivation: 视觉基础模型(VFMs)虽支持零样本目标检测,但在复杂农业场景中对文本提示构造高度敏感,亟需系统化提示优化方法。 Method: 构建包含八个提示维度的优化框架,对YOLO World、SAM3、Grounding DINO和OWLv2四种开放词汇检测器进行单因素分析与组合优化,并利用大语言模型(LLM)将提示结构跨任务迁移至形态差异目标(花→豆荚),验证其在合成与真实数据上的泛化能力。 Result: 模型特异性组合提示显著优于简单物种名基线(如YOLO World +0.357 mAP@0.5);合成数据优化的提示在真实田间场景中表现媲美甚至超越基于标注真实数据优化的提示(如YOLO World花检测:0.374 vs. 0.353)。 Conclusion: 提示工程可大幅缩小零样本VFMs与监督模型间的性能差距;最优提示具有模型特异性、非直观性,且可在合成到真实、任务间有效迁移。 Abstract: Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target -- cowpea pods -- and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.

[182] BLPR: Robust License Plate Recognition under Viewpoint and Illumination Variations via Confidence-Driven VLM Fallback

Guillermo Auza Banegas,Diego Calvimontes Vera,Sergio Castro Sandoval,Natalia Condori Peredo,Edwin Salcedo

Main category: cs.CV

TL;DR: 本文提出BLPR框架,专为玻利维亚车牌识别设计,结合合成数据预训练、真实数据微调、几何校正与轻量视觉语言模型(Gemma3 4B)置信度回退机制,并引入首个公开玻利维亚LPDR数据集,在真实场景下实现89.6%字符级识别准确率。

Details Motivation: 解决玻利维亚等欠代表地区在无约束环境下车牌识别准确率低的问题,尤其应对光照变化、视角畸变及数据稀缺等挑战。 Method: 采用两阶段YOLO检测器(基于Blender合成数据预训练+拉巴斯实地数据微调),对检测到的车牌进行几何校正后送入字符识别模型;引入基于置信度触发的轻量视觉语言模型Gemma3 4B作为回退机制,并利用合成到真实的域自适应提升鲁棒性。 Result: 在真实世界数据上达到89.6%的字符级识别准确率,并发布首个公开玻利维亚LPDR数据集。 Conclusion: BLPR框架有效提升了复杂城市环境中玻利维亚车牌的检测与识别鲁棒性,具备实际部署价值,并推动了低资源地区车牌识别研究的数据与方法建设。 Abstract: Robust license plate recognition in unconstrained environments remains a significant challenge, particularly in underrepresented regions with limited data availability and unique visual characteristics, such as Bolivia. Recognition accuracy in real-world conditions is often degraded by factors such as illumination changes and viewpoint distortion. To address these challenges, we introduce BLPR, a novel deep learning-based License Plate Detection and Recognition (LPDR) framework specifically designed for Bolivian license plates. The proposed system follows a two-stage pipeline where a YOLO-based detector is pretrained on synthetic data generated in Blender to simulate extreme perspectives and lighting conditions, and subsequently fine-tuned on street-level data collected in La Paz, Bolivia. Detected plates are geometrically rectified and passed to a character recognition model. To improve robustness under ambiguous scenarios, a lightweight vision-language model (Gemma3 4B) is selectively triggered as a confidence-based fallback mechanism. The proposed framework further leverages synthetic-to-real domain adaptation to improve robustness under diverse real-world conditions. We also introduce the first publicly available Bolivian LPDR dataset, enabling evaluation under diverse viewpoint and illumination conditions. The system achieves a character-level recognition accuracy of 89.6% on real-world data, demonstrating its effectiveness for deployment in challenging urban environments. Our project is publicly available at https://github.com/EdwinTSalcedo/BLPR.

[183] I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers

Alexa R. Tartaglini,Michael A. Lepori

Main category: cs.CV

TL;DR: 本文研究预训练视觉模型如何实现物体绑定,重点考察格式塔连续性原则在其中的作用,并通过合成数据集、注意力头分析和消融实验验证了连续性对物体绑定的重要性。

Details Motivation: 物体绑定是视觉认知的基础过程,但其在神经网络中的机制尚不清楚;特别是预训练视觉模型是否利用格式塔连续性原则进行物体绑定仍待探究。 Method: 使用合成数据集测试多种预训练视觉Transformer模型对连续性的敏感性;识别并分析特定的注意力头;通过消融实验验证这些注意力头对物体绑定表征的贡献。 Result: 发现多个预训练视觉Transformer模型对格式塔连续性敏感;识别出跨数据集泛化的、追踪连续性的特定注意力头;消融这些头会削弱模型的物体绑定能力。 Conclusion: 格式塔连续性是预训练视觉模型实现物体绑定的重要机制之一,特定注意力头在其中起关键作用。 Abstract: Object binding is a foundational process in visual cognition, during which low-level perceptual features are joined into object representations. Binding has been considered a fundamental challenge for neural networks, and a major milestone on the way to artificial models with flexible visual intelligence. Recently, several investigations have demonstrated evidence that binding mechanisms emerge in pretrained vision models, enabling them to associate portions of an image that contain an object. The question remains: how are these models binding objects together? In this work, we investigate whether vision models rely on the principle of Gestalt continuity to perform object binding, over and above other principles like similarity and proximity. Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.

[184] Cross-Cultural Value Awareness in Large Vision-Language Models

Phillip Howard,Xin Su,Kathleen C. Fraser

Main category: cs.CV

TL;DR: 本文研究大型视觉语言模型(LVLMs)在文化语境(如宗教、国籍、社会经济地位)下对人物道德、伦理和政治价值观判断的偏见问题,通过反事实图像集和多维分析方法评估五种主流LVLMs的价值判断敏感性。

Details Motivation: 尽管LVLMs的社会偏见已受关注,但其在宗教、国籍、社会经济地位等文化语境下的刻板印象研究仍不足,本文旨在填补这一空白。 Method: 构建反事实图像集(同一人不同文化背景),结合道德基础理论、词汇分析及文化语境敏感性测试,对五种主流LVLMs进行多维价值判断评估。 Result: 发现LVLMs的价值判断显著受图像中文化语境影响,暴露出其对文化价值差异缺乏鲁棒认知,存在系统性文化偏见。 Conclusion: LVLMs在跨文化价值判断中存在不可忽视的偏见,需在模型训练与评估中纳入文化敏感性设计,以提升公平性与社会适配性。 Abstract: The rapid adoption of large vision-language models (LVLMs) in recent years has been accompanied by growing fairness concerns due to their propensity to reinforce harmful societal stereotypes. While significant attention has been paid to such fairness concerns in the context of social biases, relatively little prior work has examined the presence of stereotypes in LVLMs related to cultural contexts such as religion, nationality, and socioeconomic status. In this work, we aim to narrow this gap by investigating how cultural contexts depicted in images influence the judgments LVLMs make about a person's moral, ethical, and political values. We conduct a multi-dimensional analysis of such value judgments in five popular LVLMs using counterfactual image sets, which depict the same person across different cultural contexts. Our evaluation framework diagnoses LVLM awareness of cultural value differences through the use of Moral Foundations Theory, lexical analyses, and the sensitivity of generated values to depicted cultural contexts.

[185] Unmixing-Guided Spatial-Spectral Mamba with Clustering Tokens for Hyperspectral Image Classification

Yimin Zhu,Lincoln Linlin Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于光谱解混引导的空谱Mamba模型,结合聚类token与多任务学习,显著提升了高光谱图像分类精度,并同时输出端元库和丰度图。

Details Motivation: 高光谱图像(HSI)分类面临光谱混合效应、空谱异质性以及类别边界与细节保持困难等挑战。 Method: 设计了光谱解混网络以学习端元与丰度图(考虑端元可变性);基于丰度图聚类提出Top-K token选择策略;构建解混引导的空谱Mamba模块;采用端元-丰度建模与分类联合的多任务监督框架。 Result: 在四个HSI数据集上显著优于现有最先进方法,同时输出高精度分类图、光谱库及丰度图。 Conclusion: 所提Unmixing-guided Spatial-Spectral Mamba框架有效融合解混先验与序列建模能力,在HSI分类任务中实现了性能与可解释性的双重提升。 Abstract: Although hyperspectral image (HSI) classification is critical for supporting various environmental applications, it is a challenging task due to the spectral-mixture effect, the spatial-spectral heterogeneity and the difficulty to preserve class boundaries and details. This letter presents a novel unmixing-guided spatial-spectral Mamba with clustering tokens for improved HSI classification, with the following contributions. First, to disentangle the spectral mixture effect in HSI for improved pattern discovery, we design a novel spectral unmixing network that not only automatically learns endmembers and abundance maps from HSI but also accounts for endmember variabilities. Second, to generate Mamba token sequences, based on the clusters defined by abundance maps, we design an efficient Top-\textit{K} token selection strategy to adaptively sequence the tokens for improved representational capability. Third, to improve spatial-spectral feature learning and detail preservation, based on the Top-\textit{K} token sequences, we design a novel unmixing-guided spatial-spectral Mamba module that greatly improves traditional Mamba models in terms of token learning and sequencing. Fourth, to learn simultaneously the endmember-abundance patterns and classification labels, a multi-task scheme is designed for model supervision, leading to a new unmixing-classification framework that outputs not only accurate classification maps but also a comprehensive spectral-library and abundance maps. Comparative experiments on four HSI datasets demonstrate that our model can greatly outperform the other state-of-the-art approaches. Code is available at https://github.com/GSIL-UCalgary/Unmixing_guided_Mamba.git

[186] Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

Tzu Ling Liu,Ian Stavness,Mrigank Rochan

Main category: cs.CV

TL;DR: 本文提出Learnable Motion-Focused Tokenization (LMFT)方法,用于视频无监督域自适应(VUDA)任务,通过动态筛选高运动性、动作相关token,剔除静态冗余背景token,在提升动作识别性能的同时显著降低计算开销。

Details Motivation: 现有VUDA方法受限于静态背景导致的域偏移,且忽视计算效率,难以达到全监督性能并阻碍实际应用。 Method: 提出可学习的运动聚焦分词(LMFT)机制:对视频帧进行分块分词,并通过学习自动丢弃低运动性(主要是背景)token,保留高运动性、动作相关token用于域自适应。 Result: 在三个标准VUDA基准、21种域适应设置上取得SOTA性能,同时显著降低计算开销。 Conclusion: LMFT实现了高效且有效的视频无监督域自适应,兼顾性能与计算效率。 Abstract: Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

[187] YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection

Yiyu Liu,Shuo Ye,Chao Hao,Zitong Yu

Main category: cs.CV

TL;DR: 本文提出YUV20K数据集和一种新框架(含MFS与TAA模块),以解决视频伪装目标检测中因复杂运动导致的外观不稳定与时序特征错位问题,显著提升模型鲁棒性与跨域泛化能力。

Details Motivation: 现有视频伪装目标检测(VCOD)受限于缺乏具挑战性的基准数据集,且模型在剧烈运动动态下鲁棒性不足,尤其难以应对运动引起的外观不稳定和时序特征错位问题。 Method: 构建了像素级标注、面向复杂度的VCOD基准YUV20K(24,295帧,91个场景,47类物种);提出包含Motion Feature Stabilization(MFS)和Trajectory-Aware Alignment(TAA)两个核心模块的新框架:MFS利用帧无关语义基元稳定特征,TAA通过轨迹引导的可变形采样实现精准时序对齐。 Result: 所提方法在现有数据集上显著超越SOTA,在新提出的YUV20K上建立新基线,并展现出优异的跨域泛化能力和对复杂时空场景的鲁棒性。 Conclusion: YUV20K填补了VCOD领域高质量挑战性基准的空白,所提框架有效缓解运动干扰问题,为后续研究提供了可靠数据基础与方法范式。 Abstract: Video Camouflaged Object Detection (VCOD) is currently constrained by the scarcity of challenging benchmarks and the limited robustness of models against erratic motion dynamics. Existing methods often struggle with Motion-Induced Appearance Instability and Temporal Feature Misalignment caused by complex motion scenarios. To address the data bottleneck, we present YUV20K, a pixel-level annoated complexity-driven VCOD benchmark. Comprising 24,295 annotated frames across 91 scenes and 47 kinds of species, it specifically targets challenging scenarios like large-displacement motion, camera motion and other 4 types scenarios. On the methodological front, we propose a novel framework featuring two key modules: Motion Feature Stabilization (MFS) and Trajectory-Aware Alignment (TAA). The MFS module utilizes frame-agnostic Semantic Basis Primitives to stablize features, while the TAA module leverages trajectory-guided deformable sampling to ensure precise temporal alignment. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art competitors on existing datasets and establishes a new baseline on the challenging YUV20K. Notably, our framework exhibits superior cross-domain generalization and robustness when confronting complex spatiotemporal scenarios. Our code and dataset will be available at https://github.com/K1NSA/YUV20K

[188] FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

Yuchen Zou,Huikai Shao,Lihuang Fang,Zhipeng Xiong,Dexing Zhong

Main category: cs.CV

TL;DR: 本文提出FlowPalm框架,利用光流驱动的扩散模型生成具有真实几何形变的合成掌纹图像,显著提升下游识别性能。

Details Motivation: 现有合成掌纹方法主要关注风格迁移,忽略或简单近似真实掌纹中复杂的非刚性几何形变,导致生成数据多样性不足。 Method: FlowPalm通过估计真实掌纹对之间的光流来建模几何形变统计规律,并在扩散过程中设计渐进式采样策略,在引入形变的同时保持身份一致性。 Result: 在六个基准数据集上的实验表明,FlowPalm在下游识别任务中显著优于现有最先进掌纹生成方法。 Conclusion: 光流驱动的几何建模与扩散生成结合是提升合成掌纹真实性和识别有效性的关键路径。 Abstract: Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Project page: https://yuchenzou.github.io/FlowPalm/

[189] Gait Recognition with Temporal Kolmogorov-Arnold Networks

Mohammed Asad,Dinesh Kumar Vishwakarma

Main category: cs.CV

TL;DR: 本文提出了一种新型时序模型Temporal Kolmogorov-Arnold Network(TKAN),用于提升步态识别鲁棒性,尤其在服装变化、携带物品和视角变化等干扰下表现更优;该模型通过可学习的一维函数替代固定权重,并引入短时RKAN子层与门控长时通路的双层记忆机制,在保持轻量结构的同时兼顾局部步态周期与长期运动趋势建模。

Details Motivation: 现有基于轮廓的时序模型对长序列敏感、易受观测噪声和外观协变量(如服装、携带物、视角)影响;RNN类模型难以保留早期帧信息且训练低效,Transformer类模型计算开销大、需大量数据、对不规则长度和噪声输入鲁棒性差,导致步态识别在复杂场景下性能下降。 Method: 提出Temporal Kolmogorov-Arnold Network(TKAN):以可学习的一维函数替代传统固定边权,并设计两层记忆机制——短时RKAN子层捕获局部步态周期动态,门控长时通路建模整体运动趋势;与CNN联合构成CNN+TKAN框架。 Result: 在CASIA-B数据集上实验表明,CNN+TKAN在标准评估设置下取得了优异的步态识别性能,尤其提升了对服装变化、携带条件和视角变化的鲁棒性。 Conclusion: TKAN通过函数化权重与双级记忆机制,有效平衡了建模精度、鲁棒性与模型紧凑性,为实际监控场景下的步态识别提供了新思路。 Abstract: Gait recognition is a biometric modality that identifies individuals from their characteristic walking patterns. Unlike conventional biometric traits, gait can be acquired at a distance and without active subject cooperation, making it suitable for surveillance and public safety applications. Nevertheless, silhouette-based temporal models remain sensitive to long sequences, observation noise, and appearance-related covariates. Recurrent architectures often struggle to preserve information from earlier frames and are inherently sequential to optimize, whereas transformer-based models typically require greater computational resources and larger training sets and may be sensitive to irregular sequence lengths and noisy inputs. These limitations reduce robustness under clothing variation, carrying conditions, and view changes, while also hindering the joint modeling of local gait cycles and longer-term motion trends. To address these challenges, we introduce a Temporal Kolmogorov-Arnold Network (TKAN) for gait recognition. The proposed model replaces fixed edge weights with learnable one-dimensional functions and incorporates a two-level memory mechanism consisting of short-term RKAN sublayers and a gated long-term pathway. This design enables efficient modeling of both cycle-level dynamics and broader temporal context while maintaining a compact backbone. Experiments on the CASIA-B dataset indicate that the proposed CNN+TKAN framework achieves strong recognition performance under the reported evaluation setting.

[190] Revisiting the Scale Loss Function and Gaussian-Shape Convolution for Infrared Small Target Detection

Hao Li,Man Fung Zhuo

Main category: cs.CV

TL;DR: 本文提出了一种新的红外小目标检测方法,通过设计基于差分的尺度损失函数和高斯形状卷积结合旋转风车掩码,提升了训练稳定性和空间注意力,显著改善了检测性能。

Details Motivation: 红外小目标检测面临两个挑战:非单调尺度损失函数导致训练不稳定,以及通用卷积核忽略小目标物理成像特性而导致空间注意力不足。 Method: 提出diff-based scale loss以实现单调梯度和稳定收敛;引入高斯形状卷积(带可学习尺度参数)匹配小目标中心集中强度分布,并结合旋转风车掩码(通过直通估计器自适应对齐目标方向)。 Result: 在IRSTD-1k、NUDT-SIRST和SIRST-UAVB数据集上,mIoU、P_d和F_a指标均一致优于当前最优方法。 Conclusion: 所提方法从损失函数和空间建模两方面改进红外小目标检测,兼顾物理先验与深度学习建模能力,具有良好的泛化性和实用性。 Abstract: Infrared small target detection still faces two persistent challenges: training instability from non-monotonic scale loss functions, and inadequate spatial attention due to generic convolution kernels that ignore the physical imaging characteristics of small targets. In this paper, we revisit both aspects. For the loss side, we propose a \emph{diff-based scale loss} that weights predictions according to the signed area difference between the predicted mask and the ground truth, yielding strictly monotonic gradients and stable convergence. We further analyze a family of four scale loss variants to understand how their geometric properties affect detection behavior. For the spatial side, we introduce \emph{Gaussian-shaped convolution} with a learnable scale parameter to match the center-concentrated intensity profile of infrared small targets, and augment it with a \emph{rotated pinwheel mask} that adaptively aligns the kernel with target orientation via a straight-through estimator. Extensive experiments on IRSTD-1k, NUDT-SIRST, and SIRST-UAVB demonstrate consistent improvements in $mIoU$, $P_d$, and $F_a$ over state-of-the-art methods. We release our anonymous code and pretrained models.

[191] A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery

Mohammed Asad,Ajai Kumar Gautam,Priyanshu Dhiman,Rishi Raj Prajapati

Main category: cs.CV

TL;DR: 本研究在AppleBBCH81数据集上建立了一个标准化的单类苹果检测基准,系统评估了六种主流检测器(YOLOv10n、YOLO11n、RT-DETR-L、Faster R-CNN、FCOS、SSDLite320)的性能,发现YOLO11n在严格定位指标(mAP@0.5:0.95)上最优,而YOLOv10n在低置信度阈值下F1-score最高,强调需结合下游任务需求选择模型。

Details Motivation: 苹果在果园图像中受光照变化、叶片遮挡、果实簇密集和部分遮挡等因素影响,检测困难;缺乏统一、可控、可复现的基准用于公平比较不同检测器性能。 Method: 采用AppleBBCH81公开数据集,设定确定性的训练/验证/测试划分与统一评估协议,对比六种代表性目标检测器,并以COCO风格mAP@0.5和mAP@0.5:0.95为主指标,辅以PR曲线及固定IoU=0.5下的精确率、召回率与F1-score分析。 Result: YOLO11n在验证集上取得最佳严格定位性能(mAP@0.5:0.95 = 0.6065,mAP@0.5 = 0.9620);YOLOv10n在置信度≥0.05时F1-score最高;RT-DETR-L召回率高但精度低,存在大量低置信度误检。 Conclusion: 果园场景下的检测器选型不应仅依赖定位精度,还需综合考虑置信度阈值鲁棒性及下游任务(如采摘、计数)对精度或召回的偏好。 Abstract: Accurate apple detection in orchard images is important for yield prediction, fruit counting, robotic harvesting, and crop monitoring. However, changing illumination, leaf clutter, dense fruit clusters, and partial occlusion make detection difficult. To provide a fair and reproducible comparison, this study establishes a controlled benchmark for single-class apple detection on the public AppleBBCH81 dataset using one deterministic train, validation, and test split and a unified evaluation protocol across six representative detectors: YOLOv10n, YOLO11n, RT-DETR-L, Faster R-CNN (ResNet50-FPN), FCOS (ResNet50-FPN), and SSDLite320 (MobileNetV3-Large). Performance is evaluated primarily using COCO-style mAP@0.5 and mAP@0.5:0.95, and threshold-dependent behavior is further analyzed using precision-recall curves and fixed-threshold precision, recall, and F1-score at IoU = 0.5. On the validation split, YOLO11n achieves the best strict localization performance with mAP@0.5:0.95 = 0.6065 and mAP@0.5 = 0.9620, followed closely by RT-DETR-L and YOLOv10n. At a fixed operating point with confidence >= 0.05, YOLOv10n attains the highest F1-score, whereas RT-DETR-L achieves very high recall but low precision because of many false positives at low confidence. These findings show that detector selection for orchard deployment should be guided not only by localization-aware accuracy but also by threshold robustness and the requirements of the downstream task.

[192] GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts

Kiran Thorat,Nicole Meng,Mostafa Karami,Caiwen Ding,Yingjie Lao,Zhijie Jerry Shi

Main category: cs.CV

TL;DR: 本文提出GIF框架,通过融合几何布局与电路拓扑信息,利用条件扩散模型生成高质量IR压降图像,在多项指标上超越现有方法。

Details Motivation: 传统EDA工具在晶体管密度提升后变得缓慢昂贵;现有基于机器学习的IR压降分析方法未能有效建模局部与长程依赖,且忽略版图几何与逻辑连接拓扑信息。 Method: 提出GIF(Generative IR drop Framework),将图像特征(版图)与图特征(电路拓扑)融合,用于引导条件扩散过程以生成IR压降图像。 Result: 在CircuitNet-N28数据集上达到SSIM=0.78、Pearson=0.95、PSNR=21.77、NMAE=0.026,性能优于先前方法。 Conclusion: 联合建模几何布局与逻辑拓扑的扩散式多模态生成方法可显著提升IR压降预测质量,为电源完整性分析提供了新范式。 Abstract: IR drop analysis is essential in physical chip design to ensure the power integrity of on-chip power delivery networks. Traditional Electronic Design Automation (EDA) tools have become slow and expensive as transistor density scales. Recent works have introduced machine learning (ML)-based methods that formulate IR drop analysis as an image prediction problem. These existing ML approaches fail to capture both local and long-range dependencies and ignore crucial geometrical and topological information from physical layouts and logical connectivity. To address these limitations, we propose GIF, a Generative IR drop Framework that uses both geometrical and topological information to generate IR drop images. GIF fuses image and graph features to guide a conditional diffusion process, producing high-quality IR drop images. For instance, On the CircuitNet-N28 dataset, GIF achieves 0.78 SSIM, 0.95 Pearson correlation, 21.77 PSNR, and 0.026 NMAE, outperforming prior methods. These results demonstrate that our framework, using diffusion based multimodal conditioning, reliably generates high quality IR drop images. This shows that IR drop analysis can effectively leverage recent advances in generative modeling when geometric layout features and logical circuit topology are jointly modeled. By combining geometry aware spatial features with logical graph representations, GIF enables IR drop analysis to benefit from recent advances in generative modeling for structured image generation.

[193] SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

Ashfak Yeafi,Parthaw Goswami,Md Khairul Islam,Ashifa Islam Shamme

Main category: cs.CV

TL;DR: 本文提出SwinTextUNet,一种融合CLIP文本嵌入与Swin Transformer UNet的多模态医学图像分割框架,通过跨注意力与卷积融合提升分割精度,在QaTaCOV19数据集上取得Dice 86.47%、IoU 78.2%的性能。

Details Motivation: 传统仅依赖视觉特征的医学图像分割模型在面对模糊或低对比度病灶时表现不佳,亟需引入语义先验增强鲁棒性。 Method: 将CLIP提取的文本嵌入引入Swin Transformer UNet主干网络,设计跨注意力机制与卷积融合模块,实现文本语义引导与多层级视觉特征的对齐。 Result: 在QaTaCOV19数据集上,四阶段SwinTextUNet达到Dice系数86.47%、IoU 78.2%;消融实验证实文本引导与多模态融合的关键作用。 Conclusion: 视觉-语言融合显著提升医学图像分割性能,为构建临床可用的智能诊断工具提供了新路径。 Abstract: Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies further validate the importance of text guidance and multimodal fusion. These findings underscore the promise of vision language integration in advancing medical image segmentation and supporting clinically meaningful diagnostic tools.

[194] Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

Alaa Elobaid

Main category: cs.CV

TL;DR: 本文全面评估了多模态语言模型在处理文本、图像、音频和视频时存在的种族与语言偏差,发现图像和视频任务偏差较小,而音频任务存在显著偏差和性能下降。

Details Motivation: 尽管多模态语言模型被广泛部署,但其在不同人口统计群体和模态下的性能尚未得到充分研究,亟需系统性公平性评估。 Method: 对四个多模态模型在人口属性估计、身份验证、活动识别、多语言语音转录和语言识别等任务上进行跨年龄、性别、肤色、语言和原籍国的准确性差异分析。 Result: 图像与视频理解任务表现更优、偏差更小;音频理解任务性能显著更低,且在年龄、性别、语言维度存在大幅准确率差异和预测坍缩现象。 Conclusion: 必须在所有支持模态上同步开展公平性评估,以保障多模态语言模型在现实应用中的可靠性与公正性。 Abstract: This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.

[195] What and Where to Adapt: Structure-Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters

Shaobo Liu,Haobo Xiong,Kai Liu,Yuna Lin

Main category: cs.CV

TL;DR: 本文提出Structure-Semantics Co-Tuning (S2-CoT)框架,通过结构保真适配器(SFA)和语义上下文适配器(SCA)协同调优编解码器与熵模型,在仅微调少量参数下达到接近全量微调的性能。

Details Motivation: 现有参数高效微调方法多关注编码器-解码器主干结构,而忽略熵模型中统计语义的适配,但其对隐特征概率分布建模至关重要;简单在熵模型中插入适配器效果不佳,需协调适配器类型与位置。 Method: 提出S2-CoT框架,包含两个专用适配器:SFA嵌入编码器-解码器以融合空频信息、保持结构保真度;SCA嵌入熵模型以细化通道上下文、适配SFA调优后的特征;二者联合优化。 Result: 在四个不同基础编解码器上均达到SOTA性能,仅需极少量可训练参数,性能接近全量微调。 Conclusion: 结构与语义的协同调优是提升参数高效图像压缩性能的关键,S2-CoT为兼顾人类与机器视觉需求提供了新范式。 Abstract: Parameter-efficient fine-tuning of pre-trained codecs is a promising direction in image compression for human and machine vision. While most existing works have primarily focused on tuning the feature structure within the encoder-decoder backbones, the adaptation of the statistical semantics within the entropy model has received limited attention despite its function of predicting the probability distribution of latent features. Our analysis reveals that naive adapter insertion into the entropy model can lead to suboptimal outcomes, underscoring that the effectiveness of adapter-based tuning depends critically on the coordination between adapter type and placement across the compression pipeline. Therefore, we introduce Structure-Semantics Co-Tuning (S2-CoT), a novel framework that achieves this coordination via two specialized, synergistic adapters: the Structural Fidelity Adapter (SFA) and the Semantic Context Adapter (SCA). SFA is integrated into the encoder-decoder to preserve high-fidelity representations by dynamically fusing spatial and frequency information; meanwhile, the SCA adapts the entropy model to align with SFA-tuned features by refining the channel context for more efficient statistical coding. Through joint optimization, S2-CoT turns potential performance degradation into synergistic gains, achieving state-of-the-art results across four diverse base codecs with only a small fraction of trainable parameters, closely matching full fine-tuning performance. Code is available at https://github.com/Brock-bit4/S2-CoT.

[196] FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer

Shenghe Zheng,Minyu Zhang,Tianhao Liu,Hongzhi Wang

Main category: cs.CV

TL;DR: 本文提出FREE-Switch框架,通过频域重要性驱动的动态LoRA切换与语义级生成对齐机制,高效融合多个预训练适配器,解决图像生成中内容漂移与细节退化问题,显著降低定制化生成的训练成本。

Details Motivation: 现有模型融合方法在图像生成中存在内容漂移、计算昂贵或忽略适配器差异导致细节退化等问题;不同适配器在各扩散步中贡献度不同,且语义一致性影响生成质量。 Method: 提出频域重要性驱动的动态LoRA切换方法,根据各扩散步对不同适配器的重要性动态选择;设计自动Generation Alignment机制,在语义层面统一多适配器的生成意图。 Result: FREE-Switch在多对象、多风格适配器融合任务上表现优异,显著降低高质量定制化生成的训练成本,缓解内容漂移与细节损失。 Conclusion: 频率域建模与语义对齐相结合可有效提升多适配器融合在扩散模型中的性能,为边缘部署和低成本定制生成提供新范式。 Abstract: With the growing availability of open-sourced adapters trained on the same diffusion backbone for diverse scenes and objects, combining these pretrained weights enables low-cost customized generation. However, most existing model merging methods are designed for classification or text generation, and when applied to image generation, they suffer from content drift due to error accumulation across multiple diffusion steps. For image-oriented methods, training-based approaches are computationally expensive and unsuitable for edge deployment, while training-free ones use uniform fusion strategies that ignore inter-adapter differences, leading to detail degradation. We find that since different adapters are specialized for generating different types of content, the contribution of each diffusion step carries different significance for each adapter. Accordingly, we propose a frequency-domain importance-driven dynamic LoRA switch method. Furthermore, we observe that maintaining semantic consistency across adapters effectively mitigates detail loss; thus, we design an automatic Generation Alignment mechanism to align generation intents at the semantic level. Experiments demonstrate that our FREE-Switch (Frequency-based Efficient and Dynamic LoRA Switch) framework efficiently combines adapters for different objects and styles, substantially reducing the training cost of high-quality customized generation.

[197] LVSum: A Benchmark for Timestamp-Aware Long Video Summarization

Alkesh Patel,Melis Ozyildirim,Ying-Chang Cheng,Ganesh Nagarajan

Main category: cs.CV

TL;DR: 本文提出了LVSum,一个用于评估长视频摘要的、具有细粒度时间对齐的人类标注基准,并通过新提出的LLM-based指标和标准指标全面评测了现有MLLMs在时间理解上的系统性缺陷。

Details Motivation: 当前多模态大语言模型(MLLMs)在长视频摘要任务中难以维持长时间段内的时序保真性,且生成的摘要缺乏语义与时间上的双重支撑。 Method: 构建了LVSum基准数据集,包含13个领域的长视频及对应含精确时间标记的人类摘要;设计了新的基于LLM的内容相关性与模态一致性评估指标,并结合标准指标对主流MLLMs进行系统评测。 Result: 实验揭示了现有MLLMs在时间理解能力上存在系统性不足,为长视频摘要中的时间推理研究提供了新基线与洞见。 Conclusion: LVSum为长视频摘要任务提供了首个注重时间对齐的高质量评估基准,推动MLLMs在时序建模与跨模态对齐方向的发展。 Abstract: Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.

[198] SinkTrack: Attention Sink based Context Anchoring for Large Language Models

Xu Liu,Guikun Chen,Wenguan Wang

Main category: cs.CV

TL;DR: 本文提出SinkTrack方法,利用大语言模型中注意力汇聚(attention sink)现象,将标记作为信息锚点注入关键上下文特征,从而缓解幻觉和上下文遗忘问题,无需训练、即插即用且开销极小。

Details Motivation: 大语言模型存在幻觉和上下文遗忘问题,主因是注意力漂移(attention drift),即模型注意力随生成过程逐渐偏离初始输入。 Method: 提出SinkTrack方法,利用模型固有的attention sink特性,将标记作为上下文锚点,向其表征中注入关键输入特征(如图像或指令特征),使模型在整个生成过程中持续锚定初始上下文。该方法无需训练、即插即用、推理开销极小。 Result: 在文本任务(如SQuAD2.0上Llama3.1-8B-Instruct提升21.6%)和多模态任务(如M3CoT上Qwen2.5-VL-7B-Instruct提升22.8%)中显著缓解幻觉与上下文遗忘;跨架构与规模表现稳健。 Conclusion: SinkTrack是一种通用、高效、无需训练的上下文锚定方法,通过利用attention sink机制增强模型对初始输入的长期关注,有效提升生成一致性与忠实性。 Abstract: Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs' focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink -- the tendency to consistently allocate high attention to the very first token (i.e., ) of a sequence. Concretely, we propose an advanced context anchoring method, SinkTrack, which treats as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SinkTrack is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SinkTrack mitigates hallucination and context forgetting across both textual (e.g., +21.6% on SQuAD2.0 with Llama3.1-8B-Instruct) and multi-modal (e.g., +22.8% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at https://github.com/67L1/SinkTrack.

[199] Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

Gordon Chen,Ziqi Huang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出Prompt Relay方法,在推理阶段实现多事件视频生成的细粒度时间控制,无需修改模型结构或增加计算开销,通过在交叉注意力中引入惩罚机制,提升时间提示对齐与视觉质量。

Details Motivation: 现有视频扩散模型难以精确控制多事件的时间顺序、持续时长和语义概念出现时机,导致复杂事件描述下语义纠缠和文本-视频对齐差,难以满足电影级视频合成需求。 Method: 提出Prompt Relay,一种推理时即插即用的方法:在交叉注意力机制中引入时间分段约束的惩罚项,使每个时间片段仅关注其对应提示,从而解耦多事件语义。 Result: 显著提升多事件视频生成中的时间提示对齐能力,减少语义干扰,改善视觉质量,且不增加模型参数或计算成本。 Conclusion: Prompt Relay为视频扩散模型提供了高效、轻量、通用的时间控制机制,推动了可控、叙事性视频生成的发展。 Abstract: Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.

[200] Counting to Four is still a Chore for VLMs

Duy Le Dinh Anh,Patrick Amadeus Irawan,Tuan Van Vo

Main category: cs.CV

TL;DR: 本文通过行为和机制分析,研究视觉-语言模型(VLMs)在物体计数任务中的失败原因,发现计数错误不仅源于视觉感知限制,更关键的是语言阶段对视觉证据利用不足;为此提出轻量干预方法Modality Attention Share(MAS),提升视觉注意力在答案生成中的参与度。

Details Motivation: 现有评估仅关注最终输出,难以定位VLMs在简单计数任务中失败的内部原因;需深入理解其多模态推理过程中视觉与语言信息的交互机制。 Method: 构建可控评测集COUNTINGTRICKS,结合注意力分析与组件级探测,定位计数相关视觉证据在模型各层的分布变化;并提出Modality Attention Share(MAS)干预方法,在语言生成阶段强制保留最低视觉注意力预算。 Result: 发现计数相关视觉证据在模态投影阶段最强,但在后续语言层显著衰减,模型更依赖文本先验;MAS能有效缓解该问题,提升计数性能。 Conclusion: VLMs的计数失败主因是语言阶段对视觉证据的欠利用,而非单纯视觉编码能力不足;强调跨模态注意力分配机制的设计重要性。 Abstract: Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.

[201] Intra-finger Variability of Diffusion-based Latent Fingerprint Generation

Noor Hussein,Anil K. Jain,Karthik Nandakumar

Main category: cs.CV

TL;DR: 本文系统评估了基于扩散模型生成的合成指纹(尤其是潜在指纹)在单个手指内的变化性,通过构建包含七种不同数据集的潜在风格库来增强风格多样性,并开发半自动化框架分析生成指纹的脊线和细节特征完整性。研究发现,尽管生成过程基本保持身份一致性,但在参考图像质量较差或风格嵌入不匹配时,会出现局部或全局的不一致问题,揭示了当前合成模型在多样性和身份一致性上的局限性。

Details Motivation: 评估现有扩散模型生成的合成指纹在单指内的风格与结构一致性,揭示其在实际应用(如法医鉴定、生物识别测试)中的可靠性瓶颈。 Method: 构建覆盖40余种表面与处理技术的多源潜在风格库;设计半自动化框架量化生成指纹的脊线连续性与细节特征(如端点、分叉点)保真度;分析参考图像质量与风格嵌入匹配度对生成结果的影响。 Result: 发现生成指纹在身份层面总体稳定,但存在两类不一致:1)局部不一致——低质量参考区域导致细节增删;2)全局不一致——风格嵌入错配引发幻觉脊线;二者共同制约模型实用性。 Conclusion: 当前基于扩散的合成指纹方法在提升风格多样性的同时牺牲了结构保真度,亟需在生成架构中引入细粒度约束机制,以协同优化风格多样性与身份一致性。 Abstract: The primary goal of this work is to systematically evaluate the intra-finger variability of synthetic fingerprints (particularly latent prints) generated using a state-of-the-art diffusion model. Specifically, we focus on enhancing the latent style diversity of the generative model by constructing a comprehensive \textit{latent style bank} curated from seven diverse datasets, which enables the precise synthesis of latent prints with over 40 distinct styles encapsulating different surfaces and processing techniques. We also implement a semi-automated framework to understand the integrity of fingerprint ridges and minutiae in the generated impressions. Our analysis indicates that though the generation process largely preserves the identity, a small number of local inconsistencies (addition and removal of minutiae) are introduced, especially when there are poor quality regions in the reference image. Furthermore, mismatch between the reference image and the chosen style embedding that guides the generation process introduces global inconsistencies in the form of hallucinated ridge patterns. These insights highlight the limitations of existing synthetic fingerprint generators and the need to further improve these models to simultaneously enhance both diversity and identity consistency.

[202] U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation

Xunpei Sun,Wenwei Lin,Yi Chang,Gang Chen

Main category: cs.CV

TL;DR: U²Flow is the first recurrent unsupervised optical flow method that jointly estimates flow and per-pixel uncertainty using augmentation consistency and Laplace-based likelihood, improving robustness and interpretability.

Details Motivation: Unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. Method: U²Flow proposes a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective; integrates predicted uncertainty to guide adaptive flow refinement and dynamically modulate regional smoothness loss; and introduces an uncertainty-guided bidirectional flow fusion mechanism. Result: U²Flow achieves state-of-the-art performance among unsupervised methods on KITTI and Sintel benchmarks while producing highly reliable uncertainty maps. Conclusion: The joint estimation paradigm of optical flow and uncertainty is effective and enhances both performance and interpretability in unsupervised settings. Abstract: Unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm. The code is available at https://github.com/sunzunyi/U2FLOW.

[203] On The Application of Linear Attention in Multimodal Transformers

Armin Gerami,Seyedehanita Madani,Ramani Duraiswami

Main category: cs.CV

TL;DR: 本文探讨了在线性注意力(LA)机制下,多模态Transformer模型的高效替代方案,通过将计算复杂度从二次降低到线性,在保持竞争力的同时显著提升可扩展性。

Details Motivation: 现有基于Transformer的视觉-语言模型因二次注意力复杂度而难以扩展,亟需更高效的注意力机制。 Method: 在多模态框架中集成线性注意力(LA),并在ViT-S/16、ViT-B/16、ViT-L/16架构上于LAION-400M数据集训练,以ImageNet-21K零样本准确率验证性能。 Result: 线性注意力在大幅降低计算开销的同时,保持与标准Softmax注意力一致的缩放律和竞争性性能。 Conclusion: 线性注意力是一种鲁棒、可扩展的多模态Transformer解决方案,适用于处理日益庞大复杂的多模态数据。 Abstract: Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for next-generation multimodal Transformers tasked with processing increasingly large and complex datasets.

[204] Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

Yebo Wu,Han Jin,Zhijiang Guo,Li Li

Main category: cs.CV

TL;DR: 本文提出Dual-Anchor Introspective Decoding (DaID),一种通过挖掘模型内部感知差异来动态校准每个token生成的对比解码框架,以缓解多模态大语言模型(MLLMs)中的幻觉问题。

Details Motivation: 多模态大语言模型(MLLMs)虽展现出强大推理能力,但仍存在文本与视觉内容矛盾的幻觉问题。 Method: DaID引入Spotlight层增强视觉事实信号、Shadow层抑制文本惯性,并利用视觉注意力分布指导双锚点选择,实现逐token精准适配。 Result: 在多个基准和MLLMs上的实验表明,DaID显著缓解幻觉,同时提升通用推理能力。 Conclusion: DaID是一种有效且通用的解码机制,能从内部感知层面提升MLLMs的事实一致性与推理鲁棒性。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model's internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.

[205] DocRevive: A Unified Pipeline for Document Text Restoration

Kunal Purkayastha,Ayan Banerjee,Josep Llados,Umapada Pal

Main category: cs.CV

TL;DR: 本文提出了一种结合OCR、图像分析、掩码语言建模与扩散模型的统一文档文本重建流水线,构建了包含30,078张退化文档图像的OPRB合成数据集,并设计了综合编辑、语义、长度及上下文可预测性的UCSM评估指标。

Details Motivation: 文档理解中,受损、遮挡或不完整文本的重建是一个关键但尚未被充分研究的问题,而高质量的重建有助于后续文档理解任务。 Method: 提出一个统一pipeline,融合OCR、图像分析、掩码语言建模和扩散模型;构建OPRB合成退化文档数据集;使用遮挡检测器识别退化区域,并通过inpainting和扩散模块实现语义连贯、字体/大小/对齐一致的文本重建;提出UCSM评估指标。 Result: 在合成与真实退化文档上验证了方法有效性;发布了OPRB数据集与开源代码;UCSM指标能更全面评估重建质量。 Conclusion: 该工作推动了文档修复技术发展,为档案研究与数字保存提供了新工具与新基准。 Abstract: In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

[206] Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating

Saniah Kayenat Chowdhury,Muhammad E. H. Chowdhury

Main category: cs.CV

TL;DR: 本文提出DualEngage双流框架,用于从课堂视频中识别小组层面的学生参与度,融合个体动作动态与场景级时空信息,实现高精度(准确率96.21%)的群体参与度评估。

Details Motivation: 现有自动参与度识别方法多面向在线课堂或仅关注个体层面,缺乏对真实教室环境中小组层面参与度建模的有效方法。 Method: 提出DualEngage双流框架:主干流建模个体运动动态(基于目标检测、密集光流、Transformer编码与注意力池化);辅流利用预训练3D ResNet提取全视频场景级时空特征;两流通过softmax门控融合机制动态加权整合。 Result: 在海洋大学构建的Classroom Group Engagement Dataset上五折交叉验证,平均分类准确率达0.9621±0.0161,宏平均F1为0.9530±0.0204;消融实验验证了双流设计的有效性。 Conclusion: DualEngage首次在课堂参与度识别中显式引入运动线索并采用双流架构,有效联合建模个体行为与群体动态,显著提升小组层面参与度识别性能。 Abstract: Student engagement is crucial for improving learning outcomes in group activities. Highly engaged students perform better both individually and contribute to overall group success. However, most existing automated engagement recognition methods are designed for online classrooms or estimate engagement at the individual level. Addressing this gap, we propose DualEngage, a novel two-stream framework for group-level engagement recognition from in-classroom videos. It models engagement as a joint function of both individual and group-level behaviors. The primary stream models person-level motion dynamics by detecting and tracking students, extracting dense optical flow with the Recurrent All-Pairs Field Transforms network, encoding temporal motion patterns using a transformer encoder, and finally aggregating per-student representations through attention pooling into a unified representation. The secondary stream captures scene-level spatiotemporal information from the full video clip, leveraging a pretrained three-dimensional Residual Network. The two-stream representations are combined via softmax-gated fusion, which dynamically weights each stream's contribution based on the joint context of both features. DualEngage learns a joint representation of individual actions with overarching group dynamics. We evaluate the proposed approach using fivefold cross-validation on the Classroom Group Engagement Dataset developed by Ocean University of China, achieving an average classification accuracy of 0.9621+/-0.0161 with a macro-averaged F1 of 0.9530+/-0.0204. To understand the contribution of each branch, we further conduct an ablation study comparing single-stream variants against the two-stream model. This work is among the first in classroom engagement recognition to adopt a dual-stream design that explicitly leverages motion cues as an estimator.

[207] MatRes: Zero-Shot Test-Time Model Adaptation for Simultaneous Matching and Restoration

Kanggeon Lee,Soochahn Lee,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 本文提出MatRes,一种零样本测试时自适应框架,通过仅使用一对低质量和高质量图像,在不进行离线训练或额外监督的情况下,联合提升图像恢复质量与几何匹配精度。

Details Motivation: 现实世界中的图像对常同时存在严重退化和大视角变化,导致图像恢复与几何匹配相互干扰。 Method: MatRes通过在对应位置施加条件相似性约束,仅更新轻量级模块,保持所有预训练组件冻结。 Result: 在多种组合实验中,MatRes在图像恢复和几何对齐两方面均显著优于单独使用恢复或匹配模型。 Conclusion: MatRes为真实场景中多视角、多质量图像的联合处理提供了一种实用且广泛适用的解决方案,有效缓解了恢复与匹配之间的相互干扰问题。 Abstract: Real-world image pairs often exhibit both severe degradations and large viewpoint changes, making image restoration and geometric matching mutually interfering tasks when treated independently. In this work, we propose MatRes, a zero-shot test-time adaptation framework that jointly improves restoration quality and correspondence estimation using only a single low-quality and high-quality image pair. By enforcing conditional similarity at corresponding locations, MatRes updates only lightweight modules while keeping all pretrained components frozen, requiring no offline training or additional supervision. Extensive experiments across diverse combinations show that MatRes yields significant gains in both restoration and geometric alignment compared to using either restoration or matching models alone. MatRes offers a practical and widely applicable solution for real-world scenarios where users commonly capture multiple images of a scene with varying viewpoints and quality, effectively addressing the often-overlooked mutual interference between matching and restoration.

[208] Active Diffusion Matching: Score-based Iterative Alignment of Cross-Modal Retinal Images

Kanggeon Lee,Su Jeong Song,Soochahn Lee,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为Active Diffusion Matching(ADM)的新型跨模态眼底图像配准方法,用于解决标准眼底图像(SFI)与超广角眼底图像(UWFI)之间因视场差异大、视网膜形态不规则而导致的配准难题。ADM结合两个相互依赖的基于分数的扩散模型,通过迭代Langevin马尔可夫链联合估计全局变换和局部形变,并引入定制采样策略提升鲁棒性。实验表明其在私有SFI-UWFI数据集和公开SFI-SFI数据集上均达到SOTA性能。

Details Motivation: 现有图像配准方法无法准确对齐视场范围和形态差异极大的标准眼底图像(SFI)与超广角眼底图像(UWFI),且尚无专门针对该任务的方法。 Method: 提出Active Diffusion Matching(ADM),融合两个互依的基于分数的扩散模型,利用迭代Langevin马尔可夫链协同优化全局刚性变换与局部非刚性形变,并设计定制化采样策略以增强对输入图像对的适应性。 Result: 在私有SFI-UWFI数据集上mAUC提升5.2点,在公开SFI-SFI数据集上提升0.4点,性能达当前最优。 Conclusion: ADM首次有效解决了SFI与UWFI跨模态配准这一未被研究的问题,通过联合优化全局与局部对齐,为跨模态医学图像配准提供了新范式。 Abstract: Objective: The study aims to address the challenge of aligning Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which is difficult due to their substantial differences in viewing range and the amorphous appearance of the retina. Currently, no specialized method exists for this task, and existing image alignment techniques lack accuracy. Methods: We propose Active Diffusion Matching (ADM), a novel cross-modal alignment method. ADM integrates two interdependent score-based diffusion models to jointly estimate global transformations and local deformations via an iterative Langevin Markov chain. This approach facilitates a stochastic, progressive search for optimal alignment. Additionally, custom sampling strategies are introduced to enhance the adaptability of ADM to given input image pairs. Results: Comparative experimental evaluations demonstrate that ADM achieves state-of-the-art alignment accuracy. This was validated on two datasets: a private dataset of SFI-UWFI pairs and a public dataset of SFI-SFI pairs, with mAUC improvements of 5.2 and 0.4 points on the private and public datasets, respectively, compared to existing state-of-the-art methods. Conclusion: ADM effectively bridges the gap in aligning SFIs and UWFIs, providing an innovative solution to a previously unaddressed challenge. The method's ability to jointly optimize global and local alignment makes it highly effective for cross-modal image alignment tasks. Significance: ADM has the potential to transform the integrated analysis of SFIs and UWFIs, enabling better clinical utility and supporting learning-based image enhancements. This advancement could significantly improve diagnostic accuracy and patient outcomes in ophthalmology.

[209] Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images

Kanggeon Lee,Soochahn Lee,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为粒子扩散匹配(PDM)的鲁棒配准方法,用于标准眼底图像(SFIs)与超广角眼底图像(UWFIs)的对齐,通过迭代的随机游走对应搜索(RWCS)结合扩散模型,显著提升了多模态眼底图像配准性能。

Details Motivation: SFIs与UWFIs因尺度、外观差异大且特征稀疏,传统配准方法效果差,亟需鲁棒、准确的跨模态对齐技术。 Method: 提出粒子扩散匹配(PDM),在每次迭代中利用扩散模型估计粒子点的位移向量,综合局部外观、粒子结构分布和全局变换估计,实现渐进式对应关系优化。 Result: 在多个视网膜图像配准基准上达到SOTA,尤其在SFI-UWFI配对主数据集上提升显著,并验证了其在真实临床场景中的有效性。 Conclusion: PDM克服了现有方法在多模态眼底图像配准中的局限性,为眼科下游任务(如监督学习、疾病诊断和多模态分析)提供了新范式。 Abstract: We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.

[210] Global monitoring of methane point sources using deep learning on hyperspectral radiance measurements from EMIT

Vishal V. Batchu,Michelangelo Conserva,Alex Wilson,Anna M. Michalak,Varun Gulshan,Philip G. Brodrick,Andrew K. Thorpe,Christopher V. Arsdale

Main category: cs.CV

TL;DR: 本文提出了一种名为MAPL-EMIT的端到端视觉Transformer模型,利用EMIT仪器全光谱辐射数据实现甲烷排放羽流的自动检测、量化与精确定位,显著提升检测灵敏度和效率,支持全球尺度、设施级别的高通量甲烷监测。

Details Motivation: 人为甲烷点源对近期气候强迫、安全及系统效率构成威胁;现有卫星遥感方法依赖人工识别羽流,效率低、检测限高,亟需自动化、高灵敏度的全球监测方案。 Method: 提出基于视觉Transformer的端到端模型MAPL-EMIT,联合利用EMIT全波段辐射光谱与空间上下文信息,实现整景图像中甲烷增强值的像素级反演;模型在360万个人工物理合成羽流数据上训练,并引入光谱拟合得分与噪声估计等指标抑制误报。 Result: 在合成数据上表现高查全率与查准率,可检测更微弱羽流;在1084景真实EMIT数据中识别出79%已知NASA L2B羽流复合体,发现数量是人工标注的两倍;经机载观测、大型填埋场与受控释放实验验证,能发现此前未被捕捉的排放源。 Conclusion: MAPL-EMIT实现了从人工判读向全自动、高通量、设施级全球甲烷羽流测绘范式的转变,为近实时、可扩展的甲烷点源监管提供关键技术支撑。 Abstract: Anthropogenic methane (CH4) point sources drive near-term climate forcing, safety hazards, and system inefficiencies. Space-based imaging spectroscopy is emerging as a tool for identifying emissions globally, but existing approaches largely rely on manual plume identification. Here we present the Methane Analysis and Plume Localization with EMIT (MAPL-EMIT) model, an end-to-end vision transformer framework that leverages the complete radiance spectrum from the Earth Surface Mineral Dust Source Investigation (EMIT) instrument to jointly retrieve methane enhancements across all pixels within a scene. This approach brings together spectral and spatial context to significantly lower detection limits. MAPL-EMIT simultaneously supports enhancement quantification, plume delineation, and source localization, even for multiple overlapping plumes. The model was trained on 3.6 million physics-based synthetic plumes injected into global EMIT radiance data. Synthetic evaluation confirms the model's ability to identify plumes with high recall and precision and to capture weaker plumes relative to existing matched-filter approaches. On real-world benchmarks, MAPL-EMIT captures 79% of known hand-annotated NASA L2B plume complexes across a test set of 1084 EMIT granules, while capturing twice as many plausible plumes than identified by human analysts. Further validation against coincident airborne data, top-emitting landfills, and controlled release experiments confirms the model's ability to identify previously uncaptured sources. By incorporating model-generated metrics such as spectral fit scores and estimated noise levels, the framework can further limit false-positive rates. Overall, MAPL-EMIT enables high-throughput implementation on the full EMIT catalog, shifting methane monitoring from labor-intensive workflows to a rapid, scalable paradigm for global plume mapping at the facility scale.

[211] Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

Yu Jiang,Hanwen Jiang,Ahmed Abdelkader,Wen-Sheng Chu,Brandon Y. Feng,Zhangyang Wang,Qixing Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于合成数据生成LoRA子空间的方法,揭示了3D数据中纹理、几何、相机运动和光照等变化对应的近似解耦LoRA子空间,并通过融合这些子空间构建更小、更高效的LoRA适配器,在真实下游任务中提升精度。

Details Motivation: 3D基础模型微调中LoRA占据主导,但3D数据在纹理、几何、相机运动和光照等方面存在显著差异,引发关于是否存在对应LoRA子空间、是否解耦及如何有效计算的三个基本问题。 Method: 构建具有可控变化的合成3D数据集,分别在各数据集上微调LoRA适配器,并提取对应每种变化类型的LoRA子空间;分析其正交性(解耦性)并融合形成降维后的LoRA子空间。 Result: 验证了不同变化类型对应的LoRA子空间近似解耦;融合后的降维LoRA子空间在真实数据上泛化良好,提升了下游任务预测精度且更高效;消融实验证实方法设计的有效性。 Conclusion: LoRA子空间可与3D数据的特定变化类型关联并近似解耦;利用合成数据学习的降维LoRA子空间能有效迁移至真实场景,为3D模型高效微调提供了新范式。 Abstract: With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.

[212] ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents

Dongjie Huo,Haoyun Liu,Guoqing Liu,Dekang Qi,Zhiming Sun,Maoguo Gao,Jianxin He,Yandan Yang,Xinyuan Chang,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu

Main category: cs.CV

TL;DR: ABot-Claw 是一种面向开放世界环境的具身智能体架构,通过统一具身接口、视觉为中心的跨具身多模态记忆和基于评判器的闭环反馈机制,弥合高层推理与底层物理执行之间的鸿沟。

Details Motivation: 现有具身智能系统在开放世界中难以兼顾高阶推理与低阶物理执行;VLA模型开环、System 2代理受限于封闭沙盒、OpenClaw缺乏具身控制架构。 Method: 提出ABot-Claw:1)能力驱动的异构机器人协调统一具身接口;2)视觉为中心的跨具身多模态记忆以支持持续上下文保持与具身检索;3)基于通用奖励模型的评判器闭环反馈机制,实现在线评估、局部修正与重规划;采用解耦三层架构(OpenClaw层、共享服务层、机器人具身层)。 Result: ABot-Claw 实现了从自然语言意图到物理动作的闭环,支持多机器人长期、动态开放环境下的真实交互与渐进式自演化。 Conclusion: ABot-Claw 为构建可扩展、可演化、强具身的开放世界智能体提供了可行架构范式。 Abstract: Current embodied intelligent systems still face a substantial gap between high-level reasoning and low-level physical execution in open-world environments. Although Vision-Language-Action (VLA) models provide strong perception and intuitive responses, their open-loop nature limits long-horizon performance. Agents incorporating System 2 cognitive mechanisms improve planning, but usually operate in closed sandboxes with predefined toolkits and limited real-system control. OpenClaw provides a localized runtime with full system privileges, but lacks the embodied control architecture required for long-duration, multi-robot execution. We therefore propose ABot-Claw, an embodied extension of OpenClaw that integrates: 1) a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination; 2) a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval; and 3) a critic-based closed-loop feedback mechanism with a generalist reward model for online progress evaluation, local correction, and replanning. With a decoupled architecture spanning the OpenClaw layer, shared service layer, and robot embodiment layer, ABot-Claw enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents in open, dynamic environments.

[213] Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

Zongyou Yang,Yinghan Hou,Xiaokun Yang

Main category: cs.CV

TL;DR: 本文提出Degradation-Consistent Paired Training (DCPT)方法,通过在训练中引入干净视图与退化视图之间的特征和预测一致性约束,显著提升AI生成图像检测器在真实图像退化(如JPEG压缩)下的鲁棒性,且不增加参数或推理开销。

Details Motivation: 现有AI生成图像检测器在真实世界图像退化(如JPEG压缩、高斯模糊、降采样)下性能大幅下降,而当前SOTA方法(如B-Free)仅将退化鲁棒性视为数据增强的副产品,未将其作为显式训练目标。 Method: 提出DCPT训练策略:对每张训练图像构造干净视图和退化视图,施加两个一致性约束——(1)特征一致性损失(最小化余弦距离),(2)预测一致性损失(基于对称KL散度对齐输出分布)。该方法零参数、零推理开销。 Result: 在Synthbuster基准(9种生成器、8种退化)上,DCPT相比无配对训练的基线,在退化条件下的平均准确率提升9.1个百分点,仅损失0.9%干净图像准确率;在JPEG压缩下提升最显著(+15.7%至+17.9%);消融表明改进训练目标比增加网络结构更有效。 Conclusion: 显式建模退化一致性是提升检测器鲁棒性的关键,DCPT以极简设计实现了显著性能增益,验证了训练目标优化优于架构扩展的有效性。 Abstract: AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

[214] Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

Ruibin Li,Tao Yang,Fangzhou Ai,Tianhe Wu,Shilei Wen,Bingyue Peng,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出Hybrid Forcing方法,通过混合注意力机制(轻量线性时序注意力+块稀疏局部注意力)提升流式视频生成的长程依赖建模能力与计算效率,并配合分阶段蒸馏策略,在保持实时性的同时实现SOTA性能。

Details Motivation: 滑动窗口注意力(SWA)在长视频生成中丢失远距离历史信息,且计算开销大,难以实现实时部署。 Method: 提出Hybrid Forcing:1)轻量线性时序注意力,用紧凑KV状态增量吸收被滑出的token以保留长程时序上下文;2)块稀疏注意力优化局部窗口内短程建模的冗余计算;3)分阶段解耦蒸馏策略——先密集注意力初蒸馏,再切换至混合注意力进行流式建模蒸馏。 Result: 在长短视频生成基准上均达SOTA;单H100 GPU上实现无量化/压缩的实时、无界832×480视频生成(29.5 FPS)。 Conclusion: Hybrid Forcing有效平衡了流式视频生成中的时序建模能力与计算效率,为高质量实时视频生成提供了新范式。 Abstract: Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.

[215] VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction

Vasiliki Vasileiou,Panagiotis P. Filntisis,Petros Maragos,Kostas Daniilidis

Main category: cs.CV

TL;DR: 本文提出了一种相对头部姿态估计方法VGGT-HPE,通过预测两帧间刚性变换而非绝对姿态,提升了鲁棒性和精度,仅用合成数据训练即在BIWI基准上达到SOTA。

Details Motivation: 传统单目头部姿态估计采用直接回归到绝对姿态的方式,迫使网络隐式学习数据集特定的参考系;本文认为预测两帧之间的相对刚性变换更简单、更鲁棒。 Method: 提出VGGT-HPE模型,基于通用几何基础模型,仅在合成面部渲染数据上微调,将问题转化为从显式给定的已知姿态锚点估计几何位移。 Result: 尽管未使用任何真实世界数据训练,VGGT-HPE在BIWI基准上达到SOTA,超越了在混合/真实数据上训练的绝对回归方法;并通过易/难样本对验证了相对预测精度更高且优势随目标姿态难度增加而扩大。 Conclusion: 相对头部姿态估计是一种更本质、更鲁棒的建模范式,能摆脱对隐式锚点的依赖,支持测试时灵活选择锚点以控制预测难度。 Abstract: Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose. Project page and code: https://vasilikivas.github.io/VGGT-HPE

[216] Dual-Branch Remote Sensing Infrared Image Super-Resolution

Xining Ge,Gengjia Chang,Weijun Yuan,Zhan Li,Zhanglu Chen,Boyang Yao,Yihang Chen,Yifan Deng,Shuhong Liu

Main category: cs.CV

TL;DR: 本文提出了一种双分支红外图像超分辨率方法(HAT-L + MambaIRv2-L),通过局部转换、自集成与等权融合,在NTIRE 2026挑战赛中取得优异性能,验证了局部强恢复能力与全局稳定建模互补的有效性。

Details Motivation: 红外图像纹理弱、对局部锐化不稳定,需兼顾局部细节重建与全局结构/辐射稳定性,现有方法难以兼顾。 Method: 构建双分支架构:HAT-L分支擅长局部纹理恢复,MambaIRv2-L分支提供全局状态空间建模;推理时对HAT应用测试时局部转换,对MambaIRv2采用八路自集成,并在图像空间进行固定等权融合。 Result: 在NTIRE 2026官方测试及自建12张×4合成热成像数据集上,融合结果在PSNR、SSIM和综合Score上均超越单一分支。 Conclusion: 红外超分辨率任务中,显式融合局部强建模(如Transformer)与全局稳态建模(如Mamba)可显著提升性能,二者具有本质互补性。 Abstract: Remote sensing infrared image super-resolution aims to recover sharper thermal observations from low-resolution inputs while preserving target contours, scene layout, and radiometric stability. Unlike visible-image super-resolution, thermal imagery is weakly textured and more sensitive to unstable local sharpening, which makes complementary local and global modeling especially important. This paper presents our solution to the NTIRE 2026 Infrared Image Super-Resolution Challenge, a dual-branch system that combines a HAT-L branch and a MambaIRv2-L branch. The inference pipeline applies test-time local conversion on HAT, eight-way self-ensemble on MambaIRv2, and fixed equal-weight image-space fusion. We report both the official challenge score and a reproducible evaluation on 12 synthetic times-four thermal samples derived from Caltech Aerial RGB-Thermal, on which the fused output outperforms either single branch in PSNR, SSIM, and the overall Score. The results suggest that infrared super-resolution benefits from explicit complementarity between locally strong transformer restoration and globally stable state-space modeling.

[217] A Dual Cross-Attention Graph Learning Framework For Multimodal MRI-Based Major Depressive Disorder Detection

Nojod M. Alotaibi,Areej M. Alhothali

Main category: cs.CV

TL;DR: 本文提出了一种基于双交叉注意力机制的多模态MRI融合框架,用于抑郁症(MDD)分类,显著提升了rs-fMRI与sMRI联合建模的性能。

Details Motivation: 单一模态MRI难以全面刻画重度抑郁症(MDD)复杂的神经生物学改变,而现有 multimodal 融合方法在有效整合结构与功能MRI方面仍具挑战。 Method: 提出双交叉注意力机制的多模态融合框架,显式建模结构MRI(sMRI)与静息态功能MRI(rs-fMRI)表征之间的双向交互;在REST-meta-MDD大数据集上,结合多种脑图谱(结构/功能)进行10折分层交叉验证。 Result: 最优模型在REST-meta-MDD数据集上达到84.71%准确率、86.42%敏感性、82.89%特异性、84.34%精确率和85.37% F1分数;在功能图谱上显著优于传统特征拼接,在结构图谱上性能相当。 Conclusion: 显式建模跨模态交互对基于多模态神经影像的MDD分类至关重要,所提双交叉注意力融合框架具有鲁棒性和竞争力。 Abstract: Major depressive disorder (MDD) is a prevalent mental disorder associated with complex neurobiological changes that cannot be fully captured using a single imaging modality. The use of multimodal magnetic resonance imaging (MRI) provides a more comprehensive understanding of brain changes by combining structural and functional data. Despite this, the effective integration of these modalities remains challenging. In this study, we propose a dual cross-attention-based multimodal fusion framework that explicitly models bidirectional interactions between structural MRI (sMRI) and resting-state functional MRI (rs-fMRI) representations. The proposed approach is tested on the large-scale REST-meta-MDD dataset using both structural and functional brain atlas configurations. Numerous experiments conducted under a 10-fold stratified cross-validation demonstrated that the proposed fusion algorithm achieves robust and competitive performance across all atlas types. The proposed method consistently outperforms conventional feature-level concatenation for functional atlases, while maintaining comparable performance for structural atlases. The most effective dual cross-attention multimodal model obtained 84.71% accuracy, 86.42% sensitivity, 82.89% specificity, 84.34% precision, and 85.37% F1-score. These findings emphasize the importance of explicitly modeling cross-modal interactions for multimodal neuroimaging-based MDD classification.

[218] PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization

Dongli Wu,Jingyu Hu,Ka-Hei Hui,Xiaobao Wei,Chengwen Luo,Jianqiang Li,Zhengzhe Liu

Main category: cs.CV

TL;DR: 本文提出了一种统一的物理评估器(Physics Evaluator)来衡量单图像生成3D室内场景的物理一致性,并基于该评估器设计了PhyMix框架,通过训练中奖励塑形与推理时显式优化相结合,显著提升生成场景的物理合理性与视觉保真度。

Details Motivation: 现有单图像3D室内场景生成方法虽视觉合理,但常违背真实物理规律,限制其在机器人、具身AI和设计等领域的实际应用。 Method: 提出四维度九子项的Physics Evaluator作为首个物理一致性基准;在此基础上构建PhyMix框架,包含Scene-GRPO(隐式对齐,以评估器为偏好信号进行无critic组相对策略优化)和Test-Time Optimizer(显式微调,利用可微评估信号在推理时修正物理违规)。 Result: 在合成数据上达到视觉保真度与物理合理性的SOTA;大量定性结果验证其在风格化与真实图像上的鲁棒性。 Conclusion: 本文首次系统量化并提升3D室内场景生成的物理一致性,实现了评估、训练反馈与推理修正的统一,推动生成结果从‘看起来合理’迈向‘真正可行’。 Abstract: Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.

[219] VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

Longteng Jiang,DanDan Zheng,Qianqian Qiao,Heng Huang,Huaye Wang,Yihang Bo,Bao Peng,Jingdong Chen,Jun Zhou,Xin Jin

Main category: cs.CV

TL;DR: 本文提出VGA-Bench,一个面向AIGC视频生成的统一评估基准,涵盖生成质量与美学质量两方面,构建了三层分类体系,生成大规模视频数据集,并开发三个多任务神经评估器,实验表明其与人类判断高度一致。

Details Motivation: 现有视频生成评估基准主要关注技术保真度,缺乏对感知性与艺术性等美学维度的系统评估,亟需更全面的评估框架。 Method: 提出三层分类法(美学质量、美学标签、生成质量),设计1016个多样化提示,利用12种模型生成超6万视频;通过人工标注子集,构建VAQA-Net、VTag-Net和VGQA-Net三个专用多任务评估网络。 Result: 三个评估模型在多项指标上与人类判断高度一致,具备高准确率与高效率;VGA-Bench已开源,支持内容审核、模型调试与生成模型优化等应用。 Conclusion: VGA-Bench填补了AIGC视频美学评估的空白,为生成式AI的全面、可扩展、自动化评估提供了新范式。 Abstract: The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

[220] Improving Deep Learning-Based Target Volume Auto-Delineation for Adaptive MR-Guided Radiotherapy in Head and Neck Cancer: Impact of a Volume-Aware Dice Loss

Sogand Beirami,Zahra Esmaeilzadeh,Ahmed Gomaa,Pluvio Stephan,Ishita Sheth,Thomas Weissmann,Juliane Szkitsak,Philipp Schubert,Yixing Huang,Annette Schwarz,Stefanie Corradini,Florian Putz

Main category: cs.CV

TL;DR: 本研究提出了一种基于Volume-Aware Dice损失函数的nnU-Net改进框架,用于头颈部癌MRI引导放疗中的原发肿瘤与淋巴结自动分割;结果表明,针对淋巴结单独加权可提升小病灶检出率,但损害原发灶分割精度;双掩膜加权则在两者间取得更好平衡。

Details Motivation: 手动勾画头颈癌靶区耗时长、观察者间差异大;尤其小而复杂的转移性淋巴结易被标准分割模型漏检,亟需提升体积敏感性的自动分割方法。 Method: 在HNTS-MRG 2024数据集上,基于nnU-Net ResEnc M架构,对比标准Dice损失、Dual Mask VA损失(PT和LN均加权)与Selective LN Mask VA损失(仅LN加权)三种配置,采用多标签分割与多种指标(Dice、SDS、MSD、HD95、病灶级敏感性/精确度)评估。 Result: Selective LN Mask配置使LN Dice达0.758(基线0.734)、病灶检出敏感性升至84.93%(基线81.80%),但PT精确度显著下降至63.65%(基线81.27%);Dual Mask配置在保持PT精确度82.04%的同时,将LN敏感性提升至83.46%,实现更均衡性能。 Conclusion: Volume-Aware损失可有效缓解小转移灶在训练中的欠表征问题;但在多目标分割中,需对所有目标统一施加体积感知权重(如Dual Mask),而非仅针对小目标选择性加权,以兼顾不同尺度结构的分割鲁棒性。 Abstract: Background: Manual delineation of target volumes in head and neck cancer (HNC) remains a significant bottleneck in radiotherapy planning, characterized by high inter-observer variability and time consumption. This study evaluates the integration of a Volume-Aware (VA) Dice loss function into a self-configuring deep learning framework to enhance the auto-segmentation of primary tumors (PT) and metastatic lymph nodes (LN) for adaptive MR-guided radiotherapy. We investigate how volume-sensitive weighting affects the detection of small, anatomically complex nodal metastases compared to conventional loss functions. Methods: Utilizing the HNTS-MRG 2024 dataset, we implemented an nnU-Net ResEnc M architecture. We conducted a multi-label segmentation task, comparing a standard Dice loss baseline against two Volume-Aware configurations: a "Dual Mask" setup (VA loss on both PT and LN) and a "Selective LN Mask" setup (VA loss on LN only). Evaluation metrics included volumetric Dice scores, surface-based metrics (SDS, MSD, HD95), and lesion-wise binary detection sensitivity and precision. Results: The Selective LN Mask configuration achieved the highest LN Volumetric Dice Score (0.758 vs. 0.734 baseline) and significantly improved LN Lesion-Wise Detection Sensitivity (84.93% vs. 81.80%). However, a critical trade-off was observed; PT detection precision declined significantly in the selective setup (63.65% vs. 81.27%). The Dual Mask configuration provided the most balanced performance across both targets, maintaining primary tumor precision at 82.04% while improving LN sensitivity to 83.46%. Conclusions: A volume-sensitive loss function mitigated the under-representation of small metastatic lesions in HNC. While selective weighting yielded the best nodal detection, a dual-mask approach is required in multi-label tasks to maintain segmentation accuracy for larger primary tumor volumes.

[221] Semantic Manipulation Localization

Zhenshan Tan,Chenhan Lu,Yuxiang Huang,Ziwen He,Xiang Zhang,Yuzhe Sha,Xianyi Chen,Tianrun Chen,Zhangjie Fu

Main category: cs.CV

TL;DR: 本文提出语义操纵定位(SML)新任务,构建细粒度基准,并设计端到端框架TRACE,通过语义锚定、扰动感知与语义约束推理三阶段实现对细微但语义关键编辑的精准定位,显著优于传统基于伪影的图像操纵定位方法。

Details Motivation: 现有图像操纵定位(IML)方法依赖低层伪影检测,难以应对现代编辑和生成模型产生的高度视觉一致、但语义上显著改变图像理解的细微编辑。 Method: 提出语义操纵定位(SML)任务及配套细粒度基准;设计TRACE框架,包含三个渐进耦合模块:语义锚定(识别支撑图像理解的语义区域)、语义扰动感知(注入频率域敏感线索捕捉细微编辑)、语义约束推理(联合语义内容与语义范围验证候选区域)。 Result: TRACE在自建SML基准上全面超越现有IML方法,定位结果更完整、紧凑且语义连贯。 Conclusion: 图像取证需从依赖伪影转向具备语义敏感性的定位范式,SML与TRACE为复杂语义编辑场景下的图像真实性分析提供了新方向。 Abstract: Image Manipulation Localization (IML) aims to identify edited regions in an image. However, with the increasing use of modern image editing and generative models, many manipulations no longer exhibit obvious low-level artifacts. Instead, they often involve subtle but meaning-altering edits to an object's attributes, state, or relationships while remaining highly consistent with the surrounding content. This makes conventional IML methods less effective because they mainly rely on artifact detection rather than semantic sensitivity. To address this issue, we introduce Semantic Manipulation Localization (SML), a new task that focuses on localizing subtle semantic edits that significantly change image interpretation. We further construct a dedicated fine-grained benchmark for SML using a semantics-driven manipulation pipeline with pixel-level annotations. Based on this task, we propose TRACE (Targeted Reasoning of Attributed Cognitive Edits), an end-to-end framework that models semantic sensitivity through three progressively coupled components: semantic anchoring, semantic perturbation sensing, and semantic-constrained reasoning. Specifically, TRACE first identifies semantically meaningful regions that support image understanding, then injects perturbation-sensitive frequency cues to capture subtle edits under strong visual consistency, and finally verifies candidate regions through joint reasoning over semantic content and semantic scope. Extensive experiments show that TRACE consistently outperforms existing IML methods on our benchmark and produces more complete, compact, and semantically coherent localization results. These results demonstrate the necessity of moving beyond artifact-based localization and provide a new direction for image forensics in complex semantic editing scenarios.

[222] Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

Yibo Yan,Mingdong Ou,Yi Cao,Jiahao Huo,Xin Zou,Shuliang Liu,James Kwok,Xuming Hu

Main category: cs.CV

TL;DR: ColChunk is a plug-and-play framework for Visual Document Retrieval that uses multimodal late chunking with hierarchical clustering and 2D position priors to reduce storage by over 90% while improving nDCG@5 by 9 points.

Details Motivation: Multi-vector models in Visual Document Retrieval (VDR) offer fine-grained matching but suffer from high storage and computational costs, hindering practical deployment. Method: ColChunk introduces multimodal late chunking via hierarchical clustering on patch-level embeddings, fused with a 2D position prior to maintain spatial-semantic coherence, enabling adaptive, content-aware grouping of vectors. Result: ColChunk achieves over 90% reduction in storage and a 9-point average improvement in nDCG@5 across 24 VDR datasets compared to representative single-vector models. Conclusion: ColChunk provides a practical balance between retrieval accuracy and efficiency in visual document systems, serving as an efficient, contextualized multi-vector solution. Abstract: Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.

[223] Radiology Report Generation for Low-Quality X-Ray Images

Hongze Zhu,Chen Hu,Jiaxuan Jiang,Hong Liu,Yawen Huang,Ming Hu,Tianyu Wang,Zhijian Wu,Yefeng Zheng

Main category: cs.CV

TL;DR: 本文提出了一种面向图像质量变化的鲁棒放射学报告生成框架,包括自动质量评估代理(AQAA)和双循环训练策略,以缓解因图像质量下降导致的模型性能退化。

Details Motivation: 现有视觉-语言模型在放射学报告自动生成中假设输入图像质量高,忽视了临床环境中普遍存在的噪声和伪影,导致在处理低质量图像时性能严重下降。 Method: 提出了自动化质量评估代理(AQAA)构建LRRG基准,并设计了基于双层优化和梯度一致性的双循环训练策略,使模型学习质量无关的诊断特征。 Result: 大量实验表明,所提方法能有效缓解图像质量下降带来的模型性能退化。 Conclusion: 该框架显著提升了VLMs在真实临床低质量影像下的鲁棒性与报告生成质量。 Abstract: Vision-Language Models (VLMs) have significantly advanced automated Radiology Report Generation (RRG). However, existing methods implicitly assume high-quality inputs, overlooking the noise and artifacts prevalent in real-world clinical environments. Consequently, current models exhibit severe performance degradation when processing suboptimal images. To bridge this gap, we propose a robust report generation framework explicitly designed for image quality variations. We first introduce an Automated Quality Assessment Agent (AQAA) to identify low-quality samples within the MIMIC-CXR dataset and establish the Low-quality Radiology Report Generation (LRRG) benchmark. To tackle degradation-induced shifts, we propose a novel Dual-loop Training Strategy leveraging bi-level optimization and gradient consistency. This approach ensures the model learns quality-agnostic diagnostic features by aligning gradient directions across varying quality regimes. Extensive experiments demonstrate that our approach effectively mitigates model performance degradation caused by image quality deterioration. The code and data will be released upon acceptance.

[224] A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

Meng'en Qin,Yu Song,Quanling Zhao,Xiaodong Yang,Yingtao Che,Xiaohui Yang

Main category: cs.CV

TL;DR: 本文提出A3-FPN,一种渐近解耦、内容感知的金字塔注意力网络,用于增强多尺度特征表示,显著提升小目标检测与分割性能。

Details Motivation: 现有特征金字塔网络在捕获判别性特征和识别小目标方面存在固有缺陷,难以应对目标尺度变化问题。 Method: 提出A3-FPN:1)采用横向展开列网络实现渐近全局特征交互与层级解耦;2)在特征融合中引入邻层内容生成位置感知重采样偏移与权重,并学习深度上下文重加权;3)在特征重组中基于信息量与空间变化强化单尺度判别性特征学习并重组冗余特征。 Result: 在MS COCO、VisDrone2019-DET和Cityscapes上验证有效;与OneFormer+Swin-L组合时达COCO 49.6 mask AP和Cityscapes 85.6 mIoU。 Conclusion: A3-FPN是一种通用、即插即用的多尺度特征增强模块,可显著提升CNN和Transformer架构在密集预测任务中的性能,尤其对小目标识别更优。 Abstract: Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

[225] Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

Isaac Corley,Alex Stoken,Gabriele Berton

Main category: cs.CV

TL;DR: 本文评估了24种预训练图像匹配器在光学-SAR跨模态遥感影像配准任务上的零样本迁移性能,发现显式跨模态训练并非总是必要,基础模型特征(如DINOv2)可能提供一定模态不变性;同时强调部署协议(如几何模型、分块大小、内点筛选)对精度影响巨大,甚至超过模型选择本身。

Details Motivation: 光学与SAR影像跨模态配准是灾害应急响应的关键瓶颈,但现有匹配器主要面向自然图像设计和评测,缺乏在遥感跨模态场景下的系统性零样本评估。 Method: 在SpaceNet9及两个自建跨模态基准上,以零样本方式(无微调/域适配)评估24种预训练匹配器;采用确定性协议:大图分块推理、鲁棒几何滤波、基于地面控制点的评估指标。 Result: XoFTR和RoMa在SpaceNet9上达到最低均值误差3.0像素;RoMa未使用跨模态训练仍表现优异;MatchAnything-ELoFTR(合成跨模态训练)达3.4像素;DINOv2等基础模型特征可能缓解对显式跨模态监督的依赖;3D重建匹配器(MASt3R/DUSt3R)对协议高度敏感;仅改用仿射几何模型即可使均值误差从12.34降至9.74像素,协议调整影响可达33倍。 Conclusion: 跨模态匹配器性能不仅取决于模型结构或训练目标,更受部署协议强烈影响;基础模型特征可部分替代显式跨模态监督;未来设计应兼顾架构创新与鲁棒部署实践,尤其针对卫星遥感场景。 Abstract: Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

[226] SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation

Yun Wang,Zhengjie Yang,Jiahao Zheng,Zhanjie Zhang,Dapeng Oliver Wu,Yulan Guo

Main category: cs.CV

TL;DR: 本文提出SMFormer框架,通过引入视觉基础模型(VFM)和鲁棒数据增强机制,提升自监督立体匹配的性能,显著缩小与监督方法的精度差距,并在多个基准上达到SOTA水平,甚至在Booster上超越部分监督方法。

Details Motivation: 现有自监督立体匹配方法依赖光度一致性假设,在真实场景中易受干扰导致监督信号失效,性能远低于监督方法。 Method: 提出SMFormer框架:1)将视觉基础模型(VFM)与特征金字塔网络(FPN)结合,提升特征鲁棒性;2)设计新型数据增强机制,显式约束光照变化下的特征一致性,并正则化强增强与标准样本间视差预测的一致性。 Result: 在多个主流基准(如SceneFlow、KITTI)上达到自监督方法SOTA;在挑战性Booster基准上超越CFNet等部分监督方法。 Conclusion: SMFormer通过VFM引导的可靠自监督与针对性数据增强,有效缓解光度假设失效问题,显著提升自监督立体匹配鲁棒性与精度,推动其向监督方法性能靠拢。 Abstract: Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.

[227] Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

Yang Yu,Dunyuan Xu,Yaoqian Li,Xiaomeng Li,Jinpeng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了一种将2D多模态大语言模型迁移至3D医学图像分析的新方法,通过文本引导的分层MoE框架和两阶段训练策略,提升了医学报告生成与视觉问答任务的性能。

Details Motivation: 现有3D医学多模态大语言模型因3D医学图像数据稀缺,导致视觉编码器预训练不足,难以提取适配不同任务的定制化图像特征。 Method: 首先将预训练良好的2D MLLM迁移以支持3D医学体数据输入并复用全部参数;其次设计文本引导的分层MoE(TGH-MoE)框架,依据文本提示区分任务;最后采用两阶段训练策略学习任务共享与任务特定的图像特征。 Result: 在医学报告生成(MRG)和医学视觉问答(MVQA)两个任务上,该方法均优于现有3D医学MLLMs。 Conclusion: 所提方法有效缓解了3D医学图像数据稀缺带来的建模瓶颈,为3D医学多模态理解提供了可扩展、任务自适应的新范式。 Abstract: 3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.

[228] MedVeriSeg: Teaching MLLM-Based Medical Segmentation Models to Verify Query Validity Without Extra Training

Ziqian Lu,Qinyue Tong,Jun Liu,Yunlong Yu

Main category: cs.CV

TL;DR: 本文提出MedVeriSeg,一种无需训练的验证框架,用于提升LISA类医学图像分割模型对虚假查询(即目标不存在)的识别与拒绝能力,通过分析[SEG]标记特征与图像特征的相似性图分布,并结合GPT-4o进行多维度验证。

Details Motivation: 现有MLLM-based医学图像分割方法(如LISA)难以可靠拒绝虚假查询,易产生幻觉分割结果,影响医学教育与临床实用性。 Method: 基于[SEG]标记与图像特征相似图在真假查询下的分布差异,设计相似响应质量评分模块(衡量强度、紧凑性、纯度),并引入GPT-4o联合分析相似热图与评分结果完成最终验证。 Result: 在基于SA-Med2D-20M构建的小规模基准上,MedVeriSeg能有效拒绝虚假查询请求,同时保持对真实查询的高识别可靠性。 Conclusion: MedVeriSeg为LISA类模型提供了可靠的无训练验证机制,显著提升了医学图像分割系统在实际应用中的可信度与安全性。 Abstract: Despite recent advances in MLLM-based medical image segmentation, existing LISA-like methods cannot reliably reject false queries and often produce hallucinated segmentation masks for absent targets. This limitation reduces practical reliability in both medical education and clinical use. In this work, we propose MedVeriSeg, a training-free verification framework that equips LISA-like medical segmentation models with the ability to identify and reject false queries which contain non-existent targets. Our key observation is that the similarity map between the [SEG] token feature and MLLM image features exhibits markedly different distribution patterns for true and false queries. Based on this, we introduce a Similarity Response Quality Scoring Module that characterizes the similarity map from three aspects: strength, compactness, and purity, producing an initial target-existence prediction. We further incorporate qualitative visual evidence by using GPT-4o to jointly assess the similarity heatmap and the results of Similarity Response Quality Scoring Module for final verification. Experiments on a small-scale benchmark constructed from SA-Med2D-20M show that MedVeriSeg effectively rejects false-query segmentation requests while maintaining reliable recognition of true queries.

[229] Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

Hanyuan Zhang,Lucas He,Zijie Cheng,Abdolrahim Kadkhodamohammadi,Danail Stoyanov,Brian R. Davidson,Evangeles B. Mazomenos,Matthew. J Clarkson

Main category: cs.CV

TL;DR: 本文提出了一种基于离散动作强化学习(RL)的CT到腹腔镜视频配准框架,用于增强现实手术导航,避免了传统监督方法依赖优化后处理的问题,在保持精度的同时提升了效率。

Details Motivation: 现有基于学习的配准方法虽速度快,但常产生粗略对齐结果,仍需耗时的优化后处理;亟需一种端到端、无需人工设定步长和停止条件的自动迭代配准方法。 Method: 构建一个离散动作强化学习框架,将CT-to-video配准建模为六自由度刚性变换的序列决策过程;采用从监督姿态估计网络warm-start的共享特征编码器提取CT渲染图与腹腔镜帧的几何特征,RL策略头决定每步变换动作及是否终止。 Result: 在公开腹腔镜数据集上达到平均靶标配准误差(TRE)15.70 mm,精度媲美带优化的监督方法,同时收敛更快、无需人工调参。 Conclusion: 该离散RL配准范式实现了自动化、高效迭代配准,为未来连续动作与可变形配准在手术AR中的应用提供了实用基础。 Abstract: Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.

[230] A Comparison of Multi-View Stereo Methods for Photogrammetric 3D Reconstruction: From Traditional to Learning-Based Approaches

Yawen Li,George Vosselman,Francesco Nex

Main category: cs.CV

TL;DR: 本文对比了传统MVS方法(如COLMAP)与多种前沿学习型MVS方法(几何引导型和端到端型)在航拍场景下的3D重建性能,评估指标包括精度、覆盖度和运行时间。结果表明:COLMAP精度高、几何一致性强但耗时;学习型方法在鲁棒性和速度上更优,尤其在图像配准失败时表现更好;端到端方法(如DUSt3R、VGGT)速度快、精度尚可,但重建残差较大。

Details Motivation: 传统SfM/MVS方法虽精度高,但速度慢、扩展性差;学习型MVS方法兴起,亟需系统性对比其在真实航拍场景下的性能权衡。 Method: 在两个航拍数据集(MARS-LVIG含LiDAR真值;Pix4D公开场景含Pix4Dmapper真值)上,定量评估COLMAP与多种学习型MVS方法(MVSNet、PatchmatchNet、MVSAnywhere、MVSFormer++、Stereo4D、FoundationStereo、DUSt3R、MASt3R、Fast3R、VGGT)在精度、覆盖度和运行时间三方面的表现。 Result: COLMAP重建可靠且几何一致,但耗时最长;学习型方法鲁棒性更强,尤其在传统方法配准失败时;几何引导型方法依赖COLMAP提供的位姿或深度先验;端到端方法(如DUSt3R、VGGT)速度快、精度与覆盖度较优,但3D残差较大,尤其在挑战性场景下。 Conclusion: 学习型MVS方法在效率和鲁棒性上优于传统方法,适合作为快速重建方案;但当前端到端方法仍难以兼顾速度、精度与几何一致性,未来需进一步提升其在复杂航拍场景下的重建质量。 Abstract: Photogrammetric 3D reconstruction has long relied on traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) methods, which provide high accuracy but face challenges in speed and scalability. Recently, learning-based MVS methods have emerged, aiming for faster and more efficient reconstruction. This work presents a comparative evaluation between a representative traditional MVS pipeline (COLMAP) and state-of-the-art learning-based approaches, including geometry-guided methods (MVSNet, PatchmatchNet, MVSAnywhere, MVSFormer++) and end-to-end frameworks (Stereo4D, FoundationStereo, DUSt3R, MASt3R, Fast3R, VGGT). Two experiments were conducted on different aerial scenarios. The first experiment used the MARS-LVIG dataset, where ground-truth 3D reconstruction was provided by LiDAR point clouds. The second experiment used a public scene from the Pix4D official website, with ground truth generated by Pix4Dmapper. We evaluated accuracy, coverage, and runtime across all methods. Experimental results show that although COLMAP can provide reliable and geometrically consistent reconstruction results, it requires more computation time. In cases where traditional methods fail in image registration, learning-based approaches exhibit stronger feature-matching capability and greater robustness. Geometry-guided methods usually require careful dataset preparation and often depend on camera pose or depth priors generated by COLMAP. End-to-end methods such as DUSt3R and VGGT achieve competitive accuracy and reasonable coverage while offering substantially faster reconstruction. However, they exhibit relatively large residuals in 3D reconstruction, particularly in challenging scenarios.

[231] Real-Time Human Reconstruction and Animation using Feed-Forward Gaussian Splatting

Devdoot Chatterjee,Zakaria Laskar,C. V. Jawahar

Main category: cs.CV

TL;DR: 本文提出了一种通用的前馈式高斯点阵(Gaussian Splatting)框架,用于从多视角RGB图像及对应SMPL-X姿态直接进行人体3D重建与实时动画,仅需单次前向推理即可生成可驱动的高斯表示,并通过线性混合蒙皮实现高效动画。

Details Motivation: 现有方法依赖深度监督、固定视角、UV映射或对每个目标姿态/视角重复前向推理,缺乏高效、通用且可实时动画的人体三维表示。 Method: 在标准SMPL-X规范姿态下,为每个顶点预测一组3D高斯基元:一个受约束高斯紧贴SMPL-X表面以提供几何先验和稳定对应;多个自由高斯捕捉衣物、头发等偏离参数表面的几何细节;高斯与顶点显式绑定,支持线性混合蒙皮驱动。 Result: 在THuman 2.1、AvatarReX和THuman 4.0数据集上,重建质量媲美SOTA方法,同时首次实现无需重复网络推理的实时动画与交互应用。 Conclusion: 该方法实现了高质量重建与高效动画的统一,显著提升了高斯点阵在动态人体建模中的实用性与泛化性。 Abstract: We present a generalizable feed-forward Gaussian splatting framework for human 3D reconstruction and real-time animation that operates directly on multi-view RGB images and their associated SMPL-X poses. Unlike prior methods that rely on depth supervision, fixed input views, UV map, or repeated feed-forward inference for each target view or pose, our approach predicts, in a canonical pose, a set of 3D Gaussian primitives associated with each SMPL-X vertex. One Gaussian is regularized to remain close to the SMPL-X surface, providing a strong geometric prior and stable correspondence to the parametric body model, while an additional small set of unconstrained Gaussians per vertex allows the representation to capture geometric structures that deviate from the parametric surface, such as clothing and hair. In contrast to recent approaches such as HumanRAM, which require repeated network inference to synthesize novel poses, our method produces an animatable human representation from a single forward pass; by explicitly associating Gaussian primitives with SMPL-X vertices, the reconstructed model can be efficiently animated via linear blend skinning without further network evaluation. We evaluate our method on the THuman 2.1, AvatarReX and THuman 4.0 datasets, where it achieves reconstruction quality comparable to state-of-the-art methods while uniquely supporting real-time animation and interactive applications. Code and pre-trained models are available at https://github.com/Devdoot57/HumanGS .

[232] EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model

Kunho Kim,Sumin Seo,Yongjun Cho,Hyungjin Chung

Main category: cs.CV

TL;DR: EditCrafter是一种无需微调、基于预训练文生图扩散模型的高分辨率图像编辑方法,通过分块反演和改进的噪声抑制流形约束无分类器引导(NDCFG++)实现任意长宽比与超训练分辨率的高质量编辑。

Details Motivation: 现有基于扩散模型的图像编辑方法受限于训练分辨率(如512×512或1024×1024),难以处理任意长宽比或更高分辨率图像;直接分块编辑会导致结构失真与内容重复。 Method: 提出EditCrafter流程:1)分块反演(tiled inversion)以保持高分辨率输入图像身份;2)设计噪声抑制的流形约束无分类器引导(NDCFG++)用于从反演潜空间进行高分辨率编辑。 Result: EditCrafter在无需微调或优化的前提下,在多种分辨率下均实现了高质量、结构一致的图像编辑效果,显著优于朴素分块编辑等基线方法。 Conclusion: EditCrafter验证了利用大模型生成先验进行免调优高分辨率编辑的可行性,为扩散模型在真实场景中的应用拓展了新路径。 Abstract: We propose EditCrafter, a high-resolution image editing method that operates without tuning, leveraging pretrained text-to-image (T2I) diffusion models to process images at resolutions significantly exceeding those used during training. Leveraging the generative priors of large-scale T2I diffusion models enables the development of a wide array of novel generation and editing applications. Although numerous image editing methods have been proposed based on diffusion models and exhibit high-quality editing results, they are difficult to apply to images with arbitrary aspect ratios or higher resolutions since they only work at the training resolutions (512x512 or 1024x1024). Naively applying patch-wise editing fails with unrealistic object structures and repetition. To address these challenges, we introduce EditCrafter, a simple yet effective editing pipeline. EditCrafter operates by first performing tiled inversion, which preserves the original identity of the input high-resolution image. We further propose a noise-damped manifold-constrained classifier-free guidance (NDCFG++) that is tailored for high resolution image editing from the inverted latent. Our experiments show that the our EditCrafter can achieve impressive editing results across various resolutions without fine-tuning and optimization.

[233] Dual-Exposure Imaging with Events

Mingyuan Lin,Hongyi Liu,Chu He,Wen Yang,Gui-Song Xia,Lei Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于事件相机的双曝光成像(E-DEI)算法,利用事件流提供高时序精度的运动信息,协同双曝光图像实现低光场景下的高质量图像重建,并设计了双路径特征对齐与融合模块及真实世界数据集PIED。

Details Motivation: 现有双曝光成像方法在场景运动和不同曝光导致的特征差异下易产生伪影,需引入更精确的动态信息来缓解该问题。 Method: 提出E-DEI算法,将任务分解为基于事件的运动去模糊与低光图像增强两个子任务;设计双路径并行特征传播网络及Dual-path Feature Alignment and Fusion(DFAF)模块,融合双曝光图像与事件流;构建真实世界数据集PIED。 Result: 在多个数据集上的实验验证了所提方法在图像质量上的优越性;代码与数据集已开源。 Conclusion: 事件相机的高时间分辨率可有效提升双曝光成像在动态低光场景下的鲁棒性与重建质量,E-DEI为低光视觉提供了新思路。 Abstract: By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.

[234] FastSHADE: Fast Self-augmented Hierarchical Asymmetric Denoising for Efficient inference on mobile devices

Nikolay Falaleev

Main category: cs.CV

TL;DR: FastSHADE是一种专为移动GPU设计的轻量级实时图像去噪网络,通过不对称频率去噪模块和空间门控上采样器提升效率与质量,并结合噪声偏移自增强策略增强泛化性,在速度与保真度间取得优异平衡。

Details Motivation: 现代移动摄影对实时图像去噪有迫切需求,但边缘设备严格的延迟和功耗限制使其极具挑战性。 Method: 提出FastSHADE:基于U-Net的多阶段轻量架构;引入不对称频率去噪块(AFDB)解耦空间结构提取与高频噪声抑制;设计空间门控上采样器(SGU)优化高分辨率跳跃连接融合;采用噪声偏移自增强策略提升数据多样性且避免域偏移。 Result: 在MAI2021基准上验证,FastSHADE-M实现<50ms实时延迟并保持结构完整性,FastSHADE-XL在整体图像质量上达到新SOTA。 Conclusion: FastSHADE成功弥合理论网络效率与实际移动端ISP流水线部署之间的鸿沟。 Abstract: Real-time image denoising is essential for modern mobile photography but remains challenging due to the strict latency and power constraints of edge devices. This paper presents FastSHADE (Fast Self-augmented Hierarchical Asymmetric Denoising), a lightweight U-Net-style network tailored for real-time, high-fidelity restoration on mobile GPUs. Our method features a multi-stage architecture incorporating a novel Asymmetric Frequency Denoising Block (AFDB) that decouples spatial structure extraction from high-frequency noise suppression to maximize efficiency, and a Spatially Gated Upsampler (SGU) that optimizes high-resolution skip connection fusion. To address generalization, we introduce an efficient Noise Shifting Self-Augmentation strategy that enhances data diversity without inducing domain shifts. Evaluations on the MAI2021 benchmark demonstrate that our scalable model family establishes a highly efficient speed-fidelity trade-off. Our base FastSHADE-M variant maintains real-time latency (<50 ms on a modern mobile GPU) while preserving structural integrity, and our scaled-up FastSHADE-XL establishes a new state-of-the-art for overall image quality. Ultimately, FastSHADE successfully bridges the gap between theoretical network efficiency and practical deployment for real-world mobile ISP pipelines.

[235] FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data

Peng Yuan,Bingyin Mei,Hui Zhang

Main category: cs.CV

TL;DR: 本文提出多视角组合图像检索(Multi-View CIR)任务,解决现有CIR方法仅支持单图输入输出、无法建模电商中多视角产品展示的局限;构建首个大规模多视角时尚数据集FashionMV,并提出ProCIR框架,基于多模态大语言模型,融合两阶段对话、字幕对齐与思维链引导机制,在多个基准上显著超越大尺寸基线模型。

Details Motivation: 现有CIR方法和数据集仅在图像级操作(单参考图+文本→单目标图),而真实电商场景中用户需基于产品多视角图像进行推理,存在‘视角不完整性’问题,亟需推广至产品级检索。 Method: 提出ProCIR框架:基于多模态大语言模型,采用两阶段对话架构、字幕驱动的跨模态对齐机制、思维链引导策略,并可选加入结构化产品知识的监督微调(SFT)阶段,再进行对比学习。 Result: 在三个时尚基准上的系统消融实验表明:(1)字幕对齐最关键;(2)两阶段对话是实现有效对齐的前提;(3)SFT与思维链部分冗余;最佳0.8B参数模型性能超越10倍参数量的通用嵌入模型。 Conclusion: Multi-View CIR是更贴近实际电商需求的产品级检索新范式;FashionMV数据集与ProCIR框架为该方向奠定基础,验证了轻量高效多模态建模范式在细粒度视觉语言任务中的潜力。 Abstract: Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level -- a single reference image plus modification text in, a single target image out -- while real e-commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi-View CIR task that generalizes standard CIR from image-level to product-level retrieval. To support this task, we construct FashionMV, the first large-scale multi-view fashion dataset for product-level CIR, comprising 127K products, 472K multi-view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product-level Composed Image Retrieval), a modeling framework built upon a multimodal large language model that employs three complementary mechanisms -- two-stage dialogue, caption-based alignment, and chain-of-thought guidance -- together with an optional supervised fine-tuning (SFT) stage that injects structured product knowledge prior to contrastive training. Systematic ablation across 16 configurations on three fashion benchmarks reveals that: (1) alignment is the single most critical mechanism; (2) the two-stage dialogue architecture is a prerequisite for effective alignment; and (3) SFT and chain-of-thought serve as partially redundant knowledge injection paths. Our best 0.8B-parameter model outperforms all baselines, including general-purpose embedding models 10x its size. The dataset, model, and code are publicly available at https://github.com/yuandaxia2001/FashionMV.

[236] Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking

Jingru Li,Wei Ren,Tianqing Zhu

Main category: cs.CV

TL;DR: 本文提出Attention-Guided Visual Jailbreaking方法,通过直接操控注意力模式绕过LVLM的安全对齐机制,而非强行对抗;该方法通过抑制安全前缀token注意力与锚定对抗图像特征两个辅助目标,显著提升攻击成功率并降低梯度冲突。

Details Motivation: 现有视觉语言模型对抗攻击因需同时优化对抗目标与模型内在安全检索机制而存在梯度冲突,导致收敛慢、效率低。 Method: 提出注意力引导的视觉越狱方法,引入两个辅助目标:(1) 抑制对安全对齐相关前缀token的注意力;(2) 将生成过程锚定于对抗性图像特征。 Result: 在Qwen-VL上攻击成功率达94.4%(基线为68.8%),迭代次数减少40%,梯度冲突降低45%;在更严苛扰动预算下(ε=8/255)仍保持59.0% ASR(基线为45.7%);机制分析发现‘安全失明’现象——成功攻击使系统提示注意力下降80%。 Conclusion: 直接操纵注意力模式比优化图像扰动更高效地绕过LVLM安全机制;‘安全失明’揭示了模型失效的新机理——非违背安全规则,而是无法检索规则。 Abstract: Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets ($ε=8/255$), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system-prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.

[237] AC-MIL: Weakly Supervised Atrial LGE-MRI Quality Assessment via Adversarial Concept Disentanglement

K M Arefeen Sultan,Kaysen Hansen,Benjamin Orkild,Alan Morris,Eugene Kholmovski,Erik Bieging,Eugene Kwan,Ravi Ranjan,Ed DiBella,Shireen Elhabian

Main category: cs.CV

TL;DR: 本文提出Adversarial Concept-MIL(AC-MIL)框架,用于在弱监督下对心房晚期钆增强MRI图像质量进行可解释的评估,通过分解全局质量为临床定义的放射学概念,并生成定位精准的概念图,帮助医生识别扫描失败的具体原因。

Details Motivation: 现有基于多实例学习(MIL)的质量评估方法将局部视觉证据映射为不透明的全局特征向量,无法提供关于运动伪影、对比度不足或解剖结构缺失等具体失败模式的可操作反馈。 Method: 提出AC-MIL框架:1)利用体积级弱监督分解图像质量为临床概念;2)引入无监督残差分支与对抗擦除机制防止信息泄露;3)施加空间多样性约束以减少不同概念注意力图之间的重叠,确保局部化与可解释性。 Result: 在心房LGE-MRI临床数据集上验证,AC-MIL能生成高定位精度的空间概念图,准确揭示非诊断性扫描的具体成因,同时在序数分级任务中性能媲美现有最优方法。 Conclusion: AC-MIL成功打开了MIL的‘黑箱’,在保持高性能的同时显著提升了临床可解释性与实用性,为医学影像质量评估提供了新范式。 Abstract: High-quality Late Gadolinium Enhancement (LGE) MRI can be helpful for atrial fibrillation management, yet scan quality is frequently compromised by patient motion, irregular breathing, and suboptimal image acquisition timing. While Multiple Instance Learning (MIL) has emerged as a powerful tool for automated quality assessment under weak supervision, current state-of-the-art methods map localized visual evidence to a single, opaque global feature vector. This black box approach fails to provide actionable feedback on specific failure modes, obscuring whether a scan degrades due to motion blur, inadequate contrast, or a lack of anatomical context. In this paper, we propose Adversarial Concept-MIL (AC-MIL), a weakly supervised framework that decomposes global image quality into clinically defined radiological concepts using only volume-level supervision. To capture latent quality variations without entangling predefined concepts, our framework incorporates an unsupervised residual branch guided by an adversarial erasure mechanism to strictly prevent information leakage. Furthermore, we introduce a spatial diversity constraint that penalizes overlap between distinct concept attention maps, ensuring localized and interpretable feature extraction. Extensive experiments on a clinical dataset of atrial LGE-MRI volumes demonstrate that AC-MIL successfully opens the MIL black box, providing highly localized spatial concept maps that allow clinicians to pinpoint the specific causes of non-diagnostic scans. Crucially, our framework achieves this deep clinical transparency while maintaining highly competitive ordinal grading performance against existing baselines. Code to be released on acceptance.

[238] Class-Adaptive Cooperative Perception for Multi-Class LiDAR-based 3D Object Detection in V2X Systems

Blessing Agyei Kyem,Joshua Kofi Asamoah,Armstrong Aboah

Main category: cs.CV

TL;DR: 本文提出了一种类自适应的协同感知架构,用于多类别3D目标检测,通过多尺度窗口注意力、类特定融合模块、鸟瞰图增强和类别平衡损失,提升了V2X-Real基准上各类别(尤其是卡车和行人)的检测性能。

Details Motivation: 现有协同3D检测方法采用统一融合策略,难以适配不同类别目标(如小目标与大目标)的几何结构和点云分布差异;且评估协议过于局限,缺乏对多类目标及多样化V2X协作场景的充分验证。 Method: 提出类自适应协同感知架构,包含:1)带学习尺度路由的多尺度窗口注意力机制;2)区分小/大目标的类特定融合模块;3)并行空洞卷积与通道重校准的鸟瞰图增强;4)类别平衡的目标加权策略。 Result: 在V2X-Real基准的多种协作设置(车端/路侧/车车/路路/车路)下,显著提升平均检测性能,尤其在卡车类别上增益最大,行人检测明显改善,汽车检测保持竞争力。 Conclusion: 将特征提取与融合策略与目标类别相关的几何特性和点密度相匹配,可实现更均衡、鲁棒的V2X协同感知。 Abstract: Cooperative perception allows connected vehicles and roadside infrastructure to share sensor observations, creating a fused scene representation beyond the capability of any single platform. However, most cooperative 3D object detectors use a uniform fusion strategy for all object classes, which limits their ability to handle the different geometric structures and point-sampling patterns of small and large objects. This problem is further reinforced by narrow evaluation protocols that often emphasize a single dominant class or only a few cooperation settings, leaving robust multi-class detection across diverse vehicle-to-everything interactions insufficiently explored. To address this gap, we propose a class-adaptive cooperative perception architecture for multi-class 3D object detection from LiDAR data. The model integrates four components: multi-scale window attention with learned scale routing for spatially adaptive feature extraction, a class-specific fusion module that separates small and large objects into attentive fusion pathways, bird's-eye-view enhancement through parallel dilated convolution and channel recalibration for richer contextual representation, and class-balanced objective weighting to reduce bias toward frequent categories. Experiments on the V2X-Real benchmark cover vehicle-centric, infrastructure-centric, vehicle-to-vehicle, infrastructure-to-infrastructure, and vehicle-to-infrastructure settings under identical backbone and training configurations. The proposed method consistently improves mean detection performance over strong intermediate-fusion baselines, with the largest gains on trucks, clear improvements on pedestrians, and competitive results on cars. These results show that aligning feature extraction and fusion with class-dependent geometry and point density leads to more balanced cooperative perception in realistic vehicle-to-everything deployments.

[239] SatReg: Regression-based Neural Architecture Search for Lightweight Satellite Image Segmentation

Edward Humes,Tinoosh Mohsenin

Main category: cs.CV

TL;DR: 本文提出SatReg框架,通过回归建模在边缘设备上高效搜索轻量级遥感分割模型的最优架构参数,兼顾精度与硬件开销。

Details Motivation: 地球观测任务正转向星上和边缘处理,遥感分割模型需在严格延迟和能耗约束下运行。 Method: 以CM-UNet为教师模型,将搜索空间简化为两个宽度相关变量,在Jetson Orin Nano上采样并评测少量学生模型,构建mIoU、延迟和功耗的低阶代理模型,并结合知识蒸馏训练学生模型。 Result: 所学代理模型能快速选出面向部署目标的近优架构;实验证明宽度变量对精度和硬件成本影响不同,验证了降维回归策略的有效性。 Conclusion: SatReg是一种实用的硬件感知调优方法,适用于未来空-边协同系统中混合CNN-Mamba分割模型的轻量化适配。 Abstract: As Earth-observation workloads move toward onboard and edge processing, remote-sensing segmentation models must operate under tight latency and energy constraints. We present SatReg, a regression-based hardware-aware tuning framework for lightweight remote-sensing segmentation on edge platforms. Using CM-UNet as the teacher architecture, we reduce the search space to two dominant width-related variables, profile a small set of student models on an NVIDIA Jetson Orin Nano, and fit low-order surrogate models for mIoU, latency, and power. Knowledge distillation is used to efficiently train the sampled students. The learned surrogates enable fast selection of near-optimal architecture settings for deployment targets without exhaustive search. Results show that the selected variables affect task accuracy and hardware cost differently, making reduced-space regression a practical strategy for adapting hybrid CNN-Mamba segmentation models to future space-edge systems.

[240] Anatomy-Informed Deep Learning for Abdominal Aortic Aneurysm Segmentation

Osamah Sufyan,Martin Brückmann,Ralph Wickenhöfer,Babette Dellen,Uwe Jaekel

Main category: cs.CV

TL;DR: 本文提出了一种解剖学感知的腹主动脉瘤(AAA)分割框架,通过引入TotalSegmentator生成的器官排除掩码,在U-Net训练中融入解剖先验知识,显著减少假阳性并提升边界一致性,即使在小样本数据上也表现出高鲁棒性与泛化能力。

Details Motivation: 腹主动脉瘤(AAA)在CT血管造影中因解剖变异大、血管边界对比度低及邻近器官强度相似,导致分割困难、假阳性高。 Method: 提出一种解剖学感知分割框架:将TotalSegmentator生成的非血管器官排除掩码作为解剖先验,嵌入U-Net训练过程,对模型在这些区域预测AAA施加惩罚,引导其聚焦于主动脉及其病理性扩张。 Result: 相比标准U-Net基线,该方法在小规模数据集上实现了更高分割精度、显著降低假阳性率、改善边界一致性。 Conclusion: 利用器官排除掩码融入解剖知识是一种高效策略,可增强模型鲁棒性与泛化能力,支持有限标注数据下的可靠AAA分割。 Abstract: In CT angiography, the accurate segmentation of abdominal aortic aneurysms (AAAs) is difficult due to large anatomical variability, low-contrast vessel boundaries, and the close proximity of organs whose intensities resemble vascular structures, often leading to false positives. To address these challenges, we propose an anatomy-aware segmentation framework that integrates organ exclusion masks derived from TotalSegmentator into the training process. These masks encode explicit anatomical priors by identifying non-vascular organsand penalizing aneurysm predictions within these regions, thereby guiding the U-Net to focus on the aorta and its pathological dilation while suppressing anatomically implausible predictions. Despite being trained on a relatively small dataset, the anatomy-aware model achieves high accuracy, substantially reduces false positives, and improves boundary consistency compared to a standard U-Net baseline. The results demonstrate that incorporating anatomical knowledge through exclusion masks provides an efficient mechanism to enhance robustness and generalization, enabling reliable AAA segmentation even with limited training data.

[241] NTIRE 2026 Challenge on Single Image Reflection Removal in the Wild: Datasets, Results, and Methods

Jie Cai,Kangning Yang,Zhiyuan Li,Florin-Alexandru Vasluianu,Radu Timofte,Jinlong Li,Jinglin Shen,Zibo Meng,Junyan Cao,Lu Zhao,Pengwei Liu,Yuyi Zhang,Fengjun Guo,Jiagao Hu,Zepeng Wang,Fei Wang,Daiguo Zhou,Yi'ang Chen,Honghui Zhu,Mengru Yang,Yan Luo,Kui Jiang,Jin Guo,Jonghyuk Park,Jae-Young Sim,Wei Zhou,Hongyu Huang,Linfeng Li,Lindong Kong,Saiprasad Meesiyawar,Misbha Falak Khanpagadi,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Kosuke Shigematsu,Hiroto Shirono,Asuka Shin,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Jiachen Tu,Shreeniketh Joshi,Jin-Hui Jiang,Yu-Fan Lin,Yu-Jou Hsiao,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 本文介绍了NTIRE 2026单图像反射去除(SIRR)挑战赛,聚焦于真实场景下的反射去除任务,并发布了OpenRR-5k真实世界数据集,推动了该领域的性能提升。

Details Motivation: 现有反射去除方法多在合成图像或有限真实图像上测试,难以满足实际应用需求,亟需面向真实场景的评估基准与方法突破。 Method: 组织NTIRE 2026 SIRR挑战赛,构建并发布大规模真实世界反射图像数据集OpenRR-5k,支持参赛者开发适用于复杂反射场景的去除算法。 Result: 吸引超100支队伍注册,11支进入最终测试;多个方法刷新SOTA性能,并获五位领域专家一致认可。 Conclusion: OpenRR-5k数据集和本次挑战有效弥合了学术研究与真实应用之间的鸿沟,为反射去除任务提供了新基准和实用进展。 Abstract: In this paper, we review the NTIRE 2026 challenge on single-image reflection removal (SIRR) in the Wild. SIRR is a fundamental task in image restoration. Despite progress in academic research, most methods are tested on synthetic images or limited real-world images, creating a gap in real-world applications. In this challenge, we provide participants with the OpenRR-5k dataset, which requires them to process real-world images that cover a range of reflection scenarios and intensities, with the goal of generating clean images without reflections. The challenge attracted more than 100 registrations, with 11 of them participating in the final testing phase. The top-ranked methods advanced the state-of-the-art reflection removal performance and earned unanimous recognition from the five experts in the field. The proposed OpenRR-5k dataset is available at https://huggingface.co/datasets/qiuzhangTiTi/OpenRR-5k, and the homepage of this challenge is at https://github.com/caijie0620/OpenRR-5k. Due to page limitations, this article only presents partial content; the full report and detailed analyses are available in the extended arXiv version.

[242] SIMPLER: H&E-Informed Representation Learning for Structured Illumination Microscopy

Abu Zahid Bin Aziz,Syed Fahim Ahmed,Gnanesh Rasineni,Mei Wang,Olcaytu Hatipoglu,Marisa Ricci,Malaiyah Shaw,Guang Li,J. Quincy Brown,Valerio Pascucci,Shireen Elhabian

Main category: cs.CV

TL;DR: 本文提出SIMPLER框架,通过将结构光照明显微镜(SIM)与苏木精-伊红(H&E)染色图像进行跨模态自监督预训练,解决SIM在厚组织荧光成像中因模态偏移导致的性能下降问题,实现无需染色、快速、非破坏性病理分析。

Details Motivation: 现有数字病理大模型主要针对薄层切片(如H&E、IHC),难以直接迁移到厚组织荧光成像模态(如SIM),存在显著模态偏移和过拟合外观特征的问题。 Method: 提出SIMPLER框架,以H&E为语义锚点,通过对抗学习、对比学习和重建目标,渐进对齐SIM与H&E的表征,使SIM嵌入既保留模态特异性又捕获组织学结构。 Result: SIMPLER单个编码器在多个下游任务(如多实例学习、形态聚类)中持续优于从零训练或仅用H&E预训练的SIM模型;联合对齐提升SIM性能的同时不损害H&E表征能力。 Conclusion: SIMPLER实现了H&E与SIM之间的不对称表征增强,为新鲜组织无标记、快速术中诊断提供了可迁移、结构感知的深度学习基础。 Abstract: Structured Illumination Microscopy (SIM) enables rapid, high-contrast optical sectioning of fresh tissue without staining or physical sectioning, making it promising for intraoperative and point-of-care diagnostics. Recent foundation and large-scale self-supervised models in digital pathology have demonstrated strong performance on section-based modalities such as Hematoxylin and Eosin (H&E) and immunohistochemistry (IHC). However, these approaches are predominantly trained on thin tissue sections and do not explicitly address thick-tissue fluorescence modalities such as SIM. When transferred directly to SIM, performance is constrained by substantial modality shift, and naive fine-tuning often overfits to modality-specific appearance rather than underlying histological structure. We introduce SIMPLER (Structured Illumination Microscopy-Powered Learning for Embedding Representations), a cross-modality self-supervised pretraining framework that leverages H&E as a semantic anchor to learn reusable SIM representations. H&E encodes rich cellular and glandular structure aligned with established clinical annotations, while SIM provides rapid, nondestructive imaging of fresh tissue. During pretraining, SIM and H&E are progressively aligned through adversarial, contrastive, and reconstruction-based objectives, encouraging SIM embeddings to internalize histological structure from H&E without collapsing modality-specific characteristics. A single pretrained SIMPLER encoder transfers across multiple downstream tasks, including multiple instance learning and morphological clustering, consistently outperforming SIM models trained from scratch or H&E-only pretraining. Importantly, joint alignment enhances SIM performance without degrading H&E representations, demonstrating asymmetric enrichment rather

[243] Context Matters: Vision-Based Depression Detection Comparing Classical and Deep Approaches

Maneesh Bilalpur,Saurabh Hinduja,Sonish Sivarajkumar,Nicholas Allen,Yanshan Wang,Itir Onal Ertugrul,Jeffrey F. Cohn

Main category: cs.CV

TL;DR: 本文比较了传统方法(手工特征+SVM)和深度学习方法(FMAE-IAT嵌入+MLP)在视觉模态抑郁检测中的性能,发现传统方法在准确率和公平性上更优,但两种方法跨情境泛化能力均较弱,提示抑郁表征具有情境特异性。

Details Motivation: 缺乏对传统方法与深度学习方法在抑郁视觉检测中准确性、公平性和跨情境泛化能力的系统比较。 Method: 在TPOT(母婴互动)和Pitt(医患访谈)两个数据库中,对比手工特征+SVM(传统)与FMAE-IAT嵌入+MLP(深度)两种方法;抑郁定义依据DSM标准,在不同情境下有差异。 Result: 传统方法在两个情境中准确率均更高;在医患情境中显著更公平;两种方法跨情境泛化能力均有限。 Conclusion: 传统方法在抑郁视觉检测中仍具优势,且抑郁表征可能具有强情境依赖性,需针对具体情境建模。 Abstract: The classical approach to detecting depression from vision emphasizes interpretable features, such as facial expression, and classifiers such as the Support Vector Machine (SVM). With the advent of deep learning, there has been a shift in feature representations and classification approaches. Contemporary approaches use learnt features from general-purpose vision models such as VGGNet to train machine learning models. Little is known about how classical and deep approaches compare in depression detection with respect to accuracy, fairness, and generalizability, especially across contexts. To address these questions, we compared classical and deep approaches to the detection of depression in the visual modality in two different contexts: Mother-child interactions in the TPOT database and patient-clinician interviews in the Pitt database. In the former, depression was operationalized as a history of depression per the DSM and current or recent clinically significant symptoms. In the latter, all participants met initial criteria for depression per DSM, and depression was reassessed over the course of treatment. The classical approach included handcrafted features with SVM classifiers. Learnt features were turn-level embeddings from the FMAE-IAT that were combined with Multi-Layer Perceptron classifiers. The classical approach achieved higher accuracy in both contexts. It was also significantly fairer than the deep approach in the patient-clinician context. Cross-context generalizability was modest at best for both approaches, which suggests that depression may be context-specific.

[244] Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi

Patrick Kage,Pavlos Andreadis

Main category: cs.CV

TL;DR: 本文提出Scale-ALiBi,一种结合空间编码偏置的线性偏差Transformer注意力机制,用于多尺度、多模态卫星影像建模,并在GEO-Bench上取得提升,同时开源新数据集。

Details Motivation: 现有视觉基础模型难以有效处理多空间分辨率与多模态(如光学与SAR)卫星影像的联合建模需求。 Method: 提出Scale-ALiBi注意力机制,引入基于地面采样距离(GSD)尺度的线性空间编码偏置;构建三重对比学习与重建联合训练架构;在配准的高/低分辨率光学与低分辨率SAR影像数据集上实现。 Result: 在GEO-Bench基准测试中性能提升,并公开发布新整理的多模态多尺度卫星影像数据集。 Conclusion: Scale-ALiBi能有效建模跨尺度、跨模态卫星影像关系,为遥感基础模型提供了可扩展且高效的注意力设计范式。 Abstract: Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.

[245] Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex

Alexandru Brateanu,Tingting Mu,Codruta Ancuti,Cosmin Ancuti

Main category: cs.CV

TL;DR: 本文提出Multinex,一种超轻量级、多表示融合的低光照图像增强框架,基于Retinex残差建模,在极小参数量(最低0.7K)下实现媲美重型模型的性能。

Details Motivation: 现有SOTA方法依赖大模型、多阶段训练,难以部署于边缘设备;且仅使用单一颜色空间,易导致曝光和色彩伪影。 Method: 提出Multinex框架,将图像分解为来自不同解析表示的照度与色彩先验堆栈,并在Retinex残差框架下学习融合这些表示以生成亮度和反射率校正;强调增强而非重建,并采用轻量神经操作。 Result: Multinex含45K和0.7K两种轻量版本,在多个基准上显著超越对应轻量SOTA方法,并达到与重型模型相当的性能。 Conclusion: Multinex验证了多细粒度表示融合与结构化轻量设计在低光照图像增强中的有效性,兼顾高性能与边缘部署可行性。 Abstract: Low-light image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State-of-the-art (SOTA) LLIE techniques often rely on large models and multi-stage training, limiting practicality for edge deployment. Moreover, their dependence on a single color space introduces instability and visible exposure or color artifacts. To address these, we propose Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex residual formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. By prioritizing enhancement over reconstruction and exploiting lightweight neural operations, Multinex significantly reduces computational cost, exemplified by its lightweight (45K parameters) and nano (0.7K parameters) versions. Extensive benchmarks show that all lightweight variants significantly outperform their corresponding lightweight SOTA models, and reach comparable performance to heavy models. Paper page available at https://albrateanu.github.io/multinex.

[246] DeepShapeMatchingKit: Accelerated Functional Map Solver and Shape Matching Pipelines Revisited

Yizheng Xie,Lennart Bastian,Congyue Deng,Thomas W. Mitchel,Maolin Gao,Daniel Cremers

Main category: cs.CV

TL;DR: 本文提出了一种向量化求解功能映射的高效方法,实现最高33倍加速;分析了DiffusionNet中空间梯度特征的两种变体差异;提出了适用于部分匹配的平衡准确率评估指标;并开源了统一的深度形状匹配工具包DeepShapeMatchingKit。

Details Motivation: 标准功能映射实现串行求解k个线性系统,成为高谱分辨率下的计算瓶颈;DiffusionNet中空间梯度特征存在未被注意的实现分歧;部分匹配评估缺乏对不同重叠率鲁棒的指标。 Method: 提出向量化功能映射求解公式,单次内核调用并行求解所有线性系统;系统分析DiffusionNet中两种空间梯度特征变体的数学表达与几何含义;在部分匹配任务中引入平衡准确率(balanced accuracy)作为补充评估指标;构建并开源模块化、标准化的DeepShapeMatchingKit代码库。 Result: 向量化方法在保持解精确性前提下实现最高33倍加速;明确了两种DiffusionNet梯度特征对应的不同切平面变换族,并验证其在多个基准上的性能差异;平衡准确率被证实能更稳健地反映部分匹配性能;DeepShapeMatchingKit已开源并支持主流深度形状匹配方法的训练、评估与数据处理。 Conclusion: 通过算法优化、实现澄清与评估改进,本文显著提升了深度功能映射的效率与可复现性,并为社区提供了统一、易用的开源工具链,推动非刚性3D形状匹配研究的发展。 Abstract: Deep functional maps, leveraging learned feature extractors and spectral correspondence solvers, are fundamental to non-rigid 3D shape matching. Based on an analysis of open-source implementations, we find that standard functional map implementations solve k independent linear systems serially, which is a computational bottleneck at higher spectral resolution. We thus propose a vectorized reformulation that solves all systems in a single kernel call, achieving up to a 33x speedup while preserving the exact solution. Furthermore, we identify and document a previously unnoticed implementation divergence in the spatial gradient features of the mainstay DiffusionNet: two variants that parameterize distinct families of tangent-plane transformations, and present experiments analyzing their respective behaviors across diverse benchmarks. We additionally revisit overlap prediction evaluation for partial-to-partial matching and show that balanced accuracy provides a useful complementary metric under varying overlap ratios. To share these advancements with the wider community, we present an open-source codebase, DeepShapeMatchingKit, that incorporates these improvements and standardizes training, evaluation, and data pipelines for common deep shape matching methods. The codebase is available at: https://github.com/xieyizheng/DeepShapeMatchingKit

[247] Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

Nicolae Cudlenco,Mihai Masala,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文提出了一种新型多智能体视频生成系统,其核心是让LLM生成结构化的时空事件图(GEST),再由3D游戏引擎确定性执行,从而保证语义可靠性和物理合理性;通过分离叙事规划与约束执行、引入双智能体分层架构及关系子智能体,解决了现有方法无法生成可执行规范的问题,并在自动评估和人工评估中显著优于神经视频生成模型。

Details Motivation: 现有基于LLM代理协调神经视频生成器的多智能体系统虽视觉效果好,但语义不可靠且缺乏真实标注;亟需一种能生成语义精确、物理合理、可验证的视频内容的新范式。 Method: 提出以GEST(Graph of Events in Space and Time)为核心的代理系统:LLM仅负责自然语言叙事规划,程序化状态后端通过验证工具调用强制执行仿真约束;采用Director-Scene Builder双代理分层架构,辅以Relation Subagents填充GEST中的逻辑与语义边;全程确保生成的GEST规范100%可执行。 Result: 在自主生成评估中,本系统在文本和视频对比中分别以79%和74%胜率超越程序化基线;在种子文本驱动的对比实验中,其引擎生成视频在物理有效性(58% vs 25%/20%)和语义对齐度(3.75/5 vs 2.33/1.50)上均大幅领先VEO 3.1与WAN 2.2。 Conclusion: 将LLM从像素生成中解耦、转向结构化语义建模,并通过程序化后端保障可执行性,是提升多智能体视频生成可靠性与可控性的关键路径;GEST+游戏引擎范式为具身AI与可信内容生成提供了新基础。 Abstract: Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture -- a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine -- with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).

[248] GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

Nicolae Cudlenco,Mihai Masala,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文提出GTASA数据集和GEST-Engine系统,用于生成具有精确时空结构标注的多角色视频,并在物理合理性和语义保真度评估上优于现有神经视频生成方法;同时发现自监督视频编码器在空间结构建模上优于视觉语言模型编码器。

Details Motivation: 现有神经视频生成器难以生成复杂多角色场景视频,且缺乏物理合理性和语义忠实性的真值标注,导致评估困难。 Method: 构建GTASA数据集(含逐帧空间关系图和事件级时序映射),基于Graphs of Events in Space and Time(GEST)设计GEST-Engine生成系统;通过人类评估和视频字幕模型训练进行定性与定量比较;利用GTASA的精确3D真值,在11个时空推理任务上探针分析4种冻结视频编码器。 Result: GEST-Engine在物理有效性与语义对齐的人类评估及视频字幕任务中均显著优于开源和闭源神经生成器;自监督编码器在空间结构编码能力上明显优于VLM视觉编码器。 Conclusion: GTASA为多角色视频生成与评估提供了可靠基准;GEST-Engine展示了基于显式时空图建模的生成范式优势;编码器探针结果揭示了不同预训练范式在时空理解上的关键差异。 Abstract: Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA's exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

[249] FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception

Rahul Ahuja,Mudit Jain,Bala Murali Manoghar Sai Sudhakar,Venkatraman Narayanan,Pratik Likhar,Varun Ravi Kumar,Senthil Yogamani

Main category: cs.CV

TL;DR: 本文提出了一种轻量级框架\ours,用于将冻结的视觉基础模型(VFMs)适配到鱼眼相机几何中,通过LoRA微调DINOv2骨干网络,并引入基于球面坐标的鱼眼旋转位置编码(FishRoPE),以在角度空间而非像素空间中建模注意力,从而解决鱼眼图像严重径向失真导致的空间表征不一致问题,在WoodScape和SynWoodScapes基准上达到SOTA。

Details Motivation: 现有视觉基础模型和BEV表示基于针孔相机的直角几何假设,无法适配鱼眼相机的严重径向失真;而大规模鱼眼标注稀缺,难以从头训练新模型。 Method: 提出\ours框架:1)冻结DINOv2骨干并用LoRA进行轻量微调,迁移自监督特征;2)设计Fisheye Rotary Position Embedding(FishRoPE),将注意力机制重参数化至鱼眼球面坐标系,使自注意力与交叉注意力基于角度分离而非像素距离。 Result: 在WoodScape 2D检测任务达54.3 mAP,在SynWoodScapes BEV分割任务达65.1 mIoU,均为当前最优性能。 Conclusion: FishRoPE具有架构无关性、计算开销极小,且在针孔相机下自然退化为标准位置编码;\ours无需重新预训练即可有效适配冻结VFM至鱼眼几何,显著提升下游感知任务性能。 Abstract: Vision foundation models (VFMs) and Bird's Eye View (BEV) representation have advanced visual perception substantially, yet their internal spatial representations assume the rectilinear geometry of pinhole cameras. Fisheye cameras, widely deployed on production autonomous vehicles for their surround-view coverage, exhibit severe radial distortion that renders these representations geometrically inconsistent. At the same time, the scarcity of large-scale fisheye annotations makes retraining foundation models from scratch impractical. We present \ours, a lightweight framework that adapts frozen VFMs to fisheye geometry through two components: a frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) that transfers rich self-supervised features to fisheye without task-specific pretraining, and Fisheye Rotary Position Embedding (FishRoPE), which reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. FishRoPE is architecture-agnostic, introduces negligible computational overhead, and naturally reduces to the standard formulation under pinhole geometry. We evaluate \ours on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU), where it achieves state-of-the-art results on both benchmarks.

[250] Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Yuanhao Luo,Di Wen,Kunyu Peng,Ruiping Liu,Junwei Zheng,Yufan Chen,Jiale Wei,Rainer Stiefelhage

Main category: cs.CV

TL;DR: 本文提出DETAnt-HOI基准和HOI-DA框架,联合建模当前HOI检测与未来交互预测,通过残差式状态转移实现对人类-物体交互的多步时序预测,并在修正时间标签的基准上验证了联合学习的有效性。

Details Motivation: 现有方法将交互预测视为基于预构建人-物对的下游任务,缺乏检测与预测的联合推理;且当前数据集稀疏的关键帧标注导致未来标签与真实动态时间错位,影响评估可靠性。 Method: 构建时间校正的DETAnt-HOI基准;提出HOI-DA框架,以人-物对为中心,联合完成定位、当前HOI检测与未来交互预测,将未来交互建模为当前对状态的残差转移。 Result: 在检测与预测任务上均取得一致提升,长时域预测增益更显著;验证了将预测作为对视频对级表征学习的结构约束可最大化预测效能。 Conclusion: 人类-物体交互的未来预测应与当前检测联合学习,而非分离建模;时间对齐的标注与对级残差建模是提升多步HOI预测性能的关键。 Abstract: Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

[251] IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

Di Wen,Zeyun Zhong,David Schneider,Manuel Zaremski,Linus Kunzmann,Yitian Shi,Ruiping Liu,Yufan Chen,Junwei Zheng,Jiahang Li,Jonas Hemmerich,Qiyi Tong,Patric Grauberger,Arash Ajoudani,Danda Pani Paudel,Sven Matthiesen,Barbara Deml,Jürgen Beyerer,Luc Van Gool,Rainer Stiefelhagen,Kunyu Peng

Main category: cs.CV

TL;DR: 本文介绍了IMPACT数据集,首个面向实际工业装配/拆卸流程理解的五视角同步RGB-D数据集,包含真实角磨机组装与拆解过程,提供多视角同步采集、双手动作解耦标注、合规性感知状态跟踪及异常-恢复监督,并揭示了现有单任务基准在真实部署场景下的局限性。

Details Motivation: 现有工业流程理解数据集缺乏真实工业场景下的多视角同步感知、双手精细动作标注、状态合规性建模以及异常处理监督,难以支撑面向实际部署的算法评估与开发。 Method: 构建了IMPACT数据集:采集112次真实角磨机装配/拆解试验(39.5小时),涵盖13名参与者;采用五视角同步RGB-D采集;设计部分序先决图控制多路径执行;定义六类异常类型与NASA-TLX认知负荷测量;建立手部原子动作→流程步骤→组件状态→合规阶段的分层标注体系,并引入同步空片段以区分感知缺陷与算法失败。 Result: 系统性基线实验揭示了当前方法在不完整观测、灵活执行路径和纠错行为等真实部署条件下的根本性局限,这些局限在传统单任务基准中不可见。 Conclusion: IMPACT为工业 procedural understanding 提供了首个兼具真实性、多模态同步性、细粒度标注与异常恢复监督的综合基准,推动算法向实际部署场景演进。 Abstract: We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly--recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos-Wen/IMPACT.

[252] Neural Stochastic Processes for Satellite Precipitation Refinement

Shunya Nagashima,Takumi Bannai,Shuitsu Koyama,Tomoya Mitsui,Shuntaro Suzuki

Main category: cs.CV

TL;DR: 本文提出了一种名为Neural Stochastic Process (NSP)的新模型,用于融合卫星降水数据与地面雨量计观测,通过建模时空依赖性提升定量降水估计(QPE)精度,并构建了新基准QPEBench验证其优越性。

Details Motivation: 卫星降水产品存在系统偏差,而地面雨量计虽准确但空间稀疏;现有融合方法忽略降水场的时间结构,导致性能受限。 Method: 提出NSP模型:结合基于雨量计观测的Neural Process编码器与2D空间表示上的隐式Neural SDE,在单一变分目标下进行免仿真训练。同时构建QPEBench基准数据集。 Result: 在QPEBench上,NSP全面超越13种基线方法及JAXA业务化校准产品;在日本九州地区的独立实验也验证了其区域泛化能力。 Conclusion: NSP通过联合建模时空依赖性显著提升了QPE精度,为多源降水融合提供了新范式,QPEBench有望推动该领域标准化评估。 Abstract: Accurate precipitation estimation is critical for flood forecasting, water resource management, and disaster preparedness. Satellite products provide global hourly coverage but contain systematic biases; ground-based gauges are accurate at point locations but too sparse for direct gridded correction. Existing methods fuse these sources by interpolating gauge observations onto the satellite grid, but treat each time step independently and therefore discard temporal structure in precipitation fields. We propose Neural Stochastic Process (NSP), a model that pairs a Neural Process encoder conditioning on arbitrary sets of gauge observations with a latent Neural SDE on a 2D spatial representation. NSP is trained under a single variational objective with simulation-free cost. We also introduce QPEBench, a benchmark of 43{,}756 hourly samples over the Contiguous United States (2021--2025) with four aligned data sources and six evaluation metrics. On QPEBench, NSP outperforms 13 baselines across all six metrics and surpasses JAXA's operational gauge-calibrated product. An additional experiment on Kyushu, Japan confirms generalization to a different region with independent data sources.

[253] Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers

Tzu-Yuan Lin,Ho Jae Lee,Kevin Doherty,Yonghyeon Lee,Sangbae Kim

Main category: cs.CV

TL;DR: Point2Pose是一种无需CAD模型或类别先验的、基于单目RGB-D视频的多刚体目标6D位姿跟踪方法,仅需稀疏图像点初始化,结合2D点跟踪与在线TSDF重建,支持严重遮挡下的即时恢复和多目标跟踪。

Details Motivation: 现有无模型跟踪方法难以处理严重遮挡及多目标场景,且通常依赖CAD模型或类别先验,限制了泛化性与实用性。 Method: 利用2D点跟踪器获取长程点对应关系,并同步增量构建被跟踪目标的在线截断符号距离函数(TSDF)表示,实现无模型、多目标、抗完全遮挡的6D位姿跟踪。 Result: 在严重遮挡基准上达到与SOTA方法相当的性能,首次在无模型方法中支持多目标跟踪与完全遮挡后的即时恢复,并在新构建的含动捕真值的仿真与真实数据集上验证了有效性。 Conclusion: Point2Pose证明了仅凭稀疏点初始化和RGB-D视频即可实现鲁棒、通用的多目标6D位姿跟踪,为无模型视觉跟踪提供了新范式。 Abstract: We present Point2Pose, a model-free method for causal 6D pose tracking of multiple rigid objects from monocular RGB-D video. Initialized only from sparse image points on the objects to be tracked, our approach tracks multiple unseen objects without requiring object CAD models or category priors. Point2Pose leverages a 2D point tracker to obtain long-range correspondences, enabling instant recovery after complete occlusion. Simultaneously, the system incrementally reconstructs an online Truncated Signed Distance Function (TSDF) representation of the tracked targets. Alongside the method, we introduce a new multi-object tracking dataset comprising both simulation and real-world sequences, with motion-capture ground truth for evaluation. Experiments show that Point2Pose achieves performance comparable to the state-of-the-art methods on a severe-occlusion benchmark, while additionally supporting multi-object tracking and recovery from complete occlusion, capabilities that are not supported by previous model-free tracking approaches.

[254] DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

Song Jin,Juntian Zhang,Xun Zhang,Zeying Tian,Fei Jiang,Guojun Yin,Wei Lin,Yong Liu,Rui Yan

Main category: cs.CV

TL;DR: 本文提出了DiningBench,一个面向食物领域的多视角、分层式视觉语言模型评测基准,涵盖细粒度分类、营养估计和视觉问答三类任务,并对29个主流VLM进行了系统评估,揭示了当前模型在细粒度识别与营养推理上的显著短板。

Details Motivation: 现有食物领域视觉语言模型评测基准存在类别粗粒度、单视角图像、元数据不准确等问题,难以全面评估模型能力。 Method: 构建了包含3021道菜肴、平均每道菜5.27张图像的多视角、分层式基准DiningBench,引入菜单内细粒度难负样本和基于验证的营养数据;对29个开源及商用VLM进行评测,并分析多视角输入与思维链推理的影响及五类主要失败模式。 Result: 实验表明当前VLM在通用推理上表现良好,但在细粒度视觉判别和精确营养推理上明显不足;多视角输入和思维链推理有一定提升但未根本解决问题。 Conclusion: DiningBench为食物领域VLM研究提供了更具挑战性的评测平台,推动模型向更精细、更可靠的方向发展。 Abstract: Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

[255] SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

Ruibin Wang,Zhenyu Lin,Xinhai Zhao

Main category: cs.CV

TL;DR: 本文提出SignReasoner,通过Functional Structure Unit(FSU)将通用视觉语言模型(VLM)转化为交通标志专家推理器,提升对复杂、多语言、组合式交通标志的语义理解与泛化能力,在新基准TrafficSignEval上达到SOTA。

Details Motivation: 现有模型(包括小型专用模型和大型VLM)缺乏组合泛化能力,难以应对新型交通标志配置,威胁自动驾驶安全。 Method: 提出Functional Structure Unit(FSU)范式,将交通标志分解为最小功能模块(如方向、提示、车道等),并设计两阶段VLM后训练流程:1)迭代字幕-FSU蒸馏;2)基于树编辑距离(TED)奖励的FSU-GRPO强化学习优化。 Result: 在新构建的FSU-Reasoning基准TrafficSignEval上,SignReasoner显著提升多种VLM的交通标志理解性能,达到新SOTA,且无需修改模型结构、数据高效。 Conclusion: FSU范式有效建模交通标志的结构语法,赋予VLM强组合泛化能力,为安全可靠的自动驾驶语义理解提供新路径。 Abstract: Accurate semantic understanding of complex traffic signs-including those with intricate layouts, multi-lingual text, and composite symbols-is critical for autonomous driving safety. Current models, both specialized small ones and large Vision Language Models (VLMs), suffer from a significant bottleneck: a lack of compositional generalization, leading to failure when encountering novel sign configurations. To overcome this, we propose SignReasoner, a novel paradigm that transforms general VLMs into expert traffic sign reasoners. Our core innovation is Functional Structure Unit (FSU), which shifts from common instance-based modeling to flexible function-based decomposition. By breaking down complex signs into minimal, core functional blocks (e.g., Direction, Notice, Lane), our model learns the underlying structural grammar, enabling robust generalization to unseen compositions. We define this decomposition as the FSU-Reasoning task and introduce a two-stage VLM post-training pipeline to maximize performance: Iterative Caption-FSU Distillation that enhances the model's accuracy in both FSU-reasoning and caption generation; FSU-GRPO that uses Tree Edit Distance (TED) to compute FSU differences as the rewards in GRPO algorithm, boosting reasoning abilities. Experiments on the newly proposed FSU-Reasoning benchmark, TrafficSignEval, show that SignReasoner achieves new SOTA with remarkable data efficiency and no architectural modification, significantly improving the traffic sign understanding in various VLMs.

[256] Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance

Chenyu Wang,Weicheng Dai,Han Liu,Wenchao Li,Kayhan Batmanghelich

Main category: cs.CV

TL;DR: 本文提出DCP-PD框架,通过判别性线索提示与提示丢弃,提升放射科CT报告生成中细粒度病灶定位与描述的准确性,并引入层级化位置感知评估协议。

Details Motivation: 现有视觉-语言模型在放射科报告生成中缺乏细粒度监督(如病灶位置对齐)且评估方式无法诊断空间定位能力。 Method: 提出判别性线索提示与提示丢弃(DCP-PD)框架,从自由文本报告中蒸馏细粒度线索以指导生成,并缓解模型对捷径特征的依赖;同时设计层级化、位置感知的问答评估协议(存在→左右侧→肺叶)。 Result: 在CT-RATE上宏观F1达0.603(相对提升20%),Rad-ChestCT跨分布性能F1从0.266升至0.503(相对提升89%);新评估协议揭示当前模型的空间定位能力仍严重不足。 Conclusion: 细粒度监督与位置感知评估对提升VLMs在放射学报告生成中的临床实用性至关重要,DCP-PD为解决空间接地问题提供了有效且即插即用的方案。 Abstract: Vision--language models (VLMs) for radiology report generation (RRG) can produce long-form chest CT reports from volumetric scans and show strong potential to improve radiology workflow efficiency and consistency. However, existing methods face two key limitations: (i) training supervision is often coarse, aligning a whole CT volume with a full free-text report without explicit alignment for fine-grained attributes or pathology locations; and (ii) evaluation is typically holistic (lexical overlap, entity matching, or LLM-as-a-judge scores) and not diagnostic for spatial grounding. We propose \emph{Discriminative Cue-Prompting with Prompt Dropout (DCP-PD)}, a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. DCP-PD achieves state-of-the-art performance on CT-RATE, improving macro F1 from $=0.501$ to $0.603$ (20% relative), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 $=0.266$ to $0.503$ (89% relative). Finally, we introduce a hierarchical, location-aware question-set protocol (presence $\rightarrow$ laterality $\rightarrow$ lobe) to directly assess pathology-location grounding, showing that fine-grained spatial localization remains challenging even for models that score highly on current benchmarks.

[257] PERCEPT-Net: A Perceptual Loss Driven Framework for Reducing MRI Artifact Tissue Confusion

Ziheng Guo,Danqun Zheng,Chengwei Chen,Boyang Pan,Shuai Li,Ziqin Yu,Xiaoxiao Chen,Langdi Zhong,Yun Bian,Nan-Jie Gong

Main category: cs.CV

TL;DR: 本文提出PERCEPT-Net,一种基于感知监督的MRI运动伪影校正框架,通过Motion Perceptual Loss(MPL)实现伪影抑制与解剖结构保真之间的平衡,在临床数据上显著优于现有方法。

Details Motivation: 现有深度学习MRI伪影校正模型因伪影与组织混淆导致临床泛化能力差,难以区分伪影与真实解剖结构。 Method: 采用残差U-Net主干网络,集成多尺度恢复模块和双注意力机制;核心是Motion Perceptual Loss(MPL),提供伪影感知监督;训练使用真实与仿真混合数据集。 Result: 在临床数据上超越SOTA方法;消融实验证明MPL对结构一致性(p<0.001)和组织对比度(p<0.001)有显著因果贡献;放射科医生评估显示全局图像质量评分显著更高(中位数3 vs. 2,p<0.001),关键诊断结构得以保留。 Conclusion: PERCEPT-Net通过任务特异、伪影感知的感知学习,在抑制临床MRI运动伪影的同时保障解剖完整性,提升临床鲁棒性,并可验证地缓解过平滑与结构退化问题。 Abstract: Purpose: Existing deep learning-based MRI artifact correction models exhibit poor clinical generalization due to inherent artifact-tissue confusion, failing to discriminate artifacts from anatomical structures. To resolve this, we introduce PERCEPT-Net, a framework leveraging dedicated perceptual supervision for structure-preserving artifact suppression. Method: PERCEPT-Net utilizes a residual U-Net backbone integrated with a multi-scale recovery module and dual attention mechanisms to preserve anatomical context and salient features. The core mechanism, Motion Perceptual Loss (MPL), provides artifact-aware supervision by learning generalizable motion artifact representations. This logic directly guides the network to suppress artifacts while maintaining anatomical fidelity. Training utilized a hybrid dataset of real and simulated sequences, followed by prospective validation via objective metrics and expert radiologist assessments. Result: PERCEPT-Net outperformed state-of-the-art methods on clinical data. Ablation analysis established a direct causal link between MPL and performance; its omission caused a significant deterioration in structural consistency (p < 0.001) and tissue contrast (p < 0.001). Radiologist evaluations corroborated these objective metrics, scoring PERCEPT-Net significantly higher in global image quality (median 3 vs. 2, p < 0.001) and verifying the preservation of critical diagnostic structures. Conclusion: By integrating task-specific, artifact-aware perceptual learning, PERCEPT-Net suppresses motion artifacts in clinical MRI without compromising anatomical integrity. This framework improves clinical robustness and provides a verifiable mechanism to mitigate over-smoothing and structural degradation in medical image reconstruction.

[258] ReContraster: Making Your Posters Stand Out with Regional Contrast

Peixuan Zhang,Zijian Jia,Ziqi Cai,Shuchen Weng,Si Li,Boxin Shi

Main category: cs.CV

TL;DR: ReContraster是一种无需训练的海报设计模型,通过区域对比增强视觉吸引力,并结合多智能体系统与混合去噪策略提升海报质量。

Details Motivation: 有效海报设计需快速吸引注意力并清晰传达信息,受‘对比效应’原理启发,提出一种无需训练的模型来利用区域对比使海报更突出。 Method: ReContraster引入构图式多智能体系统模拟海报设计师的认知行为,识别元素、组织布局并评估候选海报;在扩散过程中集成混合去噪策略以确保区域边界间的和谐过渡。 Result: 在新构建的基准数据集上,通过七个定量指标和四项用户研究验证,ReContraster优于现有最先进方法,生成视觉冲击力强且美观的海报。 Conclusion: ReContraster是首个无需训练、基于区域对比的海报设计模型,在视觉表现与用户偏好方面均展现出显著优势。 Abstract: Effective poster design requires rapidly capturing attention and clearly conveying messages. Inspired by the ``contrast effects'' principle, we propose ReContraster, the first training-free model to leverage regional contrast to make posters stand out. By emulating the cognitive behaviors of a poster designer, ReContraster introduces the compositional multi-agent system to identify elements, organize layout, and evaluate generated poster candidates. To further ensure harmonious transitions across region boundaries, ReContraster integrates the hybrid denoising strategy during the diffusion process. We additionally contribute a new benchmark dataset for comprehensive evaluation. Seven quantitative metrics and four user studies confirm its superiority over relevant state-of-the-art methods, producing visually striking and aesthetically appealing posters.

[259] Parameter Efficient Fine-tuning for Domain-specific Gastrointestinal Disease Recognition

Sanjaya Poudel,Nikita Kunwor,Raj Simkhada,Mustafa Munir,Manish Dhakal,Khem Poudel

Main category: cs.CV

TL;DR: 本文提出使用低秩适应(LoRA)模块对预训练大模型进行轻量级微调,以应对跨源医学图像分布偏移问题,在胃肠道疾病分类任务中展现出优于全参数微调的性能和更高的参数效率。

Details Motivation: 解决医学图像分析中跨源数据分布偏移问题,避免为每个数据源单独全量微调大模型带来的高存储与计算成本。 Method: 引入低秩适应(LoRA)模块,在预训练模型权重上添加可学习的轻量级低秩扰动矩阵,仅微调这些低秩参数以适配下游分类任务。 Result: 在胃肠道疾病分类任务中,LoRA方法相比全参数微调取得了显著更优的性能,同时大幅提升了参数效率。 Conclusion: LoRA是一种高效、低成本的跨源医学图像微调策略,适用于资源受限场景下的临床应用部署。 Abstract: Despite recent advancements in the field of medical image analysis with the use of pretrained foundation models, the issue of distribution shifts between cross-source images largely remains adamant. To circumvent that issue, investigators generally train a separate model for each source. However, this method becomes expensive when we fully fine-tune pretrained large models for a single dataset, as we must store multiple copies of those models. Thus, in this work, we propose using a low-rank adaptation (LoRA) module for fine-tuning downstream classification tasks. LoRAs learn lightweight task-specific low-rank matrices that perturb pretrained weights to optimize those downstream tasks. For gastrointestinal tract diseases, they exhibit significantly better results than end-to-end finetuning with improved parameter efficiency. Code is available at: github.com/sanjay931/peft-gi-recognition.

[260] AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control

Shi Chen,Xuecheng Wu,Heli Sun,Yunyun Shi,Xinyi Yin,Fengjian Xue,Jinheng Xie,Dingkang Yang,Hao Wang,Junxiao Xue,Liang He

Main category: cs.CV

TL;DR: 本文提出了首个面向情感图像编辑(AIM)的基准AIM-Bench,并构建了平衡数据集AIM-40k以缓解现有模型的积极情绪偏差,显著提升了编辑效果。

Details Motivation: 现有图像编辑基准缺乏对情感维度的细粒度建模能力,难以支持情感驱动的图像编辑任务。 Method: 提出双路径情感建模方案(结合Mikels情绪分类与VAD框架),构建含800样本的AIM-Bench;设计复合评估体系;开发基于逆重绘策略的数据引擎生成平衡数据集AIM-40k(40k样本)。 Result: 在AIM-Bench上评估发现当前模型存在显著积极偏差;基于AIM-40k微调模型带来9.15%的整体性能相对提升。 Conclusion: AIM-Bench和AIM-40k为情感图像编辑提供了可靠基准与高质量数据支撑,有效推动该领域发展。 Abstract: Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.

[261] A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

Peixuan Zhang,Chang Zhou,Ziyuan Zhang,Hualuo Liu,Chunjie Zhang,Jingqi Liu,Xiaohui Zhou,Xi Chen,Shuchen Weng,Si Li,Boxin Shi

Main category: cs.CV

TL;DR: 本文提出了CineBench基准和CineAgents多智能体系统,用于指令驱动的电影视频剪辑任务,显著提升了叙事连贯性和逻辑连贯性。

Details Motivation: 现有视频剪辑方法局限于预定义任务,且缺乏全面评估电影剪辑能力的基准。 Method: 提出CineAgents多智能体系统,采用“设计-编排”范式,通过脚本逆向工程构建分层叙事记忆,并结合迭代式叙事规划生成最终剪辑脚本。 Result: CineAgents在CineBench基准上显著优于现有方法,生成的剪辑具有更优的叙事连贯性和逻辑连贯性。 Conclusion: CineBench和CineAgents为指令驱动的电影视频剪辑提供了新基准与有效方法,推动了该领域的发展。 Abstract: The surging demand for adapting long-form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction-driven cinematic video compilation, featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi-agent system that reformulates cinematic video compilation into ``design-and-compose'' paradigm. CineAgents performs script reverse-engineering to construct a hierarchical narrative memory to provide multi-level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.

[262] Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection

Xinlei Guan,David Arosemena,Tejaswi Dhandu,Kuan Huang,Meng Xu,Miles Q. Li,Bingyu Shen,Ruiyang Qin,Umamaheswara Rao Tida,Boyang Li

Main category: cs.CV

TL;DR: 本文提出了一种基于隐写术的AI生成图像溯源框架,通过在图像生成时嵌入加密签名标识,并结合多模态有害内容检测触发溯源验证,实现了对有害AI图像部署的可靠追踪。

Details Motivation: 生成式AI快速发展带来了内容审核与数字取证新挑战,尤其是良性AI图像与有害文本组合的上下文滥用问题,且合成图像通常缺乏持久元数据或设备签名,导致传统审核框架失效和归因困难。 Method: 提出隐写术增强的溯源框架:在图像创建时嵌入加密签名标识;采用五种水印方法(空域、频域、小波域)并评估其鲁棒性;构建基于CLIP的多模态有害内容融合检测模型作为溯源触发器。 Result: 实验表明,小波域扩频水印在模糊失真下具有强鲁棒性;多模态融合检测器AUC-ROC达0.99;整体形成端到端取证流水线,支持对有害AI图像部署的可靠溯源。 Conclusion: 该框架有效解决了AI生成图像在上下文滥用场景下的归因难题,提升了合成媒体环境中的可追责性与内容治理能力。 Abstract: The rapid growth of generative AI has introduced new challenges in content moderation and digital forensics. In particular, benign AI-generated images can be paired with harmful or misleading text, creating difficult-to-detect misuse. This contextual misuse undermines the traditional moderation framework and complicates attribution, as synthetic images typically lack persistent metadata or device signatures. We introduce a steganography enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Our system evaluates five watermarking methods across spatial, frequency, and wavelet domains. It also integrates a CLIP-based fusion model for multimodal harmful-content detection. Experiments demonstrate that spread-spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions, and our multimodal fusion detector achieves an AUC-ROC of 0.99, enabling reliable cross-modal attribution verification. These components form an end-to-end forensic pipeline that enables reliable tracing of harmful deployments of AI-generated imagery, supporting accountability in modern synthetic media environments. Our code is available at GitHub: https://github.com/bli1/steganography

[263] ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos

Arjun Somayazulu,Kristen Grauman

Main category: cs.CV

TL;DR: 本文提出ExpertEdit框架,通过无配对专家视频数据训练,实现无需配对监督或人工指导的技能驱动型运动编辑,能自动将初学者动作在关键技能时刻提升至专家水平。

Details Motivation: 现有运动编辑方法依赖成对输入-输出数据和明确的编辑指导,难以适用于技能提升这类数据稀缺且昂贵的场景;而心理学研究表明,观察接近完美的自身动作比单纯看专家示范更能加速学习。 Method: ExpertEdit基于无配对专家视频,采用掩码语言建模目标学习专家运动先验,推理时对新手动作在技能关键帧进行掩码,并将其投影到学习到的专家流形中,实现局部技能增强。 Result: 在Ego-Exo4D和Karate Kyokushin数据集的八个技术动作、三个运动项目上,ExpertEdit在运动真实性和专家质量等多个指标上超越当前最优监督式运动编辑方法。 Conclusion: ExpertEdit证明了仅用无配对专家视频即可实现高质量、可解释的技能驱动运动编辑,为个性化视觉反馈提供了新范式。 Abstract: Visual feedback is critical for motor skill acquisition in sports and rehabilitation, and psychological studies show that observing near-perfect versions of one's own performance accelerates learning more effectively than watching expert demonstrations alone. We propose to enable such personalized feedback by automatically editing a person's motion to reflect higher skill. Existing motion editing approaches are poorly suited for this setting because they assume paired input-output data -- rare and expensive to curate for skill-driven tasks -- and explicit edit guidance at inference. We introduce ExpertEdit, a framework for skill-driven motion editing trained exclusively on unpaired expert video demonstrations. ExpertEdit learns an expert motion prior with a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance. Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality. Project page: https://vision.cs.utexas.edu/projects/expert_edit/ .

[264] UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation

Haopeng Chen,Yihao Ai,Kabeen Kim,Robby T. Tan,Yixin Chen,Bo Wang

Main category: cs.CV

TL;DR: 本文提出UDAPose框架,通过引入DHF滤波器和LCIM模块提升低光图像合成质量,并设计DCA模块在Transformer中动态融合视觉线索与姿态先验,显著提升低光环境下人体姿态估计性能。

Details Motivation: 低光照条件下人体姿态估计面临标注数据稀缺、视觉信息丢失、现有增强方法难以真实建模噪声及高频细节、以及传统跨注意力机制在低光下失效等问题。 Method: 提出UDAPose框架:1)基于直流偏置的高通滤波(DHF)与低光特性注入模块(LCIM)联合合成高质量低光图像;2)在Transformer中引入动态注意力控制(DCA)模块,自适应融合图像线索与学习到的姿态先验。 Result: 在ExLPose-test hard集(LL-H)上AP提升10.1(56.4%),跨数据集EHPT-XC验证中AP提升7.4(31.4%),显著优于当前最优方法。 Conclusion: UDAPose有效缓解了低光姿态估计中的域偏移与线索不可靠问题,通过高质量合成与动态多源信息融合,提升了模型在真实低光场景下的泛化能力。 Abstract: Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions. But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes. Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low-light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC. Code: https://github.com/Vision-and-Multimodal-Intelligence-Lab/UDAPose

[265] Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Yudong Han,Yong Wang,Zaiquan Yang,Zhen Qu,Liyuan Pan,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态隐式推理的新框架,通过视觉重放模块和路由深度缩放机制,解决视觉token优化不足与复杂token推理深度受限的问题,在提升性能的同时显著加速推理。

Details Motivation: 现有隐式多模态推理方法存在视觉token因语言偏置导致的系统性优化不足,以及固定网络深度难以适配不同复杂度token的推理需求。 Method: 提出视觉重放模块(利用因果自注意力估计token显著性并施加空间一致性约束)和路由深度缩放机制(为复杂token动态分配更多推理步),并结合渐进式课程策略将显式思维链内化为紧凑隐式表征。 Result: 在多个基准测试上达到SOTA性能,并相比显式Chain-of-Thought基线实现显著推理加速。 Conclusion: 隐式多模态推理可通过针对性优化视觉表征与自适应深度控制实现性能与效率双赢,无需显式生成推理步骤。 Abstract: Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.

[266] FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation

Chenhan Jiang,Yu Chen,Qingwen Zhang,Jifei Song,Songcen Xu,Dit-Yan Yeung,Jiankang Deng

Main category: cs.CV

TL;DR: FreeScale 提出一种利用不完美场景重建作为几何代理、结合置信度感知自由视角采样的新框架,将有限真实图像序列转化为高质量可扩展训练数据,显著提升新型视角合成模型泛化能力与3D高斯溅射优化效果。

Details Motivation: 通用新型视角合成(NVS)模型受限于兼具多样性与精确相机轨迹的大规模真实训练数据稀缺;真实数据稀疏离散,合成数据存在域偏移和语义失真问题。 Method: 提出 FreeScale 框架:1)利用场景重建生成几何代理;2)设计置信度感知的自由视角采样策略,筛选语义合理且受重建误差影响小的新视角;3)将生成数据用于扩展前馈 NVS 模型训练及单场景 3D 高斯溅射优化。 Result: 在分布外基准上 PSNR 提升 2.7 dB;生成数据有效提升多数据集上 3D 高斯溅射优化性能;提供实用、高效的数据生成引擎。 Conclusion: FreeScale 通过融合重建几何先验与不确定性建模,实现了从少量真实图像到高质量、可扩展训练数据的可靠转化,为 3D 视觉中数据瓶颈问题提供了切实可行的解决方案。 Abstract: The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data featuring diverse and precise camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FreeScale, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy identifying novel viewpoints that are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate FreeScale's effectiveness by scaling up the training of feedforward NVS models, achieving a notable gain of 2.7 dB in PSNR on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision. Project page: https://mvp-ai-lab.github.io/FreeScale.

[267] Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

Lincoln Spencer,Song Wang,Chen Chen

Main category: cs.CV

TL;DR: 本研究通过对比不同视觉表征方法,在小样本手动小切口白内障手术(SICS)视频相位分割任务中,验证了大视觉基础模型(如DINOv3、V-JEPA2)在数据稀缺场景下的优越迁移能力,并提出缓存特征流水线以提升训练效率。

Details Motivation: 手术相位分割对计算机辅助手术至关重要,但在标注手术视频稀缺的情况下,构建鲁棒模型困难;需探索数据高效、可迁移的视觉表征方案。 Method: 在统一时空模型(MS-TCN++)和相同训练/评估设置下,系统比较监督模型(ResNet-50、I3D)与自监督大基础模型(DINOv3、V-JEPA2)的视觉编码器性能;采用缓存特征流水线解耦视觉编码与时序建模;并在SICS-155数据集(19个相位)上进行控制实验。 Result: DINOv3 ViT-7B取得最优性能(83.4%准确率,87.0编辑分数);基础模型特征显著提升分割效果;域内无标签视频结合轻量适配可进一步提升性能,但存在适用边界。 Conclusion: 现代视觉基础模型具备强手术工作流理解迁移能力,为低标注医疗视频分析提供了实用、高效的技术路径与实践指南。 Abstract: Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient-sics-phase-seg/

[268] FGML-DG: Feynman-Inspired Cognitive Science Paradigm for Cross-Domain Medical Image Segmentation

Yucheng Song,Chenxi Li,Haokang Ding,Zhining Liao,Zhifang Liao

Main category: cs.CV

TL;DR: 本文提出了一种受费曼学习法启发的认知科学驱动的元学习框架(FGML-DG),用于解决多模态、多源医疗图像分割中的域泛化问题,通过风格特征简化、元风格记忆与反馈驱动重训练提升模型跨域泛化能力。

Details Motivation: 医疗图像分割中存在跨模态、跨设备、跨医院的域偏移问题,导致模型在未见域上性能下降;现有方法在风格特征简化、领域知识复用和反馈优化方面存在不足。 Method: 提出费曼引导的元学习框架(FGML-DG):1)基于‘概念理解’简化跨域风格特征并实现对齐;2)设计元风格记忆与召回机制(MetaStyle)复用历史知识;3)引入反馈驱动重训练策略(FDRT),依据预测误差动态调整学习重点。 Result: 在两个具有挑战性的医疗图像域泛化任务上,所提方法优于现有域泛化方法。 Conclusion: 认知科学启发的元学习范式可有效提升医疗图像分割模型的域泛化能力,为AI驱动医疗提供更鲁棒、可迁移的解决方案。 Abstract: In medical image segmentation across multiple modalities (e.g., MRI, CT, etc.) and heterogeneous data sources (e.g., different hospitals and devices), Domain Generalization (DG) remains a critical challenge in AI-driven healthcare. This challenge primarily arises from domain shifts, imaging variations, and patient diversity, which often lead to degraded model performance in unseen domains. To address these limitations, we identify key issues in existing methods, including insufficient simplification of complex style features, inadequate reuse of domain knowledge, and a lack of feedback-driven optimization. To tackle these problems, inspired by Feynman's learning techniques in educational psychology, this paper introduces a cognitive science-inspired meta-learning paradigm for medical image domain generalization segmentation. We propose, for the first time, a cognitive-inspired Feynman-Guided Meta-Learning framework for medical image domain generalization segmentation (FGML-DG), which mimics human cognitive learning processes to enhance model learning and knowledge transfer. Specifically, we first leverage the 'concept understanding' principle from Feynman's learning method to simplify complex features across domains into style information statistics, achieving precise style feature alignment. Second, we design a meta-style memory and recall method (MetaStyle) to emulate the human memory system's utilization of past knowledge. Finally, we incorporate a Feedback-Driven Re-Training strategy (FDRT), which mimics Feynman's emphasis on targeted relearning, enabling the model to dynamically adjust learning focus based on prediction errors. Experimental results demonstrate that our method outperforms other existing domain generalization approaches on two challenging medical image domain generalization tasks.

[269] STORM: End-to-End Referring Multi-Object Tracking in Videos

Zijia Lu,Jingru Yi,Jue Wang,Yuxiao Chen,Junwen Chen,Xinyu Li,Davide Modolo

Main category: cs.CV

TL;DR: 本文提出STORM,一种端到端多模态大语言模型,统一处理指代表达驱动的多目标跟踪(RMOT),通过任务组合学习提升数据效率,并构建新基准STORM-Bench,显著提升性能。

Details Motivation: 现有RMOT方法将目标定位与跟踪分离,受限于训练视频稀缺、标注模糊及领域受限,性能不足。 Method: 提出端到端MLLM模型STORM,联合完成定位与跟踪;引入任务组合学习(TCL)策略,分解为图像定位和单目标跟踪子任务;构建高质量RMOT数据集STORM-Bench。 Result: STORM在图像定位、单目标跟踪和RMOT多个基准上达到SOTA,展现出强泛化能力和复杂场景下的时空定位鲁棒性。 Conclusion: STORM通过统一建模与高效学习策略,有效克服RMOT中模块割裂与数据瓶颈问题,推动该任务向更实用、鲁棒方向发展。 Abstract: Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial--temporal reasoning. We further construct STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom-up annotation pipeline. Extensive experiments show that STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial--temporal grounding in complex real-world scenarios. STORM-Bench is released at https://github.com/amazon-science/storm-referring-multi-object-grounding.

[270] BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

Aaditya Baranwal,Vishal Yadav,Abhishek Rajora

Main category: cs.CV

TL;DR: 本文提出BareBones基准,用于评估视觉语言模型(VLMs)在仅依赖几何轮廓(无RGB纹理)条件下的零样本形状理解能力,发现主流VLM普遍存在严重‘纹理偏差悬崖’现象,揭示其缺乏真正几何结构理解。

Details Motivation: 现有评估无法分离VLM对几何结构的理解与对RGB纹理和上下文先验的依赖,导致高零样本性能可能源于统计捷径而非真实几何认知。 Method: 构建BareBones零样本基准,包含六个数据集(含新提出的细粒度轮廓数据集WTP-Bench),全部使用像素级几何轮廓(去RGB),严格隔离纹理干扰;系统评测26个SOTA VLM。 Result: 所有被测VLM在去RGB条件下性能急剧下降(即‘Texture Bias Cliff’),暴露其普遍缺乏纯几何形状识别能力,尤其在WTP-Bench上表现极差。 Conclusion: 当前VLM并非真正理解几何结构,而是高度依赖纹理线索;BareBones为衡量模型几何接地能力提供了首个严谨、噪声无关的基准。 Abstract: While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce \textbf{BareBones}, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (\eg, GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the \textit{Texture Bias Cliff}. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding.

[271] The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results

Jingkai Wang,Jue Gong,Zheng Chen,Kai Liu,Jiatong Li,Yulun Zhang,Radu Timofte,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yingsi Chen,Yijiao Liu,Hui Li,Yu Wang,Congchao Zhu,Alexandru-Gabriel Lefterache,Anamaria Radoi,Chuanyue Yan,Tao Lu,Yanduo Zhang,Kanghui Zhao,Jiaming Wang,Yuqi Li,WenBo Xiong,Yifei Chen,Xian Hu,Wei Deng,Daiguo Zhou,Sujith Roy,Claudia Jesuraj,Vikas B,Spoorthi LC,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull Wei Zhou,Linfeng Li,Hongyu Huang,Hoyoung Lee,SangYun Oh,ChangYoung Jeong,Axi Niu,Jinyang Zhang,Zhenguo Wu,Senyan Qing,Jinqiu Sun,Yanning Zhang

Main category: cs.CV

TL;DR: 本文综述了NTIRE 2026真实世界人脸恢复挑战赛,聚焦于生成自然逼真且保持身份一致性的结果,评估采用加权图像质量评分与AdaFace身份校验,共9支队伍进入最终排名。

Details Motivation: 推动真实世界人脸恢复在感知质量和真实性方面的最先进水平,不设计算资源或训练数据限制,强调自然性、真实性和身份一致性。 Method: 组织NTIRE 2026挑战赛,采用加权图像质量评估(IQA)分数和AdaFace模型进行身份一致性验证。 Result: 吸引96名注册者,10支队伍提交有效模型,其中9支获得有效最终排名成绩。 Conclusion: 该挑战赛有效推动了真实世界人脸恢复技术的发展,并系统梳理了当前领域最新趋势与方法。 Abstract: This paper provides a review of the NTIRE 2026 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural and realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. Performance is evaluated using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 96 registrants, with 10 teams submitting valid models; ultimately, 9 teams achieved valid scores in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.

[272] Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets

Jia Li,Yu Zhang,Yin Chen,Zhenzhen Hu,Yong Li,Richang Hong,Shiguang Shan,Meng Wang

Main category: cs.CV

TL;DR: 本文提出了一种结构化语义映射(SSM)框架,用于在异构数据条件下实现面部动作单元(AU)检测与面部表情(FE)识别之间的双向联合学习,通过共享视觉骨干、文本语义原型(TSP)和动态先验映射(DPM)模块,在统一语义空间中实现跨任务对齐与知识迁移。

Details Motivation: 现有研究多单向地从AU向FE迁移知识,且忽视了AU与FE之间固有的语义相关性;同时,AU与FE数据在标注方式、粒度及分布上存在显著异构性,阻碍了有效的联合建模。 Method: 提出SSM框架,包含三部分:(1) 共享视觉骨干网络提取统一动态面部表征;(2) 文本语义原型(TSP)模块构建可学习的结构化语义原型,作为跨任务监督与对齐锚点;(3) 动态先验映射(DPM)模块融合FACS先验并学习数据驱动的高维特征关联矩阵,支持显式双向知识迁移。 Result: 在主流AU检测与FE识别基准上均达到SOTA性能,验证了FE语义可反向提升AU检测精度,尤其在跨异构数据集场景下仍有效。 Conclusion: SSM成功实现了AU与FE任务间的双向语义对齐与知识迁移,证明了利用粗粒度表达语义增强细粒度动作单元学习的可行性与有效性,为异构多任务情感行为分析提供了新范式。 Abstract: Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU--FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.

[273] Omnimodal Dataset Distillation via High-order Proxy Alignment

Yuxuan Gao,Xiaohao Liu,Xiaobo Xia,Tongliang Liu

Main category: cs.CV

TL;DR: 本文提出HoPA方法,通过高阶跨模态对齐实现多模态数据集蒸馏,解决了现有方法局限于单/双模态的问题,并在多个基准上验证了其优越性。

Details Motivation: 现有数据集蒸馏方法主要局限于单模态或双模态场景,而面向三模态及以上的泛模态数据集蒸馏尚未被充分探索,面临模态异质性增强与跨模态交互复杂等挑战。 Method: 提出HoPA方法,利用紧凑代理建模高阶跨模态对齐,通过共享相似性结构抽象泛模态对齐,避免两两模态建模的组合爆炸,兼容轨迹匹配,并从谱理论角度提供理论支撑。 Result: 在多个基准实验中,HoPA在压缩率与性能权衡上显著优于现有方法。 Conclusion: HoPA为泛模态数据集蒸馏提供了统一、可扩展且理论合理的解决方案,推动了多模态学习中的高效数据表示研究。 Abstract: Dataset distillation compresses large-scale datasets into compact synthetic sets while preserving training performance, but existing methods are largely restricted to single-modal or bimodal settings. Extending dataset distillation to scenarios involving more than two modalities, i.e., Omnimodal Dataset Distillation, remains underexplored and challenging due to increased heterogeneity and complex cross-modal interactions. In this work, we identify the key determinant that bounds the endpoint discrepancy in the omnimodal setting, which is exacerbated with an increasing number of modalities. To this end, we propose HoPA, a unified method that captures high-order cross-modal alignments via a compact proxy, which is compatible with trajectory matching as well. By abstracting omnimodal alignment with a shared similarity structure, our method avoids the combinatorial complexity of pairwise modality modeling and enables scalable joint distillation across heterogeneous modalities. Theoretical analysis from the spectral perspective reveals the rationality of our proposed method against bimodal dataset distillation techniques. Extensive experiments on various benchmarks demonstrate that the proposed method achieves superior compression-performance trade-offs compared to existing competitors. The source code will be publicly released.

[274] Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

Shiyin Jiang,Wei Long,Minghao Han,Zhenghao Chen,Ce Zhu,Shuhang Gu

Main category: cs.CV

TL;DR: 本文提出RDVQ框架,通过可微代码本分布松弛实现端到端率失真优化,结合自回归熵模型,在极低码率下实现高性能图像压缩。

Details Motivation: 在存储和带宽受限下,极低比特率图像压缩需求迫切;现有向量量化(VQ)方法缺乏表示学习与熵建模联合的率失真优化机制。 Method: 提出RDVQ统一框架:1)对代码本分布进行可微松弛,使熵损失直接塑造潜在先验;2)设计支持准确熵建模与测试时码率控制的自回归熵模型。 Result: 在DIV2K-val上,相比RDEIC,DISTS指标码率降低75.71%,LPIPS指标降低37.63%;以更少参数获得竞争性甚至更优的感知质量。 Conclusion: RDVQ实现了VQ压缩的端到端率失真优化,引入熵约束VQ新范式,推动图像分词化与压缩的统一建模。 Abstract: The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at https://github.com/CVL-UESTC/RDVQ.

[275] Hierarchical Textual Knowledge for Enhanced Image Clustering

Yijie Zhong,Yunfan Gao,Weipeng Jiang,Haofen Wang

Main category: cs.CV

TL;DR: 本文提出了一种知识增强聚类(KEC)方法,利用大语言模型构建概念-属性层次化文本知识结构,以提升无监督图像聚类性能,无需训练即在多数数据集上优于零样本CLIP。

Details Motivation: 传统图像聚类仅依赖视觉空间知识,难以区分视觉相似但语义不同的类别;现有文本增强方法使用粗粒度标签,忽略文本中丰富的概念与属性级语义。 Method: 利用大语言模型设计结构化提示,将冗余文本标签压缩为抽象概念,并为每个概念及相似概念对自动提取判别性属性,构建层次化概念-属性知识结构;将该知识实例化到每张图像,融合原始视觉特征形成知识增强特征,适配多种下游聚类算法。 Result: 在20个多样化数据集上验证,KEC持续提升现有文本增强聚类方法性能;无需训练即在20个数据集中的14个上超越零样本CLIP;相比盲目使用文本知识,KEC兼具更高准确率与鲁棒性。 Conclusion: 层次化概念-属性知识结构能有效弥补视觉歧义,LLM驱动的知识构建方式为无监督图像聚类提供了可解释、可泛化的文本引导新范式。 Abstract: Image clustering aims to group images in an unsupervised fashion. Traditional methods focus on knowledge from visual space, making it difficult to distinguish between visually similar but semantically different classes. Recent advances in vision-language models enable the use of textual knowledge to enhance image clustering. However, most existing methods rely on coarse class labels or simple nouns, overlooking the rich conceptual and attribute-level semantics embedded in textual space. In this paper, we propose a knowledge-enhanced clustering (KEC) method that constructs a hierarchical concept-attribute structured knowledge with the help of large language models (LLMs) to guide clustering. Specifically, we first condense redundant textual labels into abstract concepts and then automatically extract discriminative attributes for each single concept and similar concept pairs, via structured prompts to LLMs. This knowledge is instantiated for each input image to achieve the knowledge-enhanced features. The knowledge-enhanced features with original visual features are adapted to various downstream clustering algorithms. We evaluate KEC on 20 diverse datasets, showing consistent improvements across existing methods using additional textual knowledge. KEC without training outperforms zero-shot CLIP on 14 out of 20 datasets. Furthermore, the naive use of textual knowledge may harm clustering performance, while KEC provides both accuracy and robustness.

[276] NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets, Methods and Results

Xin Li,Jiachao Gong,Xijun Wang,Shiyao Xiong,Bingchen Li,Suhang Yao,Chao Zhou,Zhibo Chen,Radu Timofte,Yuxiang Chen,Shibo Yin,Yilian Zhong,Yushun Fang,Xilei Zhu,Yahui Wang,Chen Lu,Meisong Zheng,Xiaoxu Chen,Jing Yang,Zhaokun Hu,Jiahui Liu,Ying Chen,Haoran Bai,Sibin Deng,Shengxi Li,Mai Xu,Junyang Chen,Hao Chen,Xinzhe Zhu,Fengkai Zhang,Long Sun,Yixing Yang,Xindong Zhang,Jiangxin Dong,Jinshan Pan,Jiyuan Zhang,Shuai Liu,Yibin Huang,Xiaotao Wang,Lei Lei,Zhirui Liu,Shinan Chen,Shang-Quan Sun,Wenqi Ren,Jingyi Xu,Zihong Chen,Zhuoya Zou,Xiuhao Qiu,Jingyu Ma,Huiyuan Fu,Kun Liu,Huadong Ma,Dehao Feng,Zhijie Ma,Boqi Zhang,Jiawei Shi,Hao Kang,Yixin Yang,Yeying Jin,Xu Cheng,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Yanan Xing,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Wei Zhou,Linfeng Li,Hang Song,Qi Xu,Kun Yuan,Yizhen Shao,Yulin Ren

Main category: cs.CV

TL;DR: This paper introduces the NTIRE 2026 Challenge on Short-form UGC Video Restoration, featuring the new KwaiVIR benchmark with synthetic and real-world videos, two evaluation tracks (subjective and objective), and strong participation and results from 12 teams.

Details Motivation: To establish a practical and robust benchmark for restoring short-form user-generated content (S-UGC) videos under complex, real-world degradations, especially using generative models. Method: Organizing a challenge with the KwaiVIR benchmark—comprising synthetic and wild short-form UGC videos—and evaluating methods via both subjective (user study) and objective tracks. Result: 95 teams registered; 12 submitted valid solutions, achieving strong performance on KwaiVIR, indicating notable progress in generative S-UGC video restoration. Conclusion: The NTIRE 2026 Challenge successfully advances the field of short-form UGC video restoration by providing a realistic benchmark and fostering effective generative-model-based solutions. Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.

[277] Sign Language Recognition in the Age of LLMs

Vaclav Javorek,Jakub Honzik,Ivan Gruber,Tomas Zelezny,Marek Hruz

Main category: cs.CV

TL;DR: 本文探讨了现代视觉语言模型(VLMs)在零样本设置下进行孤立手语识别(ISLR)的能力,发现开源VLMs表现远逊于传统监督方法,但展现出部分视觉-语义对齐能力;更大规模的专有模型则显著提升准确率。

Details Motivation: 探究通用视觉语言模型是否能在无需任务特定训练的情况下解决专业视觉识别问题——孤立手语识别(ISLR)。 Method: 在WLASL300基准上,对多个开源与专有VLMs进行纯提示驱动的零样本推理评估,并分析其视觉-语义对齐能力。 Result: 开源VLMs在零样本ISLR上远落后于经典监督模型;但表现出一定视觉-文本对齐能力;大型专有模型准确率显著更高,凸显模型规模与训练数据多样性的重要性。 Conclusion: 当前开源VLMs尚不适用于零样本ISLR任务,但其部分对齐能力为未来改进提供方向;模型规模和数据多样性是提升性能的关键因素。 Abstract: Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.

[278] Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor

Yapeng Meng,Lin Yang,Yuguo Chen,Xiangru Chen,Taoyi Wang,Lijian Wang,Zheyu Yang,Yihan Lin,Rong Zhao

Main category: cs.CV

TL;DR: 本文提出了一种基于互补视觉传感器(CVS)的运动去模糊方法STGDNet,利用同步获取的空间差分(SD)和时间差分(TD)数据,结合RGB模糊图像,通过循环多分支网络迭代融合时空差分信息,显著提升极端运动下的去模糊效果与泛化能力。

Details Motivation: RGB单帧去模糊在极端运动下高度病态;事件相机存在速率饱和及边缘与运动特征耦合问题;CVS提供同步、高帧率、解耦的结构边缘(SD)与运动线索(TD),为解决该问题提供新途径。 Method: 提出Spatio-Temporal Difference Guided Deblur Net(STGDNet),采用循环多分支架构,对CVS提供的SD与TD序列进行迭代编码与融合,协同恢复模糊RGB中的结构与色彩细节。 Result: 在合成CVS数据集与真实场景中均优于现有RGB或事件驱动方法;在100+种极端真实场景中展现出强泛化能力。 Conclusion: CVS模态与STGDNet架构有效解耦并利用结构与运动先验,为极端动态场景下的鲁棒去模糊提供了新范式。 Abstract: Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intra-exposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion. Inspired by the human visual system, brain-inspired vision sensors introduce temporally dense information to alleviate this problem. However, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness. As a recent breakthrough, the complementary vision sensor (CVS), Tianmouc, captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference (SD, encoding structural edges) and temporal difference (TD, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses SD and TD sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Project page: https://tmcDeblur.github.io/

[279] What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?

Koki Ryu,Hitomi Yanaka

Main category: cs.CV

TL;DR: 本文分析了视觉-语言模型(VLMs)内部是否编码多层级美学属性,并利用这些表征实现无需微调的轻量级个性化图像美学评估(PIAA)。

Details Motivation: 探索VLMs是否内在编码丰富、多层次的美学属性,以支持有效的个性化图像美学评估。 Method: 分析VLMs内部表征中美学属性的存在性与分布;利用这些表征构建简单线性模型进行个体级PIAA,无需模型微调;跨层、跨架构及跨图像域分析美学信息传递机制。 Result: 发现VLMs确实在语言解码器层中编码多样化美学属性;仅用简单线性模型即可有效完成PIAA;揭示了不同VLM架构和图像域中美学信息的传播规律。 Conclusion: VLMs具备建模主观、个体化美学偏好的潜力,其隐含美学表征可被高效利用于轻量级个性化美学评估任务。 Abstract: Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at https://github.com/ynklab/vlm-latent-piaa.

[280] Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

Bo Zhou,Qiuxia Lai,Zeren Sun,Xiangbo Shu,Yazhou Yao,Wenguan Wang

Main category: cs.CV

TL;DR: 本文提出UniSplat框架,通过双掩码策略增强几何感知、粗到细高斯溅射策略减少外观-语义不一致、以及位姿条件重校准机制确保几何-语义一致性,从而在无位姿稀疏多视图图像上学习鲁棒统一的3D表征。

Details Motivation: 现有自监督方法在无位姿多视图图像上学习鲁棒3D表征时,存在几何归纳弱、外观细节有限、几何与语义不一致等问题。 Method: 提出UniSplat框架,包含:1)双掩码策略(掩码编码器与解码器token,且解码器掩码聚焦几何丰富区域)以增强几何归纳;2)粗到细高斯溅射策略以渐进式优化辐射场,缓解外观-语义不一致;3)位姿条件重校准机制,利用估计相机参数将预测3D点与语义图重投影至图像平面,并与RGB及语义预测对齐,保障跨任务一致性。 Result: UniSplat在无位姿、稀疏视角输入下生成鲁棒统一的3D表征,显著提升几何感知能力与跨任务泛化性,为空间智能提供感知基础。 Conclusion: UniSplat通过三重协同设计有效克服了现有方法在几何诱导、外观细节和几何-语义一致性方面的局限,为基于未标定多视图图像的3D表征学习提供了新范式。 Abstract: Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.

[281] Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

Zihang Fu,Haonan Wang,Jian Kang,Kenji Kawaguchi,Jiaying Wu

Main category: cs.CV

TL;DR: 本文提出MERIT,一种无需训练的、任务驱动的模型融合框架,用于恢复视频语言模型(VLMs)的时间推理能力,通过在VLM与其文本骨干模型之间分层融合自注意力机制,在提升时间推理的同时避免损害时间感知能力。

Details Motivation: 多模态适配虽赋予大语言模型感知能力,却常削弱其原有的推理能力,尤其在视频语言模型中,视觉对齐会显著损害对时序事件的时间推理(TR)。 Method: MERIT是一种训练-free的模型融合框架,通过在VLM与其配对的纯文本骨干模型之间,按层搜索自注意力机制的融合策略,优化目标为提升时间推理(TR)并惩罚时间感知(TP)退化。 Result: MERIT在三个主流VLM和多个视频基准上一致提升了TR,保持或增强了TP,并泛化至四个未参与搜索的新基准;优于均匀全模型融合与随机层选择;干预性掩码与帧级归因验证了所选层对推理的关键作用。 Conclusion: 针对感知能力进行有侧重的模型融合,可在不重训练的前提下有效恢复VLM的时间推理能力。 Abstract: Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.

[282] Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

Dehui Wang,Congsheng Xu,Rong Wei,Yue Shi,Shoufa Chen,Dingxiang Luo,Tianshuo Yang,Xiaokang Yang,Yusen Qin,Rui Tang,Yao Mu

Main category: cs.CV

TL;DR: Rein3D是一种结合3D高斯泼溅与视频扩散模型的框架,通过‘恢复-精炼’范式,从稀疏输入重建高质量、全局一致的360度室内3D场景。

Details Motivation: 现有方法难以在大范围未见区域中推断大量缺失几何结构并保持全局一致性,导致局部合理但全局不一致的重建结果。 Method: 提出Rein3D框架:1)以粗略3DGS初始化出发,采用径向探索策略渲染全景视频;2)利用全景视频到视频扩散模型(PanoV2V)恢复并超分视频;3)将精炼后的视频作为伪真值更新全局3D高斯场;同时构建PanoV2V-15K数据集支持训练。 Result: 实验表明Rein3D生成的照片级真实感和全局一致的3D场景,在长距离相机探索任务上显著优于现有基线。 Conclusion: 耦合显式3D表示与时序一致的视频扩散先验,可有效提升稀疏输入下的大规模室内场景重建质量与一致性。 Abstract: The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.

[283] Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune

Main category: cs.CV

TL;DR: 本文指出双编码器视觉语言模型(如CLIP)在组合性任务上表现差,主因并非表征能力不足,而是传统全局余弦相似度推理方式的局限;作者提出一种轻量级Transformer,在冻结编码器前提下学习细粒度区域-词元对齐,显著提升组合泛化能力,尤其在分布外场景下优于全微调等方法。

Details Motivation: 双编码器VLM(如CLIP)在组合性基准上表现差,常被归因为表征缺陷,但作者质疑该归因,认为问题更可能出在标准的全局相似度推理协议上。 Method: 1)通过受控诊断实验验证细粒度区域-词元对齐可提升组合性能;2)设计轻量级Transformer,在冻结的图像patch和文本token嵌入上直接学习局部对齐关系。 Result: 所提方法在域内检索上媲美全微调,在受控的域外组合性基准上显著优于全微调及现有端到端组合训练方法,证明局部对齐机制能提升分布鲁棒性。 Conclusion: 全局嵌入匹配是双编码器VLM组合泛化的关键瓶颈;引入显式的、基于冻结表示的局部对齐机制,是提升其鲁棒组合泛化能力的有效且高效途径。 Abstract: Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.

[284] TAPNext++: What's Next for Tracking Any Point (TAP)?

Sebastian Jung,Artem Zholus,Martin Sundermeyer,Carl Doersch,Ross Goroshin,David Joseph Tan,Sarath Chandar,Rudolph Triebel,Federico Tombari

Main category: cs.CV

TL;DR: 本文提出TAPNext++,一种改进的端到端循环Transformer模型,显著提升长视频序列中的任意点跟踪性能,尤其增强被遮挡或移出画面后重新出现点的再检测能力,并引入新指标AJ_RD评估再检测效果。

Details Motivation: TAPNext在长视频中表现下降,且难以重检测重新出现的查询点(如遮挡后重现),当前文献缺乏对再检测能力的系统评估。 Method: 提出TAPNext++模型,采用序列并行技术训练长达1024帧的视频;引入几何增强(如周期性roll)模拟点重入,并对遮挡点进行监督;设计新评估指标Re-Detection Average Jaccard(AJ_RD)。 Result: 在多个基准上达到SOTA;显著提升长序列跟踪鲁棒性与再检测性能;保持低内存与计算开销。 Conclusion: 循环视频Transformer可通过数据驱动策略(长序列训练、针对性增强、遮挡监督)大幅改进点跟踪性能,再检测能力是关键优化方向。 Abstract: Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion -- demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ($AJ_{RD}$), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at https://tap-next-plus-plus.github.io.

[285] CoFusion: Multispectral and Hyperspectral Image Fusion via Spectral Coordinate Attention

Baisong Li

Main category: cs.CV

TL;DR: 本文提出CoFusion框架,通过多尺度生成器和空间-光谱协同模块,有效融合多光谱与高光谱图像,在空间细节增强与光谱保真度间取得更好平衡。

Details Motivation: 现有方法难以建模跨尺度交互与空谱协同,导致空间细节增强与谱保真难以兼顾。 Method: 提出CoFusion框架:含三层次金字塔结构的多尺度生成器(MSG),每层采用双分支策略——空间坐标感知混合模块(SpaCAM)与光谱坐标感知混合模块(SpeCAM),并引入空间-光谱交叉融合模块(SSCFM)实现动态跨模态对齐与特征互补融合。 Result: 在多个基准数据集上实验表明,CoFusion在空间重建质量与光谱一致性两方面均显著优于当前最先进方法。 Conclusion: CoFusion通过显式建模跨尺度与跨模态依赖关系,为多/高光谱图像融合提供了更优的统一协同融合范式。 Abstract: Multispectral and Hyperspectral Image Fusion (MHIF) aims to reconstruct high-resolution images by integrating low-resolution hyperspectral images (LRHSI) and high-resolution multispectral images (HRMSI). However, existing methods face limitations in modeling cross-scale interactions and spatial-spectral collaboration, making it difficult to achieve an optimal trade-off between spatial detail enhancement and spectral fidelity. To address this challenge, we propose CoFusion: a unified spatial-spectral collaborative fusion framework that explicitly models cross-scale and cross-modal dependencies. Specifically, a Multi-Scale Generator (MSG) is designed to construct a three-level pyramidal architecture, enabling the effective integration of global semantics and local details. Within each scale, a dual-branch strategy is employed: the Spatial Coordinate-Aware Mixing module (SpaCAM) is utilized to capture multi-scale spatial contexts, while the Spectral Coordinate-Aware Mixing module (SpeCAM) enhances spectral representations through frequency decomposition and coordinate mixing. Furthermore, we introduce the Spatial-Spectral Cross-Fusion Module (SSCFM) to perform dynamic cross-modal alignment and complementary feature fusion. Extensive experiments on multiple benchmark datasets demonstrate that CoFusion consistently outperforms state-of-the-art methods, achieving superior performance in both spatial reconstruction and spectral consistency.

[286] GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

Maram Hasan,Md Aminur Hossain,Savitra Roy,Souparna Bhowmik,Ayush V. Patel,Mainak Singha,Subhasis Chaudhuri,Muhammad Haris Khan,Biplab Banerjee

Main category: cs.CV

TL;DR: 本文提出了GeoMeld——一个大规模、空间对齐、多模态遥感数据集(约250万样本),并配套提出GeoMeld-FM预训练框架,通过多任务联合学习(掩码自编码、JEPA、图文对比)实现物理一致性与语义 grounded 表征的统一建模,显著提升下游任务迁移性能和跨传感器鲁棒性。

Details Motivation: 遥感基础模型需空间对齐的异构模态与语义支撑的监督信号,但当前缺乏大规模高质量资源。 Method: 构建GeoMeld数据集(统一空间对齐协议+代理式生成与验证的语义标注),并设计GeoMeld-FM预训练框架,融合多前设掩码自编码、JEPA表征学习与图文对比对齐。 Result: 在多个下游任务中展现出持续性能增益及更强的跨传感器鲁棒性。 Conclusion: GeoMeld与GeoMeld-FM共同构成遥感领域可扩展的语义支撑型多模态基础建模基准框架。 Abstract: Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.

[287] COREY: A Prototype Study of Entropy-Guided Operator Fusion with Hadamard Reparameterization for Selective State Space Models

Bo Ma,Jinsong Wu,Hongjiang Wei,Weiqi Yan

Main category: cs.CV

TL;DR: 本文提出COREY框架,通过内存感知的算子融合与基于哈达玛变换的特征重参数化,优化状态空间模型(如Mamba)在长序列推理中的内存带宽瓶颈;利用激活熵动态调度融合边界与分块大小,并用归一化哈达玛变换正则化重尾激活,从而降低延迟、提升吞吐与减少DRAM流量。

Details Motivation: SSM(如Mamba)虽具线性时序建模优势,但实际部署受限于内存带宽——因选择性状态更新常被分解为多个碎片化核,导致中间张量重复物化。 Method: 提出COREY框架:1)内存感知的算子融合;2)基于哈达玛变换的特征重参数化(将归一化哈达玛变换吸收进线性投影);3)用固定宽度直方图估计激活熵,作为运行时调度统计以决定融合边界和tile大小。 Result: 在重尾SSM激活的受控原型实验中,COREY相比未融合及固定深度基线,持续降低代理延迟、提升吞吐量、减少DRAM流量;低比特结果仅通过手工设计的稳定性代理评估,不构成质量保证。 Conclusion: COREY验证了结合运行时熵感知调度与哈达玛正则化的融合策略,可有效缓解SSM推理的内存带宽瓶颈,为高效长上下文部署提供新路径。 Abstract: State Space Models (SSMs), represented by the Mamba family, provide linear-time sequence modeling and are attractive for long-context inference. Yet practical deployments remain memory-bandwidth limited because selective state updates are often decomposed into fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype framework that combines memory-aware operator fusion with Hadamard-based feature reparameterization. Activation entropy, estimated with fixed-width histograms, is used as a runtime scheduling statistic to place fusion boundaries and choose tile sizes. To regularize heavy-tailed activations, we absorb normalized Hadamard transforms into linear projections, preserving functional equivalence while reducing peak-coordinate concentration. In a controlled prototype study over heavy-tailed SSM activations, COREY consistently reduces proxy latency, improves throughput, and lowers DRAM traffic relative to unfused and fixed-depth baselines. Low-bit results are reported only through a hand-crafted stability proxy and are intended as diagnostic evidence rather than checkpoint-level quality claims. Code repository: https://github.com/mabo1215/COREY_Transformer.git.

[288] Self-supervised Pretraining of Cell Segmentation Models

Kaden Stillwagon,Alexandra Dunnum VandeLoo,Benjamin Magondu,Craig R. Forest

Main category: cs.CV

TL;DR: 本文提出DINOCell,一种基于DINOv2的自监督细胞实例分割框架,通过在无标签细胞图像上持续自监督训练来适配显微镜域,显著提升分割性能。

Details Motivation: 现有基于自然图像预训练模型(如SAM)的方法因领域偏移导致显微镜图像分割性能下降,且高质量标注数据稀缺。 Method: 利用DINOv2的表征,通过在无标签细胞图像上进行持续自监督训练以适配显微镜域,再进行有监督微调。 Result: 在LIVECell基准上SEG分数达0.784,比领先SAM方法提升10.42%,并在三个分布外显微镜数据集上展现强零样本性能。 Conclusion: 针对显微镜图像的领域自适应自监督预训练可显著提升细胞实例分割鲁棒性与性能。 Abstract: Instance segmentation enables the analysis of spatial and temporal properties of cells in microscopy images by identifying the pixels belonging to each cell. However, progress is constrained by the scarcity of high-quality labeled microscopy datasets. Many recent approaches address this challenge by initializing models with segmentation-pretrained weights from large-scale natural-image models such as Segment Anything Model (SAM). However, representations learned from natural images often encode objectness and texture priors that are poorly aligned with microscopy data, leading to degraded performance under domain shift. We propose DINOCell, a self-supervised framework for cell instance segmentation that leverages representations from DINOv2 and adapts them to microscopy through continued self-supervised training on unlabeled cell images prior to supervised fine-tuning. On the LIVECell benchmark, DINOCell achieves a SEG score of 0.784, improving by 10.42% over leading SAM-based models, and demonstrates strong zero-shot performance on three out-of-distribution microscopy datasets. These results highlight the benefits of domain-adapted self-supervised pretraining for robust cell segmentation.

[289] How to Design a Compact High-Throughput Video Camera?

Chenxi Qiu,Tao Yue,Xuemei Hu

Main category: cs.CV

TL;DR: 本文提出了一种基于梯度相机的低比特成像方案,结合多尺度重建CNN,以解决高通量视频成像中的读出与传输瓶颈问题。

Details Motivation: 现有高通量成像系统复杂度高;单芯片集成虽可行,但读出与传输速度跟不上像素数量增长。 Method: 分析梯度相机在快速读出和高效表征方面的优势,提出低比特梯度相机方案,并设计多尺度重建CNN进行高分辨率图像重建。 Result: 在模拟和真实数据上的大量实验验证了该方法在图像质量与可行性方面的优越性。 Conclusion: 所提低比特梯度相机方案可有效缓解高通量视频成像的读出与传输瓶颈,具备实用潜力。 Abstract: High throughput video acquisition is a challenging problem and has been drawing increasing attention. Existing high throughput imaging systems splice hundreds of sub-images/videos into high throughput videos, suffering from extremely high system complexity. Alternatively, with pixel sizes reducing to sub-micrometer levels, integrating ultra-high throughput on a single chip is becoming feasible. Nevertheless, the readout and output transmission speed cannot keep pace with the increasing pixel numbers. To this end, this paper analyzes the strength of gradient cameras in fast readout and efficient representation, and proposes a low-bit gradient camera scheme based on existing technologies that can resolve the readout and transmission bottlenecks for high throughput video imaging. A multi-scale reconstruction CNN is proposed to reconstruct high-resolution images. Extensive experiments on both simulated and real data are conducted to demonstrate the promising quality and feasibility of the proposed method.

[290] NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Xin Li,Yeying Jin,Suhang Yao,Beibei Lin,Zhaoxin Fan,Wending Yan,Xin Jin,Zongwei Wu,Bingchen Li,Peishu Shi,Yufei Yang,Yu Li,Zhibo Chen,Bihan Wen,Robby T. Tan,Radu Timofte,Runzhe Li,Kui Jiang,Zhaocheng Yu,Yiang Chen,Junjun Jiang,Xianming Liu,Hongde Gu,Zeliang Li,Mache You,Jiangxin Dong,Jinshan Pan,Qiyu Rong,Bowen Shao,Hongyuan Jing,Mengmeng Zhang,Bo Ding,Hui Zhang,Yi Ren,Mohab Kishawy,Jun Chen,Anh-Kiet Duong,Petra Gomez-Kramer,Jean-Michel Carozza,Wangzhi Xing,Xin Lu,Enxuan Gu,Jingxi Zhang,Diqi Chen,Qiaosi Yi,Bingcai Wei,Wenjie Li,Bowen Tie,Heng Guo,Zhanyu Ma,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Cici Liu,Yaokun Shi,Paula Garrido Mellado,Daniel Feijoo,Alvaro Garcia Lara,Marcos V. Conde,Zhidong Zhu,Bangshu Xiong,Qiaofeng Ou,Zhibo Rao,Wei Li,Zida Zhang,Hui Geng,Qisheng Xu,Xuyao Deng,Changjian Wang,Kele Xu,Guanglu Dong,Qiyao Zhao,Tianheng Zheng,Chunlei Li,Lichao Mou,Chao Ren,Chang-De Peng,Chieh-Yu Tsai,Guan-Cheng Liu,Li-Wei Kang,Abhishek Rajak,Milan Kumar Singh,Ankit Kumar,Dimple Sonone,Kishor Upla,Kiran Raja,Huilin Zhao,Xing Xu,Chuan Chen,Yeming Lao,Wenjing Xun,Li Yang,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Hao Yang,Ruikun Zhang,Liyuan Pan

Main category: cs.CV

TL;DR: 本文介绍了NTIRE 2026第二届昼夜双焦雨滴去除挑战赛,基于真实世界Raindrop Clarity数据集,更新了训练/验证/测试图像数量,并吸引了168支队伍注册、17支提交有效方案,推动了复杂光照与对焦条件下的雨滴去除研究进展。

Details Motivation: 建立一个强而实用的基准,用于在各种光照和对焦条件下进行雨滴去除。 Method: 组织NTIRE 2026第二届挑战赛,使用更新后的Raindrop Clarity数据集(14139训练、407验证、593测试图像)评估参赛方法。 Result: 168支队伍注册,17支提交有效方案,所提方法在Raindrop Clarity数据集上表现优异。 Conclusion: 该挑战赛显著推动了昼夜及双焦场景下雨滴去除技术的发展,验证了当前方法的有效性与进步。 Abstract: This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.

[291] Language Prompt vs. Image Enhancement: Boosting Object Detection With CLIP in Hazy Environments

Jian Pang,Bingfeng Zhang,Jin Wang,Baodi Liu,Dapeng Tao,Weifeng Liu

Main category: cs.CV

TL;DR: 本文提出了一种无需图像增强、利用语言提示增强退化目标语义信息的雾天目标检测新方法,设计了AME和FAME损失机制,并构建了大规模合成雾天数据集HazyCOCO,实现了SOTA性能。

Details Motivation: 雾天环境下目标退化严重、语义减弱,传统图像增强方法因模块不稳定而受限。 Method: 提出基于语言提示(CLIP)引导的交叉熵损失CLIP-CE,设计近似互斥性(AME)提供可信权重评估语义削弱程度;进一步提出自适应微调AME的FAME机制以缓解优化不平衡;并构建HazyCOCO雾天数据集。 Result: 在雾天目标检测任务上达到当前最优性能(SOTA),且代码与数据集将开源。 Conclusion: 语言提示可有效替代图像增强来增强退化目标的语义,所提CLIP-CE、AME/FAME机制及HazyCOCO数据集为雾天检测提供了新范式与实用资源。 Abstract: Object detection in hazy environments is challenging because degraded objects are nearly invisible and their semantics are weakened by environmental noise, making it difficult for detectors to identify. Common approaches involve image enhancement to boost weakened semantics, but these methods are limited by the instability of enhanced modules. This paper proposes a novel solution by employing language prompts to enhance weakened semantics without image enhancement. Specifically, we design Approximation of Mutual Exclusion (AME) to provide credible weights for Cross-Entropy Loss, resulting in CLIP-guided Cross-Entropy Loss (CLIP-CE). The provided weights assess the semantic weakening of objects. Through the backpropagation of CLIP-CE, weakened semantics are enhanced, making degraded objects easier to detect. In addition, we present Fine-tuned AME (FAME) which adaptively fine-tunes the weight of AME based on the predicted confidence. The proposed FAME compensates for the imbalanced optimization in AME. Furthermore, we present HazyCOCO, a large-scale synthetic hazy dataset comprising 61258 images. Experimental results demonstrate that our method achieves state-of-the-art performance. The code and dataset will be released.

[292] LogitDynamics: Reliable ViT Error Detection from Layerwise Logit Trajectories

Ido Beigelman,Moti Freiman

Main category: cs.CV

TL;DR: 本文提出一种基于ViT内部信号的单次前向传播错误预测方法,通过轻量线性头提取多层logits与不稳定性统计特征,实现高效可靠的置信度估计。

Details Motivation: 受大语言模型中内部信号幻觉检测启发,探索ViT中是否存在类似的深度方向可利用信号用于错误预测。 Method: 在ViT中间层附加轻量线性头,提取最后L层中预测类及其Top-K竞争类的logits,以及顶部类别跨深度的不稳定性统计;用线性探针训练这些特征以预测错误指示符。 Result: 在多个数据集上AUCPR优于或匹配基线,跨数据集泛化能力更强,且计算开销极小。 Conclusion: ViT深层存在可用于错误预测的可靠内部信号,所提方法以极低成本实现了高性能置信度估计。 Abstract: Reliable confidence estimation is critical when deploying vision models. We study error prediction: determining whether an image classifier's output is correct using only signals from a single forward pass. Motivated by internal-signal hallucination detection in large language models, we investigate whether similar depth-wise signals exist in Vision Transformers (ViTs). We propose a simple method that models how class evidence evolves across layers. By attaching lightweight linear heads to intermediate layers, we extract features from the last L layers that capture both the logits of the predicted class and its top-K competitors, as well as statistics describing instability of top-ranked classes across depth. A linear probe trained on these features predicts the error indicator. Across datasets, our method improves or matches AUCPR over baselines and shows stronger cross-dataset generalization while requiring minimal additional computation.

[293] LoViF 2026 The First Challenge on Weather Removal in Videos

Chenghao Qian

Main category: cs.CV

TL;DR: 本文介绍了LoViF 2026挑战赛,聚焦于视频天气去除(如雨、雪),提出新短格式WRV数据集,并推动兼顾保真度与感知质量的视频恢复方法。

Details Motivation: 推动在真实天气条件下鲁棒、逼真的视频恢复技术发展,解决现有方法在视觉合理性、时间一致性、场景结构与运动动态保持方面的不足。 Method: 组织LoViF 2026天气去除挑战赛,构建包含18个视频、1216帧合成退化帧与对应真实清晰帧的WRV数据集(832×480),按1:1:1划分训练/验证/测试集,并制定联合评估保真度与感知质量的协议。 Result: 吸引37支队伍参与,收到5份有效最终提交;发布公开可访问的挑战平台与数据集。 Conclusion: 该挑战赛有效促进了视频天气去除方向的技术进展,为后续研究提供了高质量基准数据集与标准化评估框架。 Abstract: This paper presents a review of the LoViF 2026 Challenge on Weather Removal in Videos. The challenge encourages the development of methods for restoring clean videos from inputs degraded by adverse weather conditions such as rain and snow, with an emphasis on achieving visually plausible and temporally consistent results while preserving scene structure and motion dynamics. To support this task, we introduce a new short-form WRV dataset tailored for video weather removal. It consists of 18 videos 1,216 synthesized frames paired with 1,216 real-world ground-truth frames at a resolution of 832 x 480, and is split into training, validation, and test sets with a ratio of 1:1:1. The goal of this challenge is to advance robust and realistic video restoration under real-world weather conditions, with evaluation protocols that jointly consider fidelity and perceptual quality. The challenge attracted 37 participants and received 5 valid final submissions with corresponding fact sheets, contributing to progress in weather removal for videos. The project is publicly available at https://www.codabench.org/competitions/13462/.

[294] HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

Marco Schouten,Ioannis Siglidis,Serge Belongie,Dim P. Papadopoulos

Main category: cs.CV

TL;DR: 本文提出了一种从文本条件扩散模型中蒸馏显式、类别条件空间先验的方法,用于自然场景中的物体放置;构建了大规模HiddenObjects数据集(27M标注),并蒸馏出轻量级高效模型。

Details Motivation: 现有方法依赖人工标注(规模受限)或基于修复的物体移除流程(易导致捷径学习),缺乏可扩展且鲁棒的空间先验学习方式。 Method: 设计了一个全自动、可扩展的框架,利用扩散模型进行高质量背景上的密集物体放置评估,并构建HiddenObjects数据集;随后将隐式先验蒸馏为轻量级显式模型。 Result: 所学空间先验在图像编辑任务中显著优于人工稀疏标注(VLM-Judge 3.90 vs. 2.68)及现有基线;蒸馏模型推理速度快230,000倍。 Conclusion: 该方法实现了高质量、可扩展、无需人工标注的空间先验学习,并在性能与效率上均取得突破。 Abstract: We propose a method to learn explicit, class-conditioned spatial priors for object placement in natural scenes by distilling the implicit placement knowledge encoded in text-conditioned diffusion models. Prior work relies either on manually annotated data, which is inherently limited in scale, or on inpainting-based object-removal pipelines, whose artifacts promote shortcut learning. To address these limitations, we introduce a fully automated and scalable framework that evaluates dense object placements on high-quality real backgrounds using a diffusion-based inpainting pipeline. With this pipeline, we construct HiddenObjects, a large-scale dataset comprising 27M placement annotations, evaluated across 27k distinct scenes, with ranked bounding box insertions for different images and object categories. Experimental results show that our spatial priors outperform sparse human annotations on a downstream image editing task (3.90 vs. 2.68 VLM-Judge), and significantly surpass existing placement baselines and zero-shot Vision-Language Models for object placement. Furthermore, we distill these priors into a lightweight model for fast practical inference (230,000x faster).

[295] Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

Jiayu Zhang,Shuo Ye,Qilang Ye,Zihan Song,Jiajian Huang,Zitong Yu

Main category: cs.CV

TL;DR: 本文提出R²ScP框架,通过基于检索的恢复方法替代传统的生成式插补,利用跨模态检索和上下文感知自适应净化机制,提升音频-视觉问答(AVQA)在缺失模态下的鲁棒性和推理准确性。

Details Motivation: 现有AVQA方法在面对真实场景中模态缺失(如数据中断)时性能严重下降;主流生成式插补方法难以捕获模态特异性知识,易导致幻觉和推理不准。 Method: 提出R²ScP框架:1)基于统一语义嵌入的跨模态检索获取缺失模态的领域特异性知识;2)引入上下文感知自适应净化机制去除检索结果中的语义噪声;3)采用两阶段训练策略显式建模多源知识间的语义关系。 Result: 在多个AVQA基准上显著提升性能,尤其在模态不完整场景下展现出更强的鲁棒性与推理准确性。 Conclusion: 基于检索的恢复范式比生成式插补更有效捕捉模态特异性知识,R²ScP为缺失模态下的多模态理解提供了新思路。 Abstract: Recent Audio-Visual Question Answering (AVQA) methods have advanced significantly. However, most AVQA methods lack effective mechanisms for handling missing modalities, suffering from severe performance degradation in real-world scenarios with data interruptions. Furthermore, prevailing methods for handling missing modalities predominantly rely on generative imputation to synthesize missing features. While partially effective, these methods tend to capture inter-modal commonalities but struggle to acquire unique, modality-specific knowledge within the missing data, leading to hallucinations and compromised reasoning accuracy. To tackle these challenges, we propose R$^{2}$ScP, a novel framework that shifts the paradigm of missing modality handling from traditional generative imputation to retrieval-based recovery. Specifically, we leverage cross-modal retrieval via unified semantic embeddings to acquire missing domain-specific knowledge. To maximize semantic restoration, we introduce a context-aware adaptive purification mechanism that eliminates latent semantic noise within the retrieved data. Additionally, we employ a two-stage training strategy to explicitly model the semantic relationships between knowledge from different sources. Extensive experiments demonstrate that R$^{2}$ScP significantly improves AVQA and enhances robustness in modal-incomplete scenarios.

[296] Architecture-Agnostic Modality-Isolated Gated Fusion for Robust Multi-Modal Prostate MRI Segmentation

Yongbo Shu,Wenzhao Xie,Shanhu Yao,Zirui Xin,Luo Lei,Kewen Chen,Aijing Luo

Main category: cs.CV

TL;DR: 本文提出Modality-Isolated Gated Fusion (MIGF)模块,用于提升多参数前列腺MRI在缺失或退化模态下的鲁棒性分割性能,通过模态隔离编码与模态丢弃训练实现补偿机制,在PI-CAI数据集上验证了其有效性与泛化性。

Details Motivation: 现有MRI多模态融合方法假设输入完整且早期混合模态信息,难以应对临床中常见的单模态缺失或退化问题,亟需更具鲁棒性的融合策略。 Method: 提出MIGF模块:保持各模态独立编码流,引入可学习门控机制,并结合模态丢弃(ModDrop)训练以强制模型在不完整输入下自适应补偿;在六种骨干网络上评估,使用PI-CAI数据集(1500例,fold-0)及七种缺失/伪影场景。 Result: MIGF显著提升UNet、nnUNet和Mamba的Ranking Score(分别+2.8%、+4.6%、+13.4%);最佳模型MIGFNet-nnUNet达0.7304±0.056;机制分析表明鲁棒性源于模态隔离与丢弃驱动的补偿,而非动态质量路由。 Conclusion: 结构上先隔离再补偿的设计原则更适用于鲁棒多模态医学图像分割,简化架构并提升对临床不完整数据的适应能力。 Abstract: Multi-parametric prostate MRI -- combining T2-weighted, apparent diffusion coefficient, and high b-value diffusion-weighted sequences -- is central to non-invasive detection of clinically significant prostate cancer, yet in routine practice individual sequences may be missing or degraded by motion, artifacts, or abbreviated protocols. Existing multi-modal fusion strategies typically assume complete inputs and entangle modality-specific information at early layers, offering limited resilience when one channel is corrupted or absent. We propose Modality-Isolated Gated Fusion (MIGF), an architecture-agnostic module that maintains separate modality-specific encoding streams before a learned gating stage, combined with modality dropout training to enforce compensation behavior under incomplete inputs. We benchmark six bare backbones and assess MIGF-equipped models under seven missing-modality and artifact scenarios on the PI-CAI dataset (1,500 studies, fold-0 split, five random seeds). Among bare backbones, nnUNet provided the strongest balance of performance and stability. MIGF improved ideal-scenario Ranking Score for UNet, nnUNet, and Mamba by 2.8%, 4.6%, and 13.4%, respectively; the best model, MIGFNet-nnUNet (gating + ModDrop, no deep supervision), achieved 0.7304 +/- 0.056. Mechanistic analysis reveals that robustness gains arise from strict modality isolation and dropout-driven compensation rather than adaptive per-sample quality routing: the gate converged to a stable modality prior, and deep supervision was beneficial only for the largest backbone while degrading lighter models. These findings support a simpler design principle for robust multi-modal segmentation: structurally contain corrupted inputs first, then train explicitly for incomplete-input compensation.

[297] Investigating Bias and Fairness in Appearance-based Gaze Estimation

Burak Akgül,Erol Şahin,Sinan Kalkan

Main category: cs.CV

TL;DR: 本文首次系统评估了基于外观的视线估计模型在不同种族和性别群体中的公平性,发现现有模型存在显著性能差异,且现有去偏策略效果有限,呼吁开发更鲁棒、公平的视线估计算法,并开源了标注、代码和模型。

Details Motivation: 尽管基于外观的视线估计在精度和领域自适应方面取得进展,但其在不同人口统计群体中的公平性尚未被充分研究,且缺乏衡量算法偏差的综合基准。 Method: 建立首个针对视线估计的公平性评估基准,使用标准公平性指标分析当前最先进模型在种族和性别属性上的表现,并测试现有偏差缓解策略在该领域的有效性。 Result: 揭示了现有模型在不同族群和性别间存在显著性能差距;验证了现有去偏方法在视线估计任务中效果有限。 Conclusion: 视线估计系统存在明显的公平性问题,亟需研究更鲁棒、公平的模型;作者开源了数据、代码和模型以推动后续研究。 Abstract: While appearance-based gaze estimation has achieved significant improvements in accuracy and domain adaptation, the fairness of these systems across different demographic groups remains largely unexplored. To date, there is no comprehensive benchmark quantifying algorithmic bias in gaze estimation. This paper presents the first extensive evaluation of fairness in appearance-based gaze estimation, focusing on ethnicity and gender attributes. We establish a fairness baseline by analyzing state-of-the-art models using standard fairness metrics, revealing significant performance disparities. Furthermore, we evaluate the effectiveness of existing bias mitigation strategies when applied to the gaze domain and show that their fairness contributions are limited. We summarize key insights and open issues. Overall, our work calls for research into developing robust, equitable gaze estimators. To support future research and reproducibility, we publicly release our annotations, code, and trained models at: github.com/akgulburak/gaze-estimation-fairness

[298] Defending against Patch-Based and Texture-Based Adversarial Attacks with Spectral Decomposition

Wei Zhang,Xinyu Chang,Xiao Li,Yiming Zhu,Xiaolin Hu

Main category: cs.CV

TL;DR: 本文提出了一种基于离散小波变换(DWT)频谱分解的对抗防御方法ASD,结合对抗训练(AT),有效抵御物理世界中补丁型和纹理型自适应对抗攻击。

Details Motivation: 现有防御机制在自适应攻击下效果不佳,而补丁型和纹理型物理对抗攻击对安防与自动驾驶等关键应用构成真实威胁。 Method: 提出Adversarial Spectrum Defense(ASD),利用DWT进行多尺度频谱分解,捕捉高频(细粒度)与低频(空间广泛)对抗扰动,并与现成对抗训练模型集成。 Result: ASD+AT在强自适应攻击下达到SOTA性能,平均精度(AP)较先前方法提升21.73%。 Conclusion: ASD通过多分辨率频谱分析显著增强对物理世界对抗攻击的鲁棒性,尤其在自适应攻击场景下表现优异。 Abstract: Adversarial examples present significant challenges to the security of Deep Neural Network (DNN) applications. Specifically, there are patch-based and texture-based attacks that are usually used to craft physical-world adversarial examples, posing real threats to security-critical applications such as person detection in surveillance and autonomous systems, because those attacks are physically realizable. Existing defense mechanisms face challenges in the adaptive attack setting, i.e., the attacks are specifically designed against them. In this paper, we propose Adversarial Spectrum Defense (ASD), a defense mechanism that leverages spectral decomposition via Discrete Wavelet Transform (DWT) to analyze adversarial patterns across multiple frequency scales. The multi-resolution and localization capability of DWT enables ASD to capture both high-frequency (fine-grained) and low-frequency (spatially pervasive) perturbations. By integrating this spectral analysis with the off-the-shelf Adversarial Training (AT) model, ASD provides a comprehensive defense strategy against both patch-based and texture-based adversarial attacks. Extensive experiments demonstrate that ASD+AT achieved state-of-the-art (SOTA) performance against various attacks, outperforming the APs of previous defense methods by 21.73%, in the face of strong adaptive adversaries specifically designed against ASD. Code available at https://github.com/weiz0823/adv-spectral-defense .

[299] Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

Yuqi Chen,Xiaohan Zhang,Ahmad Arrabi,Waqas Sultani,Chen Chen,Safwan Wshah

Main category: cs.CV

TL;DR: 本文提出一种参数高效微调多模态大语言模型(MLLM)用于自然语言引导的跨视角地理定位(NGCG)任务的新框架,在不改变原有架构的前提下提升跨模态对齐能力,并在多个基准上达到SOTA性能。

Details Motivation: 现有NGCG方法依赖CLIP式双编码器,存在跨模态泛化弱、结构复杂等问题;而MLLM虽具强语义推理能力,但未针对检索任务优化。 Method: 采用参数高效微调策略,在保持MLLM预训练多模态知识的前提下,仅优化其内部潜在表征,实现跨模态对齐,无需重构模型架构。 Result: 在GeoText-1652上Text-to-Image Recall@1提升12.2%,在CVG-Text的12个子任务中5个取得第一,且可训练参数远少于基线方法。 Conclusion: MLLM可作为语义跨视角检索的鲁棒基础,为NGCG提供可扩展、高性能的替代方案,有望取代传统双编码器设计。 Abstract: Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.

[300] MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

Junzhi Ning,Jiashi Lin,Yingying Fang,Wei Li,Jiyao Liu,Cheng Tang,Chenglong Ma,Wenhao Tang,Tianbin Li,Ziyan Huang,Guang Yang,Junjun He

Main category: cs.CV

TL;DR: 本文提出了首个针对罕见病的多模态多图像临床能力评估基准MMRareBench,并系统评测了23个MLLM,发现其在治疗规划和多图像证据整合方面表现普遍较差,揭示了医学微调可能导致多图像组合能力下降的‘容量稀释效应’。

Details Motivation: 现有基准主要面向常见病和单图像场景,无法评估模型在罕见病中整合多模态、多图像证据的能力;而罕见病临床决策高度依赖病例级证据,亟需专门基准支撑评估与改进。 Method: 构建了首个罕见病多模态多图像基准MMRareBench,包含1756个问答对、7958张图像,覆盖诊断、治疗规划、跨图像证据对齐、检查建议四类临床任务;采用Orphanet本体对齐、防泄露设计、证据标注及双层评估协议;对23个MLLM进行系统评测并分析能力模式。 Result: 23个MLLM在治疗规划任务上性能普遍偏低,医学领域模型在多图像任务上显著落后于通用MLLM,尽管其诊断能力相当;该现象支持‘容量稀释效应’假说:医学微调可缩小诊断差距,却削弱了罕见病所需的多图像组合推理能力。 Conclusion: MMRareBench填补了罕见病多模态评估空白,揭示当前MLLM在关键临床环节(如治疗规划与多图像整合)存在系统性短板,提示未来需兼顾领域知识注入与多证据合成能力的协同优化。 Abstract: Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.

[301] Lung Cancer Detection Using Deep Learning

Imama Ajmi,Abhishek Das

Main category: cs.CV

TL;DR: 本文探讨了使用InceptionV3、MobileNetV2、VGG16、ResNet152等深度学习算法及一种自研的16层CNN模型进行肺癌检测的效果,重点评估其准确率、精确率、召回率和F1分数,并强调所提模型在训练过程中准确率持续提升且有效缓解过拟合问题。

Details Motivation: 肺癌生存率低(约20%),主要因晚期才出现症状,亟需早期精准检测;同时非吸烟者中也有10-15%的肺癌病例,凸显通用高效检测方法的必要性。 Method: 对比分析InceptionV3、MobileNetV2、VGG16、ResNet152四种主流深度学习模型,并提出一种16层CNN架构,融合卷积、池化、展平、Dropout、全连接与稠密层,训练至30个epoch。 Result: 所提模型在训练过程中准确率随epoch增加而持续上升,并表现出对过拟合的良好抑制能力;各模型性能通过准确率、精确率、召回率和F1-score全面评估。 Conclusion: 所提出的16层CNN模型在肺癌检测任务中展现出良好潜力,尤其在避免过拟合和持续提升准确率方面具有优势,为肺癌早期诊断提供了有前景的新方法。 Abstract: Lung cancer, the second leading cause of cancer-related deaths, is primarily linked to long-term tobacco smoking (85% of cases). Surprisingly, 10-15% of cases occur in non-smokers. In 2020, approximately 2 million people were affected globally, resulting in 1.5 million deaths. The survival rate, at around 20%, lags behind other cancers, partly due to late-stage symptom manifestation. Necessitates early and accurate detection for effective treatment. Performance metrics such as accuracy, precision, recall (sensitivity), and F1-score are computed to provide a comprehensive evaluation of each model's capabilities. By comparing these metrics, this study offers insights into the strengths and limitations of each approach, contributing to the advancement of lung cancer detection techniques. In this paper, we are going to discuss the methodologies of lung cancer detection using different deep learning algorithms - InceptionV3, MobileNetV2, VGG16, ResNet152 - are explored for their efficacy in classifying lung cancer cases. Our Proposed Model algorithm based is a 16 layers architecture based on CNN model. Our Proposed model exhibits several key highlights that contribute to its novelty. By integrating multiple layer types such as convolutional, pooling, flatten, dropout, fully connected and dense layers, the model leverages the strengths of each layer to enhance its predictive capabilities. Novelty of our proposed model is that its accuracy is increasing consistently with the increasing no of epochs. We have tested the model performance up to epoch no 30. Our proposed model also overcome the overfitting problem.

[302] At FullTilt: Real-Time Open-Set 3D Macromolecule Detection Directly from Tilted 2D Projections

Ming-Yang Ho,Alberto Bartesaghi

Main category: cs.CV

TL;DR: FullTilt 是一种端到端框架,通过直接在对齐的2D倾斜序列上进行3D大分子检测,避免了传统方法中对重建体数据的滑动窗口推理,显著提升速度并降低显存占用。

Details Motivation: 现有开放集3D大分子检测方法受限于显存,需对重建后的体数据进行低效的滑动窗口推理;而倾斜序列图像更少、信息冗余更低,具备直接处理潜力。 Method: 提出 FullTilt 框架:1)倾斜序列编码器实现跨视角高效特征融合;2)多类别视觉提示编码器支持灵活提示;3)倾斜感知查询初始化器锚定3D查询;4)辅助几何基元模块增强多视角几何理解与抗伪影鲁棒性。 Result: 在三个真实数据集上达到零样本SOTA性能,推理速度大幅提升,VRAM需求显著降低。 Conclusion: FullTilt 为大规模、快速可视化蛋白质组学分析提供了新范式,推动开放集3D检测向实用化迈进。 Abstract: Open-set 3D macromolecule detection in cryogenic electron tomography eliminates the need for target-specific model retraining. However, strict VRAM constraints prohibit processing an entire 3D tomogram, forcing current methods to rely on slow sliding-window inference over extracted subvolumes. To overcome this, we propose FullTilt, an end-to-end framework that redefines 3D detection by operating directly on aligned 2D tilt-series. Because a tilt-series contains significantly fewer images than slices in a reconstructed tomogram, FullTilt eliminates redundant volumetric computation, accelerating inference by orders of magnitude. To process the entire tilt-series simultaneously, we introduce a tilt-series encoder to efficiently fuse cross-view information. We further propose a multiclass visual prompt encoder for flexible prompting, a tilt-aware query initializer to effectively anchor 3D queries, and an auxiliary geometric primitives module to enhance the model's understanding of multi-view geometry while improving robustness to adverse imaging artifacts. Extensive evaluations on three real-world datasets demonstrate that FullTilt achieves state-of-the-art zero-shot performance while drastically reducing runtime and VRAM requirements, paving the way for rapid, large-scale visual proteomics analysis. All code and data will be publicly available upon publication.

[303] HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

Haiyan Jiang,Deyu Zhang,Dongdong Weng,Weitao Song,Henry Been-Lirn Duh

Main category: cs.CV

TL;DR: HOG-Layout 是一种基于大语言模型(LLM)和视觉语言模型(VLM)的文本驱动3D场景分层生成与实时编辑方法,结合RAG提升语义一致性,引入优化模块增强物理合理性,并通过分层表示实现高效推理与编辑。

Details Motivation: 解决3D布局生成中人工创建耗时、数据驱动方法多样性不足的问题,利用大模型提升3D场景合成能力。 Method: 提出HOG-Layout框架,融合LLM/VLM进行文本驱动的分层场景生成;采用检索增强生成(RAG)提升语义一致性;加入物理一致性优化模块;使用分层表示支持实时编辑。 Result: 实验表明HOG-Layout生成的环境比现有基线更合理,且支持快速、直观的实时场景编辑。 Conclusion: HOG-Layout有效提升了3D场景生成的语义、物理一致性与交互实时性,为Embodied AI和VR交互提供了新范式。 Abstract: 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

[304] Uncertainty-quantified Pulse Signal Recovery from Facial Video using Regularized Stochastic Interpolants

Vineet R. Shenoy,Cheng Peng,Rama Chellappa,Yu Sun

Main category: cs.CV

TL;DR: 本文提出RIS-iPPG新范式,通过建模iPPG恢复为逆问题并引入随机性与正则化,在测试时采样后验分布以提供BVP波形重建及其不确定性估计。

Details Motivation: 现有iPPG算法缺乏测试时解空间采样能力,无法提供临床应用所必需的不确定性分析。 Method: 将iPPG建模为逆问题,构建随时间演化的概率路径,预测瞬时流和得分向量;测试时通过求解随机微分方程对BVP后验分布进行采样,并引入相邻时间窗残差流预测相关性最大化作为正则化。 Result: 在三个数据集上,RIS-iPPG显著提升了BVP波形重建质量及不确定性估计精度。 Conclusion: RIS-iPPG为iPPG提供了兼具高精度与可靠不确定性估计的新框架,有助于其在临床和消费级场景中的实际部署。 Abstract: Imaging Photoplethysmography (iPPG), an optical procedure which recovers a human's blood volume pulse (BVP) waveform using pixel readout from a camera, is an exciting research field with many researchers performing clinical studies of iPPG algorithms. While current algorithms to solve the iPPG task have shown outstanding performance on benchmark datasets, no state-of-the art algorithms, to the best of our knowledge, performs test-time sampling of solution space, precluding an uncertainty analysis that is critical for clinical applications. We address this deficiency though a new paradigm named Regularized Interpolants with Stochasticity for iPPG (RIS-iPPG). Modeling iPPG recovery as an inverse problem, we build probability paths that evolve the camera pixel distribution to the ground-truth signal distribution by predicting the instantaneous flow and score vectors of a time-dependent stochastic process; and at test-time, we sample the posterior distribution of the correct BVP waveform given the camera pixel intensity measurements by solving a stochastic differential equation. Given that physiological changes are slowly varying, we show that iPPG recovery can be improved through regularization that maximizes the correlation between the residual flow vector predictions of two adjacent time windows. Experimental results on three datasets show that RIS-iPPG provides superior reconstruction quality and uncertainty estimates of the reconstruction, a critical tool for the widespread adoption of iPPG algorithms in clinical and consumer settings.

[305] LIDARLearn: A Unified Deep Learning Library for 3D Point Cloud Classification, Segmentation, and Self-Supervised Representation Learning

Said Ohamouddou,Hanaa El Afia,Abdellatif El Afia,Raddouane Chiheb

Main category: cs.CV

TL;DR: 本文介绍了LIDARLearn,一个统一、可扩展的PyTorch库,整合了55+点云模型配置(涵盖监督学习、自监督预训练和参数高效微调),提供标准化训练、统计显著性检验、自动化评估与测试,旨在解决点云深度学习方法间难以公平比较的问题。

Details Motivation: 现有点云深度学习方法(监督、自监督、参数高效微调)实现分散、代码库不兼容、数据流程与评估协议不一致,导致公平比较困难。 Method: 设计并实现了一个基于注册表(registry-based)的统一PyTorch库LIDARLearn,集成29种监督架构、7种SSL预训练方法、5种PEFT策略;提供标准化训练器、分层K折交叉验证、自动化LaTeX/CSV表格生成、Friedman/Nemenyi统计检验及临界差异图、以及2200+端到端自动化测试。 Result: LIDARLearn成功整合了55+模型配置,支持四大任务(分类、语义分割、部件分割、少样本学习),具备完备的评估、统计分析与测试能力,显著提升点云模型开发与比较的效率与严谨性。 Conclusion: LIDARLearn为3D点云分析提供了首个开源、统一、可复现且统计严谨的基准平台,有望推动该领域方法标准化与公平评估的发展。 Abstract: Three-dimensional (3D) point cloud analysis has become central to applications ranging from autonomous driving and robotics to forestry and ecological monitoring. Although numerous deep learning methods have been proposed for point cloud understanding, including supervised backbones, self-supervised pre-training (SSL), and parameter-efficient fine-tuning (PEFT), their implementations are scattered across incompatible codebases with differing data pipelines, evaluation protocols, and configuration formats, making fair comparisons difficult. We introduce \lib{}, a unified, extensible PyTorch library that integrates over 55 model configurations covering 29 supervised architectures, seven SSL pre-training methods, and five PEFT strategies, all within a single registry-based framework supporting classification, semantic segmentation, part segmentation, and few-shot learning. \lib{} provides standardised training runners, cross-validation with stratified $K$-fold splitting, automated LaTeX/CSV table generation, built-in Friedman/Nemenyi statistical testing with critical-difference diagrams for rigorous multi-model comparison, and a comprehensive test suite with 2\,200+ automated tests validating every configuration end-to-end. The code is available at https://github.com/said-ohamouddou/LIDARLearn under the MIT licence.

[306] ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

Mingyu Dong,Chong Xia,Mingyuan Jia,Weichen Lyu,Long Xu,Zheng Zhu,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出ReplicateAnyScene框架,实现从随意拍摄的视频中全自动、零样本生成结构化3D场景,通过五阶段级联整合多模态先验知识,并引入C3DR基准进行综合评估。

Details Motivation: 现有方法在跨模态信息融合不足,依赖人工提示、辅助视觉输入,且受限于训练偏差,难以处理复杂真实场景。 Method: 提出ReplicateAnyScene框架,采用五阶段级联流程,从视觉基础模型中提取并结构化对齐文本、视觉和空间维度的通用先验,生成语义一致且物理合理的结构化3D表示。 Result: 在新提出的C3DR基准上实验表明,该方法在生成高质量可组合3D场景方面显著优于现有基线。 Conclusion: ReplicateAnyScene实现了无需人工干预、零样本、端到端的视频到结构化3D场景重建,推动了空间智能与具身AI的发展。 Abstract: Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even mentally assemble them into structured 3D scenes. Replicating such capability, termed compositional 3D reconstruction, is pivotal for the advancement of Spatial Intelligence and Embodied AI. However, existing methods struggle to achieve practical deployment due to the insufficient integration of cross-modal information, leaving them dependent on manual object prompting, reliant on auxiliary visual inputs, and restricted to overly simplistic scenes by training biases. To address these limitations, we propose ReplicateAnyScene, a framework capable of fully automated and zero-shot transformation of casually captured videos into compositional 3D scenes. Specifically, our pipeline incorporates a five-stage cascade to extract and structurally align generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility of the constructed scenes. To facilitate a more comprehensive evaluation of this task, we further introduce the C3DR benchmark to assess reconstruction quality from diverse aspects. Extensive experiments demonstrate the superiority of our method over existing baselines in generating high-quality compositional 3D scenes.

[307] WBCBench 2026: A Challenge for Robust White Blood Cell Classification Under Class Imbalance

Xin Tian,Xudong Ma,Tianqi Yang,Alin Achim,Bartłomiej W Papież,Phandee Watanaboonyongcharoen,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: WBCBench 2026 是一个面向白细胞(WBC)自动分类的ISBI挑战与基准,聚焦于解决严重类别不平衡、严格的患者级数据划分以及合成域偏移三大难点。

Details Motivation: 为应对白细胞分类中类别不平衡、数据泄露风险(如跨患者混用)及真实部署中成像条件变化带来的泛化性挑战,构建更具鲁棒性和临床实用性的评估基准。 Method: 构建双阶段基准:Phase 1使用高质量图像训练;Phase 2引入按训练/验证/测试集分别设定强度的合成退化(噪声、模糊、光照变化),模拟开发到部署的域偏移;采用患者级数据划分、专家标注、标准化染色与单中心显微图像;定义统一提交格式、开源评测器和宏F1为主评价指标。 Result: 成功组织ISBI 2026挑战赛,汇集多种解决方案并完成系统性评测,验证了各方法在严苛设定下的性能差异与局限性。 Conclusion: WBCBench 2026 提供了一个更贴近临床现实的WBC分类评估框架,推动算法向高鲁棒性、强泛化能力和严格数据伦理方向发展。 Abstract: We present WBCBench 2026, an ISBI challenge and benchmark for automated WBC classification designed to stress-test algorithms under three key difficulties: (i) severe class imbalance across 13 morphologically fine-grained WBC classes, (ii) strict patient-level separation between training, validation and test sets, and (iii) synthetic scanner- and setting-induced domain shift via controlled noise, blur and illumination perturbations. All images are single-site microscopic blood smear acquisitions with standardised staining and expert hematopathologist annotations. This paper reviews the challenge and summarises the proposed solutions and final outcomes. The benchmark is organised into two phases. Phase 1 provides a pristine training set. Phase 2 introduces degraded images with split-specific severity distributions for train, validation and test, emulating a realistic shift between development and deployment conditions. We specify a standardised submission schema, open-source evaluator, and macro-averaged F1 score as the primary ranking metric.

[308] Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

Mateusz Szulc,Marcin Iwanowski

Main category: cs.CV

TL;DR: 本文研究了单目相机距离估计中因平面单应性初始化误差导致的距离失真问题,推导出单应性扰动与距离误差之间的显式关系(误差随真实距离近似二次增长),并评估了基于回归的误差建模和基于坐标的梯度下降优化两种校正策略。

Details Motivation: 单目相机在智能监控系统中需准确估计距离,但常用的手动初始化平面单应性易引入小误差,进而导致系统性距离失真。 Method: 推导单应性扰动与距离误差的显式关系;提出并评估两种校正策略:1)回归拟合二次误差函数;2)基于图像坐标的梯度下降直接优化单应矩阵。 Result: 大规模仿真(超1900万样本)表明:回归法在模型拟合良好时峰值精度更高;梯度下降法对初始标定误差更具鲁棒性。 Conclusion: 在许多实际系统中,提升几何标定质量比增加模型复杂度更能带来性能增益。 Abstract: Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.

[309] Uncertainty-Guided Attention and Entropy-Weighted Loss for Precise Plant Seedling Segmentation

Mohamed Ehab,Ali Hamdi

Main category: cs.CV

TL;DR: 本文提出UGDA-Net,一种结合不确定性引导双注意力机制、熵加权混合损失函数和深层监督的新型植物幼苗分割网络,在Dice系数上较基线提升9.3%(U-Net)和13.2%(LinkNet),显著改善叶缘分割精度与细结构识别能力。

Details Motivation: 标准分割模型在复杂背景和植物叶片精细结构下表现不佳,亟需提升幼苗图像分割精度以支持自动化表型分析。 Method: 提出UGDA-Net,包含三个创新模块:1)基于通道方差的不确定性引导双注意力(UGDA);2)聚焦高不确定性边界像素的熵加权混合损失函数;3)对编码器中间层施加深层监督;并在U-Net和LinkNet上开展系统性消融实验。 Result: 在432张高分辨率幼苗图像数据集上训练验证,UGDA-Net使U-Net和LinkNet的Dice系数分别提升9.3%和13.2%,定性结果表明叶缘误检减少,不确定性热图与植物形态复杂性一致。 Conclusion: 不确定性引导注意力与不确定性加权损失是互补机制,UGDA-Net有效提升了植物幼苗尤其是细微结构的高精度分割性能。 Abstract: Plant seedling segmentation supports automated phenotyping in precision agriculture. Standard segmentation models face difficulties due to intricate background images and fine structures in leaves. We introduce UGDA-Net (Uncertainty-Guided Dual Attention Network with Entropy-Weighted Loss and Deep Supervision). Three novel components make up UGDA-Net. The first component is Uncertainty-Guided Dual Attention (UGDA). UGDA uses channel variance to modulate feature maps. The second component is an entropy-weighted hybrid loss function. This loss function focuses on high-uncertainty boundary pixels. The third component employs deep supervision for intermediate encoder layers. We performed a comprehensive systematic ablation study. This study focuses on two widely-used architectures, U-Net and LinkNet. It analyzes five incremental configurations: Baseline, Loss-only, Attention-only, Deep Supervision, and UGDA-Net. We trained UGDA-net using a high-resolution plant seedling image dataset containing 432 images. We demonstrate improved segmentation performance and accuracy. With an increase in Dice coefficient of 9.3% above baseline. LinkNet's variance is 13.2% above baseline. Overlays that are qualitative in nature show the reduced false positives at the leaf boundary. Uncertainty heatmaps are consistent with the complex morphology. UGDA-Net aids in the segmentation of delicate structures in plants and provides a high-def solution. The results showed that uncertainty-guided attention and uncertainty-weighted loss are two complementing systems.

[310] HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Zerui Chen,Rolandos Alexandros Potamias,Shizhe Chen,Jiankang Deng,Cordelia Schmid,Stefanos Zafeiriou

Main category: cs.CV

TL;DR: HO-Flow 是一种基于文本和标准3D物体生成逼真手-物交互运动序列的新框架,结合交互感知变分自编码器与掩码流匹配模型,在物理合理性和运动多样性上达到SOTA。

Details Motivation: 现有方法难以学习表达性强的运动表征并进行有效时序推理,限制了手-物交互(HOI)生成的真实性、时序连贯性与物理合理性。 Method: 提出HO-Flow框架:1)交互感知变分自编码器,将手与物体运动序列联合编码至统一隐空间,融合运动学信息;2)掩码流匹配模型,融合自回归时序推理与连续隐变量生成;3)以初始帧为参考预测物体运动,支持大规模合成数据预训练。 Result: 在GRAB、OakInk和DexYCB基准上,HO-Flow在物理合理性与运动多样性两方面均达到当前最优性能。 Conclusion: HO-Flow通过联合建模手-物运动动力学与增强时序建模能力,显著提升了文本/物体驱动的手-物交互运动生成质量与泛化性。 Abstract: Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.

[311] Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

Zeqian Long,Ozgur Kara,Haotian Xue,Yongxin Chen,James M. Rehg

Main category: cs.CV

TL;DR: 本文提出Immune2V框架,通过在编码器层面强制时间均衡的潜在空间发散,并对齐预计算的坍缩诱导轨迹,以有效抵御图像到视频生成中的对抗性免疫攻击。

Details Motivation: 现有静态图像防御方法难以直接迁移到图像到视频(I2V)生成中,因其视频编码会稀释噪声、文本引导会覆盖干扰效果,导致传统图像级对抗免疫失效。 Method: 分析I2V模型对图像级对抗免疫鲁棒的原因;提出Immune2V框架:1)在编码器层施加时间均衡的潜在发散以防止信号稀释;2)对齐中间生成表示与坍缩诱导轨迹以对抗文本引导覆盖。 Result: Immune2V在相同不可感知性预算下,相比适配的图像级基线方法,产生更强、更持久的生成质量退化效果。 Conclusion: Immune2V为I2V生成提供了首个系统、有效的对抗免疫防御方案,揭示了跨模态生成防御需兼顾时间动态性与条件引导机制。 Abstract: Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.

[312] Retinal Cyst Detection from Optical Coherence Tomography Images

Abhishek Dharmaratnakar,Aadheeshwar Vijayakumar,Suchand Dayanand

Main category: cs.CV

TL;DR: 本文提出了一种基于ResNet CNN的补丁级分类方法,用于光学相干断层扫描(OCT)图像中视网膜囊肿的自动分割,在公开的囊肿分割挑战数据集上实现了优于现有方法的性能(Dice系数>70%),尤其在多厂商、含噪图像上表现出更强鲁棒性。

Details Motivation: 现有视网膜囊肿自动分割方法精度低(仅68%)、对图像质量(如Topcon设备产生的高噪声图像)敏感,难以满足临床精准量化需求。 Method: 采用ResNet卷积神经网络,结合补丁级分类策略,基于囊肿分割挑战数据集进行训练,并在来自4个不同厂商、由两位专家标注的测试集上验证。 Result: 在所有厂商设备图像上均取得Dice系数超过70%的分割性能,显著优于此前SOTA方法,且对图像噪声和质量变化更具鲁棒性。 Conclusion: 该ResNet补丁分类方法为视网膜囊肿分割提供了更准确、更稳健的解决方案,有助于提升相关眼病(如糖尿病性黄斑水肿)的临床诊断与预后评估能力。 Abstract: Retinal Cysts are formed by leakage and accumulation of fluid in the retina due to the incompetence of retinal vasculature. These cystic spaces have significance in several ocular diseases such as age-related macular degeneration, diabetic macular edema, etc. Optical coherence tomography is one of the predominant diagnosing techniques for imaging retinal pathologies. Segmenting and quantification of intraretinal cysts plays the vital role in predicting visual acuity. In literature, several methods have been proposed for automatic segmentation of intraretinal cysts. As cystoid macular edema becomes a major problem to humankind, we need to quantify it accurately and operate it out, else it might cause many problems later on. Though research is being carried out in this area, not much of progress has been made and accuracy achieved so far is 68\% which is very less. Also, the methods depend on the quality of the image and give very low results for high noise images like topcon. This work uses ResNet CNN (Convolutional Neural Network) approach of segmentation by the way of patchwise classification for training on image set from cyst segmentation challenge dataset and testing on test data set given by 2 different graders for all 4 vendors in the challenge. It also compares these methods using first publicly available novel cyst segmentation challenge dataset. The methods were evaluated using quantitative measures to assess their robustness against the challenges of intraretinal cyst segmentation. The results are found to be better than the previous state of the art approaches giving more than 70\% dice coefficient on all vendors irrespective of their quality.

[313] LRD-Net: A Lightweight Real-Centered Detection Network for Cross-Domain Face Forgery Detection

Xuecen Zhang,Vipin Chaudhary

Main category: cs.CV

TL;DR: 本文提出LRD-Net,一种轻量级、面向真实人脸的深度伪造检测网络,通过频域引导的序列架构与真实中心化学习策略,在跨域泛化性和计算效率上均取得显著提升。

Details Motivation: 现有检测方法存在跨域泛化能力差和计算开销大两大问题,难以应对新型伪造类型及部署于资源受限设备。 Method: 提出LRD-Net:采用轻量多尺度小波引导模块生成注意力信号,调控MobileNetV3空间主干;引入真实中心化学习策略,结合指数滑动平均原型更新与漂移正则化,以真实人脸为表征锚点。 Result: 在DiFF基准上达到SOTA跨域检测精度;仅2.63M参数(约为传统方法1/9),训练快8倍以上、推理快近10倍。 Conclusion: LRD-Net证明了高鲁棒性跨域伪造检测与高计算效率可兼得,适用于移动端实时身份认证等实际场景。 Abstract: The rapid advancement of diffusion-based generative models has made face forgery detection a critical challenge in digital forensics. Current detection methods face two fundamental limitations: poor cross-domain generalization when encountering unseen forgery types, and substantial computational overhead that hinders deployment on resource-constrained devices. We propose LRD-Net (Lightweight Real-centered Detection Network), a novel framework that addresses both challenges simultaneously. Unlike existing dual-branch approaches that process spatial and frequency information independently, LRD-Net adopts a sequential frequency-guided architecture where a lightweight Multi-Scale Wavelet Guidance Module generates attention signals that condition a MobileNetV3-based spatial backbone. This design enables effective exploitation of frequency-domain cues while avoiding the redundancy of parallel feature extraction. Furthermore, LRD-Net employs a real-centered learning strategy with exponential moving average prototype updates and drift regularization, anchoring representations around authentic facial images rather than modeling diverse forgery patterns. Extensive experiments on the DiFF benchmark demonstrate that LRD-Net achieves state-of-the-art cross-domain detection accuracy, consistently outperforming existing methods. Critically, LRD-Net accomplishes this with only 2.63M parameters - approximately 9x fewer than conventional approaches - while achieving over 8x faster training and nearly 10x faster inference. These results demonstrate that robust cross-domain face forgery detection can be achieved without sacrificing computational efficiency, making LRD-Net suitable for real-time deployment in mobile authentication systems and resource-constrained environments.

[314] Product Review Based on Optimized Facial Expression Detection

Vikrant Chaugule,Abhishek D,Aadheeshwar Vijayakumar,Pravin Bhaskar Ramteke,Shashidhar G. Koolagudi

Main category: cs.CV

TL;DR: 本文提出了一种基于顾客面部表情分析来评估公众对品牌产品接受度的方法,通过改进的Harris算法提取面部特征点以实现快速准确的表情识别。

Details Motivation: 利用面部表情识别技术评估消费者对超市或大卖场中品牌产品的接受度。 Method: 采用改进的Harris算法提取面部特征点,降低原有Harris算法的时间复杂度,并对比了不同算法的时间复杂度。 Result: 所提算法在角点检测中显著加快了处理速度,同时保持了应用所需的准确性。 Conclusion: 改进的Harris算法适用于实时面部表情识别场景,在保证精度的同时提升了效率。 Abstract: This paper proposes a method to review public acceptance of products based on their brand by analyzing the facial expression of the customer intending to buy the product from a supermarket or hypermarket. In such cases, facial expression recognition plays a significant role in product review. Here, facial expression detection is performed by extracting feature points using a modified Harris algorithm. The modified Harris algorithm reduced the time complexity of the existing feature extraction Harris Algorithm. A comparison of time complexities of existing algorithms is done with proposed algorithm. The algorithm proved to be significantly faster and nearly accurate for the needed application by reducing the time complexity for corner points detection.

[315] EviRCOD: Evidence-Guided Probabilistic Decoding for Referring Camouflaged Object Detection

Ye Wang,Kai Huang,Sumin Shen,Chenyang Ma

Main category: cs.CV

TL;DR: 本文提出EviRCOD框架,通过参考引导的可变形编码器、不确定性感知证据解码器和边界感知细化模块,提升参考驱动伪装目标检测的语义对齐、不确定性建模与边界保持能力。

Details Motivation: 现有方法在参考-目标语义对齐、显式不确定性建模和鲁棒边界保持方面存在不足。 Method: 提出EviRCOD框架,包含三个核心组件:(1) 参考引导可变形编码器(RGDE),实现分层参考驱动调制与多尺度可变形聚合;(2) 不确定性感知证据解码器(UAED),引入Dirichlet证据估计建模不确定性;(3) 边界感知细化模块(BARM),利用低层边缘线索与预测置信度增强模糊边界。 Result: 在Ref-COD基准上达到SOTA检测性能,并提供良好校准的不确定性估计。 Conclusion: EviRCOD有效提升了参考驱动伪装目标检测的整体性能与可靠性,尤其在语义对齐、不确定性建模和边界细节恢复方面具有优势。 Abstract: Referring Camouflaged Object Detection (Ref-COD) focuses on segmenting specific camouflaged targets in a query image using category-aligned references. Despite recent advances, existing methods struggle with reference-target semantic alignment, explicit uncertainty modeling, and robust boundary preservation. To address these issues, we propose EviRCOD, an integrated framework consisting of three core components: (1) a Reference-Guided Deformable Encoder (RGDE) that employs hierarchical reference-driven modulation and multi-scale deformable aggregation to inject semantic priors and align cross-scale representations; (2) an Uncertainty-Aware Evidential Decoder (UAED) that incorporates Dirichlet evidence estimation into hierarchical decoding to model uncertainty and propagate confidence across scales; and (3) a Boundary-Aware Refinement Module (BARM) that selectively enhances ambiguous boundaries by exploiting low-level edge cues and prediction confidence. Experiments on the Ref-COD benchmark demonstrate that EviRCOD achieves state-of-the-art detection performance while providing well-calibrated uncertainty estimates. Code is available at: https://github.com/blueecoffee/EviRCOD.

[316] Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance

Matteo Wohlrapp,Niklas Bubeck,Daniel Rueckert,William Lotter

Main category: cs.CV

TL;DR: 本文提出了一种可扩展的评估框架,联合评估AI图像重建与下游诊断任务(分类、分割)的性能与公平性,发现传统像素级指标(如PSNR)不能反映诊断准确率变化,而重建可能轻微放大性别偏差,但整体影响有限;现有偏置缓解策略在重建中效果有限。

Details Motivation: 现有AI图像重建模型多依赖像素级指标(如PSNR)评估,忽视其对下游临床诊断性能与公平性的影响,亟需端到端的综合评估框架。 Method: 构建重建-诊断联合评估框架,涵盖两类任务(分类、分割)、三类重建方法(U-Net、GAN、扩散模型)及两种模态(X射线、MRI),系统分析重建质量变化对诊断准确率和公平性(如性别偏差)的影响,并尝试迁移分类领域的两种偏置缓解策略至重建环节。 Result: PSNR等传统指标与诊断准确率弱相关;诊断准确率在噪声增加时保持稳定;重建可能轻微放大性别偏差,但幅度远小于诊断模型自身固有偏差;所试两种偏置缓解策略在重建中效果有限。 Conclusion: 应重视医疗影像全流程(重建→诊断)的性能与公平性联合评估,尤其在生成式重建模型临床部署日益增多的背景下。 Abstract: AI-based image reconstruction models are increasingly deployed in clinical workflows to improve image quality from noisy data, such as low-dose X-rays or accelerated MRI scans. However, these models are typically evaluated using pixel-level metrics like PSNR, leaving their impact on downstream diagnostic performance and fairness unclear. We introduce a scalable evaluation framework that applies reconstruction and diagnostic AI models in tandem, which we apply to two tasks (classification, segmentation), three reconstruction approaches (U-Net, GAN, diffusion), and two data types (X-ray, MRI) to assess the potential downstream implications of reconstruction. We find that conventional reconstruction metrics poorly track task performance, where diagnostic accuracy remains largely stable even as reconstruction PSNR declines with increasing image noise. Fairness metrics exhibit greater variability, with reconstruction sometimes amplifying demographic biases, particularly regarding patient sex. However, the overall magnitude of this additional bias is modest compared to the inherent biases already present in diagnostic models. To explore potential bias mitigation, we adapt two strategies from classification literature to the reconstruction setting, but observe limited efficacy. Overall, our findings emphasize the importance of holistic performance and fairness assessments throughout the entire medical imaging workflow, especially as generative reconstruction models are increasingly deployed.

[317] STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation

Jierun Lin,Jiacong Chen,Qingyu Mao,Shuai Liu,Xiandong Meng,Fanyang Meng,Yongsheng Liang

Main category: cs.CV

TL;DR: 本文提出了一种用于高保真视频表示的时空哈希编码框架STGV,通过解耦空间与时间特征编码来分别建模静态背景与动态运动,并采用关键帧初始化策略提升高斯原语的几何一致性,显著提升了视频重建质量(+0.98 PSNR)及下游任务性能。

Details Motivation: 现有2D高斯点绘方法使用内容无关或时空重叠的特征嵌入预测高斯形变,导致静态与动态成分混淆,难以准确建模各自特性,进而影响变形预测精度和表示质量。 Method: 提出时空哈希编码框架STGV:将视频特征分解为可学习的2D空间哈希编码和3D时间哈希编码,分别建模静态细节与动态运动;并设计关键帧规范初始化策略,构建稳定一致的初始高斯表示,避免特征重叠与几何失真。 Result: 在视频表示任务中PSNR提升0.98,优于其他基于高斯的方法;在下游视频任务中也展现出竞争力。 Conclusion: STGV通过时空解耦编码与结构化初始化,有效分离并建模视频中的静态与动态成分,显著提升了高斯点绘方法的视频表示能力与泛化性。 Abstract: 2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static elements.In addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.

[318] TAMISeg: Text-Aligned Multi-scale Medical Image Segmentation with Semantic Encoder Distillation

Qiang Gao,Yi Wang,Yong Zhang,Yong Li,Yongbing Deng,Lan Du,Cunjian Chen

Main category: cs.CV

TL;DR: TAMISeg是一种文本引导的医学图像分割框架,通过引入临床语言提示和语义蒸馏来减少对精细像素级标注的依赖,并在多个数据集上表现出优越性能。

Details Motivation: 医学图像分割面临细粒度标注稀缺、解剖结构复杂以及噪声、低对比度或光照变化导致的图像退化等挑战。 Method: 提出TAMISeg框架,包含三个核心组件:1)采用强扰动预训练的一致性感知编码器以实现鲁棒特征提取;2)基于冻结DINOv3教师模型监督的语义编码器蒸馏模块以增强语义判别能力;3)尺度自适应解码器以跨不同空间尺度分割解剖结构。 Result: 在Kvasir-SEG、MosMedData+和QaTa-COV19数据集上的实验表明,TAMISeg在定性和定量评估中均持续优于现有单模态和多模态方法。 Conclusion: TAMISeg有效利用文本提示和语义蒸馏提升视觉理解能力,在减少对精细标注依赖的同时实现了更鲁棒、更具判别力的医学图像分割。 Abstract: Medical image segmentation remains challenging due to limited fine-grained annotations, complex anatomical structures, and image degradation from noise, low contrast, or illumination variation. We propose TAMISeg, a text-guided segmentation framework that incorporates clinical language prompts and semantic distillation as auxiliary semantic cues to enhance visual understanding and reduce reliance on pixel-level fine-grained annotations. TAMISeg integrates three core components: a consistency-aware encoder pretrained with strong perturbations for robust feature extraction, a semantic encoder distillation module with supervision from a frozen DINOv3 teacher to enhance semantic discriminability, and a scale-adaptive decoder that segments anatomical structures across different spatial scales. Experiments on the Kvasir-SEG, MosMedData+, and QaTa-COV19 datasets demonstrate that TAMISeg consistently outperforms existing uni-modal and multi-modal methods in both qualitative and quantitative evaluations. Code will be made publicly available at https://github.com/qczggaoqiang/TAMISeg.

[319] ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang,Xiaoman Zhang,Sung Eun Kim,Ankit Pal,Pranav Rajpurkar

Main category: cs.CV

TL;DR: 本文提出了ReXSonoVQA视频问答基准,用于评估视觉语言模型在超声动态操作中的 procedural understanding 能力,发现现有大模型在故障排查等因果推理任务上仍存在明显不足。

Details Motivation: 现有视觉语言模型(VLMs)在超声自主系统中潜力巨大,但当前基准仅评估静态图像理解,缺乏对动态操作过程(如探头操控、实时调整)的理解能力评测。 Method: 构建了ReXSonoVQA视频QA基准,包含514个视频片段和对应问题(249道多选题+265道自由回答),聚焦三大能力:动作-目标推理、伪影识别与优化、流程上下文与规划;并在Gemini 3 Pro、Qwen3.5-397B、LLaVA-Video-72B、Seed 2.0 Pro上进行零样本评估。 Result: VLMs能提取部分程序性信息,但在故障排查类问题上表现差,提升微弱,甚至接近纯文本基线,暴露其因果推理能力的严重短板。 Conclusion: ReXSonoVQA填补了超声动态理解评测空白,为超声培训、临床辅助及机器人自动化提供新工具,也揭示了VLMs在 procedural reasoning 方面的关键瓶颈。 Abstract: Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

[320] LiveGesture Streamable Co-Speech Gesture Generation Model

Muhammad Usama Saleem,Mayur Jagdishbhai Patel,Ekkasit Pinyoanuntapong,Zhongxing Qin,Li Yang,Hongfei Xue,Ahmed Helmy,Chen Chen,Pu Wang

Main category: cs.CV

TL;DR: LiveGesture 是首个完全流式、语音驱动的全身手势生成框架,支持零前瞻和任意长度序列,通过 SVQ 与 HAR 模块实现因果、区域协同的实时手势生成,并在 BEAT2 数据集上达到或超越离线方法性能。

Details Motivation: 现有共语手势生成方法多为离线设计,难以满足实时交互场景对零前瞻、流式处理和长序列支持的需求;同时存在身体区域建模割裂或过度耦合的问题。 Method: 提出 LiveGesture 框架,包含流式向量量化运动分词器(SVQ)和分层自回归 Transformer(HAR);SVQ 实现各身体区域运动的因果离散化编码;HAR 包含区域专家 xAR 模块与因果时空融合模块(xAR Fusion),均以流式音频为条件;引入自回归掩码训练策略提升噪声与误差鲁棒性。 Result: 在 BEAT2 数据集上验证了 LiveGesture 能实时生成连贯、多样且节拍同步的全身手势,在零前瞻条件下性能匹配或超越当前最优离线方法。 Conclusion: LiveGesture 成功解决了共语手势生成中的流式性、因果性与区域协同建模难题,为实时人机交互提供了新范式。 Abstract: We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods, which are designed for offline generation and either treat body regions independently or entangle all joints within a single model, LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-expert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero look-ahead conditions.

[321] AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling

Juncheng Hu,Ziteng Xue,Guotao Liang,Anran Qi,Buyu Li,Sheng Wang,Dong Xu,Qian Yu

Main category: cs.CV

TL;DR: AmodalSVG 是一种新型无模态图像矢量化框架,能从自然图像生成语义清晰、几何完整的 SVG 表示,支持对象级矢量编辑。

Details Motivation: 现有矢量化方法仅追踪可见像素(模态范式),忽略遮挡,导致 SVG 语义纠缠、几何不完整,限制结构可编辑性。 Method: 提出两阶段框架:第一阶段用大语言模型引导的语义层剥离(SLP)在光栅域进行语义解耦与补全;第二阶段用误差预算驱动的自适应分层矢量化(ALV)对各层独立矢量化。 Result: 在视觉保真度上显著超越现有方法,并首次实现矢量域中的对象级编辑能力。 Conclusion: AmodalSVG 通过无模态重建和语义分层,突破了传统图像矢量化的几何与语义局限,拓展了 SVG 在可编辑图形生成中的应用边界。 Abstract: We introduce AmodalSVG, a new framework for amodal image vectorization that produces semantically organized and geometrically complete SVG representations from natural images. Existing vectorization methods operate under a modal paradigm: tracing only visible pixels and disregarding occlusion. Consequently, the resulting SVGs are semantically entangled and geometrically incomplete, limiting SVG's structural editability. In contrast, AmodalSVG reconstructs full object geometries, including occluded regions, into independent, editable vector layers. To achieve this, AmodalSVG reformulates image vectorization as a two-stage framework, performing semantic decoupling and completion in the raster domain to produce amodally complete semantic layers, which are then independently vectorized. In the first stage, we introduce Semantic Layer Peeling (SLP), a VLM-guided strategy that progressively decomposes an image into semantically coherent layers. By hybrid inpainting, SLP recovers complete object appearances under occlusions, enabling explicit semantic decoupling. To vectorize these layers efficiently, we propose Adaptive Layered Vectorization (ALV), which dynamically modulates the primitive budget via an error-budget-driven adjustment mechanism. Extensive experiments demonstrate that AmodalSVG significantly outperforms prior methods in visual fidelity. Moreover, the resulting amodal layers enable object-level editing directly in the vector domain, capabilities not supported by existing vectorization approaches. Code will be released upon acceptance.

[322] Progressive Deep Learning for Automated Spheno-Occipital Synchondrosis Maturation Assessment

Omid Halimi Milani,Amanda Nikho,Marouane Tliba,Lauren Mills,Emadeldeen Hamdan,Ahmet Enis Cetin,Mohammed H. Elnagar

Main category: cs.CV

TL;DR: 本文提出一种渐进式表征学习框架,模拟专家临床推理过程,用于精确评估蝶枕软骨结合部(SOS)在CBCT图像中的成熟度,显著提升中间融合阶段的识别准确率与稳定性,且不改变网络结构或损失函数。

Details Motivation: 现有基于CBCT的SOS分期依赖细微、连续变化的形态学特征,导致观察者间差异大、可重复性差,尤其在融合过渡阶段。 Method: 将SOS评估建模为细粒度视觉识别问题,设计渐进式表示学习框架:按专家认知顺序(从粗略解剖结构到细微闭合模式),通过逐步激活深层网络模块实现深度课程学习,使浅层先学习稳定颅底形态,深层再区分相邻成熟阶段。 Result: 在卷积与Transformer架构上均验证该策略优化更稳定、准确率更高,尤其改善模糊中间阶段的判别性能;增益完全源于训练动态改进,无需修改模型结构或损失函数。 Conclusion: 该框架建立了专家牙科直觉与深度视觉表征间的原理性联系,实现了鲁棒、数据高效的小儿CBCT SOS分期,并为医学影像中其他连续生物学过程建模提供了通用策略。 Abstract: Accurate assessment of spheno-occipital synchondrosis (SOS) maturation is a key indicator of craniofacial growth and a critical determinant for orthodontic and surgical timing. However, SOS staging from cone-beam CT (CBCT) relies on subtle, continuously evolving morphological cues, leading to high inter-observer variability and poor reproducibility, especially at transitional fusion stages. We frame SOS assessment as a fine-grained visual recognition problem and propose a progressive representation-learning framework that explicitly mirrors how expert clinicians reason about synchondral fusion: from coarse anatomical structure to increasingly subtle patterns of closure. Rather than training a full-capacity network end-to-end, we sequentially grow the model by activating deeper blocks over time, allowing early layers to first encode stable cranial base morphology before higher-level layers specialize in discriminating adjacent maturation stages. This yields a curriculum over network depth that aligns deep feature learning with the biological continuum of SOS fusion. Extensive experiments across convolutional and transformer-based architectures show that this expert-inspired training strategy produces more stable optimization and consistently higher accuracy than standard training, particularly for ambiguous intermediate stages. Importantly, these gains are achieved without changing network architectures or loss functions, demonstrating that training dynamics alone can substantially improve anatomical representation learning. The proposed framework establishes a principled link between expert dental intuition and deep visual representations, enabling robust, data-efficient SOS staging from CBCT and offering a general strategy for modeling other continuous biological processes in medical imaging.

[323] Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

Songlin Yang,Xianghao Kong,Anyi Rao

Main category: cs.CV

TL;DR: 本文提出一种信息论探测框架,揭示了统一多模态模型(UMMs)中‘伪统一’现象的内在成因:模态不对称编码与模式分割响应,并指出真正统一需信息流一致性而非仅参数共享。

Details Motivation: 现有UMMs虽设计为融合语言推理与视觉生成能力,但实际未实现跨模态能力迁移,存在‘伪统一’现象;而现有探测方法缺乏模型内部洞察或忽略提示-响应依赖关系。 Method: 提出一种联合分析UMM输入编码与输出生成的信息论探测框架,并在10个代表性UMM上进行验证。 Result: 发现伪统一源于双重偏离:(i) 模态不对称编码(视觉与语言熵轨迹不同),(ii) 模式分割响应(文本生成高熵、图像合成低熵);仅同时统一二者(如通过上下文预测)的模型才展现更真实的统一性。 Conclusion: 真正的多模态协同需信息流一致性,而不仅是共享参数;该工作首次实现了对UMM统一性的模型内部探测。 Abstract: Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.

[324] Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

Jihun Kim,Hoyong Kwon,Hyeokjun Kweon,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文提出DiTTA框架,通过蒸馏辅助的测试时自适应(TTA),将图像语义分割(ISS)模型高效转化为具备时间感知能力的视频语义分割(VSS)模型,无需视频标注,在少量视频片段上即可实现强泛化,性能媲美全监督方法。

Details Motivation: 全监督视频语义分割依赖大量密集标注视频,成本高、实用性低;而直接逐帧应用预训练图像分割模型忽略时间一致性;现有基础模型(如SAM2)虽能传播掩码,但语义理解弱、计算开销大,难以直接用于VSS。 Method: 提出DiTTA框架:在单次前向初始化阶段,将SAM2的时间分割知识蒸馏至ISS模型,并引入轻量级时间融合模块聚合跨帧上下文,实现无标注视频的高效测试时自适应。 Result: 在VSPW和Cityscapes数据集上,DiTTA仅用初始10%视频片段即可显著优于零样本微调方法,性能达到甚至超过全监督VSS方法。 Conclusion: DiTTA为真实场景下的视频语义分割提供了一种实用、零标注的解决方案,有效弥合了图像模型与视频任务之间的时序鸿沟。 Abstract: Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA's effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.

[325] FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

Haohang Xu,Lin Liu,Zhibo Zhang,Rong Cong,Xiaopeng Zhang,Qi Tian

Main category: cs.CV

TL;DR: 本文提出FineEdit,一种基于多级边界框注入的扩散模型图像编辑方法,利用精确的边界框作为视觉提示来提高目标定位精度和背景一致性,并构建了大规模细粒度数据集FineEdit-1.2M和基准测试集FineEdit-Bench。

Details Motivation: 传统基于文本提示的扩散图像编辑模型难以精确定位目标对象,且因全局重生成导致背景不一致;而视觉提示(如边界框)能更直观、精确地指定编辑区域。 Method: 提出FineEdit方法,采用多级边界框注入机制,使扩散模型能有效利用空间条件;构建含120万对图像编辑样本的FineEdit-1.2M数据集及含1000张图像的FineEdit-Bench基准。 Result: 在FineEdit-Bench上显著优于Qwen-Image-Edit、LongCat-Image-Edit等开源SOTA模型,在指令遵循与背景保持方面表现更优;在GEdit和ImgEdit Bench等开放基准上也展现出更强泛化性与鲁棒性。 Conclusion: 边界框作为精确空间引导可显著提升扩散模型图像编辑的定位能力与背景一致性,FineEdit及其配套数据集与基准为区域级图像编辑提供了新范式与实用工具。 Abstract: Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

[326] You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

Yinuo Yang,Zixian Ma,Manasi Ganti,Jieyu Zhang,Ranjay Krishna

Main category: cs.CV

TL;DR: 本文提出了一种判别式多模态奖励模型,通过单次前向传播同时评估多个候选响应,显著提升效率与性能,并构建了两个新的N路基准测试MR²Bench-Image和MR²Bench-Video。

Details Motivation: 传统判别式奖励模型需对每个响应单独进行前向传播,效率低且难以支持N路偏好学习;现有基准也主要限于两两比较,缺乏多选项评估场景。 Method: 将N个候选响应拼接并加入分隔符,用交叉熵监督其标量得分,实现单次前向的N路比较;采用4B视觉语言骨干网络+LoRA微调+轻量MLP价值头;构建MR²Bench-Image(图像)和MR²Bench-Video(视频)两个新N路基准。 Result: 在六个多模态奖励基准上达到SOTA,显著优于更大规模的生成式和判别式奖励模型;用于GRPO强化学习时,策略模型训练更稳定、开放生成质量大幅提升。 Conclusion: 单次前向N路奖励建模是高效且有效的范式,新基准推动了多模态奖励建模向更真实、更复杂的人类偏好对齐方向发展。 Abstract: We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

[327] Towards Automated Solar Panel Integrity: Hybrid Deep Feature Extraction for Advanced Surface Defect Identification

Muhammad Junaid Asif,Muhammad Saad Rafaqat,Usman Nazakat,Uzair Khan,Rana Fayyaz Ahmad

Main category: cs.CV

TL;DR: 本文提出了一种结合手工特征(LBP、HoG、Gabor)与深度特征(DenseNet-169)的混合方法,用于太阳能板缺陷检测,并融合SVM、XGBoost、LGBM分类器,在增强数据集上达到99.17%最高准确率。

Details Motivation: 传统人工巡检太阳能电站费时费力、成本高、易出错,尤其在大规模或偏远地区,亟需自动化、智能化的连续监测与早期故障检测系统。 Method: 提出一种融合手工特征(LBP、HoG、Gabor滤波器)和深度特征(DenseNet-169提取)的混合方法;两类特征拼接后分别输入SVM、XGBoost和LGBM分类器进行缺陷识别。 Result: 在增强数据集上,DenseNet-169 + Gabor + SVM组合取得99.17%最高分类准确率,整体性能优于其他对比方法。 Conclusion: 所提混合框架在检测精度、鲁棒性和灵活性方面表现优异,为实际部署光伏面板自动化监测系统提供了坚实基础。 Abstract: To ensure energy efficiency and reliable operations, it is essential to monitor solar panels in generation plants to detect defects. It is quite labor-intensive, time consuming and costly to manually monitor large-scale solar plants and those installed in remote areas. Manual inspection may also be susceptible to human errors. Consequently, it is necessary to create an automated, intelligent defect-detection system, that ensures continuous monitoring, early fault detection, and maximum power generation. We proposed a novel hybrid method for defect detection in SOLAR plates by combining both handcrafted and deep learning features. Local Binary Pattern (LBP), Histogram of Gradients (HoG) and Gabor Filters were used for the extraction of handcrafted features. Deep features extracted by leveraging the use of DenseNet-169. Both handcrafted and deep features were concatenated and then fed to three distinct types of classifiers, including Support Vector Machines (SVM), Extreme Gradient Boost (XGBoost) and Light Gradient-Boosting Machine (LGBM). Experimental results evaluated on the augmented dataset show the superior performance, especially DenseNet-169 + Gabor (SVM), had the highest scores with 99.17% accuracy which was higher than all the other systems. In general, the proposed hybrid framework offers better defect-detection accuracy, resistance, and flexibility that has a solid basis on the real-life use of the automated PV panels monitoring system.

[328] Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization

Ben Isselmann,Dilara Göksu,Heinz Neumann,Andreas Weinmann

Main category: cs.CV

TL;DR: 本研究探讨了自监督学习(SSL)模型在显微镜图像分析中的泛化能力,发现基于DINO的ViT模型在HPA FOV或ImageNet-1k上预训练后,即使不微调也能很好地迁移到OpenCell数据集;HPA FOV预训练模型零样本性能最优,微调后进一步提升;单细胞层面也验证了领域相关预训练的有效性。

Details Motivation: 显微镜任务特定数据集通常较小,难以训练鲁棒的深度学习模型;现有SSL方法在跨染色协议和通道配置的数据集间泛化能力尚不明确。 Method: 评估DINO-ViT模型在ImageNet-1k和HPA FOV上预训练后的迁移性能,测试其在OpenCell数据集上的零样本与微调表现,并比较两种通道不匹配策略及不同微调数据比例的影响;同时在标注的OpenCell子集上分析单细胞嵌入效果。 Result: HPA FOV预训练模型零样本macro F1达0.822±0.007,微调后升至0.860±0.013;单细胞层面,HPA单细胞预训练模型在所有邻域大小下k-NN macro F1均最高(≥0.796)。 Conclusion: 基于DINO等SSL方法、在大规模领域相关数据集上预训练,可有效提升小规模显微镜任务数据上的深度学习特征泛化与微调性能。 Abstract: Background: Task-specific microscopy datasets are often small, making it difficult to train deep learning models that learn robust features. While self-supervised learning (SSL) has shown promise through pretraining on large, domain-specific datasets, generalizability across datasets with differing staining protocols and channel configurations remains underexplored. We investigated the generalizability of SSL models pretrained on ImageNet-1k and HPA FOV, evaluating their embeddings on OpenCell with and without fine-tuning, two channel-mismatch strategies, and varying fine-tuning data fractions. We additionally analyzed single-cell embeddings on a labeled OpenCell subset. Result: DINO-based ViT backbones pretrained on HPA FOV or ImageNet-1k transfer well to OpenCell even without fine-tuning. The HPA FOV-pretrained model achieved the highest zero-shot performance (macro $F_1$ 0.822 $\pm$ 0.007). Fine-tuning further improved performance to 0.860 $\pm$ 0.013. At the single-cell level, the HPA single-cell-pretrained model achieved the highest k-nearest neighbor performance across all neighborhood sizes (macro $F_1$ $\geq$ 0.796). Conclusion: SSL methods like DINO, pretrained on large domain-relevant datasets, enable effective use of deep learning features for fine-tuning on small, task-specific microscopy datasets.

[329] MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

Xincheng Yao,Zefeng Qian,Chao Shi,Jiayang Song,Chongyang Zhang

Main category: cs.CV

TL;DR: 本文提出MMR-AD基准以推动多模态大语言模型(MLLM)在通用异常检测(GAD)中的研究,并基于该基准设计推理增强的基线模型Anomaly-R1,在检测与定位性能上显著超越现有通用MLLM。

Details Motivation: 现有MLLM虽具强大视觉与语言能力,但其预训练数据与工业异常检测场景存在分布差距,且主流AD数据集不适用于MLLM后训练,导致MLLM在通用异常检测中能力未被充分挖掘。 Method: 构建首个面向MLLM的通用异常检测综合基准MMR-AD(含训练与评测),并基于其CoT数据和强化学习提出推理型基线模型Anomaly-R1。 Result: 实验表明Anomaly-R1在异常检测与定位任务上显著优于当前SOTA通用MLLM;同时揭示现有通用MLLM距离工业实际需求仍有较大差距。 Conclusion: MMR-AD为MLLM赋能通用异常检测提供了关键基础设施,Anomaly-R1验证了结合推理与强化学习提升MLLM AD能力的有效路径,推动了GAD向实用化迈进。 Abstract: In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.

[330] Energy-oriented Diffusion Bridge for Image Restoration with Foundational Diffusion Models

Jinhui Hou,Zhiyu Zhu,Junhui Hou

Main category: cs.CV

TL;DR: 本文提出了一种面向能量优化的扩散桥框架(E-Bridge),通过设计更短时间跨度的桥过程和熵正则化起点,结合一致性建模实现单步或少步高质量图像恢复,在多种图像复原任务中达到SOTA性能。

Details Motivation: 现有扩散桥模型虽能连接干净与退化图像分布,但依赖复杂高成本轨迹,限制了采样效率和复原质量。 Method: 提出E-Bridge框架:① 设计短时桥过程;② 反向过程起始于退化图像与高斯噪声的熵正则化混合点;③ 借鉴一致性模型学习单步映射函数,并采用针对该轨迹定制的连续时间一致性目标进行优化;④ 将轨迹长度设为可调的任务自适应参数。 Result: 在去噪、超分辨率等多种图像复原任务上达到SOTA性能,支持单步或极少步采样并保持高质量恢复效果。 Conclusion: E-Bridge通过降低轨迹能量需求与引入任务自适应轨迹长度,在保证复原质量的同时大幅提升采样效率,为高效扩散图像复原提供了新范式。 Abstract: Diffusion bridge models have shown great promise in image restoration by explicitly connecting clean and degraded image distributions. However, they often rely on complex and high-cost trajectories, which limit both sampling efficiency and final restoration quality. To address this, we propose an Energy-oriented diffusion Bridge (E-Bridge) framework to approximate a set of low-cost manifold geodesic trajectories to boost the performance of the proposed method. We achieve this by designing a novel bridge process that evolves over a shorter time horizon and makes the reverse process start from an entropy-regularized point that mixes the degraded image and Gaussian noise, which theoretically reduces the required trajectory energy. To solve this process efficiently, we draw inspiration from consistency models to learn a single-step mapping function, optimized via a continuous-time consistency objective tailored for our trajectory, so as to analytically map any state on the trajectory to the target image. Notably, the trajectory length in our framework becomes a tunable task-adaptive knob, allowing the model to adaptively balance information preservation against generative power for tasks of varying degradation, such as denoising versus super-resolution. Extensive experiments demonstrate that our E-Bridge achieves state-of-the-art performance across various image restoration tasks while enabling high-quality recovery with a single or fewer sampling steps. Our project page is https://jinnh.github.io/E-Bridge/.

[331] ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation

Yuan Shui,Yandong Guan,Zhanwei Zhang,Juncheng Hu,Jing Zhang,Dong Xu,Qian Yu

Main category: cs.CV

TL;DR: ArtiCAD 是首个无需训练的多智能体系统,能直接从文本或图像生成可编辑的、带运动关节的CAD装配体,通过预定义连接器在设计阶段即确定装配关系,并引入验证与回滚机制及自进化经验库提升质量。

Details Motivation: 参数化CAD建模对产品开发至关重要,但目前尚无方法能从高层描述(如文本/图像)自动生成多部件、可动的CAD装配体。 Method: 提出四智能体协同架构(Design/Generation/Assembly/Review),核心创新是将装配关系预测前移至设计阶段,借助显式定义连接点和关节参数的Connector;引入生成与装配阶段的验证步骤、跨阶段回滚机制,以及自演化的经验存储库。 Result: 在ArtiCAD-Bench、CADPrompt和ACD三个数据集上验证有效;支持需求驱动的概念设计、物理原型制作及URDF格式导出以生成具身AI训练资产。 Conclusion: ArtiCAD成功克服了现有大模型空间推理能力不足的问题,实现了高质量、可编辑、端到端的 articulated CAD 装配生成,为CAD自动化与具身AI提供了新范式。 Abstract: Parametric Computer-Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi-part, movable models from high-level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training-free multi-agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high-quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross-stage rollback mechanism that accurately isolates and corrects design- and code-level errors. Additionally, a self-evolving experience store accumulates design knowledge to continuously improve performance on future tasks. Extensive evaluations on three datasets (ArtiCAD-Bench, CADPrompt, and ACD) validate the effectiveness of our approach. We further demonstrate the applicability of ArtiCAD in requirement-driven conceptual design, physical prototyping, and the generation of embodied AI training assets through URDF export.

[332] LumiMotion: Improving Gaussian Relighting with Scene Dynamics

Joanna Kaleta,Piotr Wójcik,Kacper Marzol,Tomasz Trzciński,Kacper Kania,Marek Kowalski

Main category: cs.CV

TL;DR: 本文提出LumiMotion,首个利用动态元素进行逆向渲染的高斯泼溅方法,通过运动区域提供不同光照下的表面信息,提升反照率估计和场景重光照性能,并发布首个动态环境下的合成基准。

Details Motivation: 现有基于高斯泼溅的方法主要针对静态场景,假设简化或中等光照,难以在真实复杂光照下准确分离光照与材质属性。 Method: 提出LumiMotion方法,构建动态2D高斯泼溅表示,引入新约束使动态区域可形变而静态区域保持稳定,利用运动作为监督信号辅助逆向渲染。 Result: 反照率估计LPIPS提升23%,场景重光照LPIPS提升15%;发布含5个场景、4种光照、动静态变体的合成基准。 Conclusion: 动态区域可作为强监督信号提升逆向渲染中材质与光照的解耦能力,LumiMotion在动态场景逆向渲染中取得显著性能提升,并推动该方向的系统性评估。 Abstract: In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that employs a set of novel constraints which encourage the dynamic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting. Link to project page: https://joaxkal.github.io/LumiMotion/

[333] TraversalBench: Challenging Paths to Follow for Vision Language Models

Clara Petrova,Zhuo Chen,Marin Soljačić

Main category: cs.CV

TL;DR: 本文提出了TraversalBench,一个用于评估视觉语言模型(VLMs)在精确视觉路径遍历能力上的受控基准;实验发现自交点是主要难点,错误集中在首次交叉点附近,揭示了模型在持续视觉处理中的类人失败模式。

Details Motivation: 现有VLMs在多模态基准上表现优异,但缺乏对复杂视觉路径追踪能力的系统评测,而该能力对人类而言直观简单,亟需受控、可解释的诊断基准。 Method: 构建TraversalBench基准:包含单条连续折线、唯一起点标记及顶点标记,要求模型输出精确遍历顺序;显式平衡自交点数、曲折度、顶点数和邻近干扰线等结构因素,并控制OCR、常识和开放规划干扰;辅以阅读顺序基准作对照分析。 Result: 自交点是主导性难点;首交点前性能稳定,之后急剧下降;邻近干扰线则导致渐进式、累积性性能退化;模型表现出左→右序列化偏好,但无法解释路径复杂度主效应。 Conclusion: TraversalBench有效诊断VLMs在模糊、杂乱与干扰下的路径忠实视觉推理缺陷,填补了持续视觉定位评测基准的空白,为多模态空间推理研究提供可控工具。 Abstract: Vision-language models (VLMs) perform strongly on many multimodal benchmarks. However, the ability to follow complex visual paths -- a task that human observers typically find straightforward -- remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a single continuous polyline, a unique start marker, and markers placed at path vertices; the task is to recover the exact ordered sequence encountered when traversing the path from start to finish. The benchmark explicitly balances key path-structural factors including self-intersection count, tortuosity, vertex count, and nearby confounding lines, while minimizing reliance on OCR, world knowledge, and open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis shows that errors are sharply localized: performance is relatively stable immediately before the first crossing, then drops steeply when the model must resolve the correct continuation. By contrast, nearby confounding lines produce a weaker persistent degradation that compounds with repeated exposure. These analyses make TraversalBench a useful diagnostic for identifying whether models suffer from human-like failures or other breakdowns in sustained visual processing. An auxiliary reading-order benchmark further reveals a consistent preference for layouts compatible with left-to-right serialization, while not explaining away the main effects of path complexity. Together, these results position TraversalBench as a controlled diagnostic of path-faithful visual reasoning and as a useful testbed for studying multimodal spatial reasoning under ambiguity, clutter, and distractor structure. More broadly, we position TraversalBench as a contribution to the still-limited area of sustained visual grounding benchmarks for VLMs.

[334] Panoptic Pairwise Distortion Graph

Muhammad Kamran Janjua,Abdul Wahab,Bahador Rashidi

Main category: cs.CV

TL;DR: 本文提出了一种新的图像对比较评估方法——失真图(Distortion Graph, DG),将图像对表示为基于区域的结构化拓扑,编码失真类型、严重程度、对比关系和质量评分;为此构建了区域级数据集PandaSet、基准套件PandaBench及高效模型Panda,并验证其可提升多模态大模型对区域级失真的理解能力。

Details Motivation: 现有图像对评估方法主要依赖整图分析,虽隐含区域理解,但缺乏显式、结构化的区域级建模,难以实现细粒度、可解释的比较评估。 Method: 提出跨图像场景图(Distortion Graph)新范式,构建区域级数据集PandaSet、基准PandaBench,并设计专用模型Panda生成失真图;通过训练或提示注入DG结构,增强模型区域失真感知能力。 Result: PandaBench对当前SOTA多模态大语言模型构成显著挑战,表明其缺乏区域级失真理解能力;而使用PandaSet训练或DG提示可有效激发模型的区域级失真识别与推理能力。 Conclusion: 失真图(DG)为图像对评估提供了结构化、可解释、细粒度的新范式,推动从整图到区域级理解的范式转变,并为多模态模型的视觉细粒度评估开辟新方向。 Abstract: In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.

[335] Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

Zhiyuan Zhang,Zijian Zhou,Linjun Li,Long Chen,Hao Tang,Yichen Gong

Main category: cs.CV

TL;DR: 本文提出了一种新的3D纹理生成任务——发射纹理生成(emission texture generation),旨在从参考图像中合成具有真实LED发光效果的3D物体材质;为此构建了首个含4万带高质量发射材质3D资产的数据集Objaverse-Emission,并提出了基线模型EmissionGen及相应评估指标。

Details Motivation: 现有3D纹理生成方法仅支持少数非自发光PBR材质,难以复现如赛博朋克等流行风格及真实LED发光效果。 Method: 构建Objaverse-Emission数据集(40k带高质量发射材质的3D资产),提出EmissionGen模型作为发射纹理生成任务的基线,并定义专用评估指标。 Result: 实现了从参考图像生成逼真发射材质的3D纹理,验证了该任务在工业应用中的潜力。 Conclusion: 发射纹理生成是一项新颖且具实用价值的3D生成任务,所构建的数据集、模型与评估体系为后续研究与工业落地奠定基础。 Abstract: 3D texture generation is receiving increasing attention, as it enables the creation of realistic and aesthetic texture materials for untextured 3D meshes. However, existing 3D texture generation methods are limited to producing only a few types of non-emissive PBR materials (e.g., albedo, metallic maps and roughness maps), making them difficult to replicate highly popular styles, such as cyberpunk, failing to achieve effects like realistic LED emissions. To address this limitation, we propose a novel task, emission texture generation, which enables the synthesized 3D objects to faithfully reproduce the emission materials from input reference images. Our key contributions include: first, We construct the Objaverse-Emission dataset, the first dataset that contains 40k 3D assets with high-quality emission materials. Second, we propose EmissionGen, a novel baseline for the emission texture generation task. Third, we define detailed evaluation metrics for the emission texture generation task. Our results demonstrate significant potential for future industrial applications. Dataset will be available at https://github.com/yx345kw/EmissionGen.

[336] Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling

Takahiko Furuya

Main category: cs.CV

TL;DR: 本文提出PLOVIS框架,利用开放词汇图像分割模型为3D点云生成高质量伪标签,解决训练场景少、点标注少、缺乏对应2D图像序列这三重数据不足问题,并通过两阶段伪标签过滤与类别均衡记忆库提升性能。

Details Motivation: 现实场景中3D点云语义分割面临三重数据不足:训练场景稀缺、点级标注稀缺、重建点云的2D图像序列缺失;现有方法仅解决其中一到两个问题,缺乏联合应对方案。 Method: 提出PLOVIS算法:1)利用开放词汇图像分割(OVIS)模型作为伪标签生成器;2)直接从3D点云渲染2D图像用于伪标签生成,无需原始2D序列;3)采用两阶段伪标签过滤(先去低置信度、再去可能错误标签);4)引入类别均衡记忆库缓解伪标签噪声与类别不平衡。 Result: 在ScanNet、S3DIS、Toronto3D和Semantic3D四个基准数据集上,仅用数十个训练场景、每个场景<100个点标注的极端稀疏条件下,PLOVIS持续优于标准微调与前沿弱监督方法。 Conclusion: PLOVIS是首个系统性应对3D点云分割中三重数据不足的框架,验证了基于OVIS的跨模态伪标签生成与精细化过滤策略的有效性,为数据高效3D理解提供了新范式。 Abstract: Semantic segmentation of 3D point cloud scenes is a crucial task for various applications. In real-world scenarios, training segmentation models often faces three concurrent forms of data insufficiency: scarcity of training scenes, scarcity of point-level annotations, and absence of 2D image sequences from which point clouds were reconstructed. Existing data-efficient algorithms typically address only one or two of these challenges, leaving the joint treatment of all three unexplored. This paper proposes a data-efficient training framework specifically designed to address the three forms of data insufficiency. Our proposed algorithm, called Point pseudo-Labeling via Open-Vocabulary Image Segmentation (PLOVIS), leverages an Open-Vocabulary Image Segmentation (OVIS) model as a pseudo label generator to compensate for the lack of training data. PLOVIS creates 2D images for pseudo-labeling directly from training 3D point clouds, eliminating the need for 2D image sequences. To mitigate the inherent noise and class imbalance in pseudo labels, we introduce a two-stage filtering of pseudo labels combined with a class-balanced memory bank for effective training. The two-stage filtering mechanism first removes low-confidence pseudo labels, then discards likely incorrect pseudo labels, thereby enhancing the quality of pseudo labels. Experiments on four benchmark datasets, i.e., ScanNet, S3DIS, Toronto3D, and Semantic3D, under realistic data-scarce conditions (a few tens of training 3D scenes, each annotated with only <100 3D points) demonstrate that PLOVIS consistently outperforms existing methods including standard fine-tuning strategies and state-of-the-art weakly supervised learning algorithms. Code will be made publicly available.

[337] Byte-level generative predictions for forensics multimedia carving

Jaewon Lee,Md Eimran Hossain Eimon,Avinash Srinivasan,Hari Kalva

Main category: cs.CV

TL;DR: 本文提出了一种基于字节级生成模型bGPT的多媒体文件碎片恢复方法,通过预测缺失字节提升碎片匹配能力。

Details Motivation: 传统文件 carving 方法无法重建或预测缺失数据,难以应对缺乏文件系统元数据的碎片化多媒体文件恢复问题。 Method: 提出基于字节级Transformer(bGPT)的生成式carving方法,利用其next-byte预测能力,对部分BMP图像碎片进行续写,并用余弦相似度、SSIM、卡方距离和JS散度评估预测保真度。 Result: 实验表明,生成模型能有效预测字节级模式,辅助未分配磁盘空间中的碎片匹配。 Conclusion: 生成式方法为数字取证中的碎片恢复提供了新思路,弥补了传统判别式方法在数据补全上的不足。 Abstract: Digital forensic investigations often face significant challenges when recovering fragmented multimedia files that lack file system metadata. While traditional file carving relies on signatures and discriminative deep learning models for fragment classification, these methods cannot reconstruct or predict missing data. We propose a generative approach to multimedia carving using bGPT, a byte-level transformer designed for next-byte prediction. By feeding partial BMP image data into the model, we simulate the generation of likely fragment continuations. We evaluate the fidelity of these predictions using different metrics, namely, cosine similarity, structural similarity index (SSIM), chi-square distance, and Jensen-Shannon divergence (JSD). Our findings demonstrate that generative models can effectively predict byte-level patterns to support fragment matching in unallocated disk space.

[338] UHD-GPGNet: UHD Video Denoising via Gaussian-Process-Guided Local Spatio-Temporal Modeling

Weiyuan He,Chen Wu,Pengwen Dai,Wei Wang,Dianjie Lu,Guijuan Zhang,Linwei Fan,Yongzhen Wang,Zhuoran Zheng

Main category: cs.CV

TL;DR: 本文提出UHD-GPGNet,一种基于高斯过程引导的局部时空视频去噪框架,兼顾UHD视频去噪中的降质抑制、纹理/色度保真与4K实时部署需求;通过显式建模退化不确定性、结构-色彩协同重建及异方差损失等设计,在参数量更少前提下实现SOTA性能与实时4K推理能力,并具备强真实场景泛化性。

Details Motivation: UHD视频去噪需同时解决复杂时空退化抑制、精细纹理与色度稳定性保持、以及高效全分辨率4K部署等多重挑战,现有方法难以兼顾。 Method: 提出UHD-GPGNet:1)利用高斯过程(GP)对紧凑时空描述符建模,显式估计局部退化响应与不确定性,指导自适应时序细节融合;2)结构-色彩协同重建头解耦亮度、色度与高频校正;3)引入异方差损失与重叠分块推理以稳定训练并支持内存受限的4K部署。 Result: 在UVG和RealisVideo-4K数据集上,UHD-GPGNet以显著更少参数达到有竞争力的恢复精度;实现全分辨率4K实时推理,速度大幅超越同质量最优方法;在多级混合退化下鲁棒性强;且在手机拍摄的真实4K视频上验证了其从合成训练到真实传感器噪声的良好泛化能力,并提升下游目标检测性能。 Conclusion: UHD-GPGNet通过将显式不确定性建模与轻量高效架构相结合,为UHD视频去噪提供了兼顾性能、效率与泛化性的新范式。 Abstract: Ultra-high-definition (UHD) video denoising requires simultaneously suppressing complex spatio-temporal degradations, preserving fine textures and chromatic stability, and maintaining efficient full-resolution 4K deployment. In this paper, we propose UHD-GPGNet, a Gaussian-process-guided local spatio-temporal denoising framework that addresses these requirements jointly. Rather than relying on implicit feature learning alone, the method estimates sparse GP posterior statistics over compact spatio-temporal descriptors to explicitly characterize local degradation response and uncertainty, which then guide adaptive temporal-detail fusion. A structure-color collaborative reconstruction head decouples luminance, chroma, and high-frequency correction, while a heteroscedastic objective and overlap-tiled inference further stabilize optimization and enable memory-bounded 4K deployment. Experiments on UVG and RealisVideo-4K show that UHD-GPGNet achieves competitive restoration fidelity with substantially fewer parameters than existing methods, enables real-time full-resolution 4K inference with significant speedup over the closest quality competitor, and maintains robust performance across a multi-level mixed-degradation schedule.A real-world study on phone-captured 4K video further confirms that the model, trained entirely on synthetic degradation, generalizes to unseen real sensor noise and improves downstream object detection under challenging conditions.

[339] Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

Zheng Jiang,Yiming Chen,Nan He,Jiahui Chen,Chaoyang Li,Houde Qian,Lifeng Sun

Main category: cs.CV

TL;DR: 本文提出Test-Time Scaling over Perception (TTSP)框架,通过在推理时对感知过程进行多路径探索、熵基置信度筛选、结构化知识蒸馏与不确定性驱动的迭代优化,解决多模态大模型中因‘定位悖论’导致的细粒度视觉推理脆弱性问题。

Details Motivation: 现有多模态大语言模型(MLLMs)在细粒度视觉推理中表现脆弱,因其需在获得充分视觉证据前就决定关注区域,形成‘定位悖论’(Grounding Paradox)。 Method: 提出Test-Time Scaling over Perception(TTSP)框架:生成多条探索性感知轨迹,基于熵估计过滤不可靠轨迹,将可信观察蒸馏为结构化知识,并迭代聚焦于未解决的不确定性区域。 Result: 在高分辨率及通用多模态推理基准上,TTSP在不同骨干模型规模下均显著优于强基线,同时展现出良好的可扩展性与token效率。 Conclusion: 在测试时扩展感知能力是提升多模态模型在感知不确定性下鲁棒推理能力的有效新方向。 Abstract: Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process. TTSP generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Extensive experiments on high-resolution and general multimodal reasoning benchmarks show that TTSP consistently outperforms strong baselines across backbone sizes, while also exhibiting favorable scalability and token efficiency. Our results suggest that scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty.

[340] EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

Weikun Peng,Denys Iliash,Manolis Savva

Main category: cs.CV

TL;DR: EgoFun3D 提出了一种从第一人称视频中建模交互式3D物体的新任务、数据集与基准,引入‘功能模板’来刻画部件间功能关系,并构建四阶段处理流程。

Details Motivation: 交互式3D物体对具身AI至关重要但稀缺,而真实世界中的第一人称视频丰富;因此需从这类视频中自动提取仿真就绪的交互式3D模型。 Method: 提出四阶段pipeline:2D部件分割、3D重建、运动结构估计、功能模板推断;并定义结构化‘功能模板’表征跨部件功能映射(如旋钮旋转控制炉温),支持精确评估与跨平台代码生成。 Result: 构建含271段视频的数据集,涵盖3D几何、2D/3D分割、运动结构及功能模板标注;基准测试表明现有方法在该任务上表现不佳,验证了任务挑战性。 Conclusion: EgoFun3D为从egocentric视频理解物理交互开辟了新方向,功能模板提供了可评估、可执行的语义建模范式,推动具身智能与仿真环境的深度融合。 Abstract: We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. Interactive objects are of high interest for embodied AI but scarce, making modeling from readily available real-world videos valuable. Our task focuses on obtaining simulation-ready interactive 3D objects from egocentric video input. While prior work largely focuses on articulations, we capture general cross-part functional mappings (e.g., rotation of stove knob controls stove burner temperature) through function templates, a structured computational representation. Function templates enable precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline consisting of: 2D part segmentation, reconstruction, articulation estimation, and function template inference. Comprehensive benchmarking shows that the task is challenging for off-the-shelf methods, highlighting avenues for future work.

[341] Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

Renyu Li,Vladimir Kirilenko,Yao You,Crag Wolfe

Main category: cs.CV

TL;DR: 本文提出一种基于视觉-语言模型的标签协调工作流,用于解决多源数据集在目标检测任务中因标注标准不一致(如类别语义和边界框粒度差异)导致的性能下降问题,并在文档布局检测任务上验证了其有效性。

Details Motivation: 细调目标检测模型时,混合多个数据集常因标注标准不兼容(如语义相同但空间定义不同)而损害性能,尤其在文档布局检测等标注差异大的场景中尤为突出。 Method: 提出一种代理式标签协调工作流,利用视觉-语言模型统一不同数据集的类别语义与边界框粒度,在训练前完成标注对齐;并在文档布局检测任务上评估其效果。 Result: 在SCORE-Bench上,表格TEDS从0.750提升至0.814,检测F-score从0.860升至0.883,平均边界框重叠度从0.043降至0.016;表征分析显示协调后特征嵌入更紧凑可分。 Conclusion: 标注不一致性会扭曲模型学习到的特征空间,而训练前进行标签协调能显著提升检测性能与表征质量,为多源数据融合提供可靠预处理范式。 Abstract: Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, naïve mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.

[342] Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

Shimon Murai,Teppei Kurita,Ryuta Satoh,Yusuke Moriuchi

Main category: cs.CV

TL;DR: 本文提出了一种轻量级两阶段低光照图像增强框架,结合冻结的算法预处理与基于深度可分离卷积的紧凑U-Net,在参数量大幅减少的同时实现了具有竞争力的感知质量。

Details Motivation: 现有低光照图像增强方法参数量大、计算开销高,难以部署于资源受限设备;需在保持高质量增强效果的同时显著降低模型复杂度。 Method: 采用两阶段设计:第一阶段为冻结的传统算法预处理(生成亮度校正的互补视图以归一化输入分布),第二阶段为仅使用深度可分离卷积构建的轻量U-Net,专注于残差颜色校正。 Result: 在CVPR 2026 NTIRE高效低光照图像增强挑战赛中获第4名,并通过扩展基准测试与消融实验验证了方法的有效性与泛化能力。 Conclusion: 该轻量框架证明了算法与学习模块协同设计可在极低参数量下实现高性能低光增强,为边缘端实时应用提供了可行方案。 Abstract: We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 4th place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

[343] ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Suyoung Kim,Sunghyun Wee,Hyeonjin Kim,Kyomin Hwang,Hyunho Lee,Nojun Kwak

Main category: cs.CV

TL;DR: 本文提出ReSpinQuant框架,通过离线激活旋转融合与残差子空间旋转,在保持层自适应高表达力的同时,显著降低推理开销,实现W4A4/W3A3量化下的SOTA性能。

Details Motivation: 全局旋转方法表达能力有限,而层式旋转方法虽精度高但引入显著在线计算开销,需兼顾精度与效率。 Method: 提出ReSpinQuant框架,采用离线激活旋转融合与基于匹配基的高效残差子空间旋转,兼顾层自适应表达力与低推理开销。 Result: 在W4A4和W3A3量化设置下,ReSpinQuant达到SOTA性能,精度媲美计算昂贵的层式方法,同时开销极小。 Conclusion: ReSpinQuant成功调和了旋转PTQ中表达力与推理效率的矛盾,为LLM低比特量化提供了实用新范式。 Abstract: Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

[344] MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling

Mingyang Li,Brian Lee,Rui Zuo,Brent Bacchus,Priyantha Mudalige,Qinru Qiu

Main category: cs.CV

TL;DR: 本文提出MapATM,一种利用历史车辆轨迹作为道路几何结构先验来提升车道线检测精度的新型深度神经网络,在NuScenes数据集上显著提升了检测性能。

Details Motivation: HD地图构建中车道检测与预测受视野遮挡、远距离车道可见性差及恶劣天气等非理想条件影响,导致检测精度下降、系统可靠性降低。 Method: 提出MapATM模型,将运动车辆(actors)的历史轨迹作为道路几何结构的结构先验,融入深度神经网络以增强车道检测。 Result: 在NuScenes数据集上,车道分隔线AP提升4.6(相对提升10.1%),整体mAP提升2.6(相对提升6.1%);定性分析表明其在复杂驾驶场景下具备稳定鲁棒的地图重建能力。 Conclusion: MapATM有效利用轨迹先验信息提升了车道检测鲁棒性与精度,具有实际自动驾驶应用价值。 Abstract: High-definition (HD) mapping tasks, which perform lane detections and predictions, are extremely challenging due to non-ideal conditions such as view occlusions, distant lane visibility, and adverse weather conditions. Those conditions often result in compromised lane detection accuracy and reduced reliability within autonomous driving systems. To address these challenges, we introduce MapATM, a novel deep neural network that effectively leverages historical actor trajectory information to improve lane detection accuracy, where actors refer to moving vehicles. By utilizing actor trajectories as structural priors for road geometry, MapATM achieves substantial performance enhancements, notably increasing AP by 4.6 for lane dividers and mAP by 2.6 on the challenging NuScenes dataset, representing relative improvements of 10.1% and 6.1%, respectively, compared to strong baseline methods. Extensive qualitative evaluations further demonstrate MapATM's capability to consistently maintain stable and robust map reconstruction across diverse and complex driving scenarios, underscoring its practical value for autonomous driving applications.

[345] RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

Yakun Yu,Ashley Wiens,Adrián Barahona-Ríos,Benedict Wilkins,Saman Zadtootaghaj,Nabajeet Barman,Cor-Paul Bezemer

Main category: cs.CV

TL;DR: 本文提出RESP框架,利用视觉-语言模型(VLM)进行多帧游戏画面异常检测,通过参考帧引导提示实现视频级鲁棒检测,无需微调VLM,并在多个数据集上验证其有效性。

Details Motivation: 视频游戏中视觉异常严重影响玩家体验,但人工质检难以覆盖现代游戏日益增长的测试面;现有基于VLM的自动化方法多局限于单帧或弱视频级基线,在真实场景变化下鲁棒性不足。 Method: 提出RESP多帧检测框架,核心是参考引导提示:为每个测试帧选取同一视频中前期的参考帧构建视觉基线,将检测任务转化为视频内帧间比较;顺序输入参考/测试帧对至VLM,聚合噪声帧预测得到稳定视频级判定,且不微调VLM;并构建合成数据集RefGlitch用于可控分析。 Result: 在5个VLM和3个数据集(1个合成、2个真实)上的实验表明,参考引导显著提升帧级检测性能,且该提升可稳定转化为更优的视频级异常筛选效果。 Conclusion: 参考引导是一种简单而有效的策略,能显著增强VLM在视频级游戏异常检测中的鲁棒性和实用性,为游戏QA自动化提供了可扩展、免微调的新范式。 Abstract: Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \href{https://github.com/PipiZong/RESP_code.git}{this https URL}.

[346] FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling

Dawei Guan,Di Yang,Chengjie Jin,Jiangtao Wang

Main category: cs.CV

TL;DR: 本文提出FlowCoMotion框架,通过结合连续隐空间建模与离散语义建模(token-latent耦合),在文本驱动动作生成中兼顾语义对齐与运动细节保真。

Details Motivation: 现有文本到动作生成方法受限于连续表示纠缠动力学与语义、离散表示丢失细粒度运动细节的问题,亟需一种能兼顾二者优势的统一建模方式。 Method: FlowCoMotion采用双分支结构:隐空间分支通过多视角蒸馏正则化连续隐空间;token分支通过离散时间分辨率量化提取高层语义;二者经token-latent耦合网络融合;再基于文本条件预测速度场,并利用ODE求解器从先验分布积分生成目标动作。 Result: 在HumanML3D和SnapMoGen等主流文本到动作基准上取得具有竞争力的性能。 Conclusion: FlowCoMotion成功统一连续与离散运动表征建模,在语义对齐与运动保真之间实现更好平衡,为文本驱动动作生成提供了新范式。 Abstract: Text-to-motion generation is driven by learning motion representations for semantic alignment with language. Existing methods rely on either continuous or discrete motion representations. However, continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. In this context, we propose FlowCoMotion, a novel motion generation framework that unifies both treatments from a modeling perspective. Specifically, FlowCoMotion employs token-latent coupling to capture both semantic content and high-fidelity motion details. In the latent branch, we apply multi-view distillation to regularize the continuous latent space, while in the token branch we use discrete temporal resolution quantization to extract high-level semantic cues. The motion latent is then obtained by combining the representations from the two branches through a token-latent coupling network. Subsequently, a velocity field is predicted based on the textual conditions. An ODE solver integrates this velocity field from a simple prior, thereby guiding the sample to the potential state of the target motion. Extensive experiments show that FlowCoMotion achieves competitive performance on text-to-motion benchmarks, including HumanML3D and SnapMoGen.

[347] Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

Jinsung Lee,Jaemin Oh,Namhun Kim,Dongwon Kim,Byung-Jun Yoon,Suha Kwak

Main category: cs.CV

TL;DR: 本文提出一种新正则化方法,通过引导图像tokenizer模仿状态空间模型(SSM)的隐状态动态,将频率感知能力注入潜在特征,从而在保持紧凑表征的同时提升生成友好性。

Details Motivation: 现有图像tokenizer难以同时兼顾潜在空间的紧凑性与生成友好性,需在表征效率与生成建模能力之间取得平衡。 Method: 基于SSM理论分析,设计新型正则器,使图像tokenizer的潜在特征学习SSM的频率感知特性,从而编码细粒度空间结构和频域线索。 Result: 在扩散模型中显著提升生成质量,同时仅造成极小的重建保真度损失。 Conclusion: 将SSM的频率感知机制引入图像tokenization,可有效增强潜在空间的表达效率与生成适配性,为视觉生成模型提供更优的底层表征。 Abstract: Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image's essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.

[348] LDEPrompt: Layer-importance guided Dual Expandable Prompt Pool for Pre-trained Model-based Class-Incremental Learning

Linjie Li,Zhenyu Wu,Huiyu Xiao,Yang Ji

Main category: cs.CV

TL;DR: 本文提出LDEPrompt方法,通过层重要性引导的双可扩展提示池,实现自适应层选择及提示池的动态冻结与扩展,解决了现有基于提示的类增量学习方法中提示池固定、手动选择提示嵌入及过度依赖预训练骨干网络等问题,在多个基准上达到SOTA性能。

Details Motivation: 现有基于提示的类增量学习方法存在提示池固定、需人工选择提示嵌入、以及高度依赖预训练骨干网络进行提示选择等局限性。 Method: 提出层重要性引导的双可扩展提示池(LDEPrompt),支持自适应层选择以及提示池的动态冻结与扩展。 Result: 在多个广泛使用的类增量学习基准上进行了大量实验,结果表明LDEPrompt达到了当前最优性能。 Conclusion: LDEPrompt有效提升了类增量学习中提示机制的灵活性与鲁棒性,具有良好的有效性与可扩展性。 Abstract: Prompt-based class-incremental learning methods typically construct a prompt pool consisting of multiple trainable key-prompts and perform instance-level matching to select the most suitable prompt embeddings, which has shown promising results. However, existing approaches face several limitations, including fixed prompt pools, manual selection of prompt embeddings, and strong reliance on the pretrained backbone for prompt selection. To address these issues, we propose a \textbf{L}ayer-importance guided \textbf{D}ual \textbf{E}xpandable \textbf{P}rompt Pool (\textbf{LDEPrompt}), which enables adaptive layer selection as well as dynamic freezing and expansion of the prompt pool. Extensive experiments on widely used class-incremental learning benchmarks demonstrate that LDEPrompt achieves state-of-the-art performance, validating its effectiveness and scalability.

[349] CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation

Rongjia Yu,Tong Jia,Hao Wang,Xiaofang Li,Xiao Yang,Zinuo Zhang,Cuiwei Liu

Main category: cs.CV

TL;DR: 本文提出CDPR框架,通过融合RGB与偏振图像(AoLP/DoLP)并引入置信度感知门控机制,在扩散模型中实现更鲁棒的单目深度估计,尤其在纹理缺失、透明和镜面反射区域表现优异,并可扩展至法向量预测。

Details Motivation: 单目深度估计在纹理缺失、透明和镜面反射等复杂场景下性能受限;现有基于扩散的方法仅依赖RGB输入,缺乏足够线索。 Method: 提出CDPR:将RGB与偏振图像(AoLP/DoLP)经预训练VAE编码至共享潜在空间,并通过可学习的置信度感知门控机制动态融合多模态信息,抑制偏振噪声、保留关键物理线索,再用于扩散式深度估计。 Result: 在合成与真实数据集上,CDPR在挑战区域显著优于纯RGB基线,同时在常规场景保持竞争力;且可轻量修改推广至表面法向量预测任务。 Conclusion: 融合物理驱动的偏振先验能有效提升扩散模型在复杂场景下的单目深度估计鲁棒性与泛化能力,验证了多模态扩散框架在密集预测任务中的潜力。 Abstract: Monocular depth estimation is a fundamental yet challenging task in computer vision, especially under complex conditions such as textureless surfaces, transparency, and specular reflections. Recent diffusion-based approaches have significantly advanced performance by reformulating depth prediction as a denoising process in the latent space. However, existing methods rely solely on RGB inputs, which often lack sufficient cues in challenging regions. In this work, we present CDPR - Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation - a novel diffusion-based framework that integrates physically grounded polarization priors to enhance estimation robustness. Specifically, we encode both RGB and polarization (AoLP/DoLP) images into a shared latent space via a pre-trained Variational Autoencoder (VAE), and dynamically fuse multi-modal information through a learnable confidence-aware gating mechanism. This fusion module adaptively suppresses noisy signals in polarization inputs while preserving informative cues, particularly around reflective or transparent surfaces, and provides the integrated latent representation for subsequent monocular depth estimation. Beyond depth estimation, we further verify that our framework can be easily generalized to surface normal prediction with minimal modification, showcasing its scalability to general polarization-guided dense prediction tasks. Experiments on both synthetic and real-world datasets validate that CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes.

[350] Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction

Zeyi Ren,Jialin Dong,Wei Zuo,Yikun Wang,Bingyang Cheng,Sheng Zhou,Zhisheng Niu

Main category: cs.CV

TL;DR: 本文提出了一种面向低空智能网络中大规模三维场景重建的深度学习端到端无线图像传输方案,将3D高斯溅射(3DGS)直接嵌入训练过程,联合优化通信模块以提升重建质量,并支持稀疏导频以降低开销。

Details Motivation: 低空智能网络(LAIN)中大规模三维场景重建对无线图像传输效率和精度提出高要求,而现有方法难以兼顾导频开销与重建保真度。 Method: 提出一种深度学习驱动的端到端收发机设计,将3D高斯溅射(3DGS)作为可微分渲染模块嵌入训练流程,以3DGS渲染损失联合优化通信链路,并采用稀疏导频策略。 Result: 在真实航拍图像数据集上实验表明,该方法显著优于现有基线,在降低导频开销的同时实现了更高精度的图像恢复和三维场景重建。 Conclusion: 任务驱动的端到端联合优化框架能有效提升低空信道下图像传输与三维重建的整体性能,为LAIN中的高效感知通信一体化提供了新范式。 Abstract: Large-scale three-dimensional (3D) scene reconstruction in low-altitude intelligent networks (LAIN) demands highly efficient wireless image transmission. However, existing schemes struggle to balance severe pilot overhead with the transmission accuracy required to maintain reconstruction fidelity. To strike a balance between efficiency and reliability, this paper proposes a novel deep learning-based end-to-end (E2E) transceiver design that integrates 3D Gaussian Splatting (3DGS) directly into the training process. By jointly optimizing the communication modules via the combined 3DGS rendering loss, our approach explicitly improves scene recovery quality. Furthermore, this task-driven framework enables the use of a sparse pilot scheme, significantly reducing transmission overhead while maintaining robust image recovery under low-altitude channel conditions. Extensive experiments on real-world aerial image datasets demonstrate that the proposed E2E design significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions.

[351] OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Junfu Pu,Yuxin Chen,Teng Wang,Ying Shan

Main category: cs.CV

TL;DR: 本文提出了视频到剧本(V2S)新任务,构建首个长视频人工标注基准和时序感知分层评估框架,并提出轻量级多模态大模型OmniScript,在长视频脚本生成任务上达到媲美Gemini 3-Pro的性能。

Details Motivation: 现有MLLM在短视频理解表现优异,但难以将长电影视频转化为时序精准、细节丰富的分场景剧本。 Method: 构建首个长视频人工标注V2S基准;设计时序感知分层评估框架;提出8B参数的OmniScript模型,采用链式思维监督微调+时序分段奖励强化学习的渐进训练策略。 Result: OmniScript在时序定位与多字段语义准确性上显著优于更大开源模型,性能媲美Gemini 3-Pro等前沿闭源模型。 Conclusion: OmniScript验证了参数高效、时序感知与分层建模对长视频叙事理解任务的有效性,为影视内容结构化生成提供了新范式。 Abstract: Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

[352] Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

Yueying Li,Fengxiang Wang,Yan Li,Mingshuo Chen,Mengying Zhao,Long Lan

Main category: cs.CV

TL;DR: 本文提出DualComp框架,通过任务自适应的双流视觉令牌压缩方法,解决超高清遥感影像处理中计算开销大的问题,在保证精度的同时显著提升推理效率。

Details Motivation: 现有视觉令牌压缩方法采用静态、统一策略,忽视遥感解译任务中'语义-几何二元性':目标语义任务需保留小目标并裁剪背景,而场景几何任务依赖空间拓扑完整性。 Method: 提出DualComp双流框架:1)由轻量级预训练路由器动态引导;2)语义流使用空间连续语义聚合器(SCSA)进行尺寸自适应聚类以压缩背景并保护小目标;3)几何流使用指令引导结构恢复器(IGSR)通过贪心路径追踪完成空间骨架重建。 Result: 在UHR遥感基准XLRS-Bench上验证,DualComp在极低计算成本下实现高保真遥感解译,同时提升效率与精度。 Conclusion: DualComp通过解耦语义与几何处理路径,有效建模遥感任务的二元特性,为高效多模态遥感大模型推理提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent "Semantic-Geometric Duality" in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.

[353] BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Zekun Qian,Ruize Han,Wei Feng

Main category: cs.CV

TL;DR: 本文提出BoxTuning,通过将彩色边界框和轨迹线直接渲染到视频帧上作为视觉提示,而非将坐标编码为文本,从而解决现有视频多模态大模型中对象信息编码的模态不匹配与高文本开销问题;该方法显著降低文本token消耗(87–93%),保留完整时间分辨率,并增强运动建模能力,在多个视频问答基准上提升空间类任务性能,同时缓解推理类任务的精度下降。

Details Motivation: 现有MLLMs对视频帧进行整体编码,缺乏显式的细粒度对象定位机制;将边界框坐标序列化为文本token存在模态不匹配和高token开销,导致必须大幅降采样时间帧,丢失动态细节。 Method: BoxTuning将彩色边界框与运动轨迹线渲染至视频帧作为视觉提示,仅用简短的颜色-物体图例作为辅助文本,从而将对象时空信息注入视觉模态而非文本模态。 Result: 在CLEVRER、Perception Test、STAR、NExT-QA、IntentQA五个视频QA基准上,BoxTuning在空间导向任务上超越文本坐标基线,在推理密集型任务中几乎消除了精度下降;实际实现87–93%的文本token缩减,并保持全帧率时序分辨率。 Conclusion: 视觉提示(如渲染边界框与轨迹)比文本坐标更自然、高效地向视频MLLM传递对象信息,是提升对象级时空理解能力的有效范式。 Abstract: Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

[354] Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE

Wei Bao,Yuehan Wang,Tianhang Zhou,Siqi Li,Yue Gao

Main category: cs.CV

TL;DR: 本文提出Hyper-FEOD框架,通过稀疏超图增强跨模态融合(S-HCF)和细粒度混合专家(FG-MoE)模块,高效融合RGB与事件流数据,实现高精度、低计算开销的实时目标检测。

Details Motivation: RGB相机与事件相机模态异构且数据冗余,导致计算开销大或特征融合效果差,亟需高效鲁棒的多模态检测方法。 Method: 提出Hyper-FEOD框架:1)Sparse Hypergraph-enhanced Cross-Modal Fusion(S-HCF),利用事件稀疏性构建活动图,并在关键稀疏token上进行高阶超图建模;2)Fine-Grained Mixture of Experts(FG-MoE),采用像素级空间门控机制路由针对边界、纹理、背景的专用超图专家,并结合负载均衡损失与零初始化保障训练稳定。 Result: 在主流RGB-Event基准上,Hyper-FEOD在精度与效率间取得更优权衡,性能超越现有SOTA方法,同时保持轻量级,适合边缘端实时部署。 Conclusion: Hyper-FEOD通过稀疏化超图建模与细粒度专家协同机制,有效缓解了多模态融合中的计算瓶颈与语义异质性问题,为动态场景下的高效目标检测提供了新范式。 Abstract: Integrating frame-based RGB cameras with event streams offers a promising solution for robust object detection under challenging dynamic conditions. However, the inherent heterogeneity and data redundancy of these modalities often lead to prohibitive computational overhead or suboptimal feature fusion. In this paper, we propose Hyper-FEOD, a high-performance and efficient detection framework, which synergistically optimizes multi-modal interaction through two core components. First, we introduce Sparse Hypergraph-enhanced Cross-Modal Fusion (S-HCF), which leverages the inherent sparsity of event streams to construct an event-guided activity map. By performing high-order hypergraph modeling exclusively on selected motion-critical sparse tokens, S-HCF captures complex non-local dependencies between RGB and event data while overcoming the traditional complexity bottlenecks of hypergraph computation. Second, we design a Fine-Grained Mixture of Experts (FG-MoE) Enhancement module to address the diverse semantic requirements of different image regions. This module employs specialized hypergraph experts tailored for object boundaries, internal textures, and backgrounds, utilizing a pixel-level spatial gating mechanism to adaptively route and enhance features. Combined with a load-balancing loss and zero-initialization strategy, FG-MoE ensures stable training and precise feature refinement without disrupting the pre-trained backbone's distribution. Experimental results on mainstream RGB-Event benchmarks demonstrate that Hyper-FEOD achieves a superior accuracy-efficiency trade-off, outperforming state-of-the-art methods while maintaining a lightweight footprint suitable for real-time edge deployment.

[355] Naka-GS: A Bionics-inspired Dual-Branch Naka Correction and Progressive Point Pruning for Low-Light 3DGS

Runyu Zhu,SiXun Dong,Zhiqiang Zhang,Qingxia Ye,Zhihua Xu

Main category: cs.CV

TL;DR: 本文提出NAKA-GS,一种仿生低光3D高斯泼溅框架,通过Naka引导的色度校正网络和轻量级点预处理模块,联合提升低光条件下的光度恢复与几何初始化性能。

Details Motivation: 低光照条件严重损害3D重建效果,表现为图像可见性下降、色彩失真及几何先验污染,亟需兼顾光度修复与几何初始化的鲁棒方法。 Method: 提出NAKA-GS框架:1)Naka引导的色度校正网络,融合物理先验增强、双分支建模、频域解耦校正与掩码引导优化;2)前馈式多视角重建生成稠密场景先验;3)轻量级点预处理模块(PPM),实现坐标对齐、体素池化与距离自适应渐进剪枝。 Result: 在NTIRE 3D Restoration and Reconstruction挑战赛中显著超越基线方法,在恢复质量、训练稳定性与优化效率上均有提升,且未增加显著推理开销。 Conclusion: NAKA-GS是一种高效、鲁棒的低光3D重建方法,兼顾光度与几何建模,具备实际应用潜力。 Abstract: Low-light conditions severely hinder 3D restoration and reconstruction by degrading image visibility, introducing color distortions, and contaminating geometric priors for downstream optimization. We present NAKA-GS, a bionics-inspired framework for low-light 3D Gaussian Splatting that jointly improves photometric restoration and geometric initialization. Our method starts with a Naka-guided chroma-correction network, which combines physics-prior low-light enhancement, dual-branch input modeling, frequency-decoupled correction, and mask-guided optimization to suppress bright-region chromatic artifacts and edge-structure errors. The enhanced images are then fed into a feed-forward multi-view reconstruction model to produce dense scene priors. To further improve Gaussian initialization, we introduce a lightweight Point Preprocessing Module (PPM) that performs coordinate alignment, voxel pooling, and distance-adaptive progressive pruning to remove noisy and redundant points while preserving representative structures. Without introducing heavy inference overhead, NAKA-GS improves restoration quality, training stability, and optimization efficiency for low-light 3D reconstruction. The proposed method was presented in the NTIRE 3D Restoration and Reconstruction (3DRR) Challenge, and outperformed the baseline methods by a large margin. The code is available at https://github.com/RunyuZhu/Naka-GS

[356] rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training

Tianyang Dai,Ming Chang,Yan Chen,Yang Hu

Main category: cs.CV

TL;DR: 本文提出rPPG-VQA框架,用于评估视频是否适合无监督远程光电容积描记(rPPG)建模,结合信号级(多方法共识SNR估计)与场景级(多模态大语言模型识别干扰)分析,并引入两阶段自适应采样策略优化训练数据,显著提升无监督rPPG模型在标准基准上的精度。

Details Motivation: 现有无监督rPPG方法在低质量“野外”视频上性能严重下降,而传统视频质量评估(VQA)面向人眼感知,不适用于rPPG任务所需的生理信号可用性评估。 Method: 提出rPPG-VQA双分支评估框架:信号级分支通过多方法共识机制鲁棒估计信噪比(SNR);场景级分支利用多模态大语言模型(MLLM)识别运动、光照不稳定等干扰;并设计两阶段自适应采样(TAS)策略,依据质量分筛选最优训练视频。 Result: 在大规模野外视频上应用该框架筛选训练集后,所训练的无监督rPPG模型在标准基准上取得显著精度提升。 Conclusion: rPPG-VQA填补了面向rPPG任务的视频适用性评估空白,为高质量无监督rPPG建模提供了可靠数据筛选基础。 Abstract: Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality "in-the-wild" videos severely degrades model performance. An essential step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, and the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, "in-the-wild" videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks. Our code is available at https://github.com/Tianyang-Dai/rPPG-VQA.

[357] Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks

Camile Lendering,Erkut Akdag,Egor Bondarev

Main category: cs.CV

TL;DR: 本文提出Boxes2Pixels框架,利用带噪的SAM生成的伪掩码,在仅有边界框标注的弱监督下实现鲁棒的工业缺陷像素级分割。通过分层解码器、辅助定位头和单侧在线自校正机制,显著提升分割精度与召回率,同时减少参数量。

Details Motivation: 工业缺陷分割常缺乏密集像素级标注,而用SAM等基础模型将边界框转为伪掩码会产生系统性噪声(如幻觉背景、漏检稀疏缺陷),影响性能。 Method: 提出Boxes2Pixels:将SAM视为带噪教师模型;离线生成SAM伪掩码;学生模型基于冻结DINOv2特征采用分层解码器、辅助二值定位头,并引入单侧在线自校正机制(在学生置信时放松背景监督,缓解教师漏检)。 Result: 在人工标注的风力涡轮机检测基准上,相比最强弱监督基线,异常mIoU提升+6.97,二值IoU提升+9.71;在线自校正使二值召回率提升+18.56;模型可训练参数减少80%。 Conclusion: Boxes2Pixels有效缓解了SAM伪标签在工业场景中的噪声问题,实现了更鲁棒、高效、高召回的弱监督缺陷分割。 Abstract: Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80\% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.

[358] RADA: Region-Aware Dual-encoder Auxiliary learning for Barely-supervised Medical Image Segmentation

Shuang Zeng,Boxu Xie,Lei Zhu,Xinliang Zhang,Jiakui Hu,Zhengjian Yao,Yuanwei Li,Yuxing Lu,Yanye Lu

Main category: cs.CV

TL;DR: 本文提出RADA方法,通过双编码器框架结合Alpha-CLIP预训练,利用图像级细粒度特征与文本级语义引导,提升极稀疏标注下的3D医学图像分割性能。

Details Motivation: 深度学习在医学图像分割中依赖大量密集标注,而3D体数据标注成本高;现有弱监督方法仅靠几何连续性传播伪标签,缺乏语义理解,导致伪标签质量低;像素级分割任务本质上依赖高质量局部视觉特征。 Method: 提出Region-Aware Dual-encoder Auxiliary (RADA)学习框架:采用Alpha-CLIP预训练的双编码器提取图像区域细粒度视觉特征,并融合文本语义引导,实现区域感知的语义监督;嵌入三视图训练框架中进行联合优化。 Result: 在LA2018、KiTS19和LiTS三个数据集的极稀疏标注设置下达到SOTA性能,展现出跨数据集的强泛化能力。 Conclusion: RADA有效弥合了图像级语义与像素级分割之间的鸿沟,验证了结合视觉细粒度特征与文本语义引导对弱监督医学图像分割的关键作用。 Abstract: Deep learning has greatly advanced medical image segmentation, but its success relies heavily on fully supervised learning, which requires dense annotations that are costly and time-consuming for 3D volumetric scans. Barely-supervised learning reduces annotation burden by using only a few labeled slices per volume. Existing methods typically propagate sparse annotations to unlabeled slices through geometric continuity to generate pseudo-labels, but this strategy lacks semantic understanding, often resulting in low-quality pseudo-labels. Furthermore, medical image segmentation is inherently a pixel-level visual understanding task, where accuracy fundamentally depends on the quality of local, fine-grained visual features. Inspired by this, we propose RADA, a novel Region-Aware Dual-encoder Auxiliary learning pipeline which introduces a dual-encoder framework pre-trained on Alpha-CLIP to extract fine-grained, region-specific visual features from the original images and limited annotations. The framework combines image-level fine-grained visual features with text-level semantic guidance, providing region-aware semantic supervision that bridges image-level semantics and pixel-level segmentation. Integrated into a triple-view training framework, RADA achieves SOTA performance under extremely sparse annotation settings on LA2018, KiTS19 and LiTS, demonstrating robust generalization across diverse datasets.

[359] Do Instance Priors Help Weakly Supervised Semantic Segmentation?

Anurag Das,Anna Kukleva,Xinting Hu,Yuki M. Asano,Bernt Schiele

Main category: cs.CV

TL;DR: SeSAM 是一个利用弱标签(如粗略掩码、涂鸦和点)结合 Segment Anything Model (SAM) 进行语义分割的新框架,通过分解掩码、骨架采样点、覆盖筛选与迭代伪标签优化,显著降低标注成本并提升性能。

Details Motivation: 语义分割依赖密集像素级标注,成本高;而 SAM 原为实例分割设计,无法直接用于类别级语义分割,亟需适配弱监督场景。 Method: SeSAM 将类别掩码分解为连通分量,在物体骨架上采样点提示,依据弱标签覆盖率筛选 SAM 输出掩码,并通过迭代伪标签优化;再将其嵌入半监督学习框架,联合真实标签、SAM 伪标签与高置信伪标签进行训练。 Result: 在多个基准和弱标注类型(粗掩码、涂鸦、点)上,SeSAM 持续优于弱监督基线,同时大幅降低标注成本。 Conclusion: SeSAM 成功将 SAM 适配到弱监督语义分割任务中,为减少人工标注负担提供了高效可行的解决方案。 Abstract: Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.

[360] Development and evaluation of CADe systems in low-prevalence setting: The RARE25 challenge for early detection of Barrett's neoplasia

Tim J. M. Jaspers,Francisco Caetano,Cris H. B. Claessens,Carolus H. J. Kusters,Rixta A. H. van Eijck van Heslinga,Floor Slooter,Jacques J. Bergman,Peter H. N. De With,Martijn R. Jong,Albert J. de Groof,Fons van der Sommen

Main category: cs.CV

TL;DR: RARE25 是一项针对巴雷特食管早期肿瘤检测的低发病率场景下CADe系统性能评估挑战,强调真实流行率下的高敏感性与临床实用性,并揭示当前全监督分类方法在低阳性率下PPV偏低的问题,呼吁发展异常检测等流行率无关方法。

Details Motivation: 现有CADe系统在平衡或富集数据集上表现良好,但在真实低发病率临床场景下的性能缺乏充分表征,易高估临床价值。 Method: 组织RARE25挑战赛,构建大规模、流行率感知的公开训练集与隐藏测试集,采用操作点特异性指标(侧重高敏感性并纳入流行率因素)评估11支国际团队提交的多种深度学习方法(含不同架构、预训练、集成与校准策略)。 Result: 多个方法展现出强判别能力,但所有方法的阳性预测值(PPV)均偏低;所有参赛方案均依赖全监督分类,未采用异常检测或单类学习等流行率无关范式。 Conclusion: 低发病率显著影响CADe系统的临床实用性评估,需推动流行率鲁棒、适合真实筛查流程的新型方法(如异常检测),并通过开源数据与可复现框架促进该方向研究。 Abstract: Computer-aided detection (CADe) of early neoplasia in Barrett's esophagus is a low-prevalence surveillance problem in which clinically relevant findings are rare. Although many CADe systems report strong performance on balanced or enriched datasets, their behavior under realistic prevalence remains insufficiently characterized. The RARE25 challenge addresses this gap by introducing a large-scale, prevalence-aware benchmark for neoplasia detection. It includes a public training set and a hidden test set reflecting real-world incidence. Methods were evaluated using operating-point-specific metrics emphasizing high sensitivity and accounting for prevalence. Eleven teams from seven countries submitted approaches using diverse architectures, pretraining, ensembling, and calibration strategies. While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection and the risk of overestimating clinical utility when prevalence is ignored. All methods relied on fully supervised classification despite the dominance of normal findings, indicating a lack of prevalence-agnostic approaches such as anomaly detection or one-class learning. By releasing a public dataset and a reproducible evaluation framework, RARE25 aims to support the development of CADe systems robust to prevalence shift and suitable for clinical surveillance workflows.

[361] Precision Synthesis of Multi-Tracer PET via VLM-Modulated Rectified Flow for Stratifying Mild Cognitive Impairment

Tuo Liu,Shuijin Lin,Shaozhen Yan,Haifeng Wang,Jie Lu,Jianhua Ma,Chunfeng Lian

Main category: cs.CV

TL;DR: 本文提出DIReCT++模型,通过结合MRI和临床信息生成多示踪剂PET图像,提升阿尔茨海默病早期诊断与预后预测的精度与可扩展性。

Details Motivation: PET在阿尔茨海默病临床筛查中受限于成本与辐射,需低成本、高精度的替代方法;现有生成模型难以实现个体化精准合成。 Method: 提出基于3D修正流(rectified flow)架构与领域适配的视觉-语言模型BiomedCLIP的DIReCT++模型,融合MRI与临床评分实现多示踪剂PET(如¹⁸F-AV-45和¹⁸F-FDG)的个性化生成。 Result: 在多中心数据集上验证,DIReCT++生成的PET图像具有更高保真度与泛化能力,并能准确复现疾病特异性模式;联合MRI可实现轻度认知障碍(MCI)的精准个体分层。 Conclusion: DIReCT++为阿尔茨海默病的无创、可扩展、数据高效早期诊断与预后预测提供了新范式。 Abstract: The biological definition of Alzheimer's disease (AD) relies on multi-modal neuroimaging, yet the clinical utility of positron emission tomography (PET) is limited by cost and radiation exposure, hindering early screening at preclinical or prodromal stages. While generative models offer a promising alternative by synthesizing PET from magnetic resonance imaging (MRI), achieving subject-specific precision remains a primary challenge. Here, we introduce DIReCT$++$, a Domain-Informed ReCTified flow model for synthesizing multi-tracer PET from MRI combined with fundamental clinical information. Our approach integrates a 3D rectified flow architecture to capture complex cross-modal and cross-tracer relationships with a domain-adapted vision-language model (BiomedCLIP) that provides text-guided, personalized generation using clinical scores and imaging knowledge. Extensive evaluations on multi-center datasets demonstrate that DIReCT$++$ not only produces synthetic PET images ($^{18}$F-AV-45 and $^{18}$F-FDG) of superior fidelity and generalizability but also accurately recapitulates disease-specific patterns. Crucially, combining these synthesized PET images with MRI enables precise personalized stratification of mild cognitive impairment (MCI), advancing a scalable, data-efficient tool for the early diagnosis and prognostic prediction of AD. The source code will be released on https://github.com/ladderlab-xjtu/DIReCT-PLUS.

[362] Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

Shivam Sharma,Sankalp Nagaonkar,Ashish Choithani,Ashutosh Trivedi

Main category: cs.CV

TL;DR: 本文通过基准测试研究了内部推理轨迹(thought streams)对视频场景理解的影响,发现额外思考带来的质量提升迅速达到平台期,且Flash Lite模型在质量和token使用间取得了最佳平衡。

Details Motivation: 探究内部推理轨迹(thought streams)如何影响视觉语言模型的视频场景理解能力,并回答三个核心问题:更多思考是否带来更好输出、收益何时停止、以及模型实际思考内容是什么。 Method: 使用Google Gemini 2.5 Flash和Flash Lite的四种配置,在100小时视频提取的场景上进行实验;提出三个新评估指标:Contentfulness、Thought-Final Coverage和Dominant Entity Analysis;以GPT-5作为独立评判器。 Result: 质量提升在前几百个token后迅速饱和;Flash Lite在质量与token效率间表现最优;严格推理预算会导致‘压缩式幻觉’(即最终输出包含未推理过的内容);Flash与Flash Lite虽属不同模型层级,但thought streams内容相似,风格不同(Flash偏元推理,Lite偏场景描述)。 Conclusion: 内部推理轨迹对视频理解有帮助但存在明显边际效益递减;模型推理风格与其架构设计密切相关;评估thought streams需兼顾内容质量、覆盖度与关注焦点。 Abstract: We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

[363] Towards Adaptive Open-Set Object Detection via Category-Level Collaboration Knowledge Mining

Yuqi Ji,Junjie Ke,Lihuo He,Lizhi Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了一种面向自适应开放集目标检测(AOOD)的类别级协作知识挖掘方法,通过聚类记忆库、基类到新类选择度量和自适应特征分配策略,缓解跨域表征弱、新类模糊及源域偏差问题,在多个基准上显著提升性能。

Details Motivation: 现有自适应开放集目标检测方法受限于跨域表征能力弱、新类别间语义模糊以及源域特征偏差,难以兼顾域迁移与新类发现。 Method: 提出类别级协作知识挖掘策略:构建基于聚类的记忆库以编码类原型、辅助特征与类内差异;设计基类到新类选择度量初始化新类分类器;引入自适应特征分配策略实现知识迁移并异步更新记忆库。 Result: 在多个基准数据集上,所提方法持续超越现有SOTA AOOD方法,mAP提升1.1–5.5。 Conclusion: 类别级协作知识挖掘能有效增强跨域类别表示、缓解源域偏差,并提升对未见新类的泛化能力,为AOOD提供了更鲁棒的框架。 Abstract: Existing object detectors often struggle to generalize across domains while adapting to emerging novel categories. Adaptive open-set object detection (AOOD) addresses this challenge by training on base categories in the source domain and adapting to both base and novel categories in the target domain without target annotations. However, current AOOD methods remain limited by weak cross-domain representations, ambiguity among novel categories, and source-domain feature bias. To address these issues, we propose a category-level collaboration knowledge mining strategy that exploits both inter-class and intra-class relationships across domains. Specifically, we construct a clustering-based memory bank to encode class prototypes, auxiliary features, and intra-class disparity information, and iteratively update it via unsupervised clustering to enhance category-level knowledge representation. We further design a base-to-novel selection metric to discover source-domain features related to novel categories and use them to initialize novel-category classifiers. In addition, an adaptive feature assignment strategy transfers the learned category-level knowledge to the target domain and asynchronously updates the memory bank to alleviate source-domain bias. Extensive experiments on multiple benchmarks show that our method consistently surpasses state-of-the-art AOOD methods by 1.1-5.5 mAP.

[364] MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Jiahui Peng,He Yao,Jingwen Li,Yanzhou Su,Sibo Ju,Yujie Lu,Jin Ye,Hongchun Lu,Xue Li,Lincheng Jiang,Min Zhu,Junlong Cheng

Main category: cs.CV

TL;DR: 本文提出MedP-CLIP,一种区域感知的医学视觉语言模型,通过融合医学先验知识与特征级区域提示机制,在大规模区域标注数据上预训练,实现细粒度解剖/病灶理解,显著提升零样本识别、交互式分割等任务性能。

Details Motivation: 现有CLIP类模型擅长全局图像理解,但医学图像分析关键在于对特定解剖结构或病灶区域的细粒度理解,需有效利用医生或模型提供的感兴趣区域(RoI)信息。 Method: 提出MedP-CLIP模型,设计特征级区域提示集成机制,支持点、框、掩码等多种提示形式;融入医学先验知识;在包含640万图像和9730万区域标注的大规模医学数据集上进行预训练。 Result: 在零样本识别、交互式分割及赋能多模态大语言模型等任务中显著优于基线方法,具备跨疾病、跨模态的细粒度空间语义理解能力。 Conclusion: MedP-CLIP提供了一种可扩展、即插即用的医学AI视觉骨干网络,兼顾整体图像理解与局部精准分析,推动医学视觉语言模型向临床实用化迈进。 Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.

[365] LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment: Methods and Results

Xin Li,Daoli Xu,Wei Luo,Guoqiang Xiang,Haoran Li,Chengyu Zhuang,Zhibo Chen,Jian Guan,Weping Li,Weixia Zhang,Wei Sun,Zhihua Wang,Dandan Zhu,Chengguang Zhu,Ayush Gupta,Rachit Agarwal,Shouvik Das,Biplab Ch Das,Amartya Ghosh,Kanglong Fan,Wen Wen,Shuyan Zhai,Tianwu Zhi,Aoxiang Zhang,Jianzhao Liu,Yabin Zhang,Jiajun Wang,Yipeng Sun,Kaiwei Lian,Banghao Yin

Main category: cs.CV

TL;DR: 本文介绍了LoViF 2026挑战赛,聚焦于从人类视角评估图像语义信息损失,构建了首个面向人类语义质量评估的数据集SeIQA,并推动语义编码、处理与优化等新方向发展。

Details Motivation: 现有图像质量评估数据集未充分关注人类对语义信息损失的感知,本文旨在填补这一空白,推动语义层面的质量评估研究。 Method: 构建了名为SeIQA的人类导向语义图像质量评估数据集,包含训练(510对)、验证(80对)和测试(160对)三部分图像对,并组织国际挑战赛以建立新基准。 Result: 共58支队伍注册参赛,其中6支提交有效方案,在SeIQA数据集上取得当前最优(SOTA)性能。 Conclusion: LoViF 2026挑战赛成功建立了首个面向人类语义感知的图像质量评估基准与数据集,为语义级图像处理与编码研究提供了重要支撑。 Abstract: This paper reviews the LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment. This challenge aims to raise a new direction, i.e., how to evaluate the loss of semantic information from the human perspective, intending to promote the development of some new directions, like semantic coding, processing, and semantic-oriented optimization, etc. Unlike existing datasets of quality assessment, we form a dataset of human-oriented semantic quality assessment, termed the SeIQA dataset. This dataset is divided into three parts for this competition: (i) training data: 510 pairs of degraded images and their corresponding ground truth references; (ii) validation data: 80 pairs of degraded images and their corresponding ground-truth references; (iii) testing data: 160 pairs of degraded images and their corresponding ground-truth references. The primary objective of this challenge is to establish a new and powerful benchmark for human-oriented semantic image quality assessment. There are a total of 58 teams registered in this competition, and 6 teams submitted valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the SeIQA dataset.

[366] 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Stefan Schulz,Fernando Edelstein,Hannah Dröge,Matthias B. Hullin,Markus Plack

Main category: cs.CV

TL;DR: 本文提出3DTV,一种用于实时稀疏视图插值的前馈网络,通过Delaunay三角剖分选择视角三元组,并引入姿态感知深度模块估计粗到细的深度金字塔,实现高效特征重投影和遮挡感知融合,无需场景特定优化,适用于AR/VR、远程呈现等低延迟交互应用。

Details Motivation: 实时自由视角渲染需在多相机冗余与交互式应用的延迟约束之间取得平衡。 Method: 提出3DTV前馈网络,采用Delaunay三角剖分进行三元组视角选择,并设计姿态感知深度模块估计粗到细深度金字塔,支持高效特征重投影和遮挡感知融合。 Result: 在多视角视频数据集上实验表明,3DTV在质量与效率间取得良好平衡,优于近期实时新视角合成方法,且无需显式几何代理,泛化性强。 Conclusion: 3DTV是一种实用、低延迟、无需重训练的实时多视角渲染方案,适用于AR/VR、远程呈现等交互式应用场景。 Abstract: Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/

[367] H-SPAM: Hierarchical Superpixel Anything Model

Julien Walther,Rémi Giraud,Michaël Clément

Main category: cs.CV

TL;DR: 本文提出H-SPAM框架,用于生成准确、规则且完全嵌套的分层超像素,通过两阶段区域合并策略提升分割精度与多尺度适应性。

Details Motivation: 现有超像素方法在分割精度上已达瓶颈,且多为单尺度划分,难以满足需多尺度表示的视觉任务需求。 Method: H-SPAM基于深度特征与外部物体先验,从细粒度分割出发,采用两阶段区域合并:第一阶段保持物体一致性,第二阶段支持可控的跨物体聚合;并支持通过视觉注意力图或用户输入调节层次结构。 Result: 在标准基准测试中,H-SPAM在分层方法中显著优于现有方法(精度与规则性),同时性能媲美多数最新非分层SOTA方法。 Conclusion: H-SPAM提供了一种统一、灵活且高性能的分层超像素生成方案,兼顾准确性、规则性与嵌套性,拓展了超像素在多尺度视觉任务中的应用潜力。 Abstract: Superpixels offer a compact image representation by grouping pixels into coherent regions. Recent methods have reached a plateau in terms of segmentation accuracy by generating noisy superpixel shapes. Moreover, most existing approaches produce a single fixed-scale partition that limits their use in vision pipelines that would benefit multi-scale representations. In this work, we introduce H-SPAM (Hierarchical Superpixel Anything Model), a unified framework for generating accurate, regular, and perfectly nested hierarchical superpixels. Starting from a fine partition, guided by deep features and external object priors, H-SPAM constructs the hierarchy through a two-phase region merging process that first preserves object consistency and then allows controlled inter-object grouping. The hierarchy can also be modulated using visual attention maps or user input to preserve important regions longer in the hierarchy. Experiments on standard benchmarks show that H-SPAM strongly outperforms existing hierarchical methods in both accuracy and regularity, while performing on par with most recent state-of-the-art non-hierarchical methods. Code and pretrained models are available: https://github.com/waldo-j/hspam.

[368] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: AI Flash Portrait (Track 3)

Ya-nan Guan,Shaonan Zhang,Hang Guo,Yawen Wang,Xinying Fan,Tianqu Zhuang,Jie Liang,Hui Zeng,Guanyi Qin,Lishen Qu,Tao Dai,Shu-Tao Xia,Lei Zhang,Radu Timofte,Bin Chen,Yuanbo Zhou,Hongwei Wang,Qinquan Gao,Tong Tong,Yanxin Qian,Lizhao You,Jingru Cong,Lei Xiong,Shuyuan Zhu,Zhi-Qiang Zhong,Kan Lv,Yang Yang,Kailing Tang,Minjian Zhang,Zhipei Lei,Zhe Xu,Liwen Zhang,Dingyong Gou,Yanlin Wu,Cong Li,Xiaohui Cui,Jiajia Liu,Guoyi Xu,Yaoxin Jiang,Yaokun Shi,Jiachen Tu,Liqing Wang,Shihang Li,Bo Zhang,Biao Wang,Haiming Xu,Xiang Long,Xurui Liao,Yanqiao Zhai,Haozhe Li,Shijun Shi,Jiangning Zhang,Yong Liu,Kai Hu,Jing Xu,Xianfang Zeng,Yuyang Liu,Minchen Wei

Main category: cs.CV

TL;DR: 本文介绍了NTIRE 2026 RAIM挑战赛第三赛道——AI Flash Portrait,旨在推动真实低光人像图像恢复技术的发展,提供了一个含800组真实低光人像数据的新基准,并结合客观指标与主观评估进行综合评测。

Details Motivation: 现有深度学习图像恢复模型在真实低光人像场景中难以兼顾噪声抑制、细节保留及光照与色彩的准确还原,亟需建立面向实际应用的新基准。 Method: 构建包含800组真实拍摄低光人像的数据集(每组含1K低光输入、1K真值及1K人像掩码),采用融合客观定量指标与严格主观评估的混合评测体系。 Result: 吸引超100支队伍、3000+有效提交;发布公开数据集与基线代码(GitHub),并在CodaBench设立官方竞赛平台。 Conclusion: 该挑战成功推动了低光人像恢复领域的发展,提供了高质量真实数据与可复现评测框架,促进了学术界与工业界协同创新。 Abstract: In this paper, we present a comprehensive overview of the NTIRE 2026 3rd Restore Any Image Model (RAIM) challenge, with a specific focus on Track 3: AI Flash Portrait. Despite significant advancements in deep learning for image restoration, existing models still encounter substantial challenges in real-world low-light portrait scenarios. Specifically, they struggle to achieve an optimal balance among noise suppression, detail preservation, and faithful illumination and color reproduction. To bridge this gap, this challenge aims to establish a novel benchmark for real-world low-light portrait restoration. We comprehensively evaluate the proposed algorithms utilizing a hybrid evaluation system that integrates objective quantitative metrics with rigorous subjective assessment protocols. For this competition, we provide a dataset containing 800 groups of real-captured low-light portrait data. Each group consists of a 1K-resolution low-light input image, a 1K ground truth (GT), and a 1K person mask. This challenge has garnered widespread attention from both academia and industry, attracting over 100 participating teams and receiving more than 3,000 valid submissions. This report details the motivation behind the challenge, the dataset construction process, the evaluation metrics, and the various phases of the competition. The released dataset and baseline code for this track are publicly available from the same \href{https://github.com/zsn1434/AI_Flash-BaseLine/tree/main}{GitHub repository}, and the official challenge webpage is hosted on \href{https://www.codabench.org/competitions/12885/}{CodaBench}.

[369] Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection

You Su,Yonghong Song,Jingqi Chen,Zehan Wen

Main category: cs.CV

TL;DR: 本文提出Seg2Change框架,通过构建类别无关的变更检测数据集CA-CDD和设计类别无关变更头,将开放词汇语义分割模型适配到开放词汇变更检测(OVCD)任务中,在多个数据集上达到SOTA性能。

Details Motivation: 现有变更检测方法受限于训练数据中预定义的类别,难以扩展到现实场景;而新兴的开放词汇语义分割模型尚未被有效用于开放词汇变更检测(OVCD)这一新任务。 Method: 构建类别无关变更检测数据集CA-CDD;设计类别无关变更头以检测任意类别的变化并映射至具体类别;提出Seg2Change适配器,将开放词汇语义分割模型迁移到变更检测任务。 Result: 在WHU-CD数据集上IoU提升9.52,在SECOND数据集上mIoU提升5.50,达到当前最优(SOTA)的开放词汇变更检测性能。 Conclusion: Seg2Change是一种简洁而高效的方法,成功 bridging 开放词汇语义分割与变更检测,为OVCD任务提供了可行且高性能的解决方案。 Abstract: Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land-cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real-world scenarios. In recent years, numerous advanced open-vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open-vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category-agnostic change detection dataset, termed CA-CDD. Further, we design a category-agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open-vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state-of-the-art OVCD performance (+9.52 IoU on WHU-CD and +5.50 mIoU on SECOND). Our code is released at https://github.com/yogurts-sy/Seg2Change.

[370] Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

Jiaqi Wu,Zhen Wang,Enhao Huang,Kangqing Shen,Yulin Wang,Yang Yue,Yifan Pu,Gao Huang

Main category: cs.CV

TL;DR: 本文提出了一种语义桥接融合框架,利用文本作为共享语义桥梁对齐RGB与红外(IR)模态,并引入共识支持与差异支持的双支撑建模机制,提升多光谱目标检测鲁棒性。

Details Motivation: 现有方法仅将文本作为辅助语义增强信号,未充分利用其引导作用来弥合RGB与IR固有的粒度不对称性;同时,传统基于注意力的数据驱动融合易忽略有价值的跨模态差异。 Method: 提出语义桥接融合框架,以文本为共享语义桥梁,在统一类别条件下对齐RGB与IR响应;将重校准的热成像语义先验投影至RGB分支实现语义级映射融合;构建共识支持与互补差异支持两种交互证据,并通过动态重校准引入结构化归纳偏置;设计双向语义对齐模块增强视觉-文本闭环引导。 Result: 在多光谱基准数据集上验证了所提融合框架的有效性,检测性能优于现有方法。 Conclusion: 文本可作为有效语义桥梁弥合RGB与IR模态鸿沟,显式建模跨模态共识与差异能显著提升多光谱目标检测鲁棒性与精度。 Abstract: Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.

[371] Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

Kexin Ma,Jing Xiao,Chaofeng Chen,Geyong Min,Guibo Zhu,Jinqiao Wang,Liang Liao

Main category: cs.CV

TL;DR: 本文提出DeSAP方法,通过解耦相似性与视觉显著性信号结合,实现任务感知的视觉token剪枝,在大幅减少计算量的同时保持模型性能。

Details Motivation: 现有LVLM的token剪枝方法依赖单一注意力源,导致剪枝决策不完整且次优,难以应对不同任务需求。 Method: 提出解耦相似性感知剪枝(DeSAP)方法,利用跨模态细粒度相关性(解耦相似性)和视觉注意力得到的显著性信号联合指导视觉token剪枝。 Result: 在LLaVA-1.5-7B上仅保留11.1%视觉token即实现10倍FLOPs降低、2.3倍prefill加速,同时保持98.1%原始性能;在多个基准和架构上均超越SOTA。 Conclusion: DeSAP通过融合任务相关与视觉线索,实现了更精准、鲁棒的视觉token剪枝,为高效LVLM部署提供了新思路。 Abstract: Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.

[372] Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Tencent Hunyuan Team

Main category: cs.CV

TL;DR: 本文提出Multi-Stream Scene Script (MTSS)范式,通过流分解与关系接地将视频描述解耦为多个显式对齐的场景流,显著提升视频理解与生成性能。

Details Motivation: 现有视频字幕方法将视频视为单一段落,导致视觉、听觉与身份信息纠缠,损害表征保真度并限制可扩展性。 Method: 提出MTSS范式,包含两个核心:Stream Factorization(将视频分解为Reference、Shot、Event和Global四个互补流)和Relational Grounding(通过显式身份与时间链接重建流间一致性)。 Result: 在Video-SALMONN-2上平均错误率降低25%,Daily-Omni推理基准平均提升67%;缩小大小MLLM性能差距;多镜头视频生成中,身份一致性、音画对齐和时序可控性分别提升45%、56%、71%。 Conclusion: MTSS提供了一种更可学习、可编辑、结构清晰的视频语义接口,无需模型架构修改即可显著提升理解和生成能力。 Abstract: Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.

[373] Variational Latent Entropy Estimation Disentanglement: Controlled Attribute Leakage for Face Recognition

Ünsal Öztürk,Vedrana Krivokuća Hahn,Sushil Bhattacharjee,Sébastien Marcel

Main category: cs.CV

TL;DR: 本文提出了一种名为VLEED的后处理方法,利用变分自编码器对预训练的人脸识别嵌入进行变换,以分离出性别、族裔等敏感属性与身份信息,兼顾隐私保护与公平性。

Details Motivation: 人脸嵌入不仅包含身份信息,还隐含性别、族裔等敏感属性,可能引发隐私泄露和识别偏差,需在下游任务中将其与身份信息解耦。 Method: 提出Variational Latent Entropy Estimation Disentanglement(VLEED),基于变分自编码器框架,通过估计潜在空间中分类属性的熵来构建互信息目标函数,实现可控且稳定的敏感属性剥离。 Result: 在IJB-C、RFW和VGGFace2数据集上验证了VLEED在性别与族裔解耦上的有效性;相比SOTA方法,在保持识别性能的同时显著降低敏感属性可预测性,并减小不同人群间的误匹配率差异。 Conclusion: VLEED提供灵活的隐私-效用权衡,在保障人脸识别准确性的同时提升公平性与隐私保护能力,是一种实用且稳健的后处理解耦方案。 Abstract: Face recognition embeddings encode identity, but they also encode other factors such as gender and ethnicity. Depending on how these factors are used by a downstream system, separating them from the information needed for verification is important for both privacy and fairness. We propose Variational Latent Entropy Estimation Disentanglement (VLEED), a post-hoc method that transforms pretrained embeddings with a variational autoencoder and encourages a distilled representation where the categorical variable of interest is separated from identity-relevant information. VLEED uses a mutual information-based objective realised through the estimation of the entropy of the categorical attribute in the latent space, and provides stable training with fine-grained control over information removal. We evaluate our method on IJB-C, RFW, and VGGFace2 for gender and ethnicity disentanglement, and compare it to various state-of-the-art methods. We report verification utility, predictability of the disentangled variable under linear and nonlinear classifiers, and group disparity metrics based on false match rates. Our results show that VLEED offers a wide range of privacy-utility tradeoffs over existing methods and can also reduce recognition bias across demographic groups.

[374] A Deep Equilibrium Network for Hyperspectral Unmixing

Chentong Wang,Jincheng Gao,Fei Zhu,Jie Chen

Main category: cs.CV

TL;DR: 本文提出DEQ-Unmix,一种基于深度均衡模型的高光谱解混方法,通过隐式微分实现恒定内存训练,并用可训练卷积网络替代数据重建项的梯度算子以更好捕获光谱-空间信息,显著提升解混精度且保持内存效率。

Details Motivation: 传统方法难以建模复杂的光谱-空间特征,深度学习方法缺乏物理可解释性,而展开式方法在光谱-空间信息利用、内存消耗和数值精度方面存在不足。 Method: 将丰度估计重构为深度均衡模型(DEQ),采用隐式微分实现恒定内存反向传播;用可训练卷积网络替代数据重建项中的梯度算子,以增强光谱-空间特征建模能力。 Result: 在合成数据集和两个真实数据集上的实验表明,DEQ-Unmix在解混精度上优于现有方法,同时保持恒定内存开销。 Conclusion: DEQ-Unmix有效平衡了模型表达能力、物理可解释性与计算效率,为高光谱解混提供了一种高效且鲁棒的新范式。 Abstract: Hyperspectral unmixing (HU) is crucial for analyzing hyperspectral imagery, yet achieving accurate unmixing remains challenging. While traditional methods struggle to effectively model complex spectral-spatial features, deep learning approaches often lack physical interpretability. Unrolling-based methods, despite offering network interpretability, inadequately exploit spectral-spatial information and incur high memory costs and numerical precision issues during backpropagation. To address these limitations, we propose DEQ-Unmix, which reformulates abundance estimation as a deep equilibrium model, enabling efficient constant-memory training via implicit differentiation. It replaces the gradient operator of the data reconstruction term with a trainable convolutional network to capture spectral-spatial information. By leveraging implicit differentiation, DEQ-Unmix enables efficient and constant-memory backpropagation. Experiments on synthetic and two real-world datasets demonstrate that DEQ-Unmix achieves superior unmixing performance while maintaining constant memory cost.

[375] Empowering Video Translation using Multimodal Large Language Models

Bingzheng QU,Kehai Chen,Xuefeng Bai,Min Zhang

Main category: cs.CV

TL;DR: 本文综述了多模态大语言模型(MLLMs)在视频翻译任务中的最新进展,提出一个三角色分类法(语义推理者、表现力执行者、视觉合成器),系统梳理其技术路径,并指出当前挑战与未来方向。

Details Motivation: 尽管MLLMs在视频翻译中快速发展,且已有大量关于通用视频-语言理解的综述,但尚缺乏对MLLMs如何赋能视频翻译任务的聚焦、系统性回顾。 Method: 提出基于三角色的分类框架:1)语义推理者(视频理解、时序推理、多模态融合);2)表现力执行者(LLM驱动/增强的可控语音生成);3)视觉合成器(高保真唇形同步与视觉对齐的视频生成)。 Result: 首次全面概述MLLMs驱动的视频翻译,明确各角色的技术特点与代表性方法,并识别出视频理解、时序建模与多模态对齐等关键挑战。 Conclusion: MLLMs正推动视频翻译从传统级联流水线转向端到端、联合建模的新范式,具备更强鲁棒性与表现力,但仍需在深层语义理解与跨模态一致性上持续突破。 Abstract: Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.

[376] Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

Dongxu Wei,Qi Xu,Zhiqi Li,Hangning Zhou,Cong Qiu,Hailong Qin,Mu Yang,Zhaopeng Cui,Peidong Liu

Main category: cs.CV

TL;DR: 本文提出了一种直接在隐式3D潜在空间中进行3D场景生成的新方法,通过构建3D表示自编码器(3DRAE)和3D扩散Transformer(3DDiT),解决了传统2D多视角/视频扩散模型在3D空间外推中的冗余表示与空间不一致问题。

Details Motivation: 现有3D场景生成主要依赖2D多视角或视频扩散模型,受限于缺乏场景级3D潜在表示,且2D表示导致3D空间外推退化为2D时间扩展,引发表示冗余和空间不一致两大问题。 Method: 提出3D表示自编码器(3DRAE),利用冻结的2D编码器将多视角2D语义表示映射为视角解耦的3D潜在表示;在此基础上构建3D扩散Transformer(3DDiT),在该3D潜在空间中进行扩散建模,并支持多种条件输入。 Result: 实现了高效、空间一致的3D场景生成,支持任意相机轨迹下的图像与可选点云解码,无需每条轨迹单独进行扩散采样。 Conclusion: 首次实现端到端的3D潜在空间扩散建模,突破了2D扩散框架对3D生成的固有局限,为高质量、高一致性3D内容生成提供了新范式。 Abstract: 3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

[377] A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study

Shkelqim Sherifi

Main category: cs.CV

TL;DR: 本文提出了一种轻量级卷积神经网络PD36 C(125万参数,4.77MB),在New Plant Diseases Dataset上训练,实现高达99.53%的平均测试准确率,并配套开发了支持离线推理的Qt桌面应用,适用于边缘设备部署。

Details Motivation: 为满足智能农业中对轻量、鲁棒且可边缘部署的植物病害诊断模型的需求,解决现有大型模型难以在资源受限设备上运行的问题。 Method: 设计紧凑型CNN模型PD36 C;使用TensorFlow Keras在New Plant Diseases Dataset(87k图像,38类)上训练;构建基于Qt for Python的桌面GUI应用以支持离线推理。 Result: 训练准确率达0.99697(第30轮),平均测试准确率为0.9953;多个类别达到精度与召回率均为1.00,部分相似病害(如玉米Cercospora叶斑病)精度约0.9777、召回约0.9634。 Conclusion: 精心设计的小型CNN可在保持高精度的同时满足边缘部署需求;当前模型在恶劣天气、低质图像及多病共存等场景下性能下降,需进一步提升领域鲁棒性。 Abstract: Deep learning has markedly advanced image based plant disease diagnosis as improved hardware and dataset quality have enabled increasingly accurate neural network models. This paper presents PD36 C, a compact convolutional neural network (1,250,694 parameters and 4.77 MB) for plant disease classification. Trained with TensorFlow Keras on the New Plant Diseases Dataset (87k images, 38 classes), PD36 C is designed for robustness and edge deployability, complemented by a Qt for Python desktop application that offers an intuitive GUI and offline inference on commodity hardware. Across experiments, training accuracy reached 0.99697 by epoch 30, and average test accuracy was 0.9953 across 38 classes. Per class performance is uniformly high; on the lower end, Corn (maize) Cercospora leaf spot achieved precision around 0.9777 and recall around 0.9634, indicating occasional confusion with visually similar categories, while on the upper end numerous classes including Apple Black rot, Cedar apple rust, Blueberry healthy, Cherry Powdery mildew, Cherry healthy, and all four grape categories achieved perfect precision 1.00 and recall of 1.00, indicating no false positives and strong coverage. These results show that with a well curated dataset and careful architectural design, small CNNs can achieve competitive accuracy compared with recent baselines while remaining practical for edge scenarios. We also note typical constraints such as adverse weather, low quality imagery, and leaves exhibiting multiple concurrent diseases that can degrade performance and warrant future work on domain robustness. Overall, PD36 C and its application pipeline contribute a field ready, efficient solution for AI assisted plant disease detection in smart agriculture.

[378] LoGo-MR: Screening Breast MRI for Cancer Risk Prediction by Efficient Omni-Slice Modeling

Xin Wang,Yuan Gao,George Yiasemis,Antonio Portaluri,Zahra Aghdam,Muzhen He,Luyi Han,Yaofei Duan,Chunyao Lu,Xinglong Liang,Tianyu Zhang,Vivien van Veldhuizen,Yue Sun,Tao Tan,Ritse Mann,Jonas Teuwen

Main category: cs.CV

TL;DR: 本文提出LoGo-MR(2.5D局部-全局建模框架)及扩展版LoGo3-MR,用于乳腺MRI的5年乳腺癌风险预测,兼顾效率与可解释性;通过邻片编码捕获短期风险局部特征,结合Transformer增强的多实例学习建模长期风险全局模式,并支持跨轴/矢/冠三平面融合与体素级风险热图定位;在约7500例大规模队列上显著优于2D/3D CNN及SOTA MIL方法。

Details Motivation: 现有方法难以兼顾乳腺MRI风险预测的效率与建模能力:3D CNN计算开销大,2D CNN无法建模层间连续性;且短/长期乳腺癌风险分层建模尚未被充分探索。 Method: 提出LoGo-MR:1)邻片编码模块捕获局部结构以表征短期风险;2)Transformer增强的多实例学习(MIL)建模全局分布模式以表征长期风险,并提供切片重要性解释;进一步扩展为LoGo3-MR,融合轴向、矢状、冠状三平面特征,生成体素级风险显著性图。 Result: 在~7.5K乳腺MRI筛查队列上,LoGo-MR在1–5年风险预测中AUC达0.77–0.69,C-index较3D CNN提升约6%;LoGo3-MR进一步提升整体性能,并在七种骨干网络上表现稳健;支持跨三平面的可解释风险定位。 Conclusion: LoGo-MR/LoGo3-MR实现了高效、可解释的乳腺MRI风险分层,在大规模人群筛查中具有明确临床应用潜力。 Abstract: Efficient and explainable breast cancer (BC) risk prediction is critical for large-scale population-based screening. Breast MRI provides functional information for personalized risk assessment. Yet effective modeling remains challenging as fully 3D CNNs capture volumetric context at high computational cost, whereas lightweight 2D CNNs fail to model inter-slice continuity. Importantly, breast MRI modeling for shor- and long-term BC risk stratification remains underexplored. In this study, we propose LoGo-MR, a 2.5D local-global structural modeling framework for five-year BC risk prediction. Aligned with clinical interpretation, our framework first employs neighbor-slice encoding to capture subtle local cues linked to short-term risk. It then integrates transformer-enhanced multiple-instance learning (MIL) to model distributed global patterns related to long-term risk and provide interpretable slice importance. We further apply this framework across axial, sagittal, and coronal planes as LoGo3-MR to capture complementary volumetric information. This multi-plane formulation enables voxel-level risk saliency mapping, which may assist radiologists in localizing risk-relevant regions during breast MRI interpretation. Evaluated on a large breast MRI screening cohort (~7.5K), our method outperforms 2D/3D baselines and existing SOTA MIL methods, achieving AUCs of 0.77-0.69 for 1- to 5-year prediction and improving C-index by ~6% over 3D CNNs. LoGo3-MR further improves overall performance with interpretable localization across three planes, and validation across seven backbones shows consistent gains. These results highlight the clinical potential of efficient MRI-based BC risk stratification for large-scale screening. Code will be released publicly.

[379] LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization

Jianshi Wu,Minghang Zhu,Dunqiang Liu,Wen Li,Sheng Ao,Siqi Shen,Chenglu Wen,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出LEADER框架,通过鲁棒投影几何编码器和截断相对可靠性损失,提升LiDAR重定位在复杂场景下的鲁棒性与精度,在Oxford RobotCar和NCLT数据集上显著降低位置误差。

Details Motivation: 现有基于学习的LiDAR重定位方法对所有预测点一视同仁,易受噪声和离群值影响,在挑战性场景中表现不佳。 Method: 提出LEADER框架,包含鲁棒投影式几何编码器(捕获多尺度几何特征)和截断相对可靠性损失(建模点级不确定性并抑制不可靠预测影响)。 Result: 在Oxford RobotCar和NCLT数据集上,位置误差分别比现有最优方法相对降低24.1%和73.9%。 Conclusion: LEADER通过引入几何先验与可靠性感知损失,有效提升了LiDAR重定位的鲁棒性和精度,尤其适用于复杂、噪声干扰强的场景。 Abstract: LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose LEADER, a robust LiDAR-based relocalization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that LEADER outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. The source code is released on https://github.com/JiansW/LEADER.

[380] From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction

Adrienne Kline,Abhijit Gaonkar,Daniel Pittman,Chris Kuehn,Nils Forkert

Main category: cs.CV

TL;DR: 本文提出了一种端到端深度学习框架,用于在不损害下游分析性能的前提下,自动去除医学影像中的患者敏感信息(如烧录文字、元数据),并利用潜在扩散模型(Stable Diffusion 2)对红删区域进行解剖学与成像合理的修复。

Details Motivation: 现有医学图像去标识化方法常误删非敏感但对下游任务关键的信息,影响分析性能;同时,隐私保护与数据可用性之间存在矛盾,亟需兼顾二者的新方法。 Method: 采用轻量级混合架构:先用CRNN检测并红删含PHI区域(如烧录文本),再用基于Stable Diffusion 2的潜在扩散模型进行解剖学一致的图像修复(inpainting)。 Result: 在隐私指标(PHI残留率、红删成功率)和图像质量/任务性能指标(如分割、分类模型精度)上均表现优异;生成图像视觉连贯、下游模型性能保持良好,重识别风险显著降低。 Conclusion: 该方法实现了自动化、高质量的医学图像匿名化与重建一体化流程,可促进大规模医学影像数据共享与多中心协作,为医学AI开放科学提供实用工具。 Abstract: Removing patient-specific information from medical images is crucial to enable sharing and open science without compromising patient identities. However, many methods currently used for deidentification have negative effects on downstream image analysis tasks because of removal of relevant but non-identifiable information. This work presents an end-to-end deep learning framework for transforming raw clinical image volumes into de-identified, analysis-ready datasets without compromising downstream utility. The methodology developed and tested in this work first detects and redacts regions likely to contain protected health information (PHI), such as burned-in text and metadata, and then uses a generative deep learning model to inpaint the redacted areas with anatomically and imaging plausible content. The proposed pipeline leverages a lightweight hybrid architecture, combining CRNN-based redaction with a latent-diffusion inpainting restoration module (Stable Diffusion 2). We evaluate the approach using both privacy-oriented metrics, which quantify residual PHI and success of redaction, and image-quality and task-based metrics, which assess the fidelity of restored volumes for representative deep learning applications. Our results suggest that the proposed method yields de-identified medical images that are visually coherent, maintaining fidelity for downstream models, while substantially reducing the risk of patient re-identification. By automating anonymization and image reconstruction within a single workflow, and dissemination of large-scale medical imaging collections, thereby lowering a key barrier to data sharing and multi-institutional collaboration in medical imaging AI.

[381] ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

Nafiseh Ghaffar Nia,Vinesh Appadurai,Suchithra V.,Chinmay Rane,Daniel Pittman,James Carr,Adrienne Kline

Main category: cs.CV

TL;DR: 本文提出了一种名为ConvFormer3D-TAP的新型3D时空模型,用于准确识别标准电影心脏MRI(cine MRI)视图,该模型结合了3D卷积标记化与多尺度自注意力机制,并通过掩码时空重建和不确定性加权多片段融合训练,在大规模临床数据上实现了96%的验证准确率和良好校准性。

Details Motivation: 准确识别标准cine心脏MRI视图至关重要,错误识别会引发后续定量分析(如分割、容积评估、应变分析、瓣膜评估)的级联错误;然而,临床中因扫描仪厂商、采集协议、运动伪影及扫描平面设定差异等因素,导致视图分类极具挑战性。 Method: 提出ConvFormer3D-TAP模型:融合3D卷积tokenization与多尺度自注意力;采用掩码spatiotemporal重建预训练 + 不确定性加权multi-clip融合策略;兼顾局部解剖结构(卷积先验)与全心动周期长程动态(分层注意力)。 Result: 在150,974例临床cine序列(覆盖6种标准视图)上,验证准确率达96%,各类F1-score ≥ 0.94,校准优异(ECE=0.025,Brier=0.040);错误主要集中于解剖邻近视图对(如长轴类、LVOT/AV)。 Conclusion: ConvFormer3D-TAP是一种鲁棒、可扩展的cine MRI视图识别前端,适用于端到端工作流中的视图路由、筛选与质量控制。 Abstract: Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view identification, whether by a human reader or an automated deep learning system, can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation. However, accurate view classification remains challenging under routine clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. We present ConvFormer3D-TAP, a cine-specific spatiotemporal architecture that integrates 3D convolutional tokenization with multiscale self-attention. The model is trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion to enhance robustness across cardiac phases and ambiguous temporal segments. The design captures complementary cues: local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention. On a cohort of 150,974 clinically acquired cine sequences spanning six standard cine cardiac MRI views, ConvFormer3D-TAP achieved 96% validation accuracy with per-class F1-scores >= 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Error analysis shows that residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs, consistent with intrinsic prescription overlap. These results support ConvFormer3D-TAP as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows.

[382] Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection

Jijun Xiang,Jiayi Wang,Pengxiang Wang,Cheng Chen,Nian Wang,Tao Wang

Main category: cs.CV

TL;DR: 本文提出R2VD方法,将超光谱异常检测中的重建过程从标量残差范式转变为向量扩散生成范式,通过四阶段流程实现更鲁棒的亚像素异常检测与背景抑制。

Details Motivation: 现有超光谱异常检测模型依赖模糊的标量残差,导致亚像素异常在空间下采样中消失,并因未净化异常污染训练权重而产生严重确认偏差。 Method: 提出Reconstruction-to-Vector Diffusion(R2VD),包含四个阶段:(1) 物理先验提取(PPE)缓解早期确认偏差;(2) 引导流形净化(GMP)使用OmniContext自编码器(OCA)提取纯净残差图;(3) 残差评分建模(RSM)采用带物理光谱防火墙(PSF)的Diffusion Transformer(DiT)抑制跨波段泄漏;(4) 向量动力学推理(VDI)基于高维向量干扰模式而非标量误差进行目标-背景解耦。 Result: 在八个数据集上的综合评估表明,R2VD达到新SOTA性能,显著提升目标可检测性与背景抑制能力。 Conclusion: R2VD通过重构范式的根本性转变——从标量重建终点到向量生成起点——有效克服了传统HAD方法在亚像素异常保留和训练偏差方面的核心局限。 Abstract: While Hyperspectral Anomaly Detection (HAD) excels at identifying sparse targets in complex scenes, existing models remain trapped in a scalar "reconstruction-as-endpoint" paradigm. This reliance on ambiguous scalar residuals consistently triggers sub-pixel anomaly vanishing during spatial downsampling, alongside severe confirmation bias when unpurified anomalies corrupt training weights. In this paper, we propose Reconstruction-to-Vector Diffusion (R2VD), which fundamentally redefines reconstruction as a manifold purification origin to establish a novel residual-guided generative dynamics paradigm. Our framework introduces a four-stage pipeline: (1) a Physical Prior Extraction (PPE) stage that mitigates early confirmation bias via dual-stream statistical guidance; (2) a Guided Manifold Purification (GMP) stage utilizing an OmniContext Autoencoder (OCA) to extract purified residual maps while preserving fragile sub-pixel topologies; (3) a Residual Score Modeling (RSM) stage where a Diffusion Transformer (DiT), guarded by a Physical Spectral Firewall (PSF), effectively isolates cross-spectral leakage; and (4) a Vector Dynamics Inference (VDI) stage that robustly decouples targets from backgrounds by evaluating high-dimensional vector interference patterns instead of conventional scalar errors. Comprehensive evaluations on eight datasets confirm that R2VD establishes a new state-of-the-art, delivering exceptional target detectability and background suppression.

[383] Video-based Heart Rate Estimation with Angle-guided ROI Optimization and Graph Signal Denoising

Gan Pei,Junhao Ning,Boqiu Shen,Yan Zhu,Menghan Hu

Main category: cs.CV

TL;DR: 本文提出两种即插即用模块(角度引导的ROI自适应优化模块和多区域联合图信号去噪模块)以提升远程光电容积描记法(rPPG)在面部运动场景下的心率测量鲁棒性,显著降低平均绝对误差(MAE),平均下降20.38%。

Details Motivation: 远程光电容积描记法(rPPG)在面部运动(如说话、摇头)下性能显著下降,亟需提升其运动鲁棒性。 Method: 提出两个即插即用模块:1)角度引导的ROI自适应优化模块,通过量化ROI-相机角度来优化受运动影响的信号并捕获全局运动;2)多区域联合图信号去噪模块,利用图信号处理联合建模区域内与区域间ROI信号以抑制运动伪影。模块兼容反射模型类rPPG方法。 Result: 联合使用两模块显著降低MAE,平均较基线下降20.38%;消融实验验证各模块有效性。 Conclusion: 角度引导优化与图信号去噪相结合可有效提升rPPG在运动场景下的性能,展现出良好的应用潜力。 Abstract: Remote photoplethysmography (rPPG) enables non-contact heart rate measurement from facial videos, but its performance is significantly degraded by facial motions such as speaking and head shaking. To address this issue, we propose two plug-and-play modules. The Angle-guided ROI Adaptive Optimization module quantifies ROI-Camera angles to refine motion-affected signals and capture global motion, while the Multi-region Joint Graph Signal Denoising module jointly models intra- and inter-regional ROI signals using graph signal processing to suppress motion artifacts. The modules are compatible with reflection model-based rPPG methods and validated on three public datasets. Results show that jointly use markedly reduces MAE, with an average decrease of 20.38\% over the baseline, while ablation studies confirm the effectiveness of each module. The work demonstrates the potential of angle-guided optimization and graph-based denoising to enhance rPPG performance in motion scenarios.

[384] GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors

Qilin Zhang,Jinyu Zhu,Olaf Wysocki,Benjamin Busam,Boris Jutzi

Main category: cs.CV

TL;DR: GS4City 提出了一种结合城市模型先验(如CityGML)的分层语义高斯泼溅方法,通过两遍光线投射生成图像对齐掩码,并融合基础模型预测,实现结构感知的城市场景重建,在建筑粗粒度与细粒度语义分割上显著优于现有2D驱动方法。

Details Motivation: 现有基于2D基础模型的语义3D高斯泼溅方法在城市场景中存在边界模糊、缺乏结构化语义支持的问题;而CityGML等城市模型虽含层次化语义与几何信息,却难以直接映射到高斯原语。 Method: GS4City利用LoD3 CityGML模型进行两遍光线投射生成图像对齐掩码,利用父子关系校验和恢复立面细节;融合几何引导掩码与基础模型预测以建立场景一致的实例对应关系;并在联合2D身份监督与3D空间正则化下学习每个高斯的紧凑身份编码。 Result: 在TUM2TWIN和Gold Coast数据集上,GS4City在粗粒度建筑分割IoU上比LangSplat和Gaga高出最多15.8点,在细粒度语义分割mIoU上高出最多14.2点。 Conclusion: GS4City成功桥接了结构化城市模型与光真实感高斯场景表示,实现了可语义查询、结构感知的城市重建。 Abstract: Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city-model priors for urban scene understanding. GS4City derives reliable image-aligned masks from Level of Detail (LoD) 3 CityGML models via two-pass raycasting, explicitly using parent-child relations to validate and recover fine-grained facade elements. It then fuses these geometry-grounded masks with foundation-model predictions to establish scene-consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D-driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure-aware urban reconstruction. Code is available at https://github.com/Jinyzzz/GS4City.

[385] Scene Change Detection with Vision-Language Representation Learning

Diwei Sheng,Vijayraj Gohil,Satyam Gaba,Zihan Liu,Giles Hamilton-Fletcher,John-Ross Rizzo,Yongqing Liang,Chen Feng

Main category: cs.CV

TL;DR: 本文提出LangSCD,一种结合视觉与语言的场景变化检测框架,通过引入语言模组和几何-语义匹配模块,提升城市复杂环境下的变化检测精度,并构建了多类别标注的大规模纽约城变化检测数据集NYC-CD。

Details Motivation: 现有方法依赖低层视觉特征,在光照、季节、视角和城市布局变化下难以准确识别变化物体;且现有基准仅提供二值变化标注,无法满足细粒度场景动态理解需求。 Method: 提出LangSCD框架:1)利用视觉-语言模型(VLM)生成变化文本描述;2)跨模态特征增强器融合文本与视觉特征;3)几何-语义匹配模块优化预测掩码,保障语义一致性与空间完整性;4)构建NYC-CD多类别变化数据集。 Result: 在多个街景基准上显著提升现有变化检测架构性能,达到SOTA;验证了语言引导的语义推理对鲁棒变化检测的有效性。 Conclusion: 融合语言语义推理与视觉表征可有效克服单模态局限,提升复杂城市环境中的场景变化检测能力,为下游应用提供更丰富的变化理解支持。 Abstract: Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

[386] Online Reasoning Video Object Segmentation

Jinyuan Liu,Yang Wang,Zeyu Zhao,Weixin Li,Song Wang,Ruize Han

Main category: cs.CV

TL;DR: 本文提出了在线推理视频对象分割(ORVOS)任务,强调仅利用过去和当前帧进行因果、逐帧预测,并引入了首个支持该任务评估的基准ORVOSB及相应基线方法。

Details Motivation: 现有推理视频对象分割方法在离线模式下运行,可利用未来帧进行回溯消歧,与实际需严格因果、逐帧决策的部署场景不符;且未考虑随事件展开发生的指代对象变化问题。 Method: 构建了包含帧级因果标注和指代变化标签的ORVOSB基准;提出一种基线方法,采用持续更新的分割提示和结构化时序token存储机制,以支持计算受限下的长时序因果推理。 Result: 实验表明,现有方法在严格因果性和指代变化场景下性能显著下降;所提基线在ORVOSB上建立了强基线性能。 Conclusion: ORVOS是一项更贴近真实应用的新任务,ORVOSB基准和所提基线为未来在线视频语言推理研究提供了重要基础。 Abstract: Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

[387] Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding

Zhenghao Xie,Jing Xiao,Zhenqi Wang,Kexin Ma,Liang Liao,Gui-Song Xia,Mi Wang

Main category: cs.CV

TL;DR: 本文提出了一种跨尺度遥感理解方法,通过联合优化高分辨率(HR)采样与跨块表征预测,在成本约束下提升任务性能,并构建了包含1000万对齐多分辨率图像的大规模基准GL-10M。

Details Motivation: 遥感理解需多分辨率观测,但高分辨率影像获取成本高、覆盖有限;现有HR采样方法忽略块内重要性差异和块间上下文关系,导致表征碎片化与推理次优。 Method: 将跨尺度遥感理解建模为统一的成本感知问题,耦合细粒度HR采样与跨块表征预测;并构建大规模多分辨率基准GL-10M。 Result: 在识别与检索任务上,所提方法在HR观测数量受限条件下持续实现更优的性能-成本权衡。 Conclusion: 联合建模HR采样与跨块表征预测能显著提升稀疏HR观测下的场景推理能力,GL-10M为预算约束下的跨尺度研究提供了系统评估平台。 Abstract: Remote sensing understanding inherently requires multi-resolution observation, since different targets and application tasks demand different levels of spatial detail. While low-resolution (LR) imagery enables efficient global observation, high-resolution (HR) imagery provides critical local details at much higher acquisition cost and limited coverage. This motivates a cross-scale sensing strategy that selectively acquires HR imagery from LR-based global perception to improve task performance under constrained cost. Existing methods for HR sampling methods typically make selection decisions from isolated LR patches, which ignore fine-grained intra-patch importance and cross-patch contextual interactions, leading to fragmented feature representation and suboptimal scene reasoning under sparse HR observations. To address this issue, we formulate cross-scale remote sensing understanding as a unified cost-aware problem that couples fine-grained HR sampling with cross-patch representation prediction, enabling more effective task reasoning with fewer HR observations. Furthermore, we present GL-10M, a large-scale benchmark of 10 million spatially aligned multi-resolution images, enabling systematic evaluation of budget-constrained cross-scale reasoning in remote sensing. Extensive experiments on recognition and retrieval tasks show that our method consistently achieves a superior performance-cost trade-off.

[388] HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation

Yongxiang Liu,Jie Zhou,Yafei Song,Tianpeng Liu,Li Liu

Main category: cs.CV

TL;DR: 本文提出了HuiYanEarth-SAR,首个基于AlphaEarth并融合散射机制的基础SAR影像生成模型,通过注入地理空间先验控制宏观结构、隐式建模散射特性保证微观纹理真实性,仅凭地理坐标即可生成全球高保真SAR图像。

Details Motivation: 解决现有SAR影像生成方法难以同时保证全局地理语义与微观散射机制高保真度的问题,缓解该领域数据稀缺瓶颈。 Method: 基于AlphaEarth框架构建HuiYanEarth-SAR模型,一方面注入地理空间先验以控制宏观地物结构,另一方面采用隐式散射特征建模确保微观纹理物理真实性。 Result: 实现了仅输入地理坐标即可生成全球任意位置高保真SAR图像的能力,构建了高效SAR场景仿真器,并打通了地理学、散射机理与人工智能的方法论桥梁。 Conclusion: 该工作推动SAR研究范式从感知理解迈向仿真创造,为构建高置信度地球数字孪生体提供了关键技术支撑。 Abstract: Synthetic Aperture Radar (SAR) imagery generation is essential for deepening the study of scattering mechanisms, establishing trustworthy electromagnetic scene models, and fundamentally alleviating the data scarcity bottleneck that constrains development in this field. However, existing methods find it difficult to simultaneously ensure high fidelity in both global geospatial semantics and microscopic scattering mechanisms, resulting in severe challenges for global generation. To address this, we propose \textbf{HuiYanEarth-SAR}, the first foundational SAR imagery generation model based on AlphaEarth and integrated scattering mechanisms. By injecting geospatial priors to control macroscopic structures and utilizing implicit scattering characteristic modeling to ensure the authenticity of microscopic textures, we achieve the capability of generating high-fidelity SAR images for global locations solely based on geographic coordinates. This study not only constructs an efficient SAR scene simulator but also establishes a bridge connecting geography, scatter mechanism, and artificial intelligence from a methodological standpoint. It advances SAR research by expanding the paradigm from perception and understanding to simulation and creation, providing key technical support for constructing a high-confidence digital twin of the Earth.

[389] Beyond Model Design: Data-Centric Training and Self-Ensemble for Gaussian Color Image Denoising

Gengjia Chang,Xining Ge,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Shuhong Liu

Main category: cs.CV

TL;DR: 本文通过增强数据驱动训练和释放测试时能力,显著提升了成熟Restormer架构在NTIRE 2026高斯彩色图像去噪任务(σ=50)上的性能,PSNR提升达3.366 dB。

Details Motivation: 旨在探索成熟Restormer架构在固定噪声水平(σ=50)下的性能边界,而非设计新主干网络,聚焦于数据与测试策略优化。 Method: 采用更大更多样化的公开图像数据集进行两阶段训练,并在推理时使用×8几何自集成;保留TLC式局部推理封装但发现其贡献可忽略。 Result: 在验证集(100张图像)上达到30.762 dB PSNR和0.861 SSIM,较原始Restormer基线提升3.366 dB PSNR;消融表明主要增益来自扩展训练数据与两阶段优化,自集成带来小幅但稳定提升。 Conclusion: 数据规模与多样性、优化调度策略比模型结构创新更能推动成熟去噪架构的性能上限,测试时集成是有效且低成本的补充手段。 Abstract: This paper presents our solution to the NTIRE 2026 Image Denoising Challenge (Gaussian color image denoising at fixed noise level $σ= 50$). Rather than proposing a new restoration backbone, we revisit the performance boundary of the mature Restormer architecture from two complementary directions: stronger data-centric training and more complete Test-Time capability release. Starting from the public Restormer $σ\!=\!50$ baseline, we expand the standard multi-dataset training recipe with larger and more diverse public image corpora and organize optimization into two stages. At inference, we apply $\times 8$ geometric self-ensemble to further release model capacity. A TLC-style local inference wrapper is retained for implementation consistency; however, systematic ablation reveals its quantitative contribution to be negligible in this setting. On the challenge validation set of 100 images, our final submission achieves 30.762 dB PSNR and 0.861 SSIM, improving over the public Restormer $σ\!=\!50$ pretrained baseline by up to 3.366 dB PSNR. Ablation studies show that the dominant gain originates from the expanded training corpus and the two-stage optimization schedule, and self-ensemble provides marginal but consistent improvement.

[390] Degradation-Aware and Structure-Preserving Diffusion for Real-World Image Super-Resolution

Yang Ji,Zonghao Chen,Zhihao Xue,Junqin Hu

Main category: cs.CV

TL;DR: 本文提出了一种面向真实世界图像超分辨率的退化感知与结构保持扩散模型框架,通过退化感知标记注入和空间非对称噪声注入两个轻量模块,提升重建的感知质量与结构保真度。

Details Motivation: 真实世界的图像退化复杂、异质且难以显式建模,导致现有扩散模型在真实超分任务中性能受限。 Method: 提出退化感知标记注入(编码低分辨率输入的轻量退化统计并融合语义特征)和空间非对称噪声注入(根据局部边缘强度调制扩散噪声),二者均为轻量插件式模块。 Result: 在DIV2K和RealSR数据集上,方法在无参考感知质量与视觉真实性上优于近期基线,同时保持良好的感知-失真权衡;消融实验证明两模块有效且互补。 Conclusion: 所提框架以最小修改实现了对真实退化的显式建模与结构保护,为扩散模型在真实超分任务中的应用提供了新思路。 Abstract: Real-world image super-resolution is particularly challenging for diffusion models because real degradations are complex, heterogeneous, and rarely modeled explicitly. We propose a degradation-aware and structure-preserving diffusion framework for real-world SR. Specifically, we introduce Degradation-aware Token Injection, which encodes lightweight degradation statistics from low-resolution inputs and fuses them with semantic conditioning features, enabling explicit degradation-aware restoration. We further propose Spatially Asymmetric Noise Injection, which modulates diffusion noise with local edge strength to better preserve structural regions during training. Both modules are lightweight add-ons to the adopted diffusion SR framework, requiring only minor modifications to the conditioning pipeline. Experiments on DIV2K and RealSR show that our method delivers competitive no-reference perceptual quality and visually more realistic restoration results than recent baselines, while maintaining a favorable perception--distortion trade-off. Ablations confirm the effectiveness of each module and their complementary gains when combined. The code and model are publicly available at https://github.com/jiyang0315/DASP-SR.git.

[391] PACO: Proxy-Task Alignment and Online Calibration for On-the-Fly Category Discovery

Weidong Tang,Bohan Zhang,Zhixiang Chi,ZiZhang Wu,Yang Wang,Yanan Wu

Main category: cs.CV

TL;DR: 本文提出PACO框架,通过支持集校准、树状在线决策机制和动态阈值更新,解决现有在线类别发现(OCD)方法在推理阶段静态阈值和固定决策边界导致的不稳定问题,显著提升性能且无需额外训练或调参。

Details Motivation: 现有OCD方法将推理视为静态分类问题,仅依赖单一固定阈值,忽视其动态性;且不随流式数据更新决策边界,导致类别发现不稳定、不一致。 Method: 提出PACO:一种支持集校准的树状在线决策框架,包含已知类路由、出生感知的新类分配、以及基于动态原型记忆的'附着vs创建'操作;离线模拟代理发现以初始化阈值,推理中用成熟新类原型持续更新阈值。 Result: 在七个基准上显著超越SOTA;仅靠自适应阈值校准即可大幅提升性能,无需改变特征表示或重训练。 Conclusion: OCD本质是动态决策过程,需自适应、可更新的在线机制;PACO作为即插即用推理模块,通用性强、开销低、效果优。 Abstract: On-the-Fly Category Discovery (OCD) requires a model, trained on an offline support set, to recognize known classes while discovering new ones from an online streaming sequence. Existing methods focus heavily on offline training. They aim to learn discriminative representations on the support set so that novel classes can be separated at test time. However, their discovery mechanism at inference is typically reduced to a single threshold. We argue that this paradigm is fundamentally flawed as OCD is not a static classification problem, but a dynamic process. The model must continuously decide 1) whether a sample belongs to a known class, 2) matches an existing novel category, or 3) should initiate a new one. Moreover, prior methods treat the support set as fixed knowledge. They do not update their decision boundaries as new evidence arrives during inference. This leads to unstable and inconsistent category formation. Our experiments confirm these issues. With properly calibrated and adaptive thresholds, substantial improvements can be achieved, even without changing the representation. Motivated by this, we propose PACO, a support-set-calibrated, tree-structured online decision framework. The framework models inference as a sequence of hierarchical decisions, including known-class routing, birth-aware novel assignment, and attach-versus-create operations over a dynamic prototype memory. Furthermore, we simulate the proxy discovery process to initialize the thresholds during offline training to align with inference. Thresholds are continuously updated during inference using mature novel prototypes. Importantly, PACO requires no heavy training and no dataset-specific tuning. It can be directly integrated into existing OCD pipelines as an inference-time module. Extensive experiments show significant improvements over SOTA baselines across seven benchmarks.

[392] NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

Aleksandr Gushchin,Khaled Abud,Ekaterina Shumitskaya,Artem Filippov,Georgii Bychkov,Sergey Lavrushkin,Mikhail Erofeev,Anastasia Antsiferova,Changsheng Chen,Shunquan Tan,Radu Timofte,Dmitry Vatolin,Chuanbiao Song,Zijian Yu,Hao Tan,Jun Lan,Zhiqiang Yang,Yongwei Tang,Zhiqiang Wu,Jia Wen Seow,Hong Vin Koay,Haodong Ren,Feng Xu,Shuai Chen,Ruiyang Xia,Qi Zhang,Yaowen Xu,Zhaofan Zou,Hao Sun,Dagong Lu,Mufeng Yao,Xinlei Xu,Fei Wu,Fengjun Guo,Cong Luo,Hardik Sharma,Aashish Negi,Prateek Shaily,Jayant Kumar,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Zhilin Tu,Fengpeng Li,Jiamin Zhang,Jianwei Fei,Kemou Li,Haiwei Wu,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Chenfan Qu,Junchi Li

Main category: cs.CV

TL;DR: 本文介绍了NTIRE 2026挑战赛——野外鲁棒AI生成图像检测,旨在推动在真实场景(含多种图像变换)下准确区分真实与AI生成图像的技术发展。

Details Motivation: 现实中的AI生成图像常经过裁剪、缩放、压缩、模糊等变换,导致现有检测模型鲁棒性不足,亟需构建能应对实际干扰的检测方法。 Method: 构建包含108,750张真实图像和185,750张来自42种生成器的AI图像的新数据集,并施加36种图像变换;参赛方法以全测试集(含变换与未变换图像)上的ROC AUC为评估指标。 Result: 共511人注册,20支队伍提交有效方案;报告系统总结了各参赛方法及其性能表现。 Conclusion: 该挑战赛为提升AI生成图像检测模型在真实世界变换下的鲁棒性提供了重要基准与实践参考。 Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical usage, and therefore, the detection models should be robust to such transformations. The challenge is based on a novel dataset consisting of 108,750 real and 185,750 AI-generated images from 42 generators comprising a large variety of open-source and closed-source models of various architectures, augmented with 36 image transformations. Methods were evaluated using ROC AUC on the full test set, including both transformed and untransformed images. A total of 511 participants registered, with 20 teams submitting valid final solutions. This report provides a comprehensive overview of the challenge, describes the proposed solutions, and can be used as a valuable reference for researchers and practitioners in increasing the robustness of the detection models to real-world transformations.

[393] TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

Imtiaz Ul Hassan,Nik Bessis,Ardhendu Behera

Main category: cs.CV

TL;DR: 本文提出TAG-Head,一种轻量级时空图头,仅用RGB输入即可提升细粒度人类动作识别(FHAR)性能;通过Transformer编码器建模长程时空依赖,并设计含帧内全连接与帧间时序对齐边的图结构来增强判别力;在FineGym和HAA500上达到RGB-only方法SOTA,且优于部分多模态方法。

Details Motivation: 细粒度人类动作识别因动作间视觉相似、差异细微而困难;现有方法多依赖额外模态(如姿态、文本、光流),带来高标注与计算开销,亟需仅用RGB的高效解决方案。 Method: 提出TAG-Head:1)在3D骨干网络输出token上应用带可学习3D位置编码的Transformer编码器,捕获长程时空依赖;2)构建双分支图结构——帧内全连接边用于分辨细微外观差异,时间对齐的帧间边连接相同空间位置特征以稳定运动线索而不平滑过度;整体轻量、即插即用、端到端训练。 Result: 在FineGym(Gym99/Gym288)和HAA500数据集上,TAG-Head在RGB-only模型中达到新SOTA,并超越多个依赖姿态+文本+视频的多模态方法;消融实验验证了Transformer与图拓扑的贡献,复杂度分析表明低延迟。 Conclusion: TAG-Head通过在轻量图头中显式耦合全局上下文、高分辨率空间交互与低方差时间连续性,显著推进FHAR;其简洁设计便于实际RGB传感器系统部署,同时获得通常需更重或多模态模型才能达到的性能增益。 Abstract: Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.

[394] SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

Yvon Apedo,Martyna Poreba,Michal Szczepanski,Samia Bouchafa

Main category: cs.CV

TL;DR: 本文提出SVD-Prune,一种无需训练、即插即用的视觉令牌剪枝方法,基于奇异值分解(SVD)与统计杠杆分数选择最具全局方差贡献的令牌,在极低视觉令牌预算下(如16或32个)仍保持高性能。

Details Motivation: 现有视觉-语言模型(VLM)在处理长视觉序列时计算与内存开销大;主流基于局部启发式(如注意力分数、范数)的剪枝方法存在位置偏差和信息分散问题,难以在高剪枝率下保留关键视觉内容,尤其对细节丰富的图像性能下降明显。 Method: 提出SVD-Prune:对视觉令牌特征矩阵进行奇异值分解,利用统计杠杆分数选取前K个令牌,优先保留对主全局方差贡献最大的令牌;该方法无需训练、可即插即用。 Result: 在极端视觉令牌预算(如32和16个令牌)下,SVD-Prune持续优于先前剪枝方法,维持强性能。 Conclusion: SVD-Prune通过全局结构感知的剪枝策略有效缓解了局部启发式方法的偏差与信息损失问题,为高效VLM推理提供了一种简单、鲁棒且高性能的训练免费解决方案。 Abstract: Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

[395] CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Sohwi Lim,Lee Hyoseok,Jungjoon Park,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出CLAY方法,通过重构预训练视觉-语言模型的嵌入空间为文本条件相似性空间,实现无需额外训练的自适应图像检索,支持多条件高效检索,并构建了合成评估数据集CLAY-EVAL验证其有效性。

Details Motivation: 人类对视觉相似性的感知具有适应性和主观性,而现有图像检索系统依赖固定单一指标,难以同时融合多种条件。 Method: 提出CLAY方法,将预训练视觉-语言模型(VLM)的嵌入空间重构为文本条件相似性空间,分离文本条件化过程与视觉特征提取,复用固定视觉嵌入以支持高效多条件检索。 Result: 在标准数据集及自建CLAY-EVAL数据集上的实验表明,CLAY在检索精度和计算效率上均优于先前方法。 Conclusion: CLAY是一种无需微调、灵活高效、支持多条件文本引导的视觉相似性建模方法,显著提升了图像检索的适应性与实用性。 Abstract: Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.

[396] Progressively Texture-Aware Diffusion for Contrast-Enhanced Sparse-View CT

Tianqi Wang,Wenchao Du,Hongyu Yang

Main category: cs.CV

TL;DR: 本文提出了一种渐进式纹理感知扩散模型(PTD),用于稀疏视角CT重建,通过粗到细的学习框架,在保持图像保真度的同时提升纹理一致性与视觉质量。

Details Motivation: 现有基于扩散的稀疏视角CT成像虽稳定,但在恢复可靠图像内容和视觉一致纹理方面仍存在挑战。 Method: 提出PTD模型,包含基础重建模块PTD_rec(学习低频信号、提供初始保真估计)和条件扩散模块PTD_diff(在双域引导下重建高频细节与一致纹理)。 Result: 在稀疏视角CT重建任务中,PTD在结构相似性和视觉质量上表现优越,仅需少量采样步数,降低了扩散模型固有随机性,更好平衡高频细节的视觉质量与保真度。 Conclusion: PTD是一种有效的粗到细扩散框架,显著提升了稀疏视角CT重建中内容可靠性与纹理一致性的协同优化能力。 Abstract: Diffusion-based sparse-view CT (SVCT) imaging has achieved remarkable advancements in recent years, thanks to its more stable generative capability. However, recovering reliable image content and visually consistent textures is still a crucial challenge. In this paper, we present a Progressively Texture-aware Diffusion (PTD) model, a coarse-to-fine learning framework tailored for SVCT. Specifically, PTD comprises a basic reconstructive module PTD$_{\textit{rec}}$ and a conditional diffusion module PTD$_{\textit{diff}}$. PTD$_{\textit{rec}}$ first learns a deterministic mapping to recover the majority of the underlying low-frequency signals (i.e., coarse content with smoothed textures), which serves as the initial estimation to enable fidelity. Moreover, PTD$_{\textit{diff}}$ aims to reconstruct high-fidelity details for coarse prediction, which explores a dual-domain guided conditional diffusion to generate reliable and consistent textures. Extensive experiments on sparse-view CT reconstruction demonstrate that our PTD achieves superior performance in terms of structure similarity and visual appeal with only a few sampling steps, which mitigates the randomness inherent in general diffusion models and enables a better trade-off between visual quality and fidelity of high-frequency details.

[397] The Impact of Federated Learning on Distributed Remote Sensing Archives

Anand Umashankar,Karam Tomotaki-Dawoud,Nicolai Schneider

Main category: cs.CV

TL;DR: 本文系统研究了联邦学习(FL)在遥感图像多标签分类中的应用,对比了FedAvg、FedProx和BSP三种策略在非独立同分布(non-IID)数据下的性能,发现FedProx在深层网络下表现更优,BSP精度高但通信开销大,LeNet在精度与通信成本间取得最佳平衡。

Details Motivation: 遥感数据具有分布式、主权约束强、地理差异大等特点,且标签分布高度非IID,导致标准联邦学习算法收敛困难,亟需针对遥感场景的系统性FL策略评估。 Method: 在可控的non-IID标签偏斜条件下,对FedAvg、FedProx和BSP三种联邦学习策略进行系统实证研究;采用LeNet、AlexNet、ResNet-34三种CNN架构;综合分析算法选择、模型容量、客户端比例与数量、批量大小及通信成本的影响;实验基于UC Merced多标签遥感数据集。 Result: FedProx在深层网络(如ResNet-34)和数据异质性下优于FedAvg;BSP可逼近集中式训练精度,但通信呈强顺序依赖、开销高;LeNet在该数据规模下实现了最优的精度-通信权衡。 Conclusion: 联邦学习策略的选择需兼顾模型深度与数据异质性;BSP虽精度高但不适用于低延迟或高并发场景;轻量级模型(如LeNet)配合合适FL算法(如FedProx)更适合当前遥感大数据的实际部署需求。 Abstract: Remote sensing archives are inherently distributed: Earth observation missions such as Sentinel-1, Sentinel-2, and Sentinel-3 have collectively accumulated more than 5 petabytes of imagery, stored and processed across many geographically dispersed platforms. Training machine learning models on such data in a centralized fashion is impractical due to data volume, sovereignty constraints, and geographic distribution. Federated learning (FL) addresses this by keeping data local and exchanging only model updates. A central challenge for remote sensing is the non-IID nature of Earth observation data: label distributions vary strongly by geographic region, degrading the convergence of standard FL algorithms. In this paper, we conduct a systematic empirical study of three FL strategies -- FedAvg, FedProx, and bulk synchronous parallel (BSP) -- applied to multi-label remote sensing image classification under controlled non-IID label-skew conditions. We evaluate three convolutional neural network (CNN) architectures of increasing depth (LeNet, AlexNet, and ResNet-34) and analyze the joint effect of algorithm choice, model capacity, client fraction, client count, batch size, and communication cost. Experiments on the UC Merced multi-label dataset show that FedProx outperforms FedAvg for deeper architectures under data heterogeneity, that BSP approaches centralized accuracy at the cost of high sequential communication, and that LeNet provides the best accuracy-communication trade-off for the dataset scale considered.

[398] Training-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation

Gengjia Chang,Xining Ge,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Shuhong Liu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的单图像超分辨率输出级集成框架,通过双分支(Hybrid Attention + TLC 推理;MambaIRv2 + 几何自集成)独立处理输入并轻量加权融合,在不更新参数前提下提升性能,尤其在高频细节恢复上效果显著。

Details Motivation: 现有超分辨率模型虽架构更强,但训练成本高、部署重;而实践中常已有多个预训练模型,瓶颈在于如何高效、免训练地融合其输出。 Method: 构建双分支输出级集成框架:一为Hybrid Attention网络配合TLC推理提供稳定主重建;二为MambaIRv2配合几何自集成增强高频细节;两分支对同一低清输入独立运行,图像空间内轻量加权融合,全程无参数更新。 Result: 在DIV2K bicubic ×4评估协议下,该方法一致优于主分支,且在最佳工作点PSNR略超纯强分支;作为NTIRE 2026挑战赛方案,验证了其有效性;消融实验证明输出级补偿是低开销、易部署的升级路径。 Conclusion: 无需训练的输出级集成是一种实用、高效、轻量的超分辨率性能提升范式,尤其适用于已有多个预训练模型的实际部署场景。 Abstract: Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ($\times 4$) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic $\times 4$ evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.

[399] Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

Songlong Xing,Weijie Wang,Zhengyu Zhao,Jindong Gu,Philip Torr,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出AdvFLYP方法,通过在Web图像-文本对上进行对抗性微调,并采用对比损失匹配对抗图像与对应文本,同时引入logit级和特征级正则化来提升CLIP模型的对抗鲁棒性和零样本性能。

Details Motivation: 现有CLIP对抗鲁棒性增强方法忽视训练数据分布与学习目标,导致零样本能力下降、跨域鲁棒性迁移受限。 Method: 提出AdvFLYP范式:基于网络图像-文本对生成对抗样本,用对比损失对齐对抗图像与文本;引入logit级正则化提升鲁棒性,特征级正则化维持干净准确率。 Result: 在14个跨领域下游数据集上显著优于主流方法,兼顾对抗鲁棒性与零样本泛化能力。 Conclusion: 遵循CLIP预训练范式的对抗微调更有效;logit与特征级正则化分别优化鲁棒性与清洁准确率;AdvFLYP具备更好跨域迁移性与实用性。 Abstract: Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP's pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.

[400] Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

Seongyu Kim,Seungwoo Lee,Hyeonggon Ryu,Joon Son Chung,Arda Senocak

Main category: cs.CV

TL;DR: 本文提出了一种用于触觉定位的新方法,通过密集跨模态特征交互学习局部视觉-触觉对齐,生成触觉显著图以实现触觉条件下的材质分割,并构建了两个新数据集验证其有效性。

Details Motivation: 现有方法依赖全局对齐,难以捕捉细粒度局部对应关系;且数据集多为近景、低多样性图像,限制了模型性能。 Method: 提出基于密集跨模态特征交互的局部视觉-触觉对齐模型,生成触觉显著图;引入野外多材质场景图像和材质多样性配对策略以提升视觉多样性和触觉一致性。 Result: 在新构建及现有基准上实验表明,该方法在触觉定位任务上显著优于先前的视觉-触觉方法。 Conclusion: 局部跨模态对齐与数据增强策略能有效提升触觉定位精度与鲁棒性,为触觉感知与材质理解提供了新思路。 Abstract: We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.

[401] GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth

Krishna Jaganathan,Patricio Vela

Main category: cs.CV

TL;DR: 本文提出GeomPrompt和GeomPrompt-Recovery两种轻量级跨模态适配模块,仅用RGB图像或退化深度图,通过任务驱动的几何提示来提升冻结RGB-D语义分割模型的性能,无需深度监督,且计算高效。

Details Motivation: 现有RGB-D多模态感知系统依赖可靠的深度信息,但实际中深度常缺失、噪声大或被破坏,需不依赖真实深度监督的鲁棒替代方案。 Method: 设计GeomPrompt模块,仅用RGB图像生成面向任务的几何提示作为RGB-D模型的第四通道输入;设计GeomPrompt-Recovery模块,在深度退化时预测对第四通道的校正量;两者均仅用下游分割标签端到端训练。 Result: 在SUN RGB-D数据集上,GeomPrompt相比纯RGB推理提升6.1/3.0 mIoU;GeomPrompt-Recovery在严重深度损坏下最高提升3.6 mIoU;推理延迟仅7.8 ms,显著低于单目深度估计基线(38.3/71.9 ms)。 Conclusion: 任务驱动的几何提示是一种高效、鲁棒的跨模态补偿机制,适用于RGB-D感知中深度缺失或退化的场景。 Abstract: Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task-driven geometric prompting is an efficient mechanism for cross-modal compensation under missing and degraded depth inputs in RGB-D perception.

[402] MLLM-as-a-Judge Exhibits Model Preference Bias

Shuitsu Koyama,Yuiga Wada,Daichi Yashima,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出Philautia-Eval方法,系统评估MLLM-as-a-Judge中模型自偏好与家族间互偏好偏差,并发现其源于连接器复用与指令微调数据重叠;进一步提出集成方法Pomms,有效缓解该偏差。

Details Motivation: MLLM-as-a-Judge被广泛用于自动评测,但其潜在的模型特异性偏好偏差可能扭曲模型比较和科学进展,而该偏差的程度尚不明确。 Method: 提出Philautia-Eval方法,通过解耦偏好倾向与生成质量差异来量化模型特异性偏好偏差;基于12个MLLM生成的129万图文对-评分数据进行实证分析;并设计MLLM集成方法Pomms以缓解偏差。 Result: 发现主流MLLM普遍存在自偏好偏差;特定模型家族内存在互偏好偏差,可能由连接器复用和指令微调资源重叠导致;Pomms集成方法能有效缓解该偏差且保持评测性能。 Conclusion: MLLM-as-a-Judge存在不可忽视的模型特异性偏好偏差,需通过如Pomms等去偏策略加以校正,以保障评测公平性与基准驱动研究的可靠性。 Abstract: Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

[403] Learning Robustness at Test-Time from a Non-Robust Teacher

Stefano Bianchettin,Giulio Rossolini,Giorgio Buttazzo

Main category: cs.CV

TL;DR: 本文研究了如何在测试时对预训练的非鲁棒模型进行自适应,以提升其在目标分布上的对抗鲁棒性,提出了一种无需标签的框架,利用非鲁棒教师模型的预测作为语义锚点,并在CIFAR-10和ImageNet上验证了其优化稳定性与鲁棒-精度权衡的优越性。

Details Motivation: 预训练模型在测试时适应下游任务已广泛应用,但其对抗鲁棒性(尤其当原始模型非鲁棒时)尚未被充分研究;本文旨在解决‘能否在仅有少量无标签目标样本的测试时场景下,提升非鲁棒预训练模型的对抗鲁棒性’这一实际问题。 Method: 提出一种无标签的测试时适应框架,以非鲁棒教师模型的预测为语义锚点,统一指导干净样本和对抗样本的目标函数;并从理论上证明该方法比传统基于自我一致性正则化的方法更稳定。 Result: 在CIFAR-10和ImageNet上,该方法展现出更优的优化稳定性、更低的超参敏感性,以及更好的鲁棒性-准确率权衡,优于现有基线方法。 Conclusion: 即使教师模型本身不鲁棒,也可通过所提出的标签无关、语义锚定的测试时对抗适应框架,有效提升模型在目标域的对抗鲁棒性,且具备更强的鲁棒性和实用性。 Abstract: Nowadays, pretrained models are increasingly used as general-purpose backbones and adapted at test-time to downstream environments where target data are scarce and unlabeled. While this paradigm has proven effective for improving clean accuracy on the target domain, adversarial robustness has received far less attention, especially when the original pretrained model is not explicitly designed to be robust. This raises a practical question: \emph{can a pretrained, non-robust model be adapted at test-time to improve adversarial robustness on a target distribution?} To face this question, this work studies how adversarial training strategies behave when integrated into adaptation schemes for the unsupervised test-time setting, where only a small set of unlabeled target samples is available. It first analyzes how classical adversarial training formulations can be extended to this scenario, showing that straightforward distillation-based adaptations remain unstable and highly sensitive to hyperparameter tuning, particularly when the teacher itself is non-robust. To address these limitations, the work proposes a label-free framework that uses the predictions of a non-robust teacher model as a semantic anchor for both the clean and adversarial objectives during adaptation. We further provide theoretical insights showing that our formulation yields a more stable alternative to the self-consistency-based regularization commonly used in classical adversarial training. Experiments evaluate the proposed approach on CIFAR-10 and ImageNet under induced photometric transformations. The results support the theoretical insights by showing that the proposed approach achieves improved optimization stability, lower sensitivity to parameter choices, and a better robustness-accuracy trade-off than existing baselines in this post-deployment test-time setting.

[404] Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

Peijie Wang,Ming-Liang Zhang,Jun Cao,Chao Deng,Dekang Ran,Hongda Sun,Pi Bu,Xuan Zhang,Yingyao Wang,Jun Song,Bo Zheng,Fei Yin,Cheng-Lin Liu

Main category: cs.CV

TL;DR: 本文提出了一种统一的几何形式语言,并构建了涵盖平面与立体几何的大规模数据集GDP-29K,结合监督微调与基于可验证奖励的强化学习方法,显著提升了多模态大语言模型在几何推理任务中的性能。

Details Motivation: 多模态大语言模型在几何推理(尤其是需空间理解的立体几何)方面仍存在感知瓶颈,现有工作多集中于平面几何,缺乏对立体几何及统一建模的支持。 Method: 设计统一的平面与立体几何形式语言;构建大规模、真实来源的GDP-29K数据集(20k平面+9k立体样本);提出融合监督微调与基于可验证奖励的强化学习的训练范式。 Result: 在几何形式语言解析任务上达到SOTA性能;解析结果作为认知支架,显著提升MLLM在下游几何推理任务中的表现。 Conclusion: 统一形式语言与高质量数据驱动的可验证训练范式,是突破MLLM几何推理瓶颈、弥合平面与立体几何理解差距的有效路径。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs' capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.

[405] POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Haicheng Wang,Yuan Liu,Yikun Liu,Zhemeng Yu,Zhongyin Zhao,Yangxiu You,Zilin Yu,Le Tian,Xiao Zhou,Jie Zhou,Weidi Xie,Yanfeng Wang

Main category: cs.CV

TL;DR: 本文提出POINTS-Long,一种支持动态视觉令牌缩放的双模式多模态大语言模型,兼顾长视频与流式场景下的效率与精度权衡,并通过可动态解耦的KV缓存实现超长视觉记忆高效维护。

Details Motivation: 视觉令牌序列快速增长(尤其在长视频和流式场景中)严重制约了多模态大语言模型(MLLMs)的可扩展性与实际部署能力。 Method: 提出POINTS-Long模型,引入受人类视觉系统启发的动态视觉令牌缩放机制,设计聚焦模式(focus mode)与待机模式(standby mode)两种互补感知模式,并采用可动态解耦的KV缓存以支持流式视觉理解。 Result: 在细粒度视觉任务中聚焦模式保持最优性能;在长程通用视觉理解任务中,待机模式仅用1/40–1/10视觉令牌即可保留97.7–99.7%原始准确率;并原生支持流式视觉理解。 Conclusion: POINTS-Long为未来MLLM的设计提供了新思路,奠定了自适应、高效率长程视觉理解的基础。 Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

[406] MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance

Mokshagna Sai Teja Karanam,Tushar Kataria,Shireen Elhabian

Main category: cs.CV

TL;DR: 本文提出MorphoFlow,一种基于稀疏表面标注的生成式统计形状建模框架,结合神经隐式表示、自编码器与自回归归一化流,实现紧凑、可解释且概率化的三维解剖形状建模。

Details Motivation: 现有统计形状模型依赖密集分割和固定潜在表示,限制了可扩展性与对复杂解剖变异的建模灵活性。 Method: MorphoFlow融合神经隐式形状表示、autodecoder结构与自回归归一化流,并引入自适应稀疏先验进行潜在维度相关性加权,以学习紧凑且结构化的概率潜在空间。 Result: 在腰椎椎体和股骨公开数据集上验证,模型能从稀疏标注准确重建高分辨率形状,并恢复符合人群趋势的解剖变异模式。 Conclusion: MorphoFlow实现了无需人工设定潜变量维数、支持不确定性量化与解剖合理形状合成的灵活、可扩展、概率化形状建模。 Abstract: Statistical shape modeling (SSM) is central to population level analysis of anatomical variability, yet most existing approaches rely on densely annotated segmentations and fixed latent representations. These requirements limit scalability and reduce flexibility when modeling complex anatomical variation. We introduce MorphoFlow, a sparse supervised generative shape modeling framework that learns compact probabilistic shape representations directly from sparse surface annotations. MorphoFlow integrates neural implicit shape representations with an autodecoder formulation and autoregressive normalizing flows to learn an expressive probabilistic density over the latent shape space. The neural implicit representation enables resolution-agnostic modeling of 3D anatomy, while the autodecoder formulation supports direct optimization of per-instance latent codes under sparse supervision. The autoregressive flow captures the distribution of latent anatomical variability providing a tractable, likelihood-based generative model of shapes. To promote compact and structured latent representations, we incorporate adaptive latent relevance weighting through sparsity-inducing priors, enabling the model to regulate the contribution of individual latent dimensions according to their relevance to the underlying anatomical variation while preserving generative expressivity. The resulting latent space supports uncertainty quantification and anatomically plausible shape synthesis without manual latent dimensionality tuning. Evaluation on publicly available lumbar vertebrae and femur datasets demonstrates accurate high-resolution reconstruction from sparse inputs and recovery of structured modes of anatomical variation consistent with population level trends.

[407] STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding

Wenhao Li,Xueying Jiang,Gongjie Zhang,Xiaoqin Zhang,Ling Shao,Shijian Lu

Main category: cs.CV

TL;DR: 本文提出了一种从图谱域分析4D点云视频的新方法,设计了融合空间、时间与谱域信息的STS-Mixer框架,在3D动作识别和4D语义分割任务上取得优异性能。

Details Motivation: 现有方法多在时空域处理4D点云视频,难以有效捕获其内在几何特性,导致表征学习和理解性能下降。 Method: 将4D点云视频建模为图谱信号,分解为多频带分量以刻画不同尺度几何结构,并设计STS-Mixer框架融合空间、时间与多频谱信息。 Result: 在多个主流3D动作识别和4D语义分割基准上均取得一致优越性能。 Conclusion: 从谱域视角建模4D点云视频能更全面地刻画其几何与动态特性,STS-Mixer为4D理解提供了新范式。 Abstract: 4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low-frequency signals capture more coarse shapes while high-frequency signals encode more fine-grained geometry details. Building on these observations, we design Spatio-Temporal-Spectral Mixer (STS-Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS-Mixer integrates multi-band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine-grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at https://github.com/Vegetebird/STS-Mixer.

[408] GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

David Wong,Zeynep Isik,Bin Wang,Marouane Tliba,Gorkem Durak,Elif Keles,Halil Ertugrul Aktas,Aladine Chetouani,Cagdas Topel,Nicolo Gennaro,Camila Lopes Vendrami,Tugce Agirlar Trabzonlu,Amir Ali Rahsepar,Laetitia Perronne,Matthew Antalek,Onural Ozturk,Gokcan Okur,Andrew C. Gordon,Ayis Pyrros,Frank H. Miller,Amir Borhani,Hatice Savas,Eric Hart,Elizabeth Krupinski,Ulas Bagci

Main category: cs.CV

TL;DR: GazeVaLM是一个面向胸部X光片真实性评估的公开眼动追踪数据集,包含16位放射科专家对60张真实与合成X光片的眼动数据及诊断判断,并扩展至6个先进多模态大模型的预测结果,支持人机对比、注意力建模与不确定性量化研究。

Details Motivation: 研究临床医生在评估胸部X光片真实性(尤其是AI生成图像)时的视觉感知机制,并建立可复现的人与AI在诊断和真实性判别上的对比基准。 Method: 采集16名放射科专家对30张真实和30张扩散模型生成的胸部X光片的眼动数据(含注视点、扫描路径、显著性图等),同步收集诊断标签与真实性判断;并运行6个SOTA多模态大语言模型,在相同任务下输出诊断、真实性判断及置信度。 Result: 提供了高质量、结构化的眼动与临床判断联合数据集;揭示了专家间注视一致性、人与AI在诊断准确率和真实性识别能力上的差异;支持多种下游研究方向的基准测试。 Conclusion: GazeVaLM填补了医学影像真实性评估中人类视觉注意与AI行为联合分析的数据空白,为理解临床感知、评估生成式AI医学图像 realism 及构建可信人-AI协作系统提供了关键资源。 Abstract: We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.

[409] UNIGEOCLIP: Unified Geospatial Contrastive Learning

Guillaume Astruc,Eduard Trulls,Jan Hosang,Loic Landrieu,Paul-Edouard Sarlin

Main category: cs.CV

TL;DR: UNIGEOCLIP 是一种大规模多模态对比学习框架,用于在统一嵌入空间中联合对齐五种地理空间模态(航拍图像、街景、高程模型、文本和地理坐标),通过全对全对比对齐与多尺度经纬度编码器提升跨模态检索与推理能力。

Details Motivation: 利用日益丰富的共定位地理空间数据(如航拍影像、街景、高程模型、文本和坐标),推动多模态表征学习在地理空间领域的应用。 Method: 提出 UNIGEOCLIP 框架,实现五种地理空间模态的全对全对比对齐;设计可扩展的经纬度编码器以建模多尺度地理结构。 Result: 在多个下游地理空间任务上,UNIGEOCLIP 一致优于单模态对比模型和仅用坐标的基线方法。 Conclusion: 整体多模态地理空间对齐能显著提升跨模态理解与应用性能,验证了全对全对比学习与地理感知编码的有效性。 Abstract: The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.

[410] Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

Asbjørn Munk,Stefano Cerri,Vardan Nersesjan,Christian Hedeager Krag,Jakob Ambsdorf,Pablo Rocamora García,Julia Machnio,Peirong Liu,Suhyun Ahn,Nasrin Akbari,Yasmina Al Khalil,Kimberly Amador,Sina Amirrajab,Tal Arbel,Meritxell Bach Cuadra,Ujjwal Baid,Bhakti Baheti,Jaume Banus,Kamil Barbierik,Christoph Brune,Yansong Bu,Baptiste Callard,Yuhan Chen,Cornelius Crijnen,Corentin Dancette,Peter Drotar,Prasad Dutande,Nils D. Forkert,Saurabh Garg,Jakub Gazda,Matej Gazda,Benoît Gérin,Partha Ghosh,Weikang Gong,Pedro M. Gordaliza,Sam Hashemi,Tobias Heimann,Fucang Jia,Jiexin Jiang,Emily Kaczmarek,Chris Kang,Seung Kwan Kang,Mohammad Khazaei,Julien Khlaut,Petros Koutsouvelis,Jae Sung Lee,Yuchong Li,Mengye Lyu,Mingchen Ma,Anant Madabhushi,Klaus H. Maier-Hein,Pierre Manceron,Andrés Martínez Mora,Moona Mazher,Felix Meister,Nataliia Molchanova,Steven A. Niederer,Leonard Nürnberg,Jinah Park,Abdul Qayyum,Jonas Richiardi,Antoine Saporta,Branislav Setlak,Ning Shen,Justin Szeto,Constantin Ulrich,Puru Vaish,Vibujithan Vigneshwaran,Leroy Volmer,Zihao Wang,Siqi Wei,Anthony Winder,Jelmer M. Wolterink,Maxence Wynen,Chang Yang,Si Young Yie,Mostafa Mehdipour Ghazi,Akshay Pai,Espen Jimenez Solem,Sebastian Nørgaard Llambias,Mikael Boesen,Michael Eriksen Benros,Juan Eugenio Iglesias,Mads Nielsen

Main category: cs.CV

TL;DR: 本文介绍了FOMO25挑战赛,旨在推动基于自监督学习的脑MRI基础模型发展,使用大规模临床无标签数据FOMO60K进行预训练,并在少样本和跨域临床场景下评估模型性能;结果表明自监督预训练显著提升临床数据泛化能力,不同任务适配不同预训练目标,且小模型已具强竞争力。

Details Motivation: 临床脑MRI数据分析面临数据异质性高、噪声大、高质量标注成本高昂的问题,现有基础模型受限于小规模预训练数据和局限于高质量研究数据的评估方式,亟需面向真实临床场景的基准与挑战。 Method: 组织MICCAI 2025卫星挑战赛FOMO25,提供大规模临床无标签预训练数据集FOMO60K,设计涵盖梗死分类、脑膜瘤分割和脑龄回归的跨域少样本评估任务,设立方法赛道(仅用FOMO60K)与开放赛道(任意数据),采用标准化容器化流程统一评估19个来自16支团队的基础模型。 Result: (a)自监督预训练显著提升模型在域偏移下的临床泛化能力,最强的跨域预训练模型性能超越域内监督基线;(b)不同预训练目标对任务有特异性:MAE利于分割,混合重建-对比目标利于分类;(c)小型预训练模型即可取得优异性能,扩大模型规模或训练时长未带来稳定增益。 Conclusion: 面向临床真实场景的自监督基础模型是可行且有效的路径;任务导向的预训练目标设计比单纯扩大模型/数据规模更重要;FOMO25为脑MRI基础模型提供了首个大规模、临床原生、多任务、跨域评估基准。 Abstract: Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

[411] Unfolding 3D Gaussian Splatting via Iterative Gaussian Synopsis

Yuqin Lu,Yang Zhou,Yihua Dai,Guiqing Li,Shengfeng He

Main category: cs.CV

TL;DR: 本文提出了一种名为Iterative Gaussian Synopsis的新型框架,通过自上而下的‘展开’策略实现3D高斯泼溅(3DGS)模型的紧凑与渐进式渲染,显著降低存储开销并保持高质量视图合成。

Details Motivation: 3D高斯泼溅(3DGS)虽在实时高保真新视角合成中表现优异,但其巨大存储需求和非结构化表示限制了其在流媒体与资源受限环境中的部署;现有基于自下而上构建的细节层次(LOD)方法易引入冗余或导致保真度下降。 Method: 提出迭代高斯概要(Iterative Gaussian Synopsis),从全分辨率3DGS出发,通过自适应、可学习的掩码剪枝机制逐级生成更粗粒度的LOD;结合分层空间网格与共享锚点码本,构建紧凑且表达力强的多级特征表示,并支持层级间重用与低开销渐进细化。 Result: 实验表明该方法在所有LOD层级均保持高渲染质量,同时实现显著存储压缩,在带宽与内存受限场景下展现出良好的实用性与可扩展性。 Conclusion: Iterative Gaussian Synopsis为3DGS提供了高效、紧凑、渐进式的渲染方案,有效平衡了视觉质量与资源效率,适用于实时、轻量级3D内容分发与渲染。 Abstract: 3D Gaussian Splatting (3DGS) has become a state-of-the-art framework for real-time, high-fidelity novel view synthesis. However, its substantial storage requirements and inherently unstructured representation pose challenges for deployment in streaming and resource-constrained environments. Existing Level-of-Detail (LOD) strategies, particularly those based on bottom-up construction, often introduce redundancy or lead to fidelity degradation. To overcome these limitations, we propose Iterative Gaussian Synopsis, a novel framework for compact and progressive rendering through a top-down "unfolding" scheme. Our approach begins with a full-resolution 3DGS model and iteratively derives coarser LODs using an adaptive, learnable mask-based pruning mechanism. This process constructs a multi-level hierarchy that preserves visual quality while improving efficiency. We integrate hierarchical spatial grids, which capture the global scene structure, with a shared Anchor Codebook that models localized details. This combination produces a compact yet expressive feature representation, designed to minimize redundancy and support efficient, level-specific adaptation. The unfolding mechanism promotes inter-layer reusability and requires only minimal data overhead for progressive refinement. Experiments show that our method maintains high rendering quality across all LODs while achieving substantial storage reduction. These results demonstrate the practicality and scalability of our approach for real-time 3DGS rendering in bandwidth- and memory-constrained scenarios.

[412] LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

Dujun Nie,Fengjiao Chen,Qi Lv,Jun Kuang,Xiaoyu Li,Xuezhi Cao,Xunliang Cai

Main category: cs.CV

TL;DR: 本文提出LARY基准,用于评估视觉到动作的潜在动作表示,发现通用视觉基础模型在无动作监督下优于专用模型,且潜在空间比像素空间更适配物理动作控制。

Details Motivation: 显式动作数据稀缺限制了视觉-语言-动作(VLA)模型发展,而大量未标注的人类动作视频可作为替代数据源;但如何将视觉信号转化为本体无关的潜在动作表示并验证其控制鲁棒性仍缺乏系统评估。 Method: 构建LARY基准框架,包含超百万视频(1000小时)、62万图像对和59.5万运动轨迹,覆盖151类动作及多形态、多环境场景;在高阶语义动作与低阶机器人控制两个维度评估潜在动作表示性能。 Result: 实验表明:(i)无动作监督训练的通用视觉基础模型持续优于专用具身潜在动作模型;(ii)基于潜在表示的视觉空间比像素空间更契合物理动作空间。 Conclusion: 通用视觉表征本身即蕴含动作相关知识,语义级抽象是从视觉到动作更有效的路径,优于像素级重建。 Abstract: While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

[413] Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Efstathios Karypidis,Spyros Gidaris,Nikos Komodakis

Main category: cs.CV

TL;DR: Re2Pix是一种分层视频预测框架,先预测语义表征再生成视觉画面,提升自动驾驶等复杂场景下的时序语义一致性和图像质量。

Details Motivation: 准确的未来视频预测需兼顾高视觉保真度与一致的场景语义,尤其在自动驾驶等复杂动态环境中面临挑战。 Method: 提出Re2Pix框架:第一阶段利用冻结的视觉基础模型预测未来场景语义表征;第二阶段以该表征为条件驱动潜在扩散模型生成逼真帧;并引入嵌套丢弃和混合监督两种策略缓解训练-测试表征不匹配问题。 Result: 在驾驶基准上实验表明,该方法显著提升了时序语义一致性、感知质量和训练效率,优于强扩散基线模型。 Conclusion: 语义优先的分解式设计是提升视频预测性能的有效范式,兼顾结构合理性与视觉真实性。 Abstract: Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix

[414] Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models

Nhan Ho,Luu Le,Thanh-Huy Nguyen,Thien Nguyen,Xiaofeng Liu,Ulas Bagci

Main category: cs.CV

TL;DR: 本文提出OccSAM-Bench基准,系统评估SAM系列模型在模拟外科手术遮挡下的分割性能,揭示不同模型对遮挡的响应差异,并提出三区域评估协议以区分模型对可见/不可见区域的处理能力。

Details Motivation: 内窥镜图像中目标结构常被手术器械或组织部分遮挡(即遮挡问题),这对基础分割模型构成关键但未被充分研究的挑战。 Method: 构建OccSAM-Bench基准,合成两类遮挡(手术器械叠加与挖空)及三个严重程度等级,应用于三个公开息肉数据集;提出三区域评估协议(完整目标、仅可见区域、不可见区域)以细粒度分析分割性能。 Result: 发现两类模型行为:遮挡感知型(如SAM系列、MedSAM3)倾向于精准分割可见组织并拒绝器械区域;遮挡无关型(如MedSAM、MedSAM2)则自信预测被遮挡区域;SAM-Med2D表现最差。 Conclusion: 遮挡鲁棒性因模型架构而异,临床应用中应依据具体需求(保守可见组织分割 vs. 隐含解剖结构的无模态推断)选择合适模型。 Abstract: Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation models in clinical endoscopy. We introduce OccSAM-Bench, a benchmark designed to systematically evaluate SAM-family models under controlled, synthesized surgical occlusion. Our framework simulates two occlusion types (i.e., surgical tool overlay and cutout) across three calibrated severity levels on three public polyp datasets. We propose a novel three-region evaluation protocol that decomposes segmentation performance into full, visible-only, and invisible targets. This metric exposes behaviors that standard amodal evaluation obscures, revealing two distinct model archetypes: Occluder-Aware models (SAM, SAM 2, SAM 3, MedSAM3), which prioritize visible tissue delineation and reject instruments, and Occluder-Agnostic models (MedSAM, MedSAM2), which confidently predict into occluded regions. SAM-Med2D aligns with neither and underperforms across all conditions. Ultimately, our results demonstrate that occlusion robustness is not uniform across architectures, and model selection must be driven by specific clinical intent-whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy.

[415] BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera

Junwoo Park,Jangho Lee,Sunho Lim

Main category: cs.CV

TL;DR: 本文提出Background Embedding Memory (BEM),一种无需训练、轻量级的推理时模块,利用固定摄像头场景下的静态背景先验,通过背景嵌入记忆与逆相似性重打分机制,在不牺牲召回率的前提下显著降低误检。

Details Motivation: 预训练检测器在COCO等基准上表现好,但在真实部署(如监控、交通)中因训练数据与目标环境分布差异(尤其是实例密度高、类别单一)而性能下降;固定摄像头场景中存在稳定、无标签的背景可被利用。 Method: 提出BEM模块:估计干净背景嵌入、维护原型记忆库,并基于背景帧余弦相似度的逆相似性及排序加权方式对检测logits进行重打分;完全推理时启用,无需微调或额外训练。 Result: 在LLVIP和模拟监控视频流上,BEM在YOLO与RT-DETR系列模型上一致降低误检,保持实时性;背景帧余弦相似度与物体数量负相关、与P-AUC正相关,验证其作为无训练控制信号的有效性。 Conclusion: BEM是一种通用、即插即用、训练无关的后处理方法,有效提升密集单类场景下预训练检测器的精度-置信度权衡,尤其适用于固定摄像头视觉应用。 Abstract: Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at https://github.com/Leo-Park1214/Background-Embedding-Memory.git

[416] On the Robustness of Watermarking for Autoregressive Image Generation

Andreas Müller,Denis Lukovnikov,Shingo Kodama,Minh Pham,Anubhav Jain,Jonathan Petit,Niv Cohen,Asja Fischer

Main category: cs.CV

TL;DR: 本文研究了针对自回归(AR)图像生成器的水印方案,发现其易受移除和伪造攻击,即使仅有一张水印图像且无模型参数或水印密钥也可成功攻击,表明现有水印方案无法可靠支持合成内容检测与数据集过滤,并可能引发‘水印模仿’导致误判。

Details Motivation: 应对自回归图像生成器泛滥带来的虚假信息传播和训练数据污染问题,需可靠的水印检测与归因技术。 Method: 系统评估现有水印攻击方法,并提出三种新攻击:向量量化再生移除、基于对抗优化的攻击、频率注入攻击。 Result: 移除与伪造攻击在仅需单张水印参考图像、无需模型参数或水印密钥条件下即有效;现有水印方案无法可靠支持合成图像检测与训练数据过滤;并揭示‘水印模仿’现象——真实图像可被篡改以触发错误检测。 Conclusion: 当前面向AR图像生成器的水印方案安全性不足,不适用于关键应用场景如数据集清洗与内容溯源,亟需更鲁棒的设计。 Abstract: The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator's watermark and trigger false detection to prevent their inclusion in future model training.

[417] The Devil is in the Details -- From OCR for Old Church Slavonic to Purely Visual Stemma Reconstruction

Armin Hoenen

Main category: cs.CV

TL;DR: 本文比较了多种OCR系统(包括经典方法、机器学习和大语言模型)在18世纪教会斯拉夫语手稿上的识别性能,并探索了基于图像处理的新型谱系学方法,用于构建文本传承关系树(stemma)。

Details Motivation: 解决教会斯拉夫语等古文字OCR识别精度低、尤其是带变音符号字符识别困难的问题,并探索OCR结果如何有效支持下游谱系学分析任务。 Method: 首先对比10余种OCR系统(含GPT-5和Gemini-3-flash)对约6000字符手稿的识别效果;继而评估LLM后处理及多种智能体OCR架构(专用后处理Agent、智能体流水线、RAG);最后提出纯图像处理的谱系学新方法:自动字形提取→聚类→两两统计比对→距离矩阵→构建谱系树。 Result: 基础字母错误率(CER)可降至2–3%,但带变音符号字符仍具挑战;所提图像驱动谱系学方法在两个小型语料库(教会斯拉夫语《马可福音》与法语《玫瑰传奇》)上验证可行。 Conclusion: OCR性能显著提升有助于推动数字人文研究,而脱离文本转录、直接基于图像特征的谱系分析为古籍研究提供了新范式。 Abstract: The age of artificial intelligence has brought many new possibilities and pitfalls in many fields and tasks. The devil is in the details, and those come to the fore when building new pipelines and executing small practical experiments. OCR and stemmatology are no exception. The current investigation starts comparing a range of OCR-systems, from classical over machine learning to LLMs, for roughly 6,000 characters of late handwritten church slavonic manuscripts from the 18th century. Focussing on basic letter correctness, more than 10 CS OCR-systems among which 2 LLMs (GPT5 and Gemini3-flash) are being compared. Then, post-processing via LLMs is assessed and finally, different agentic OCR architectures (specialized post-processing agents, an agentic pipeline and RAG) are tested. With new technology elaborated, experiments suggest, church slavonic CER for basic letters may reach as low as 2-3% but elaborated diacritics could still present a problem. How well OCR can prime stemmatology as a downstream task is the entry point to the second part of the article which introduces a new stemmatic method based solely on image processing. Here, a pipeline of automated visual glyph extraction, clustering and pairwise statistical comparison leading to a distance matrix and ultimately a stemma, is being presented and applied to two small corpora, one for the church slavonic Gospel of Mark from the 14th to 16th centuries, one for the Roman de la Rose in French from the 14th and 15th centuries. Basic functioning of the method can be demonstrated.

[418] Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

Manuela González-González,Soufiane Belharbi,Muhammad Osama Zeeshan,Masoumeh Sharafi,Muhammad Haseeb Aslam,Lorenzo Sia,Nicolas Richet,Marco Pedersoli,Alessandro Lameiras Koerich,Simon L Bacon,Eric Granger

Main category: cs.CV

TL;DR: 本文探讨了利用深度学习模型(包括监督学习、无监督域自适应和基于大语言模型的零样本推理)自动识别健康干预中个体的矛盾与犹豫情绪(A/H),实验基于BAH视频数据集,结果表明现有模型性能有限,需更适配的多模态建模方法。

Details Motivation: 矛盾与犹豫(A/H)是阻碍患者接受健康干预的关键心理因素,其细微、跨模态的情感冲突难以被传统数字干预系统捕捉,亟需自动识别以提升个性化与成本效益。 Method: 采用深度学习方法,在BAH视频数据集上开展三类学习设置:监督学习、面向个性化的无监督域自适应、以及基于大语言模型(LLMs)的零样本推理,聚焦视频中的多模态(语言、面部、语音、姿态)A/H识别。 Result: 实验结果显示当前模型在A/H识别任务上性能有限,尤其在建模模态内/间情感冲突、时空动态融合方面存在明显不足。 Conclusion: 准确识别A/H需要更适配的多模态深度学习模型,尤其需改进时空建模与跨模态冲突融合机制,以支撑真正个性化的数字健康干预。 Abstract: Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

[419] Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Nick Stracke,Kolja Bauer,Stefan Andreas Baumann,Miguel Angel Bautista,Josh Susskind,Björn Ommer

Main category: cs.CV

TL;DR: 本文提出了一种高效建模和生成长时序运动的方法,通过学习来自跟踪器的大规模轨迹的长期运动嵌入,并在该压缩空间中训练条件流匹配模型,实现基于文本或空间交互提示的长时序、逼真运动生成。

Details Motivation: 现有视频模型虽能较好理解场景动态,但通过全视频合成预测多种未来运动效率极低,亟需更高效的运动建模方法。 Method: 首先从跟踪器获取大规模轨迹,学习具有64倍时间压缩比的高维压缩运动嵌入;然后在该嵌入空间中训练条件流匹配模型,以文本提示或空间‘pokes’为条件生成运动隐变量。 Result: 所生成的运动分布性能优于当前最先进的视频模型及专用任务方法,显著提升长时序运动生成的效率与真实性。 Conclusion: 直接操作于学习到的长期运动嵌入是一种更高效、更具扩展性的运动建模范式,为视觉智能中的运动理解与可控生成提供了新路径。 Abstract: Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

[420] MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI

Paula Arguello,Berk Tinaz,Mohammad Shahab Sepehri,Maryam Soltanolkotabi,Mahdi Soltanolkotabi

Main category: cs.CV

TL;DR: 本文介绍了MosaicMRI——目前最大的开源原始肌肉骨骼(MSK)MRI数据集,包含2671个体积和80156张切片,涵盖多种解剖部位、成像对比度、扫描方向和线圈数量;基于VarNet进行加速重建实验,发现跨解剖部位联合训练在小样本下优于单一部位模型,并揭示了不同身体部位间存在可迁移的影像相关性及特定的跨域泛化规律。

Details Motivation: 现有MRI深度学习研究多依赖脑部和膝关节等有限公共数据集,导致模型在多样化解剖场景下的可靠性评估不足,亟需构建覆盖广泛MSK解剖结构的大规模原始数据集以支撑方法开发与鲁棒性分析。 Method: 构建并发布MosaicMRI数据集(2671 volumes, 80156 slices),涵盖脊柱、膝、髋、踝等多个MSK部位及多种成像参数;以VarNet为基准模型,在加速重建任务上系统开展模型容量与数据规模的缩放实验,并设计跨解剖部位训练-测试协议评估泛化能力。 Result: 联合多解剖部位训练的模型在低样本量下显著优于单一部位模型;发现足部与肘部等特定部位间具有良好跨域泛化性;性能受训练数据量、解剖部位及成像协议共同影响。 Conclusion: 解剖多样性对MRI深度学习模型训练具有实质性增益,MosaicMRI为推动MSK MRI算法的泛化性、鲁棒性与临床适用性提供了关键基础设施。 Abstract: Deep learning underpins a wide range of applications in MRI, including reconstruction, artifact removal, and segmentation. However, progress has been driven largely by public datasets focused on brain and knee imaging, shaping how models are trained and evaluated. As a result, careful studies of the reliability of these models across diverse anatomical settings remain limited. In this work, we introduce MosaicMRI, a large and diverse collection of fully sampled raw musculoskeletal (MSK) MR measurements designed for training and evaluating machine-learning-based methods. MosaicMRI is the largest open-source raw MSK MRI dataset to date, comprising 2,671 volumes and 80,156 slices. The dataset offers substantial diversity in volume orientation (e.g., axial, sagittal), imaging contrasts (e.g., PD, T1, T2), anatomies (e.g., spine, knee, hip, ankle, and others), and numbers of acquisition coils. Using VarNet as a baseline for accelerated reconstruction task, we perform a comprehensive set of experiments to study scaling behavior with respect to both model capacity and dataset size. Interestingly, models trained on the combined anatomies significantly outperform anatomy-specific models in low-sample regimes, highlighting the benefits of anatomical diversity and the presence of exploitable cross-anatomical correlations. We further evaluate robustness and cross-anatomy generalization by training models on one anatomy (e.g., spine) and testing them on another (e.g., knee). Notably, we identify groups of body parts (e.g., foot and elbow) that generalize well with each other, and highlight that performance under domain shifts depends on both training set size, anatomy, and protocol-specific factors.

[421] Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation

Ricardo Coimbra Brioso,Giulio Sichili,Damiano Dei,Nicola Lambri,Pietro Mancosu,Marta Scorsetti,Daniele Loiacono

Main category: cs.CV

TL;DR: 本文提出了一种面向体素CT分割的高效KernelSHAP解释框架,通过限定感兴趣区域与感受野支持、引入补丁logit缓存机制来加速计算,并比较了多种特征抽象方式与归因聚合策略,以提升临床可解释性与归因可信度之间的平衡。

Details Motivation: 现有基于扰动的解释方法(如KernelSHAP)在patch-based 3D医学图像分割中因计算开销大、滑窗推理成本高而难以实用,亟需兼顾效率与临床意义的归因方法。 Method: 提出限制KernelSHAP计算于用户定义ROI及其感受野范围内;采用patch logit caching复用未受影响补丁的基线预测,保持nnU-Net融合机制;对比三种自动特征抽象(全器官单元、规则FCC超体素、器官感知混合超体素)及多种聚合/价值函数(如TP、Dice、Soft Dice),聚焦稳定证据或假阳性行为建模。 Result: 在全身体CT分割实验中,缓存机制带来15%–30%的计算节省;规则超体素在扰动类指标上表现最优但解剖对齐差;器官感知单元在归一化指标下更有效揭示假阳性驱动因素,临床可解释性更强。 Conclusion: 所提框架显著提升KernelSHAP在3D医学图像分割中的实用性与临床适配性;特征抽象与价值函数的选择深刻影响归因的可信度与可解释性权衡,器官感知单元是更优临床导向选择。 Abstract: Perturbation-based explainability methods such as KernelSHAP provide model-agnostic attributions but are typically impractical for patch-based 3D medical image segmentation due to the large number of coalition evaluations and the high cost of sliding-window inference. We present an efficient KernelSHAP framework for volumetric CT segmentation that restricts computation to a user-defined region of interest and its receptive-field support, and accelerates inference via patch logit caching, reusing baseline predictions for unaffected patches while preserving nnU-Net's fusion scheme. To enable clinically meaningful attributions, we compare three automatically generated feature abstractions within the receptive-field crop: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels, and we study multiple aggregation/value functions targeting stabilizing evidence (TP/Dice/Soft Dice) or false-positive behavior. Experiments on whole-body CT segmentations show that caching substantially reduces redundant computation (with computational savings ranging from 15% to 30%) and that faithfulness and interpretability exhibit clear trade-offs: regular supervoxels often maximize perturbation-based metrics but lack anatomical alignment, whereas organ-aware units yield more clinically interpretable explanations and are particularly effective for highlighting false-positive drivers under normalized metrics.

[422] HDR Video Generation via Latent Alignment with Logarithmic Encoding

Naomi Ken Korem,Mohamed Oumoumad,Harel Cain,Matan Ben Yosef,Urska Jelercic,Ofir Bibi,Yaron Inger,Or Patashnik,Daniel Cohen-Or

Main category: cs.CV

TL;DR: 本文提出了一种无需重新设计生成模型即可实现高质量HDR视频生成的简单方法,通过利用预训练模型已学得的视觉先验,并结合对数编码与相机模拟退化策略进行轻量微调。

Details Motivation: HDR图像因其高动态范围与生成模型通常训练所用的有界、感知压缩数据不匹配,导致生成困难;现有学习新表示的方法增加了复杂性和数据需求。 Method: 采用影视管线中广泛使用的对数编码将HDR映射到与预训练生成模型潜在空间自然对齐的分布,并通过轻量微调适配;引入基于相机模拟退化的训练策略,促使模型从先验中推断缺失的HDR细节。 Result: 在仅对预训练视频模型进行最小适应的前提下,实现了高质量HDR视频生成,在多种场景和挑战性光照条件下均取得优异效果。 Conclusion: HDR虽代表不同的成像机制,但只要选择与预训练模型先验对齐的数据表示,即可无需重构模型而有效处理。 Abstract: High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.

[423] LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Yuqian Yuan,Wenqiao Zhang,Juekai Lin,Yu Zhong,Mingjian Gao,Binhe Yu,Yunqi Cao,Wentong Li,Yueting Zhuang,Beng Chin Ooi

Main category: cs.CV

TL;DR: 本文综述了大图文多模态模型(LMMs)与以对象为中心的视觉(object-centric vision)交叉领域的最新进展,围绕对象级理解、指代分割、视觉编辑和生成四大主题展开,总结建模范式、学习策略与评估方法,并探讨实例持久性、空间控制、多步交互等开放挑战。

Details Motivation: 现有大图文多模态模型在对象级定位、细粒度空间推理和可控视觉操作方面存在局限,难以准确识别实例、保持对象身份一致性及高精度局部修改,需借助对象中心化视觉框架提升精确性与可控性。 Method: 系统性文献综述,按四大主题(对象级理解、指代分割、编辑、生成)组织研究进展,归纳建模范式、学习策略与评估协议,并分析当前挑战与未来方向。 Result: 构建了LMMs与对象中心化视觉融合的结构化知识体系,明确了关键技术路径与评估标准,识别出鲁棒实例持久性、细粒度空间控制、一致多步交互等核心挑战。 Conclusion: 对象中心化视觉是推动LMMs从全局场景理解迈向精准、可控、可信对象级操作的关键范式;未来需发展统一跨任务建模与可靠分布偏移下的基准评测体系。 Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

[424] LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

Junhao Chen,Kejun Gao,Yuehan Cui,Mingze Sun,Mingjin Chen,Shaohui Wang,Xiaoxiao Long,Fei Ma,Qi Tian,Ruqi Huang,Hao Zhao

Main category: cs.CV

TL;DR: 本文提出了首个用于生成矢量动画的框架LottieGPT,通过设计Lottie Tokenizer将矢量动画编码为紧凑语义序列,并构建大规模数据集LottieAnimation-660K,基于Qwen-VL微调实现从文本或图像生成可编辑、高保真矢量动画。

Details Motivation: 现有视频生成模型仅在光栅空间工作,无法生成具有分辨率无关性、紧凑性、语义结构和可编辑运动参数等优势的矢量动画;而多模态大模型在生成结构化数据方面已展现潜力,启发了原生矢量动画生成的研究。 Method: 采用Lottie标准,设计专用Lottie Tokenizer将几何图元、变换与关键帧运动编码为语义对齐的紧凑token序列;构建包含66万矢量动画和1500万静态Lottie图像的大规模数据集LottieAnimation-660K;基于Qwen-VL微调得到多模态模型LottieGPT,支持文本/图像到矢量动画的自回归生成。 Result: 所提Tokenizer显著缩短序列长度并保持结构保真度,使动态矢量内容的自回归学习更有效;LottieGPT在多种动画风格上泛化能力强,并在SVG生成(单帧矢量动画特例)任务中超越先前SOTA模型。 Conclusion: 本工作首次实现了自然语言/图像到可编辑矢量动画的端到端生成,为多媒体生成开辟了新方向,并验证了大模型原生生成结构化动态内容的可行性。 Abstract: Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

[425] SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

Deming Li,Abhay Yadav,Cheng Peng,Rama Chellappa,Anand Bhattad

Main category: cs.CV

TL;DR: SyncFix 是一种基于扩散模型的多视角场景重建一致性修复框架,通过联合潜在空间桥接匹配实现跨视角语义与几何一致性,在仅用图像对训练下可泛化至任意视角数,并在无参考图或稀疏参考图条件下均显著提升重建质量。

Details Motivation: 解决扩散模型在多视角场景重建中因视角独立处理导致的语义和几何不一致问题。 Method: 将重建优化建模为多视角联合潜在桥接匹配问题,同步扭曲与干净表征,学习多视角联合条件分布以在去噪轨迹中强制一致性;仅使用图像对训练,推理时支持任意视角数量。 Result: 在定性和定量评估中均优于当前最先进方法,即使无干净参考图像也能提升重建质量;引入稀疏参考图像后保真度进一步提高;重建质量随视角数增加而提升,但存在边际递减效应。 Conclusion: SyncFix 有效解决了多视角重建中的跨视图不一致性问题,具备强泛化性与实用性,为基于扩散模型的三维重建提供了新范式。 Abstract: We present SyncFix, a framework that enforces cross-view consistency during the diffusion-based refinement of reconstructed scenes. SyncFix formulates refinement as a joint latent bridge matching problem, synchronizing distorted and clean representations across multiple views to fix the semantic and geometric inconsistencies. This means SyncFix learns a joint conditional over multiple views to enforce consistency throughout the denoising trajectory. Our training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference. Moreover, reconstruction quality improves with additional views, with diminishing returns at higher view counts. Qualitative and quantitative results demonstrate that SyncFix consistently generates high-quality reconstructions and surpasses current state-of-the-art baselines, even in the absence of clean reference images. SyncFix achieves even higher fidelity when sparse references are available.

[426] Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

Ricardo Coimbra Brioso,Lorenzo Mondo,Damiano Dei,Nicola Lambri,Pietro Mancosu,Marta Scorsetti,Daniele Loiacono

Main category: cs.CV

TL;DR: 本文提出了一种基于nnU-Net的预算感知、不确定性驱动的质量保证(QA)框架,用于放射治疗中临床靶区(CTV)的自动分割质量评估,通过结合不确定性量化与后处理校准生成体素级不确定性图,指导有针对性的人工复核。

Details Motivation: 准确勾画临床靶区(CTV)对放疗计划至关重要,但耗时且难以评估,尤其在全骨髓及淋巴结照射(TMLI)等复杂场景下;深度学习自动分割虽可减负,但临床安全部署需可靠指示模型潜在错误的位置。 Method: 基于nnU-Net构建不确定性驱动的QA框架,采用预测熵生成体素级不确定性图;对比并组合温度缩放(TS)、深度集成(DE)、检查点集成(CE)和测试时增强(TTA)四种不确定性估计方法;使用ROI掩模校准指标和不确定性–误差对齐(AUC@0–5%最不确定体素)评估可靠性。 Result: 各配置下分割精度稳定;TS显著提升校准性能;经校准的检查点集成(CE)在不确定性–误差对齐上提升最明显,生成的不确定性图能更一致地标识需人工编辑区域。 Conclusion: 将后处理校准与高效集成相结合,是实现放疗分割中预算可控、临床可行的质量保证工作流的有前景策略。 Abstract: Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.

[427] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou,Guisheng Liu,Hao Yang,Jiatong Li,Jingyu Lin,Xiaohu Huang,Yichen Liu,Xin Gao,Cunjian Chen,Shilei Wen,Chi-Wing Fu,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出OmniShow框架,用于多模态条件(文本、图像、音频、姿态)下的人-物交互视频生成(HOIVG),通过统一通道条件注入、门控局部上下文注意力和解耦-联合训练策略提升可控性、质量与数据利用效率,并构建HOIVG-Bench基准推动该领域发展。

Details Motivation: 现有方法无法同时满足文本、参考图像、音频和姿态等多种条件输入的需求,而HOIVG在电商演示、短视频制作和互动娱乐等实际应用中具有重要价值。 Method: 提出OmniShow端到端框架,包含Unified Channel-wise Conditioning(实现图像与姿态高效注入)、Gated Local-Context Attention(保障音画精准同步)以及Decoupled-Then-Joint Training策略(缓解数据稀缺问题);并构建HOIVG-Bench评估基准。 Result: OmniShow在多种多模态条件设置下达到SOTA性能,显著提升生成质量与可控性,并填补了HOIVG任务的评估空白。 Conclusion: OmniShow为HOIVG这一新兴任务提供了实用、高效且可扩展的解决方案,推动其向工业级应用迈进。 Abstract: In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

[428] Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

Xingjian Ran,Shujie Zhang,Weipeng Zhong,Li Luo,Bo Dai

Main category: cs.CV

TL;DR: 本文提出Pair2Scene框架,通过建模局部物体间的支持与功能关系,结合场景层级结构和物理算法,实现高保真、可泛化、物理与语义合理的3D室内场景生成。

Details Motivation: 现有方法受限于数据稀缺、难以泛化至密集场景,且LLM/VLM缺乏精确空间推理能力;而物体摆放主要依赖局部依赖而非冗余全局分布,因此需更有效的局部规则建模。 Method: 提出Pair2Scene:基于学习的局部规则(支持关系与功能关系)建模,用网络估计从属物体相对于锚点物体的位置分布;构建3D-Pairs数据集训练模型;推理时在层级结构中递归应用模型,并采用碰撞感知的拒绝采样实现全局布局一致性。 Result: 在复杂、超出训练分布的场景生成任务上显著优于现有方法,同时保持物理合理性和语义连贯性。 Conclusion: 局部依赖建模结合层级结构与物理约束是提升3D室内场景生成质量与泛化能力的有效路径,Pair2Scene为此提供了可扩展、可解释的新范式。 Abstract: Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

[429] Who Handles Orientation? Investigating Invariance in Feature Matching

David Nordström,Johan Edstedt,Fredrik Kahl,Georg Bökman

Main category: cs.CV

TL;DR: 本文研究了在现代稀疏匹配流程中,如何在不同阶段(描述子或匹配器)引入旋转不变性以提升图像关键点匹配性能,并发现将旋转不变性直接嵌入描述子中,既能保持匹配精度,又能加速匹配过程,且大规模训练下不影响正向图像匹配性能。

Details Motivation: 现代关键点匹配器在处理大幅平面内旋转图像时表现不佳,而通过数据增强学习旋转不变性是一种常见策略,但尚不清楚应在匹配流程的哪个阶段(描述子或匹配器)引入该性质效果最佳。 Method: 在现代稀疏匹配流程框架下,系统性地对比将旋转不变性分别嵌入特征描述子和匹配器模块的效果;基于大量3D视觉数据集训练,并在多个主流图像匹配基准(如WxBS、HardMatch、SatAst)上评估;分析训练规模、尺度变化对旋转不变性泛化能力的影响。 Result: 1)将旋转不变性嵌入描述子与嵌入匹配器性能相当;2)前者使匹配器更早具备旋转不变性,从而加速推理;3)大规模训练下,旋转鲁棒性提升不损害正向图像匹配性能;4)训练数据量增加显著提升对旋转图像的泛化能力;5)发布两个SOTA级抗旋转匹配器。 Conclusion: 在描述子层面显式建模旋转不变性是一种高效且可扩展的设计选择,兼顾性能、速度与泛化性,为构建鲁棒图像匹配系统提供了明确指导。 Abstract: Finding matching keypoints between images is a core problem in 3D computer vision. However, modern matchers struggle with large in-plane rotations. A straightforward mitigation is to learn rotation invariance via data augmentation. However, it remains unclear at which stage rotation invariance should be incorporated. In this paper, we study this in the context of a modern sparse matching pipeline. We perform extensive experiments by training on a large collection of 3D vision datasets and evaluating on popular image matching benchmarks. Surprisingly, we find that incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Further, we find that enforcing rotation invariance does not hurt upright performance when trained at scale. Finally, we study the emergence of rotation invariance through scale and find that increasing the training data size substantially improves generalization to rotated images. We release two matchers robust to in-plane rotations that achieve state-of-the-art performance on e.g. multi-modal (WxBS), extreme (HardMatch), and satellite image matching (SatAst). Code is available at https://github.com/davnords/loma.