cs.CL [Back]

[1] TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy

James McCammon

Main category: cs.CL

TL;DR: TimeStampEval 是一个用于从长转录文本中根据非逐字引文检索精确毫秒时间戳的基准，提出结合 RapidFuzz 预筛选和 LLM 验证的两阶段方法，在提升准确率的同时大幅降低推理成本。

Details

Motivation: 解决官方文本记录与语音转录之间因语法差异导致的传统模糊匹配失效问题，特别是在自动生成长篇播客中对齐国会记录片段的需求。 Method: 采用两阶段“辅助模糊”方法：先用 RapidFuzz 进行预筛选，再利用大语言模型（LLM）在短片段上验证并精确定位时间戳边界；优化提示设计以提高效率和准确性。 Result: 在2800句转录本上测试显示，提示设计比模型选择更重要，合理格式可提升3-20点准确率并减少30-40% token消耗；加入少量推理预算使准确率从37%提升至90%以上；该方法最高提升50点模糊匹配准确率，延迟减半，单位正确结果成本降低达96%；在10个不同长度和领域的转录本上保持95-100%的负例拒绝准确率。 Conclusion: TimeStampEval 展示了结合传统模糊匹配与轻量级 LLM 验证的有效性，为长文档中的非逐字引用定位提供高效、鲁棒且低成本的解决方案。 Abstract: Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our "Assisted Fuzzy" approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.

[2] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team,Song Bai,Lidong Bing,Carson Chen,Guanzheng Chen,Yuntao Chen,Zhe Chen,Ziyi Chen,Jifeng Dai,Xuan Dong,Yue Deng,Yunjie Fu,Junqi Ge,Chenxia Han,Tammy Huang,Zhenhang Huang,Jerry Jiao,Shilei Jiang,Tianyu Jiao,Xiaoqi Jian,Lei Lei,Ruilin Li,Ryan Luo,Tiantong Li,Xiang Lin,Ziyuan Liu,Zhiqi Li,Jie Ni,Qiang Ren,Pax Sun,Shiqian Su,Chenxin Tao,Bin Wang,Hellen Wang,Haonan Wang,James Wang,Jin Wang,Jojo Wang,Letian Wang,Shizun Wang,Weizhi Wang,Zixuan Wang,Jinfan Xu,Sen Xing,Chenyu Yang,Hai Ye,Jiaheng Yu,Yue Yu,Muyan Zhong,Tianchen Zhao,Xizhou Zhu,Yanpeng Zhou,Yifan Zhang,Zhi Zhu

Main category: cs.CL

TL;DR: MiroThinker v1.0 是一种开源研究代理，通过模型级别的交互扩展（interaction scaling）提升工具增强推理和信息检索能力，支持深度、频繁的代理-环境交互，在多个基准测试中表现优异，验证了交互深度作为继模型规模和上下文长度后的第三维性能提升因素。

Details

Motivation: 现有研究代理主要依赖扩大模型规模或上下文长度，但在长推理链中易出现性能退化；MiroThinker旨在探索模型与环境交互频率和深度的系统性扩展，作为提升研究能力的新维度。 Method: 采用强化学习方法，在256K上下文窗口下训练模型执行最多600次工具调用，实现多轮持续推理和复杂研究任务处理，并在GAIA、HLE、BrowseComp和BrowseComp-ZH四个基准上评估性能。 Result: 72B版本在GAIA、HLE、BrowseComp和BrowseComp-ZH上分别达到81.9%、37.7%、47.1%和55.6%的准确率，超越此前开源代理并接近GPT-5-high等商业模型；分析表明性能随交互深度增加而稳定提升。 Conclusion: 交互扩展是构建下一代开源研究代理的关键第三维度，能够有效支持复杂现实世界研究任务，且其可预测的扩展行为类似于模型大小和上下文长度。 Abstract: We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

[3] On the Notion that Language Models Reason

Bertram Højer

Main category: cs.CL

TL;DR: 本文探讨了语言模型是否真正具备推理能力，提出其所谓的“推理”输出实际上是统计规律的体现，而非真正的逻辑机制。

Details

Motivation: 澄清语言模型在自然语言处理中的推理定义不一致问题，并解释其推理样输出的来源。 Method: 采用隐式有限阶马尔可夫核的观点分析基于Transformer的语言模型的信息处理方式。 Result: 发现语言模型的推理样输出源于学习到的条件分布中的统计正则性和近似不变性，而非显式逻辑推理。 Conclusion: 语言模型是统计模式匹配器，不具备真正的推理能力，这一区分对评估其认知不确定性至关重要。 Abstract: Language models (LMs) are said to be exhibiting reasoning, but what does this entail? We assess definitions of reasoning and how key papers in the field of natural language processing (NLP) use the notion and argue that the definitions provided are not consistent with how LMs are trained, process information, and generate new tokens. To illustrate this incommensurability we assume the view that transformer-based LMs implement an \textit{implicit} finite-order Markov kernel mapping contexts to conditional token distributions. In this view, reasoning-like outputs correspond to statistical regularities and approximate statistical invariances in the learned kernel rather than the implementation of explicit logical mechanisms. This view is illustrative of the claim that LMs are "statistical pattern matchers"" and not genuine reasoners and provides a perspective that clarifies why reasoning-like outputs arise in LMs without any guarantees of logical consistency. This distinction is fundamental to how epistemic uncertainty is evaluated in LMs. We invite a discussion on the importance of how the computational processes of the systems we build and analyze in NLP research are described.

[4] Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis

Hong-Jun Yoon,Faisal Ashraf,Thomas A. Ruggles,Debjani Singh

Main category: cs.CL

TL;DR: 本研究评估了七种开源大语言模型在水电许可文件信息提取中的性能与计算资源的权衡，发现140亿参数为性能跃升的关键阈值。

Details

Motivation: 解决监管文档信息提取中模型性能与计算资源之间的权衡问题，为实际部署提供实证指导。 Method: 在水电许可文档上评估0.6B至70B参数的七种开源模型，分析其F1分数、召回率及幻觉模式，建立资源-性能映射关系。 Result: 14B参数以下模型效果差（F1 < 0.15），14B以上显著提升至F1=0.64；消费级可部署模型达64% F1，大型模型接近77%但需企业级设施；小模型存在系统性幻觉，完美召回反指示提取失败。 Conclusion: 提出了首个面向监管文档信息提取的开源模型资源-性能综合地图，支持基于证据的模型选择，结果对水电合规具有直接应用价值，并揭示了参数规模对信息提取任务的普遍影响。 Abstract: Information extraction from regulatory documents using large language models presents critical trade-offs between performance and computational resources. We evaluated seven open-weight models (0.6B-70B parameters) on hydropower licensing documentation to provide empirical deployment guidance. Our analysis identified a pronounced 14B parameter threshold where validation methods transition from ineffective (F1 $<$ 0.15) to viable (F1 = 0.64). Consumer-deployable models achieve 64\% F1 through appropriate validation, while smaller models plateau at 51\%. Large-scale models approach 77\% F1 but require enterprise infrastructure. We identified systematic hallucination patterns where perfect recall indicates extraction failure rather than success in smaller models. Our findings establish the first comprehensive resource-performance mapping for open-weight information extraction in regulatory contexts, enabling evidence-based model selection. These results provide immediate value for hydropower compliance while contributing insights into parameter scaling effects that generalize across information extraction tasks.

[5] Towards Autoformalization of LLM-generated Outputs for Requirement Verification

Mihir Gupte,Ramesh S

Main category: cs.CL

TL;DR: 本文探讨了利用基于大语言模型（LLM）的简单自动形式化工具，对LLM生成的输出（如Gherkin场景）进行形式化验证的初步研究。

Details

Motivation: 当前缺乏验证大语言模型从自然语言生成的结构化输出（如形式化逻辑）准确性的方法，本文旨在探索自动形式化技术在确保生成结果逻辑一致性方面的潜力。 Method: 使用一个简单的LLM-based自动形式化器，将自然语言需求转化为形式逻辑，并通过两个实验评估其能力：一是检查不同表述的自然语言需求是否逻辑等价，二是检测自然语言需求与LLM生成输出之间的逻辑不一致。 Result: 在第一个实验中，自动形式化器成功识别出两个表述不同的需求具有逻辑等价性；在第二个实验中，成功发现自然语言需求与生成输出间的逻辑矛盾。 Conclusion: 尽管研究规模有限，但结果表明自动形式化技术有望用于验证LLM生成内容的逻辑一致性与保真度，为未来更深入的研究奠定了基础。 Abstract: Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models (LLMs). While LLMs show promise in generating structured outputs from natural language (NL), such as Gherkin Scenarios from NL feature requirements, there's currently no formal method to verify if these outputs are accurate. This paper takes a preliminary step toward addressing this gap by exploring the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements. We conducted two distinct experiments. In the first one, the autoformalizer successfully identified that two differently-worded NL requirements were logically equivalent, demonstrating the pipeline's potential for consistency checks. In the second, the autoformalizer was used to identify a logical inconsistency between a given NL requirement and an LLM-generated output, highlighting its utility as a formal verification tool. Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs, laying a crucial foundation for future, more extensive studies into this novel application.

[6] Three Stage Narrative Analysis; Plot-Sentiment Breakdown, Structure Learning and Concept Detection

Taimur Khan,Ramoza Ahsan,Mohib Hameed

Main category: cs.CL

TL;DR: 提出了一种分析电影剧本情感弧并进行角色上下文相关扩展分析的框架，通过自定义词典和层次聚类技术实现叙事中的高低级概念提取。

Details

Motivation: 解决自然语言理解中故事理解和分析的挑战，应对大量叙事数据对自动化语义分析的需求。 Method: 基于LabMTsimple storylab模块构建自定义词典，结合NRC-VAD数据集的Valence、Arousal、Dominance分数进行字典式情感分析，并使用Wards层次聚类技术对相似情感曲线进行聚类。 Result: 在电影数据集上的实验表明，该框架能够有效提取叙事中的情感弧和上下文信息，有助于用户选择故事或叙事内容。 Conclusion: 所提出的框架在自动化叙事分析方面具有实用价值，能够为消费者和读者提供有效的决策支持。 Abstract: Story understanding and analysis have long been challenging areas within Natural Language Understanding. Automated narrative analysis requires deep computational semantic representations along with syntactic processing. Moreover, the large volume of narrative data demands automated semantic analysis and computational learning rather than manual analytical approaches. In this paper, we propose a framework that analyzes the sentiment arcs of movie scripts and performs extended analysis related to the context of the characters involved. The framework enables the extraction of high-level and low-level concepts conveyed through the narrative. Using dictionary-based sentiment analysis, our approach applies a custom lexicon built with the LabMTsimple storylab module. The custom lexicon is based on the Valence, Arousal, and Dominance scores from the NRC-VAD dataset. Furthermore, the framework advances the analysis by clustering similar sentiment plots using Wards hierarchical clustering technique. Experimental evaluation on a movie dataset shows that the resulting analysis is helpful to consumers and readers when selecting a narrative or story.

[7] Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Namu Park,Giridhar Kaushik Ramachandran,Kevin Lybarger,Fei Xia,Ozlem Uzuner,Meliha Yetisgen,Martin Gunn

Main category: cs.CL

TL;DR: 本文介绍了一个包含6,393份放射学报告的标注语料库，用于评估大语言模型在随访成像检测任务中的表现，并比较了传统机器学习方法与生成式大模型（如GPT-4o和GPT-OSS-20B）的性能，发现优化提示后的生成模型表现最佳，但传统模型仍具竞争力。

Details

Motivation: 缺乏针对放射学任务的领域特定数据集来严格评估大语言模型在临床自然语言处理中的表现。 Method: 构建了一个标注的放射学报告语料库，比较了逻辑回归、支持向量机、Longformer、微调的Llama3-8B-Instruct以及生成式大模型（GPT-4o和GPT-OSS-20B）在基础和任务优化配置下的性能，采用精确率、召回率和F1分数进行评估。 Result: GPT-4o（Advanced）表现最好（F1 = 0.832），GPT-OSS-20B（Advanced）紧随其后（F1 = 0.828），逻辑回归和SVM也表现良好（F1 ≈ 0.775），提示优化显著提升性能，且人工标注一致性高（F1 = 0.846）。 Conclusion: 通过提示优化，生成式大模型可接近人类水平的表现，但在资源效率和可解释性方面，传统模型仍是重要基准。 Abstract: Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.

[8] MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

Fernanda Bufon Färber,Iago Alves Brito,Julia Soares Dollis,Pedro Schindler Freire Brasil Ribeiro,Rafael Teixeira Sousa,Arlindo Rodrigues Galvão Filho

Main category: cs.CL

TL;DR: 本文提出了MedPT，首个大规模巴西葡萄牙语真实医疗问答语料库，包含38.4万对患者-医生交互数据，通过多阶段清洗和LLM增强标注，支持细粒度意图分类，并在医学专科路由任务中达到94% F1分数，揭示了其主题广度与语言特性，推动葡语医疗AI发展。

Details

Motivation: 现有大语言模型在医疗领域的开发主要集中于高资源语言，而直接翻译无法捕捉如地方性疾病等临床与文化细微差异，导致低资源语言面临技术鸿沟。因此，需要构建具有文化与临床真实性的本地化医疗语料库。 Method: 构建包含384,095个真实患者-医生问答对的MedPT语料库；采用混合定量与定性分析的多阶段清洗流程以去噪并丰富上下文；利用大语言模型进行驱动注释，将问题划分为七类语义类型以捕捉用户意图。 Result: 语料库涵盖3,200个主题，展现出丰富的主题广度与患者-医生交流中的自然语言不对称性；在20类医学专科路由任务中，微调17亿参数模型取得94%的F1分数；错误分析显示误分类源于真实的临床模糊性（如共病情况），反映数据集深层语义复杂性。 Conclusion: MedPT是首个面向巴西葡萄牙语的大规模真实医疗对话数据集，具备高质量与语义深度，能有效支持 culturally-aware 医疗AI系统的开发；其发布有助于促进葡语世界更公平、准确和文化适配的医疗技术发展。 Abstract: While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages, creating a critical barrier for others as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese, comprising 384,095 authentic question-answer pairs from patient-doctor interactions. The dataset underwent a meticulous multi-stage curation protocol, using a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries. We further augmented the corpus via LLM-driven annotation, classifying questions into seven semantic types to capture user intent. Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication. To validate its utility, we benchmark a medical specialty routing task: fine-tuning a 1.7B parameter model achieves an outstanding 94\% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset's deep semantic richness. We publicly release MedPT to foster the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.

[9] ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts

Karthikeyan K,Raghuveer Thirukovalluru,David Carlson

Main category: cs.CL

TL;DR: 提出了一种名为ClinStructor的流程，利用大语言模型将临床自由文本转换为结构化的任务特定问答对，以提高机器学习模型在临床环境中的可解释性和泛化能力。

Details

Motivation: 解决临床笔记中非结构化数据带来的偏见、跨系统泛化能力差和模型不可解释等问题。 Method: 使用大语言模型（LLMs）将临床自由文本转化为结构化的任务特定问答形式，再用于预测建模。 Result: 在ICU死亡率预测任务中，相比直接微调方法仅导致2-3%的AUC性能下降，同时显著提升了模型的透明度和可控性。 Conclusion: ClinStructor为在临床环境中构建可靠、可解释且具有良好泛化能力的机器学习模型奠定了坚实基础。 Abstract: Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question-answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2-3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.

[10] Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support

Eric Hua Qing Zhang,Julia Ive

Main category: cs.CL

TL;DR: 本研究通过监督微调和强化学习技术提升GPT-2在治疗性对话生成中的表现，改进模型对上下文和情绪的理解能力，实验结果显示强化学习显著提高了生成回复的相关性、专业性和情绪准确性。

Details

Motivation: 预训练大语言模型在心理治疗对话中缺乏足够的上下文和情感意识，难以提供适当的治疗回应；同时新冠疫情加剧了心理健康服务的需求与可及性挑战，亟需有效的数字辅助工具。 Method: 采用监督微调和强化学习方法优化GPT-2，重构输入格式以同时处理用户输入、上下文信息和情绪状态，并设计多成分奖励函数，使模型输出与专业治疗师的回应及标注情绪保持一致。 Result: 相比基线GPT-2，强化学习模型在BLEU、ROUGE和METEOR等指标上均有提升，情绪识别准确率从66.96%提高到99.34%，LLM评估显示生成内容具有高相关性和专业性。 Conclusion: 强化学习能有效提升语言模型在治疗性对话中的表现，所提出的方法有望作为临床辅助工具支持心理治疗实践，但仍需保留人类临床监督。 Abstract: Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessary for appropriate therapeutic responses. This paper investigated the application of supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance GPT-2's capacity for therapeutic dialogue generation. The methodology restructured input formats to enable simultaneous processing of contextual information and emotional states alongside user input, employing a multi-component reward function that aligned model outputs with professional therapist responses and annotated emotions. Results demonstrated improvements through reinforcement learning over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581). LLM evaluation confirmed high contextual relevance and professionalism, while reinforcement learning achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2. These findings demonstrate reinforcement learning's effectiveness in developing therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.

[11] Additive Large Language Models for Semi-Structured Text

Karthikeyan K,Raghuveer Thirukovalluru,David Carlson

Main category: cs.CL

TL;DR: 提出了一种名为CALM的可解释框架，用于半结构化临床文本分类，通过将预测结果分解为各语义组件的加性贡献，实现透明且可信的预测，同时保持与传统大模型相当的性能。

Details

Motivation: 大语言模型在临床文本分类中表现优异，但其预测缺乏可解释性，阻碍了在临床和研究中的实际应用。需要一种能够明确展示哪些病历部分驱动风险预测的方法，以提升模型的可信度和实用性。 Method: 提出CALM（Classification with Additive Large Language Models）框架，将输入的半结构化文本（如入院记录的各个部分或问卷字段）拆分为语义组件，并将最终预测结果建模为各组件贡献的加性总和。该方法将解释性融入前向计算过程，支持患者级和群体级的忠实解释，并提供类似广义可加模型的风险曲线可视化。 Result: CALM在多个临床分类任务上达到了与传统大语言模型相当的性能，同时提供了更清晰的模型解释和可视化能力，有助于发现临床中有意义的模式，支持质量控制和模型审计。 Conclusion: CALM在不牺牲性能的前提下，显著提升了大模型在临床场景中的可解释性和可信度，适用于已有或可提取半结构化信息的临床文本，具有良好的实用和推广价值。 Abstract: Large Language Models have advanced clinical text classification, but their opaque predictions remain a critical barrier to practical adoption in research and clinical settings where investigators and physicians need to understand which parts of a patient's record drive risk signals. To address this challenge, we introduce \textbf{CALM}, short for \textbf{Classification with Additive Large Language Models}, an interpretable framework for semi-structured text where inputs are composed of semantically meaningful components, such as sections of an admission note or question-answer fields from an intake form. CALM predicts outcomes as the additive sum of each component's contribution, making these contributions part of the forward computation itself and enabling faithful explanations at both the patient and population level. The additive structure also enables clear visualizations, such as component-level risk curves similar to those used in generalized additive models, making the learned relationships easier to inspect and communicate. Although CALM expects semi-structured inputs, many clinical documents already have this form, and similar structure can often be automatically extracted from free-text notes. CALM achieves performance comparable to conventional LLM classifiers while improving trust, supporting quality-assurance checks, and revealing clinically meaningful patterns during model development and auditing.

[12] InData: Towards Secure Multi-Step, Tool-Based Data Analysis

Karthikeyan K,Raghuveer Thirukovalluru,Bhuwan Dhingra,David Edwin Carlson

Main category: cs.CL

TL;DR: 提出InData数据集以评估大语言模型在多步工具使用下的推理能力，强调在敏感数据场景中通过预定义安全工具进行间接数据交互的重要性。

Details

Motivation: 为解决大语言模型直接生成代码访问敏感数据带来的安全风险，需要一种更安全的替代方案。 Method: 限制大模型直接访问数据和生成代码，仅允许通过一组预定义的安全工具进行交互，并构建InData数据集来评估其多步工具推理能力。 Result: 在15个开源大模型上的实验表明，尽管大模型在简单任务上表现良好（如gpt-oss-120b在Easy任务上达97.3%准确率），但在Hard任务上性能显著下降（69.6%），显示出当前模型在多步工具推理上的不足。 Conclusion: InData为发展和评估具备强大多步工具使用能力的大模型提供了有效途径，推动安全的数据分析代理发展。 Abstract: Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs' multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels--Easy, Medium, and Hard--capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.

[13] Improving LLM's Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

Hadi Sheikhi,Chenyang Huang,Osmar R. Zaïane

Main category: cs.CL

TL;DR: 提出LLM-KAT评估方法和实体匿名化技术，以提升大语言模型在知识图谱对话生成中对外部知识的利用能力。

Details

Motivation: 大语言模型在知识图谱对话生成任务中倾向于依赖内部知识，导致与提供的外部知识图谱脱节。 Method: 引入LLM-KAT评估知识关联性，并提出实体匿名化技术以增强模型对外部知识的使用。 Result: 在OpenDialKG数据集上的实验表明，所提方法有效提升了大语言模型对外部知识的关联程度。 Conclusion: 实体匿名化结合LLM-KAT能有效促进大语言模型在对话生成中更好地利用外部知识图谱。 Abstract: Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs' attachment on external knowledge.

[14] On the Entropy Calibration of Language Models

Steven Cao,Gregory Valiant,Percy Liang

Main category: cs.CL

TL;DR: 本文研究了语言模型生成过程中的熵校准问题，发现随着生成长度增加，模型的熵逐渐上升且文本质量下降，而增大模型规模并不能显著改善这一问题。作者通过理论分析和实验验证表明，当前的截断方法虽能提升质量但牺牲多样性，最后证明在理论上存在不牺牲对数损失的情况下进行熵校准的可能性。

Details

Motivation: 语言模型在长序列生成中出现熵累积和质量下降的问题，即误差积累，导致生成结果不可靠。现有解决方案（如分布截断）以牺牲多样性为代价，因此需要探究：模型规模扩大是否能自然缓解该问题？是否存在无需权衡的校准方法？ Method: 首先构建一个简化的理论框架，分析数据分布的幂律指数如何影响误差随数据集规模的缩放行为；然后在0.5B到70B参数的语言模型上实证测量熵校准情况；最后提出一个基于未来熵预测黑箱模型的理论校准方案。 Result: 理论分析显示当数据分布的幂律指数接近1时，误差随规模改善极慢；实验结果显示大模型与小模型的误差积累速率相似，缩放效应不明显；尽管截断仍被广泛使用，但理论上证明了在不增加对数损失的前提下实现熵校准是可行的。 Conclusion: 语言模型的熵校准问题难以通过单纯扩大规模解决，当前截断策略存在多样性损失，未来方向应探索能够预测并控制未来熵的机制，以实现无损校准。 Abstract: We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.

[15] A Reasoning Paradigm for Named Entity Recognition

Hui Huang,Yanping Chen,Ruizhang Huang,Chuan Lin,Yongbin Qin

Main category: cs.CL

TL;DR: 提出了一种用于命名实体识别（NER）的推理框架ReasoningNER，通过引入链式思维（CoT）生成、CoT调优和推理增强三个阶段，将NER从隐式的模式匹配转变为显式的可验证推理过程，在零样本场景下性能显著优于GPT-4。

Details

Motivation: 现有的生成式大语言模型在NER任务中依赖隐式语义模式匹配，缺乏明确的推理机制，导致在零样本和低资源场景下泛化能力差、性能不稳定。 Method: 提出三阶段推理框架：首先生成包含任务相关推理链的NER导向的链式思维（CoT）数据集；然后利用该数据集对模型进行微调，使其在输出实体前生成连贯的推理理由；最后通过综合奖励信号优化推理过程，实现显式且可验证的实体提取。 Result: 实验表明，ReasoningNER在零样本设置下达到最先进水平，F1分数比GPT-4高出12.3个百分点，展现出强大的认知能力和良好的泛化性能。 Conclusion: 该研究成功将显式推理引入NER任务，提升了模型的可解释性和在低资源环境下的鲁棒性，为面向推理的信息抽取研究提供了新方向。 Abstract: Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This "cognitive shortcutting" leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at https://github.com/HuiResearch/ReasoningIE.

[16] Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations

Eunkyu Park,Wesley Hanwen Deng,Vasudha Varadarajan,Mingxi Yan,Gunhee Kim,Maarten Sap,Motahhare Eslami

Main category: cs.CL

TL;DR: 该研究探讨了链式思维（CoT）解释在多模态道德场景中的双刃剑作用，发现用户常因结果认同而信任模型，即使推理错误；且自信的语气会抑制错误识别，强化依赖。

Details

Motivation: 解释本应提升透明度，但可能引发用户确认偏误，导致对错误推理的盲目信任，因此需探究CoT解释如何影响用户判断与信任。 Method: 通过系统性扰动推理链和操控表达语气，分析视觉语言模型（VLMs）中的推理错误及其对用户信任和错误检测能力的影响。 Result: （1）用户常将信任等同于结果认同，即使推理有误仍维持信赖；（2）自信语气会抑制错误检测，同时维持用户依赖，说明表达方式可凌驾于正确性之上。 Conclusion: CoT解释既能澄清又可能误导，强调NLP系统应设计能促进审慎与批判性思考的解释方式，而非助长盲信。 Abstract: Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.

[17] CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs

Truong Vo,Sanmi Koyejo

Main category: cs.CL

TL;DR: 提出了一种新的厚评估方法来衡量大语言模型在文化多样性环境中的文化理解与推理能力，相较于传统薄评估更稳定且具有更高解释性。

Details

Motivation: 现有对大语言模型文化能力的评估多局限于去语境化的正确性判断或强制选择，缺乏对真实情境中文化推理能力的考察。 Method: 构建了一组包含现实情境的基准测试，要求模型进行文化相关的推理；除精确匹配外，引入覆盖度、特异性、内涵和连贯性四个新指标进行多维评估。 Result: 实验表明，传统的薄评估会高估模型的文化能力且结果波动大，而厚评估能揭示推理深度差异，降低方差，提供更稳定的评估信号。 Conclusion: 厚评估能更准确、全面地衡量模型的文化理解能力，为未来多文化场景下的模型部署提供了更可靠的评估框架。 Abstract: Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model's response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.

[18] Exploring Parameter-Efficient Fine-Tuning and Backtranslation for the WMT 25 General Translation Task

Felipe Fujita,Hideyuki Takada

Main category: cs.CL

TL;DR: 结合回译和微调可显著提升小规模日语语料的神经机器翻译性能，COMET分数从0.460提升至0.597。

Details

Motivation: 探索在低资源条件下如何有效提升日英神经机器翻译质量，特别是在仅有少量平行语料的情况下。 Method: 首先使用单语日语语料通过回译（BT）生成合成数据增强训练集，然后在真实的小规模平行新闻和文学语料上进行微调（FT），并结合两种技术进行联合优化。 Result: 单独使用回译使COMET分数从0.460提升到0.468；仅微调达到0.589；两者结合进一步提升至0.597。 Conclusion: 回译与微调的协同使用能显著提高低资源语言对的翻译质量，是一种轻量且高效的方法。 Abstract: In this paper, we explore the effectiveness of combining fine-tuning and backtranslation on a small Japanese corpus for neural machine translation. Starting from a baseline English{\textrightarrow}Japanese model (COMET = 0.460), we first apply backtranslation (BT) using synthetic data generated from monolingual Japanese corpora, yielding a modest increase (COMET = 0.468). Next, we fine-tune (FT) the model on a genuine small parallel dataset drawn from diverse Japanese news and literary corpora, achieving a substantial jump to COMET = 0.589 when using Mistral 7B. Finally, we integrate both backtranslation and fine-tuning{ -- }first augmenting the small dataset with BT generated examples, then adapting via FT{ -- }which further boosts performance to COMET = 0.597. These results demonstrate that, even with limited training data, the synergistic use of backtranslation and targeted fine-tuning on Japanese corpora can significantly enhance translation quality, outperforming each technique in isolation. This approach offers a lightweight yet powerful strategy for improving low-resource language pairs.

[19] LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models

Piotr Pęzik,Konrad Kaczyński,Maria Szymańska,Filip Żarnecki,Zuzanna Deckert,Jakub Kwiatkowski,Wojciech Janowski

Main category: cs.CL

TL;DR: 提出LLMLagBench，用于评估大语言模型训练数据的时间边界和知识新鲜度。

Details

Motivation: 大语言模型的知识受限于训练数据的时间截止点，可能导致使用过时信息影响推理准确性。 Method: 构建LLMLagBench基准，通过评估模型对近期事件的知识来推断其训练数据的最早时间边界，并在多款LLM上进行测试。 Result: 该基准能有效识别模型的训练截止时间，经手动验证和公开信息对比显示其可靠。 Conclusion: LLMLagBench为评估LLM知识新鲜度提供了系统化方法，有助于避免因时间滞后导致的错误推理。 Abstract: Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM's training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.

[20] PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

Bingbing Wang,Zhixin Bai,Zhengda Jin,Zihan Wang,Xintong Song,Jingjie Lin,Sixuan Li,Jing Li,Ruifeng Xu

Main category: cs.CL

TL;DR: 本文提出了U-MStance数据集和PRISM模型，以解决多模态对话立场检测中的伪多模态性和用户同质性问题。

Details

Motivation: 现有研究受限于伪多模态和用户同质性假设，无法真实反映多模态社交互动中用户的立场表达。 Method: 构建了首个以用户为中心的MCSD数据集U-MStance，并提出PRISM模型，通过纵向用户画像提取、基于思维链的跨模态对齐以及任务互强化机制实现立场检测。 Result: 在U-MStance上的实验表明，PRISM显著优于强基线模型，验证了用户中心化和上下文感知的多模态推理的有效性。 Conclusion: 引入用户个性建模和上下文驱动的跨模态理解能有效提升现实场景下多模态对话立场检测性能。 Abstract: The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users' attitudes toward specific targets within complex discussions. However, existing studies remain limited by: **1) pseudo-multimodality**, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and **2) user homogeneity**, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce **U-MStance**, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose **PRISM**, a **P**ersona-**R**easoned mult**I**modal **S**tance **M**odel for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.

[21] AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

Qingyu Zhang,Chunlei Xin,Xuanang Chen,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun,Qing Ye,Qianlong Xie,Xingxing Wang

Main category: cs.CL

TL;DR: 本文提出了AI-Salesman框架，用于解决目标驱动型说服对话中的策略脆弱性和事实幻觉问题，通过构建真实世界数据集TeleSalesCorpus和双阶段架构（贝叶斯监督强化学习与动态大纲引导代理），在自动指标和人工评估中均显著优于基线模型。

Details

Motivation: 现有大模型在目标驱动的说服对话中存在策略不稳健和事实幻觉问题，且缺乏特定任务的真实数据支持，限制了其在如电话营销等场景中的应用效果。 Method: 首先构建并发布了真实世界电话销售对话数据集TeleSalesCorpus；提出AI-Salesman双阶段框架：训练阶段采用贝叶斯监督强化学习从噪声对话中学习鲁棒策略；推理阶段引入动态大纲引导代理（DOGA），结合预建脚本库提供逐轮策略指导；设计融合细粒度销售技能指标与LLM-as-a-Judge的综合评估体系。 Result: 实验结果显示，AI-Salesman在自动评估指标和全面的人类评估中均显著优于基线模型，展现出其在复杂说服场景中的有效性。 Conclusion: AI-Salesman通过双阶段学习和动态策略引导，有效提升了说服对话系统的策略稳定性与事实准确性，为现实世界目标驱动对话系统提供了可行解决方案。 Abstract: Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). A lack of task-specific data often limits previous works, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios.

[22] Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

Pinxue Guo,Chongruo Wu,Xinyu Zhou,Lingyi Hong,Zhaoyu Chen,Jinglun Li,Kaixun Jiang,Sen-ching Samson Cheung,Wei Zhang,Wenqiang Zhang

Main category: cs.CL

TL;DR: 本文提出了一种无需参考的多模态大语言模型幻觉检测框架VBackChecker，通过像素级接地语言模型验证生成内容与视觉输入的一致性，在新构建的真实世界丰富上下文基准R^2-HalBench上达到SOTA性能。

Details

Motivation: 多模态大语言模型（MLLMs）存在严重的幻觉问题，影响其在实际应用中的可靠性，现有幻觉检测方法难以有效应对复杂上下文且缺乏可解释性。 Method: 基于“眼见为实”原则，设计VBackChecker框架，利用具备推理和指向分割能力的像素级接地LLM，直接比对MLLM输出与视觉输入的一致性；构建R-Instruct数据生成流程以支持指令微调，并建立包含真实世界丰富上下文描述的新基准R^2-HalBench。 Result: VBackChecker在R^2-HalBench上优于此前复杂的幻觉检测框架，性能媲美GPT-4o，且在像素级接地任务中提升超过10%。 Conclusion: VBackChecker作为一种无需参考、具有可解释性的幻觉检测方法，在丰富上下文场景下表现出色，显著提升了MLLM幻觉检测的准确性和实用性。 Abstract: Multimodal Large Language Models (MLLMs) have unlocked powerful cross-modal capabilities, but still significantly suffer from hallucinations. As such, accurate detection of hallucinations in MLLMs is imperative for ensuring their reliability in practical applications. To this end, guided by the principle of "Seeing is Believing", we introduce VBackChecker, a novel reference-free hallucination detection framework that verifies the consistency of MLLMgenerated responses with visual inputs, by leveraging a pixellevel Grounding LLM equipped with reasoning and referring segmentation capabilities. This reference-free framework not only effectively handles rich-context scenarios, but also offers interpretability. To facilitate this, an innovative pipeline is accordingly designed for generating instruction-tuning data (R-Instruct), featuring rich-context descriptions, grounding masks, and hard negative samples. We further establish R^2 -HalBench, a new hallucination benchmark for MLLMs, which, unlike previous benchmarks, encompasses real-world, rich-context descriptions from 18 MLLMs with high-quality annotations, spanning diverse object-, attribute, and relationship-level details. VBackChecker outperforms prior complex frameworks and achieves state-of-the-art performance on R^2 -HalBench, even rivaling GPT-4o's capabilities in hallucination detection. It also surpasses prior methods in the pixel-level grounding task, achieving over a 10% improvement. All codes, data, and models are available at https://github.com/PinxueGuo/VBackChecker.

[23] CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

Yaocheng Zhang,Haohuan Huang,Zijun Song,Yuanheng Zhu,Qichao Zhang,Zijie Zhao,Dongbin Zhao

Main category: cs.CL

TL;DR: 提出CriticSearch，一种细粒度信用分配框架，通过回溯性批评机制提供密集的回合级反馈，提升搜索代理在多跳推理任务中的训练效率和性能。

Details

Motivation: 现有基于强化学习的搜索代理因稀疏奖励导致探索效率低和训练不稳定，需更有效的反馈机制。 Method: 引入一个冻结的非对称批评大模型，在训练中利用完整轨迹和真实答案回溯评估每一轮，生成密集、稳定的奖励信号以指导策略优化。 Result: 在多个多跳推理基准上实验表明，CriticSearch相比基线方法具有更快的收敛速度、更高的训练稳定性和性能提升。 Conclusion: CriticSearch通过细粒度的回溯性批评机制有效解决了稀疏奖励问题，显著提升了搜索增强推理系统的训练效率和效果。 Abstract: Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.

[24] MME-RAG: Multi-Manager-Expert Retrieval-Augmented Generation for Fine-Grained Entity Recognition in Task-Oriented Dialogues

Liang Xue,Haoyu Liu,Yajun Tian,Xinyu Zhong,Yang Liu

Main category: cs.CL

TL;DR: 提出MME-RAG框架，通过多管理者-专家结构和检索增强实现细粒度实体识别的领域自适应与高精度抽取。

Details

Motivation: 现有大模型在任务导向对话中的细粒度实体识别面临领域适应性和检索可控性挑战。 Method: 将实体识别分解为轻量级管理者进行类型判断和专用专家进行片段提取两个阶段，每个专家结合KeyInfo检索器在推理时注入语义对齐的少样本示例。 Result: 在CrossNER、MIT-Movie、MIT-Restaurant及新建的多领域客服数据集上优于近期基线方法，消融实验验证了分层分解和KeyInfo检索的有效性。 Conclusion: MME-RAG是一种可扩展、可解释的解决方案，显著提升了跨领域对话理解中实体识别的鲁棒性和泛化能力。 Abstract: Fine-grained entity recognition is crucial for reasoning and decision-making in task-oriented dialogues, yet current large language models (LLMs) continue to face challenges in domain adaptation and retrieval controllability. We introduce MME-RAG, a Multi-Manager-Expert Retrieval-Augmented Generation framework that decomposes entity recognition into two coordinated stages: type-level judgment by lightweight managers and span-level extraction by specialized experts. Each expert is supported by a KeyInfo retriever that injects semantically aligned, few-shot exemplars during inference, enabling precise and domain-adaptive extraction without additional training. Experiments on CrossNER, MIT-Movie, MIT-Restaurant, and our newly constructed multi-domain customer-service dataset demonstrate that MME-RAG performs better than recent baselines in most domains. Ablation studies further show that both the hierarchical decomposition and KeyInfo-guided retrieval are key drivers of robustness and cross-domain generalization, establishing MME-RAG as a scalable and interpretable solution for adaptive dialogue understanding.

[25] Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts

Raavi Gupta,Pranav Hari Panicker,Sumit Bhatia,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: CONFACTCHECK是一种高效的幻觉检测方法，无需外部知识库，利用单个及多个大语言模型对事实探测响应的一致性来识别生成文本中的错误事实，相比现有方法在资源消耗更少的情况下实现了更高的准确率。

Details

Motivation: 大语言模型（LLMs）常产生不符合事实的幻觉内容，在医疗、金融等关键领域带来严重风险；而在受限API访问环境下，现有检测方法多依赖多次调用，导致延迟和成本增加。 Method: 提出CONFACTCHECK方法，基于生成文本中事实探测响应在同一个及不同LLM之间应保持一致的直觉，通过分析模型输出的一致性来检测幻觉，不依赖外部知识库且减少API调用次数。 Result: 在多个涵盖事实生成与开放生成的数据集上进行实验，结果显示该方法在较低资源消耗下显著优于现有基线方法，具有更高的幻觉检测准确率。 Conclusion: CONFACTCHECK为在无权重访问权限的API受限场景下提供了一种高效、低成本且高精度的大语言模型幻觉检测方案，具备实际应用潜力。 Abstract: Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.

[26] ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

Khang T. Huynh,Dung H. Nguyen,Binh T. Nguyen

Main category: cs.CL

TL;DR: 本文提出了ViConBERT，一种用于学习越南语上下文嵌入的新型框架，并引入了首个大规模合成数据集ViConWSD，以提升越南语的细粒度语义理解。

Details

Motivation: 现有的上下文化词嵌入研究主要集中于英语等高资源语言，而越南语缺乏有效的模型和评估资源，因此需要开发针对越南语的语义理解工具。 Method: 提出ViConBERT框架，结合对比学习（SimCLR）和基于词典的蒸馏方法来学习越南语上下文嵌入，并构建ViConWSD数据集用于评估。 Result: 在WSD任务上F1达到0.87，在ViCon上AP为0.88，在ViSim-400上Spearman相关系数为0.60，优于强基线模型。 Conclusion: ViConBERT能有效建模越南语中的离散词义和渐进语义关系，显著提升了越南语的语义理解性能。 Abstract: Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT

[27] Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor

Ivan Zakazov,Alexander Sharipov,Berke Argin,Oussama Gabouj,Kamel Charaf,Alexi Semiz,Lorenzo Drudi,Nicolas Baldwin,Robert West

Main category: cs.CL

TL;DR: 本文提出了一种新的提示压缩范式，利用小规模大语言模型（LLM）压缩大规模LLM的输入，以降低使用黑盒大模型的成本。作者构建了首个全面的“LLM作为压缩器”基准，涵盖25个开源和闭源模型，并通过优化和后训练方法开发出性能优越的压缩模型Cmprsr，该模型在多种任务和压缩率下均优于现有方法，且能精确控制压缩率，实现成本与质量的权衡。

Details

Motivation: 由于使用黑盒大语言模型（LLM）成本高昂，本文旨在探索通过小型LLM对输入进行压缩的方法，以降低推理开销，同时保持语义完整性和任务性能。 Method: 首先构建包含25个模型的LLM-as-a-compressor基准，评估其在保留关键语义信息和遵循指定压缩率方面的能力；然后基于Textgrad优化gpt-4.1-mini的元提示，并对Qwen3-4B结合监督微调（SFT）和组相对策略优化（GRPO）进行后训练，开发出新型压缩模型Cmprsr。 Result: 实验表明，Cmprsr在MeetingBank、LongBench和GSM8k等多个数据集上，在不同压缩率下均优于传统的抽取式和抽象式压缩方法，尤其在长文本输入中表现突出，同时能精确遵循用户指定的压缩率。 Conclusion: Cmprsr是一种高效、可控且通用的提示压缩模型，能够在显著降低大语言模型输入长度的同时保持下游任务性能，为降低LLM部署成本提供了可行方案。 Abstract: Motivated by the high costs of using black-box Large Language Models (LLMs), we introduce a novel prompt compression paradigm, under which we use smaller LLMs to compress inputs for the larger ones. We present the first comprehensive LLM-as-a-compressor benchmark spanning 25 open- and closed-source models, which reveals significant disparity in models' compression ability in terms of (i) preserving semantically important information (ii) following the user-provided compression rate (CR). We further improve the performance of gpt-4.1-mini, the best overall vanilla compressor, with Textgrad-based compression meta-prompt optimization. We also identify the most promising open-source vanilla LLM - Qwen3-4B - and post-train it with a combination of supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), pursuing the dual objective of CR adherence and maximizing the downstream task performance. We call the resulting model Cmprsr and demonstrate its superiority over both extractive and vanilla abstractive compression across the entire range of compression rates on lengthy inputs from MeetingBank and LongBench as well as short prompts from GSM8k. The latter highlights Cmprsr's generalizability across varying input lengths and domains. Moreover, Cmprsr closely follows the requested compression rate, offering fine control over the cost-quality trade-off.

[28] AugAbEx : Way Forward for Extractive Case Summarization

Purnima Bindal,Vikas Kumar,Sagar Rathore,Vasudha Bhatnagar

Main category: cs.CL

TL;DR: 提出一种轻量级透明流程，利用现有的摘要生成对应的抽取式摘要，以增强法律文书自动摘要的数据资源。

Details

Motivation: 由于法律文书语言复杂、术语上下文敏感且文档较长，人工摘要负担重，而现有神经网络生成的抽象摘要易误传法律术语或忽略关键细节，因此需要更可靠的抽取式摘要方法。 Method: 设计一个轻量透明的流程，利用已有的抽象黄金标准摘要生成对应的抽取式摘要，并对七种现有数据集进行增强；通过结构、词汇、语义及领域信息多维度评估生成摘要的质量。 Result: 成功为七个包含抽象摘要的数据集生成了高质量的抽取式摘要版本，验证结果显示新旧摘要在多个维度上保持一致，确保专家意见得以保留。 Conclusion: 该方法有效提升了法律文书摘要数据资源的丰富性与可用性，所发布的公开数据集将推动法律文本自动摘要的研究发展。 Abstract: Summarization of legal judgments poses a heavy cognitive burden on law practitioners due to the complexity of the language, context-sensitive legal jargon, and the length of the document. Therefore, the automatic summarization of legal documents has attracted serious attention from natural language processing researchers. Since the abstractive summaries of legal documents generated by deep neural methods remain prone to the risk of misrepresenting nuanced legal jargon or overlooking key contextual details, we envisage a rising trend toward the use of extractive case summarizers. Given the high cost of human annotation for gold standard extractive summaries, we engineer a light and transparent pipeline that leverages existing abstractive gold standard summaries to create the corresponding extractive gold standard versions. The approach ensures that the experts` opinions ensconced in the original gold standard abstractive summaries are carried over to the transformed extractive summaries. We aim to augment seven existing case summarization datasets, which include abstractive summaries, by incorporating corresponding extractive summaries and create an enriched data resource for case summarization research community. To ensure the quality of the augmented extractive summaries, we perform an extensive comparative evaluation with the original abstractive gold standard summaries covering structural, lexical, and semantic dimensions. We also compare the domain-level information of the two summaries. We commit to release the augmented datasets in the public domain for use by the research community and believe that the resource will offer opportunities to advance the field of automatic summarization of legal documents.

[29] Do LLMs and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering

Naoya Sugiura,Kosuke Yamada,Yasuhiro Ogawa,Katsuhiko Toyama,Ryohei Sasano

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型（LLM）与人类在抢答式测验中表现差异，发现LLM在答案未被维基百科覆盖或需要数值回答的问题上表现较差。

Details

Motivation: 探究LLM在人类难以回答的问题上是否同样表现不佳，理解LLM与人类认知难度的差异。 Method: 收集包含问题、答案和人类正确率的日语测验数据，通过提示LLM在多种设置下作答，并从两个分析角度比较其与人类的正确率。 Result: 实验结果表明，相比人类，LLM在答案未被维基百科覆盖的题目以及需要数值回答的问题上表现更差。 Conclusion: LLM的困难点与人类不同，尤其在知识覆盖不全和数值推理方面存在明显短板。 Abstract: LLMs have achieved performance that surpasses humans in many NLP tasks. However, it remains unclear whether problems that are difficult for humans are also difficult for LLMs. This study investigates how the difficulty of quizzes in a buzzer setting differs between LLMs and humans. Specifically, we first collect Japanese quiz data including questions, answers, and correct response rate of humans, then prompted LLMs to answer the quizzes under several settings, and compare their correct answer rate to that of humans from two analytical perspectives. The experimental results showed that, compared to humans, LLMs struggle more with quizzes whose correct answers are not covered by Wikipedia entries, and also have difficulty with questions that require numerical answers.

[30] Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

Logan Mann,Nayan Saxena,Sarah Tandon,Chenhao Sun,Savar Toteja,Kevin Zhu

Main category: cs.CL

TL;DR: 研究发现大语言模型在处理否定指令时会出现“讽刺性反弹”现象，即被禁止提及的概念反而更容易被激活，这种现象与认知心理学中的 ironic rebound 类似，并通过实验和电路追踪分析揭示了其机制。

Details

Motivation: 探索大语言模型在面对否定指令时是否会出现类似人类的讽刺性反弹现象，并理解其背后的机制。 Method: 进行两项实验：(1) 负载与内容实验，测试不同类型干扰文本对反弹强度的影响；(2) 极性分离实验，检验模型是否能区分概念的中性与否定表述及其与反弹持续性的关系。同时结合电路追踪分析模型内部机制。 Result: 否定指令后立即出现反弹现象，语义或较长干扰文本增强反弹，重复有助于抑制；极性分离越强，反弹越持久；中间层稀疏注意力头放大被禁用词，而早期层则抑制。 Conclusion: 大语言模型存在类似人类的讽刺性反弹现象，其机制与长上下文干扰有关，研究为理解和缓解此类问题提供了实证基础和数据资源（ReboundBench）。 Abstract: Negation instructions such as 'do not mention $X$' can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load \& content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.

[31] From Phonemes to Meaning: Evaluating Large Language Models on Tamil

Jeyarajalingam Varsha,Menan Velayuthan,Sumirtha Karunakaran,Rasan Nivethiga,Kengatharaiyer Sarveswaran

Main category: cs.CL

TL;DR: 本文介绍了ILAKKANAM，首个针对泰米尔语的特有语言评估基准，基于斯里兰卡中小学泰米尔语考试题目构建，用于评估大语言模型在低资源、形态丰富的语言中的表现。研究发现现有模型在简单任务上表现尚可，但随语言复杂度增加性能下降，且性能与语言类别识别能力无强相关性，表明其依赖训练数据暴露而非真正理解。

Details

Motivation: 现有跨语言基准多基于英语翻译数据，难以捕捉低资源且形态丰富的语言（如泰米尔语）的语言与文化细微差异，导致对大语言模型在这些语言上的语言能力评估不足。因此，需要一个真实反映目标语言特点的本地化评估基准。 Method: 作者构建了ILAKKANAM——首个泰米尔语专用语言评估基准，包含820道来自斯里兰卡中小学泰米尔语考试的手动整理题目，由训练有素的语言学家标注为五个语言类别和一个事实知识类别，覆盖1至13年级。采用标准化框架评估多个闭源与开源大语言模型的表现，并进行整体、类别和年级层面的分析。 Result: Gemini 2.5在所有模型中表现最佳，而开源模型整体落后，显示出语言根基上的差距。所有模型在低年级题目上表现较好，但随着语言复杂度上升性能明显下降。此外，模型的整体准确率与其识别题目语言类别的能力之间没有强相关性，暗示其表现可能源于训练数据的暴露而非真正的语言理解。 Conclusion: 当前大语言模型在处理低资源、形态复杂的泰米尔语时仍存在显著局限，尤其在高阶语言结构理解方面。ILAKKANAM提供了一个更贴近真实语言使用的评估方式，揭示了依赖翻译式基准的不足，并强调需发展具备真正语言理解能力的模型，而非仅依赖数据暴露的表面匹配。 Abstract: Large Language Models (LLMs) have shown strong generalization across tasks in high-resource languages; however, their linguistic competence in low-resource and morphologically rich languages such as Tamil remains largely unexplored. Existing multilingual benchmarks often rely on translated English datasets, failing to capture the linguistic and cultural nuances of the target language. To address this gap, we introduce ILAKKANAM, the first Tamil-specific linguistic evaluation benchmark manually curated using 820 questions from Sri Lankan school-level Tamil subject examination papers. Each question is annotated by trained linguists under five linguistic categories and a factual knowledge category, spanning Grades 1--13 to ensure broad linguistic coverage. We evaluate both closed-source and open-source LLMs using a standardized evaluation framework. Our results show that Gemini 2.5 achieves the highest overall performance, while open-source models lag behind, highlighting the gap in linguistic grounding. Category- and grade-wise analyses reveal that all models perform well on lower-grade questions but show a clear decline as linguistic complexity increases. Further, no strong correlation is observed between a model's overall performance and its ability to identify linguistic categories, suggesting that performance may be driven by exposure rather than genuine understanding.

[32] Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Chenglong Wang,Yifu Huo,Yang Gan,Yongyu Mu,Qiaozhi He,Murun Yang,Bei Li,Chunliang Zhang,Tongran Liu,Anxiang Ma,Zhengtao Yu,Jingbo Zhu,Tong Xiao

Main category: cs.CL

TL;DR: 本文提出了一种通过探测偏好表示来评估奖励模型的新方法，并构建了多维奖励模型基准（MRMBench）以衡量模型在不同偏好维度上的表现，实验表明该基准与大语言模型的对齐性能高度相关，且推理时探测方法能提升奖励预测的可解释性与置信度。

Details

Motivation: 现有奖励模型评估方法通常仅在固定的成对排序测试集上进行，缺乏对各偏好维度的细粒度性能分析，因此需要一种更细粒度、更具解释性的评估方式。 Method: 提出了MRMBench，包含六个针对不同偏好维度的探测任务，并引入推理时探测方法，用于识别奖励预测过程中使用的偏好维度，增强模型可解释性。 Result: MRMBench与大语言模型的对齐性能表现出强相关性；实验发现当前奖励模型在多维度偏好捕捉上存在困难；推理时探测提供了可靠的奖励预测置信度评估指标。 Conclusion: 多维探测评估方法有助于开发更优的奖励模型，多目标优化具有潜力，且推理时探测提升了奖励模型的可解释性与可靠性。 Abstract: Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.

[33] Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing

Mengying Wang,Chenhui Ma,Ao Jiao,Tuo Liang,Pengjun Lu,Shrinidhi Hegde,Yu Yin,Evren Gurkan-Cavusoglu,Yinghui Wu

Main category: cs.CL

TL;DR: 本文提出了SerenQA框架，用于评估大语言模型在科学知识图谱问答中发现意外洞察（即“惊喜性”）的能力，定义了兼顾相关性、新颖性和意外性的惊喜性度量，并发布了专注于药物重定位的专家标注基准。实验表明现有大模型在检索任务上表现良好，但在发现真正有价值的惊喜答案方面仍有不足。

Details

Motivation: 现有的知识图谱问答系统多优化于返回高度相关但可预测的答案，缺乏发现惊喜和新颖答案的能力，而这种能力在科学发现（如药物重定位）中至关重要。 Method: 提出SerenQA框架，包括一个基于相关性、新颖性和惊喜性的严格度量标准，构建了来自临床知识图谱的专家标注基准，并设计了一个涵盖知识检索、子图推理和惊喜探索三个子任务的结构化评估流程。 Result: 实验显示当前最先进的大语言模型在知识检索任务上表现良好，但在子图推理和惊喜探索任务中表现不佳，难以识别真正具有惊喜性和价值的发现。 Conclusion: SerenQA揭示了当前大语言模型在知识图谱问答中发现惊喜性答案方面的局限性，为未来研究提供了评估基准和改进方向。 Abstract: Large Language Models (LLMs) have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel ("serendipitious") answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs' ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: https://cwru-db-group.github.io/serenQA.

[34] SGuard-v1: Safety Guardrail for Large Language Models

JoonHo Lee,HyeonMin Cho,Jaewoong Yun,Hyunjae Lee,JunKyu Lee,Juree Seok

Main category: cs.CL

TL;DR: SGuard-v1是一个轻量级的大型语言模型安全防护系统，包含两个专用模型：ContentFilter用于检测提示和响应中的安全隐患，JailbreakFilter用于识别对抗性提示，覆盖60种主要攻击类型。

Details

Motivation: 为了在人机对话场景中有效识别有害内容和对抗性提示（如越狱攻击），同时保持低部署开销，提升AI系统的安全性和可解释性。 Method: 基于20亿参数的Granite-3.3-2B-Instruct模型，构建两个专用组件——ContentFilter和JailbreakFilter；采用约140万条从收集和合成数据中整理的训练样本，通过指令微调分别优化两个组件，并设计课程学习策略提升对抗提示检测能力。 Result: 在公开和私有安全基准测试中达到最先进的安全性能，支持12种语言，提供多类别安全预测及二元置信度评分，显著降低误报率，且模型轻量易于部署。 Conclusion: SGuard-v1在保持轻量化的同时实现了高效的安全防护，具备良好的可解释性和多语言支持，适用于实际AI系统中的安全部署，已开源以促进AI安全研究。 Abstract: We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

[35] QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs

Maria Tseytlin,Paul Roit,Omri Abend,Ido Dagan,Ayal Klein

Main category: cs.CL

TL;DR: 本文提出了QA-Noun，一种基于问答的名词中心语义关系捕捉框架，通过九种问题模板覆盖名词的显式句法和隐式语境角色，与QA-SRL结合实现句子意义的细粒度分解。

Details

Motivation: 现有的基于问答的语义方法主要关注谓词-论元关系，而忽视了名词中心的语义表示，因此需要一个能系统建模名词语义角色的框架。 Method: 提出QA-Noun框架，定义九种针对名词的问题模板，构建包含2000多个标注名词的标注数据集，并提供详细标注指南；训练模型并与QA-SRL集成，实现统一的语义分解。 Result: QA-Noun实现了对AMR中名词论元的近完整覆盖，并揭示了更多上下文隐含关系；与QA-SRL结合后，相比FactScore和DecompScore等方法，粒度提升超过130%。 Conclusion: QA-Noun有效补充了现有基于问答的语义框架，形成了全面且可扩展的细粒度语义分解方法，有助于跨文本对齐任务。 Abstract: Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR's noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130\% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.

[36] TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

Jie Zhang,Bo Tang,Wanzi Shao,Wenqiang Wei,Jihao Zhao,Jianqing Zhu,Zhiyu li,Wen Xi,Zehao Lin,Feiyu Xiong,Yanchao Tan

Main category: cs.CL

TL;DR: 提出TAdaRAG，一种基于任务自适应知识图构建的检索增强生成框架，通过意图驱动的抽取机制提升知识利用效率和推理准确性。

Details

Motivation: 传统RAG因文本分块导致信息丢失，并引入无关冗余内容，影响模型推理准确性和一致性。 Method: 设计意图驱动的路由机制，结合领域特定抽取模板、监督微调和强化学习隐式抽取，动态构建任务自适应的知识图谱。 Result: 在六个公开基准和一个真实商业数据集（NowNewsQA）上，基于三种主干模型的实验表明，TAdaRAG在多种领域和长文本任务中均优于现有方法。 Conclusion: TAdaRAG能有效提升知识检索的精准性与推理连贯性，具备强泛化能力和实际应用价值。 Abstract: Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.

[37] Mitigating Length Bias in RLHF through a Causal Lens

Hyeonji Kim,Sujeong Oh,Sanghack Lee

Main category: cs.CL

TL;DR: 提出一种基于反事实数据增强的因果框架，以减轻RLHF中奖励模型的长度偏差，使其能更专注于内容质量而非响应长度。

Details

Motivation: RLHF训练的奖励模型常表现出长度偏差，即错误地将 verbosity 视为高质量，导致生成过长且冗余的文本，影响对齐效果。 Method: 构建反事实数据增强方法，生成两类样本：(1) 内容相似但长度不同的响应对；(2) 长度相近但内容不同的响应对，从而在训练中解耦内容质量与长度的影响。 Result: 实验表明该方法有效减少了奖励分配中的长度偏差，使策略模型输出更简洁、更注重内容质量的回复。 Conclusion: 所提因果框架能有效缓解RLHF中奖励模型的长度偏差问题，提升奖励建模的内容敏感性与鲁棒性。 Abstract: Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias -- a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.

[38] MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Pu-Hai Yang,Heyan Huang,Heng-Da Xu,Fanshu Sun,Xian-Ling Mao,Chaoxu Mu

Main category: cs.CL

TL;DR: 本文提出了一种新的多模态对话数据集MMWOZ，用于弥合传统任务导向型对话系统在缺少后端API时与实际应用之间的差距，并提出了一个名为MATE的多模态基线模型进行实验分析。

Details

Motivation: 由于现实中广泛存在前端图形用户界面（GUI）而缺乏定制的后端API，传统任务导向型对话系统难以实际部署，因此需要构建能够与GUI交互的多模态对话系统。 Method: 基于MultiWOZ 2.3构建MMWOZ数据集：开发网页式GUI作为前端，编写自动化脚本将原始对话状态和系统动作转换为GUI操作指令，并收集网页快照及其对应的操作指令；同时提出MATE多模态模型作为基线。 Result: 成功构建了包含GUI操作和视觉信息的MMWOZ多模态对话数据集，并通过MATE模型进行了全面的实验分析，验证了其在构建实用型多模态任务对话代理中的有效性。 Conclusion: MMWOZ数据集和MATE模型为任务导向型对话系统在无后端API环境下通过GUI完成任务提供了可行路径，推动了多模态对话智能体的发展。 Abstract: Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.

[39] Group-Aware Reinforcement Learning for Output Diversity in Large Language Models

Oron Anschel,Alon Shoshan,Adam Botach,Shunit Haviv Hakimi,Asaf Gendler,Emanuel Ben Baruch,Nadav Bhonker,Igor Kviatkovsky,Manoj Aggarwal,Gerard Medioni

Main category: cs.CL

TL;DR: 本文提出了一种名为Group-Aware Policy Optimization (GAPO)的新方法，用于解决大语言模型在生成过程中出现的模式崩溃问题，从而提升生成结果的多样性。

Details

Motivation: 大语言模型常常面临模式崩溃问题，即在存在多种有效回答的情况下反复生成少数几个相似的补全内容，限制了任务中的输出多样性。 Method: GAPO是Group Relative Policy Optimization (GRPO)的一个简单扩展，通过从整体上计算组级别的奖励，利用如多样性与覆盖率等群体特性进行学习，并采用频率感知的奖励函数鼓励对有效补全结果的均匀采样。 Result: 实验表明，使用GAPO训练的模型在保持准确性的前提下，显著提高了生成响应的多样性，且在GSM8K、MATH、HumanEval和MMLU-Pro等多个标准基准上表现良好。此外，该方法可推广到开放式提示场景。 Conclusion: GAPO能有效缓解大语言模型的模式崩溃问题，在不牺牲准确性的基础上提升生成结果的多样性和覆盖范围，具有良好的通用性。 Abstract: Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.

[40] Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Yunxin Li,Xinyu Chen,Shenyuan Jiang,Haoyuan Shi,Zhenyu Liu,Xuanyu Zhang,Nanhao Deng,Zhenran Xu,Yicheng Ma,Meishan Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: Uni-MoE 2.0 是一个基于 Qwen2.5-7B 架构的开源多模态大模型，通过动态容量 MoE 设计、渐进式训练策略和多模态数据匹配技术，在语言为中心的多模态理解、推理和生成方面取得显著进展。

Details

Motivation: 为了提升现有多模态大模型在跨模态理解与生成任务中的性能与效率，尤其是在视频、音频和图像等复杂场景下的综合表现。 Method: 采用动态容量的 Mixture-of-Experts (MoE) 架构，结合共享、路由和空专家处理十种跨模态输入；引入 Omni-Modality 3D RoPE 实现自注意力层中的时空对齐；使用渐进式监督微调与迭代 GSPO-DPO 方法进行强化学习优化；并基于约 750 亿 token 的开源多模态数据训练，支持语音和图像生成标记。 Result: 在 85 个基准测试中表现优异，超过 76 个中的 50 个基准上优于 Qwen2.5-Omni（后者训练使用 1.2T token）；在视频理解（+7% 平均）、多模态理解（+7%）、音视频推理（+4%）、长语音处理（WER 降低 4.2%）以及低级图像处理和可控生成方面领先。 Conclusion: Uni-MoE 2.0 在计算效率和多模态能力之间实现了良好平衡，成为当前领先的开源多模态大模型之一，具备强大的跨模态理解与生成能力。 Abstract: We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

[41] Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

Maoqi Liu,Quan Fang,Yang Yang,Can Zhao,Kaiquan Cai

Main category: cs.CL

TL;DR: 本文提出了NOTAM语义解析任务，旨在通过结合航空领域知识实现对NOTAM的深层语义理解，并构建了包含12,347条专家标注数据的Knots数据集，采用多智能体协作框架提升字段覆盖。实验验证了提示工程与模型适配技术的有效性，显著提升了NOTAM自动化分析能力。

Details

Motivation: 现有研究主要集中于NOTAM的表层任务（如分类和命名实体识别），缺乏对深层语义和隐含推理的理解，难以满足实际飞行安全信息解析需求。 Method: 提出NOTAM语义解析任务，构建Knots高质量数据集，采用多代理协作框架进行字段发现，并系统评估多种提示工程与模型适应技术。 Result: 在航空文本理解与处理方面取得显著性能提升，验证了所提方法在语义推理和结构化输出生成上的有效性。 Conclusion: 该研究填补了NOTAM深层语义理解的空白，为自动化NOTAM分析系统提供了有效方法和高质量数据支持，具有实际应用价值。 Abstract: Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.

[42] Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing

Yuchen Wu,Liang Ding,Li Shen,Dacheng Tao

Main category: cs.CL

TL;DR: 提出Reason-KE++，一种结合SFT与强化学习的框架，通过阶段感知奖励机制提升大模型在多跳推理任务中对新知识的忠实性，解决了现有方法只模仿格式而忽略推理正确性的“忠实性差距”问题。

Details

Motivation: 现有基于SFT的方法（如Reason-KE）在多跳推理任务中存在“忠实性差距”，倾向于依赖模型先验知识而非上下文新事实，导致事实性幻觉。 Method: 提出Reason-KE++，采用SFT+RL框架，引入阶段感知奖励机制，对分解、子答案正确性等中间推理步骤提供密集监督，确保推理过程的忠实性。 Result: 在MQUAKE-CF-3k上达到95.48%的新SOTA性能（+5.28%），显著提升中间步骤准确率（Hop acc提升19.00%），验证了过程对齐的有效性。 Conclusion: 在复杂多跳推理任务中，仅优化最终结果会导致推理完整性崩溃，必须对推理过程进行对齐才能构建可信的大语言模型。 Abstract: Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a "faithfulness gap": they optimize for format mimicry rather than sound reasoning. This gap enables the LLM's powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning "Houston" from "NASA" despite an explicit edit). To solve this core LLM alignment problem, we propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets a new SOTA of 95.48% on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.

[43] Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Sina Rashidi,Hossein Sameti

Main category: cs.CL

TL;DR: 本文提出了一种用于波斯语到英语直接语音翻译（S2ST）的系统，并构建了合成平行语音数据的流程，通过自监督预训练、离散语音单元和合成数据显著提升了低资源语言对的翻译性能。

Details

Motivation: 由于低资源语言（如波斯语）缺乏足够的平行语音数据，直接S2ST模型难以训练，因此需要一种有效的方法来缓解数据稀缺问题。 Method: 提出一个包含 conformer 编码器、因果transformer解码器和基于单元的神经声码器的直接S2ST模型，并利用大语言模型翻译波斯语音转写文本为英文，再使用零样本文本到语音系统合成对应英文语音，构建新的平行语料库。 Result: 在CVSS语料库的波斯-英语部分，使用合成数据后模型ASR BLEU得分提升了4.6，在可用平行语音数据上增加了约六倍。 Conclusion: 结合自监督预训练、离散语音单元和合成平行数据的方法在提升低资源语言对的直接S2ST性能方面是有效的。 Abstract: Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

[44] Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Yunhao Chen,Xin Wang,Juncheng Li,Yixu Wang,Jie Li,Yan Teng,Yingchun Wang,Xingjun Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为EvoSynth的自主框架，通过多智能体系统和代码级自我修正机制，实现对大语言模型的进化式攻击方法合成，显著提升了攻击成功率和多样性。

Details

Motivation: 现有的自动化红队测试框架受限于预定义攻击策略的组合与优化，缺乏创造新攻击机制的能力，限制了其攻击效果和多样性。 Method: 提出EvoSynth框架，采用多智能体系统自主生成、演化和执行基于代码的新型攻击算法，并引入代码级自修正循环，使其能根据失败情况迭代重写攻击逻辑。 Result: 实验表明，EvoSynth在Claude-Sonnet-4.5等强健模型上达到85.5%的攻击成功率，显著优于现有方法，且生成的攻击更具多样性。 Conclusion: EvoSynth实现了从攻击规划到攻击方法进化合成的范式转变，为大语言模型的安全评估开辟了新方向。 Abstract: Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.

[45] Adaptive Focus Memory for Language Models

Christopher Cruz

Main category: cs.CL

TL;DR: 提出了一种名为Adaptive Focus Memory (AFM)的动态上下文管理方法，通过语义相似性、重要性分类和半衰期时效加权，动态分配历史消息的保真度（FULL、COMPRESSED、PLACEHOLDER），在保证安全关键信息保留的同时显著降低token使用量。

Details

Motivation: 大型语言模型在多轮对话中受限于固定的上下文窗口和简单的记忆策略，全量回放效率低，而静态摘要或仅保留最近内容的方法容易丢失关键用户信息，尤其是安全敏感信息。 Method: AFM根据语义相似性、时效性和重要性对历史消息进行三档保真度划分（FULL、COMPRESSED、PLACEHOLDER），在严格token预算下按时间顺序打包消息，优先保留与当前查询最相关的内容的高保真表示，同时保留对话的轻量级痕迹。 Result: 在涉及花生过敏用户规划旅行的安全基准测试中，AFM在短到中等长度对话中均成功保留了过敏信息，安全表现与全量回放相当，但token使用量平均减少了66%。 Conclusion: AFM在不牺牲安全性和事实连贯性的前提下，显著降低了多轮对话中的推理成本，适用于OpenAI兼容API和离线场景，为实际部署提供了高效且可靠的上下文管理方案。 Abstract: Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, but their behavior is still bottlenecked by fixed context windows and naive memory strategies. Replaying the full conversation at every turn is simple but expensive, while static summarization or recency-only heuristics often erase safety-critical user details. We present Adaptive Focus Memory (AFM), a dynamic context manager that assigns each past message one of three fidelity levels -- FULL, COMPRESSED, or PLACEHOLDER -- based on semantic similarity to the current query, half-life recency weighting, and importance classification. AFM packs messages chronologically under a strict token budget, preferring high fidelity for the most relevant turns while aiming to preserve a cheap trace of the dialogue. In a safety-oriented benchmark involving a user with a severe peanut allergy planning a trip to Thailand, AFM retains the allergy across both short and medium-length conversations, matches the safety performance of naive replay, and cuts average token usage by 66% relative to a replay baseline. We release a modular Python implementation of AFM designed for OpenAI-compatible APIs and offline operation, enabling practitioners to reduce inference cost without sacrificing safety or factual continuity in the evaluated scenario.

[46] On the Brittleness of LLMs: A Journey around Set Membership

Lea Hergert,Gábor Berend,Mario Szegedy,Gyorgy Turan,Márk Jelasity

Main category: cs.CL

TL;DR: 研究探讨了大语言模型在简单集合成员查询任务中的表现，发现其性能脆弱且不可预测，揭示了模型对集合概念的理解是零散和复杂的。

Details

Motivation: 大语言模型在复杂推理任务上表现出色，但在简单问题上却常失败，引发对其可靠性和可解释性的担忧。因此需要研究其基本失败模式。 Method: 通过大规模实验，系统地评估不同提示措辞、语义结构、元素顺序和模型选择下，大语言模型在集合成员查询任务上的表现。 Result: 实验表明，大语言模型在这一基础任务上的表现始终脆弱且在各个维度上均不可预测，说明其对集合概念的理解存在碎片化问题。 Conclusion: 简单的任务设计结合大规模实验能有效揭示大语言模型的失败模式，为通用的模型评估提供有价值的方法论。 Abstract: Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.

[47] Evidence of Phase Transitions in Small Transformer-Based Language Models

Noah Hong,Tao Hong

Main category: cs.CL

TL;DR: 该研究发现，即使在小型语言模型中，词汇使用的相变现象也存在于训练早期，并可在未对数变换的线性训练过程中被检测到，表明相变是语言模型训练的普遍特征。

Details

Motivation: 探索相变是否仅存在于大模型中，能否在线性训练空间直接观测，以及是否在训练初期就出现。 Method: 通过训练小型GPT式模型，分析字符级语料训练过程中的词汇使用演变，结合泊松与次泊松统计量化词汇连接与重组。 Result: 发现训练过程中存在明显的相变点，该变化在标准损失或验证曲线中不可见，但可通过词汇和统计探针检测到。 Conclusion: 相变重组是语言模型训练的普遍特性，可在小模型、线性尺度和训练早期观察到，提示需使用特定指标揭示此类非线性动态。 Abstract: Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors

[48] LLM Reinforcement in Context

Thomas Rivasseau

Main category: cs.CL

TL;DR: 提出使用“中断”方法来增强大语言模型的对齐性，通过在用户输入中定期插入控制语句，以应对随输入长度增加而上升的越狱风险。

Details

Motivation: 现有对齐研究缺乏能随用户输入长度扩展的有效防御手段，且长输入会增加LLM被越狱的概率。 Method: 在用户输入中每隔约x个token插入控制句子（即‘中断’），并可推广至思维链过程以防止模型‘谋划’行为。 Result: 该方法有望提升模型在长输入场景下的对齐鲁棒性，但具体效果需进一步实验验证。 Conclusion: 中断是一种有潜力的、可扩展的对齐强化策略，尤其适用于对抗长输入引发的越狱问题。 Abstract: Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.

[49] Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Hayden Moore,Asfahan Shah

Main category: cs.CL

TL;DR: 本文研究了大语言模型在自动形式化任务中对语义保持的自然语言改写输入的鲁棒性，发现在MiniF2F和ProofNet基准上，即使语义相近的表述微小变化也会显著影响模型生成的形式证明的正确性和编译有效性。

Details

Motivation: 尽管大语言模型在自动形式化任务中表现优异，但在面对语义一致但表述不同的自然语言输入时可能产生不一致的形式化结果，缺乏稳定性和可验证性。因此需要评估其在改写输入下的鲁棒性。 Method: 使用MiniF2F和Lean 4版ProofNet作为基准，选取两个现代大语言模型，生成自然语言命题的语义相似改写版本，并跨模型评估生成的形式证明在语义正确性和编译有效性上的表现。 Result: 实验结果显示，不同模型在改写后的自然语言输入下表现出显著性能波动，表明轻微的语言变化会严重影响形式化输出的质量和一致性。 Conclusion: 大语言模型在自动形式化任务中对自然语言的表述方式敏感，提升其对语义等价输入的鲁棒性是未来改进的关键方向。 Abstract: Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

[50] BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals

Ruiyu Wang,Yuzhang Xie,Xiao Hu,Carl Yang,Jiaying Lu

Main category: cs.CL

TL;DR: 本文提出了一个面向生物医学领域的大型数据集BioMedJImpact，整合了文献计量、合作特征和基于大模型的AI参与度指标，用于分析期刊影响力与AI研究及合作结构的关系。

Details

Motivation: 现有开放资源难以捕捉合作结构和人工智能（AI）研究如何共同影响生物医学领域期刊的声望，因此需要一个综合的数据集和方法来填补这一空白。 Method: 基于174万篇PubMed Central文章构建覆盖2744种期刊的BioMedJImpact数据集；提出可复现的三阶段大语言模型（LLM）流水线提取AI参与度特征，并结合文献计量与合作网络特征进行分析。 Result: 发现合作强度高（特别是作者团队更大更多样化）的期刊具有更高的引用影响力；AI参与度与期刊声望（尤其是分区排名）的相关性日益增强；人工评估验证了LLM流水线在AI相关性检测和子领域分类上具有较高一致性。 Conclusion: BioMedJImpact不仅是一个涵盖生物医学与AI交叉领域的综合性数据集，也为科学影响力和创新动态的可扩展、内容感知的科学计量分析提供了经过验证的方法框架。 Abstract: Assessing journal impact is central to scholarly communication, yet existing open resources rarely capture how collaboration structures and artificial intelligence (AI) research jointly shape venue prestige in biomedicine. We present BioMedJImpact, a large-scale, biomedical-oriented dataset designed to advance journal-level analysis of scientific impact and AI engagement. Built from 1.74 million PubMed Central articles across 2,744 journals, BioMedJImpact integrates bibliometric indicators, collaboration features, and LLM-derived semantic indicators for AI engagement. Specifically, the AI engagement feature is extracted through a reproducible three-stage LLM pipeline that we propose. Using this dataset, we analyze how collaboration intensity and AI engagement jointly influence scientific impact across pre- and post-pandemic periods (2016-2019, 2020-2023). Two consistent trends emerge: journals with higher collaboration intensity, particularly those with larger and more diverse author teams, tend to achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings. To further validate the three-stage LLM pipeline we proposed for deriving the AI engagement feature, we conduct human evaluation, confirming substantial agreement in AI relevance detection and consistent subfield classification. Together, these contributions demonstrate that BioMedJImpact serves as both a comprehensive dataset capturing the intersection of biomedicine and AI, and a validated methodological framework enabling scalable, content-aware scientometric analysis of scientific impact and innovation dynamics. Code is available at https://github.com/JonathanWry/BioMedJImpact.

[51] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

Niranjan Chebrolu,Gerard Christopher Yeo,Kokil Jaidka

Main category: cs.CL

TL;DR: 本文提出了一种通过激活工程（activation engineering）来增强大语言模型情感表达的方法，利用归因补丁和对比文本对生成情感表达向量，显著提升了LLaMA 3.1-8B在对话中的人类情感细腻度。

Details

Motivation: 尽管大语言模型在对话流畅性方面取得进展，但在展现类人情感细微差别方面仍存在挑战。现有对齐技术往往只处理表层输出或需要大量微调，缺乏精准性和可解释性。 Method: 首先使用归因补丁（attribution patching）识别在诊断性对话任务中具有因果影响的关键组件；然后基于正负情感样本的激活差异构建情感表达向量，并将其应用于新提示的生成过程。 Result: 引导后的模型响应表现出更强的正面情绪（如喜悦、信任）和更频繁的第一人称代词使用，表明其更具个人参与感和情感表达能力。 Conclusion: 该方法为提升对话AI的情感表达提供了一个精确且可解释的框架，为情感化对话系统的研究开辟了新方向。 Abstract: Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.

[52] Quantifying consistency and accuracy of Latent Dirichlet Allocation

Saranzaya Magsarjav,Melissa Humphries,Jonathan Tuke,Lewis Mitchell

Main category: cs.CL

TL;DR: 本文提出了一种新的稳定性度量方法，用于评估LDA主题模型在生成主题时的一致性和准确性，并通过生成具有真实主题的语料库验证发现LDA虽具内部一致性，但所生成的主题并非真实主题。

Details

Motivation: 由于概率主题模型（如LDA）具有随机性，导致运行结果不一致，影响可重复性和解释性，因此需要评估其稳定性和可靠性。 Method: 定义了一种结合准确性和一致性的新稳定性度量方法，并利用LDA的生成特性创建带有真实主题的语料库，进行50次重复实验以评估输出变异性。 Result: 实验表明LDA能正确识别文档中潜在主题的数量，且多次运行结果具有较高内部一致性，但这些主题与真实主题不符。 Conclusion: LDA模型虽然在内部运行上具有一致性，但在捕捉真实主题方面存在不足，凸显了对主题模型稳定性与解释性评估的重要性。 Abstract: Topic modelling in Natural Language Processing uncovers hidden topics in large, unlabelled text datasets. It is widely applied in fields such as information retrieval, content summarisation, and trend analysis across various disciplines. However, probabilistic topic models can produce different results when rerun due to their stochastic nature, leading to inconsistencies in latent topics. Factors like corpus shuffling, rare text removal, and document elimination contribute to these variations. This instability affects replicability, reliability, and interpretation, raising concerns about whether topic models capture meaningful topics or just noise. To address these problems, we defined a new stability measure that incorporates accuracy and consistency and uses the generative properties of LDA to generate a new corpus with ground truth. These generated corpora are run through LDA 50 times to determine the variability in the output. We show that LDA can correctly determine the underlying number of topics in the documents. We also find that LDA is more internally consistent, as the multiple reruns return similar topics; however, these topics are not the true topics.

[53] NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation

Kang Yin,Hye-Bin Shin

Main category: cs.CL

TL;DR: NeuroLex is a domain-specific language model tailored for clinical EEG reports, trained on EEG report text to better capture linguistic and diagnostic patterns than general-purpose models.

Details

Motivation: General-purpose language models fail to capture the domain-specific linguistic conventions of clinical EEG reports, limiting their utility in EEG interpretation and related applications. Method: NeuroLex is trained using span-corruption pretraining and instruction-style fine-tuning on tasks like report polishing, summarization, and question answering, using EEG reports from the Harvard Electroencephalography Database. Result: NeuroLex achieves lower perplexity, higher accuracy in information extraction and summarization, better label efficiency, and improved robustness to negation and hallucination compared to general models of similar size. Conclusion: NeuroLex provides an EEG-aware linguistic foundation that bridges biomedical text modeling and brain-computer interface systems, enabling more interpretable and language-driven neural decoding. Abstract: Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.

[54] From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Wenxin Zhu,Andong Chen,Yuchen Song,Kehai Chen,Conghui Zhu,Ziyan Chen,Tiejun Zhao

Main category: cs.CL

TL;DR: 本文系统综述了多模态思维链（MCoT），分析了其技术背景与理论动因，总结了主流方法、评估基准及应用场景，并讨论了当前挑战与未来研究方向。

Details

Motivation: 现有MLLM在复杂推理方面存在推理路径不透明和泛化能力不足的问题，亟需提升多模态下的推理能力。 Method: 从技术演进和任务需求出发，围绕CoT范式、后训练阶段和推理阶段介绍主流MCoT方法，并分析其机制。 Result: 梳理了MCoT的主要方法分类、评估基准与应用情况，揭示了当前研究的进展与局限。 Conclusion: MCoT有望提升多模态模型的推理透明性与泛化能力，未来应关注更高效的学习范式与更具挑战性的应用场景。 Abstract: With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.

[55] Classification of Hope in Textual Data using Transformer-Based Models

Chukwuebuka Fortunate Ijezue,Tania-Amanda Fredrick Eneye,Maaz Amjad

Main category: cs.CL

TL;DR: 本研究提出了一种基于Transformer的希望表达分类方法，比较了BERT、GPT-2和DeBERTa三种架构在二分类和多分类任务中的性能，发现BERT在准确率和计算效率上均表现最佳。

Details

Motivation: 希望是一种重要的情感表达，自动识别文本中的希望情绪有助于心理健康支持和社会媒体分析，但现有研究较少针对这一特定情感进行建模。 Method: 采用三种预训练语言模型（BERT、GPT-2、DeBERTa）进行对比实验，分别完成二分类（希望 vs. 非希望）和五类多分类任务，并对训练时间与性能进行综合评估。 Result: BERT在二分类（84.49%）和多分类（72.03%）任务中准确率最高且训练耗时最短（443秒）；GPT-2整体准确率最低但对讽刺性希望表达召回率高（92.46%）；DeBERTa性能适中但计算成本最高（947秒）。 Conclusion: 对于特定情感如希望的检测，模型架构的适用性可能比模型规模更重要，BERT在效率和效果之间取得了最佳平衡，适合此类任务。 Abstract: This paper presents a transformer-based approach for classifying hope expressions in text. We developed and compared three architectures (BERT, GPT-2, and DeBERTa) for both binary classification (Hope vs. Not Hope) and multiclass categorization (five hope-related categories). Our initial BERT implementation achieved 83.65% binary and 74.87% multiclass accuracy. In the extended comparison, BERT demonstrated superior performance (84.49% binary, 72.03% multiclass accuracy) while requiring significantly fewer computational resources (443s vs. 704s training time) than newer architectures. GPT-2 showed lowest overall accuracy (79.34% binary, 71.29% multiclass), while DeBERTa achieved moderate results (80.70% binary, 71.56% multiclass) but at substantially higher computational cost (947s for multiclass training). Error analysis revealed architecture-specific strengths in detecting nuanced hope expressions, with GPT-2 excelling at sarcasm detection (92.46% recall). This study provides a framework for computational analysis of hope, with applications in mental health and social media analysis, while demonstrating that architectural suitability may outweigh model size for specialized emotion detection tasks.

[56] Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy

Desheng Hu,Joachim Baumann,Aleksandra Urman,Elsa Lichtenegger,Robin Forsberg,Aniko Hannak,Christo Wilson

Main category: cs.CL

TL;DR: 该研究通过系统性算法审计1508个真实的育儿和孕期相关搜索查询，评估谷歌AI概览（AIO）和精选摘要（FS）在信息质量与一致性方面的表现，发现33%的情况下两者提供的信息不一致，且分别仅有11%和7%的回答包含医学安全警示，凸显AI生成健康信息缺乏质量控制的问题。

Details

Motivation: 随着AI生成内容在搜索引擎中的广泛应用，用户在获取关键健康信息时对其质量和可靠性缺乏掌控，尤其是在母婴等高风险领域，亟需系统评估AI输出的信息一致性与安全性。 Method: 研究采用系统性算法审计方法，针对1508个真实育儿与孕期相关搜索查询，构建多维度评估框架，涵盖答案一致性、相关性、医学安全警示、来源类别和情感倾向对齐等方面，对比分析AIO与FS的信息质量。 Result: 研究发现33%的搜索结果中AIO与FS信息不一致；尽管相关性较高，但医学安全警示缺失严重（AIO为11%，FS为7%）；健康类网站是主要信息来源，但FS更常链接至商业来源。 Conclusion: 当前AI生成的健康信息在一致性与安全警示方面存在显著缺陷，可能影响公众健康决策，亟需加强质量监管；本研究提出的评估框架可推广至其他高风险领域的AI系统审计。 Abstract: Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.

[57] Visual Room 2.0: Seeing is Not Understanding for MLLMs

Haokun Li,Yazhou Zhang,Jizhi Ding,Qiuchi Li,Peng Zhang

Main category: cs.CL

TL;DR: 本文提出“视觉房间”论点，指出多模态大语言模型（MLLMs）虽能准确描述视觉细节，但未必理解其背后的情感与意图，并构建了Visual Room 2.0基准来评估MLLMs的感知-认知对齐能力。实验发现MLLMs感知能力强于认知能力，认知不依赖于感知推理，且认知随模型规模提升而增强，但感知能力并未一致改善。

Details

Motivation: 探讨MLLMs是否真正理解所见内容，质疑当前模型在视觉理解上的局限性，受中文房间思想实验启发，提出视觉房间论点以揭示感知与认知之间的差距。 Method: 构建了一个分层的多模态基准Visual Room 2.0，涵盖低、中、高三层次共17项任务，包含350个多模态样本，每个样本设计6个递进式问题（共2100题），分别测试感知（如属性识别、场景理解）和认知（如文本蕴含、因果与社会推理）能力，并对10种最先进MLLM进行评测。 Result: 评估结果显示：（1）MLLMs感知能力优于认知能力（提升8.0%）；（2）认知能力并非基于感知推理产生；（3）认知能力随模型规模增大而提升，但感知能力未随模型变大而持续改善。 Conclusion: Seeing ≠ Understanding 可作为可检验假设，推动MLLM从感知处理向认知推理的发展，Visual Room 2.0为评估多模态模型提供了新范式。 Abstract: Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle's Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textit{Visual Room} 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition. Evaluating 10 state-of-the-art (SoTA) MLLMs, we highlight three key findings: (1) MLLMs exhibit stronger perceptual competence than cognitive ability (8.0\%$\uparrow$); (2) cognition appears not causally dependent on perception-based reasoning; and (3) cognition scales with model size, but perception does not consistently improve with larger variants. This work operationalizes Seeing $\ne$ Understanding as a testable hypothesis, offering a new paradigm from perceptual processing to cognitive reasoning in MLLMs. Our dataset is available at https://huggingface.co/datasets/LHK2003/PCBench.

[58] Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

Zeyu Shi,Ziming Wang,Tianyu Chen,Shiqi Gao,Haoyi Zhou,Qingyun Sun,Jianxin Li

Main category: cs.CL

TL;DR: 提出了一种名为Honesty-Critical Neurons Restoration (HCNR)的方法，通过识别并恢复关键神经元来修复经监督微调后大语言模型中受损的诚实表达能力，在较少数据和更快速度下有效恢复模型诚实性。

Details

Motivation: 监督微调(SFT)虽用于模型专业化，但会损害大语言模型(LLM)表达知识边界意识的能力，影响其诚实性，而现有恢复方法数据需求高且效率低。 Method: HCNR方法基于发现：微调后的LLM仍保有识别知识边界的能力，但表达该意识的能力被抑制。因此，HCNR识别出控制表达的关键神经元，并将其恢复至预训练状态，同时利用Hessian引导补偿使其与任务导向神经元协调。 Result: 在四个问答任务和五个LLM家族上的实验表明，HCNR相比基线方法以超过10倍更少的数据和至少2.23倍的速度提升，有效恢复了33.25%的受损诚实性。 Conclusion: HCNR提供了一种高效、数据经济的手术式修复方案，能够恢复微调后LLM的诚实表达能力，有助于实现可信的大语言模型部署。 Abstract: The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.

[59] AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

Declan Jackson,William Keating,George Cameron,Micah Hill-Smith

Main category: cs.CL

TL;DR: AA-Omniscience 是一个包含 6,000 个问题的新基准，用于评估语言模型在 42 个经济相关主题上的事实回忆和知识校准能力，结果揭示了前沿模型在事实性和校准方面存在持续缺陷。

Details

Motivation: 现有语言模型评估主要衡量通用能力，但在实际应用中需要确保事实准确性和对知识盲区的认知，因此需要更专业的评估基准。 Method: 构建 AA-Omniscience 基准，基于权威学术和行业来源生成 6,000 个问题，覆盖六个领域共 42 个经济相关主题，并通过 Omniscience 指数（-100 到 100）综合评估模型的事实回忆、幻觉抑制与不确定时的 abstention 能力。 Result: Claude 4.1 Opus 得分最高（4.8），是仅有的三个得分超过零的模型之一；不同模型在不同领域表现各异，表明应根据具体任务需求选择模型。 Conclusion: 当前前沿语言模型在事实性与知识校准方面仍存在显著不足，未来模型选择应依据特定领域性能而非整体表现。 Abstract: Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.

[60] How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm

Kasun Wickramasinghe,Nisansa de Silva

Main category: cs.CL

TL;DR: 本文探讨了双语词典归纳（BLI）作为评估多语言和单语嵌入对齐程度的指标的优缺点，提出了一种基于词干的BLI新方法和一种词汇剪枝技术，并比较了不同嵌入对齐方法在高低资源语言下的表现。

Details

Motivation: 尽管多语言嵌入已成为主流，但其是否在所有方面都优于对齐的单语模型仍存疑。本文旨在探究BLI作为评估对齐程度指标的有效性，并寻找多语言与单语嵌入之间的平衡。 Method: 通过在高低资源语言上评估传统对齐方法、新型多语言模型及组合技术在BLI任务上的表现，分析语言族的影响，并提出基于词干的BLI方法和词汇剪枝技术。 Result: 发现BLI在某些情况下不能准确反映对齐程度；提出的词干BLI方法和词汇剪枝技术更具信息性；组合对齐方法通常更优，但在低资源语言中多语言模型表现更好。 Conclusion: BLI并非总是可靠对齐指标，需结合新方法改进；多语言与对齐单语模型各有优势，应根据语言资源情况选择合适方法。 Abstract: Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner, as well as removing the difficult task of aligning monolingual embeddings. But is this victory complete? Are the multilingual models better than aligned monolingual models in every aspect? Can the higher computational cost of multilingual models always be justified? Or is there a compromise between the two extremes? Bilingual Lexicon Induction is one of the most widely used metrics in terms of evaluating the degree of alignment between two embedding spaces. In this study, we explore the strengths and limitations of BLI as a measure to evaluate the degree of alignment of two embedding spaces. Further, we evaluate how well traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques perform BLI tasks in the contexts of both high-resource and low-resource languages. In addition to that, we investigate the impact of the language families to which the pairs of languages belong. We identify that BLI does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI approach to evaluate two aligned embedding spaces that take into account the inflected nature of languages as opposed to the prevalent word-based BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. Often, combined embedding alignment techniques perform better while in certain cases multilingual embeddings perform better (mainly low-resource language cases).

[61] Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

Xinyuan Zhou,Yi Lei,Xiaoyu Zhou,Jingyi Sun,Yu Zhu,Zhongyi Ye,Weitai Zhang,Quan Liu,Si Wei,Cong Liu

Main category: cs.CL

TL;DR: 本文提出了一种三阶段训练框架，通过增强数据多样性和逐步优化策略，显著提升7B规模大模型在自动定理证明中的推理能力，并发布了新的评测基准ExamFormal-Bench及模型。

Details

Motivation: 由于高质量、多样化的形式化语言数据稀缺，大语言模型在自动定理证明中的发展受限，本文旨在通过更高效的训练方法和数据增强策略，释放中小规模模型的推理潜力。 Method: 采用三阶段训练框架：第一阶段在数学语料上进行持续预训练，引入CoT增强的状态预测任务；第二阶段使用监督微调（SFT）结合专家迭代循环；第三阶段应用组相对策略优化（GRPO）以强化对难题的求解能力。同时构建了ExamFormal-Bench作为新评测基准。 Result: Spark-Prover-X1-7B在同类开源模型中达到最先进水平，平均pass@32为37.0%；在PutnamBench上解决27题，在CombiBench上取得24.0%的pass@32成绩。 Conclusion: 多样化的训练数据与渐进式优化训练流程能有效提升轻量级大模型的形式化推理能力，所发布模型和数据集有助于推动该领域发展。 Abstract: Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0\% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.

[62] BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Chuyuan Li,Giuseppe Carenini

Main category: cs.CL

TL;DR: BeDiscovER是一个用于评估现代大语言模型在篇章理解能力方面的综合性基准，涵盖52个数据集，评估结果显示当前最先进的模型在时间推理的算术方面表现良好，但在文档级推理和修辞关系识别等细微语义现象上仍存在困难。

Details

Motivation: 随着大语言模型的发展，亟需一个全面、更新的基准来系统评估模型在篇章层面的理解与推理能力，尤其是在多层级、多任务和多语言场景下的表现。 Method: BeDiscovER整合了5个公开的篇章任务，共52个数据集，覆盖词汇、句级、多句和文档级篇章理解，并包含多语言和多框架的篇章关系分类共享任务，对Qwen3、DeepSeek-R1和GPT-5-mini等模型进行评测。 Result: 实验表明，前沿模型在时间推理的算术任务上表现优异，但在完整文档推理、修辞关系识别和话语标记词消歧等复杂语义任务上仍有明显不足。 Conclusion: BeDiscovER为评估语言模型的篇章理解能力提供了全面基准，揭示了当前模型在深层次语义和篇章推理上的局限性，指出了未来改进的方向。 Abstract: We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

[63] Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

Zhichao He,Mouxiao Bian,Jianhong Zhu,Jiayuan Chen,Yunqiu Wang,Wenxia Zhao,Tianbin Li,Bing Han,Jie Xu,Junyan Wu

Main category: cs.CL

TL;DR: 该研究评估了当前大语言模型（LLM）在零样本设置下识别随机对照试验（RCT）对CONSORT 2010声明遵循情况的准确性与可靠性。结果显示，顶级模型（如Gemini-2.5-Flash和DeepSeek-R1）的整体性能有限，宏观F1分数约为0.634，仅表现出中等一致性；模型能较好识别合规条目，但在识别不合规和不适用条目时表现较差。高知名度模型如GPT-4o表现更差（F1=0.521），表明LLM目前尚不能替代人工进行试验质量的严格评审，但可作为初步筛查工具。

Details

Motivation: 手动验证RCT论文对CONSORT指南的遵循情况耗时且费力，已成为同行评审和证据合成的瓶颈，因此亟需自动化工具提升效率。 Method: 构建包含150篇RCT论文的金标准数据集，涵盖多个医学领域，在零样本设置下评估多个大语言模型对CONSORT条目的三分类任务（合规、不合规、不适用）的表现，主要指标为宏观F1分数，并辅以逐项分析和定性误差分析。 Result: 顶级模型Gemini-2.5-Flash和DeepSeek-R1的宏观F1分数分别为0.634，Cohen's Kappa约0.28，仅达中等一致；模型对合规条目识别准确率高（F1>0.85），但对不合规和不适用条目识别能力差（F1<0.4）；GPT-4o表现不佳（F1=0.521）。 Conclusion: 当前大语言模型可作为CONSORT检查的初步筛查辅助工具，尤其擅长识别报告良好的条目，但因无法可靠发现报告遗漏或方法学缺陷，尚不能取代专家的人工评审。 Abstract: The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.

[64] Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

Quanjiang Guo,Sijie Wang,Jinchuan Zhang,Ben Zhang,Zhao Kang,Ling Tian,Ke Yan

Main category: cs.CL

TL;DR: 提出了一种名为Agent-Event-Coder（AEC）的多智能体框架，将零样本事件抽取视为软件工程中的代码生成过程，通过分解任务和迭代优化显著提升了抽取的准确性和结构一致性。

Details

Motivation: 现有的大模型在零样本事件抽取中存在输出不完整或结构无效的问题，如触发词误分类、论元缺失和模式违反，难以满足复杂推理和领域特定理解的需求。 Method: 设计了一个多智能体框架AEC，将事件抽取分解为检索、规划、编码和验证四个子任务，每个由专门的大模型代理执行；使用可执行类定义表示事件模式，并通过验证代理提供精确反馈，实现迭代式修正。 Result: 在五个不同领域和六种大语言模型上的实验表明，AEC consistently优于现有的零样本基线方法，显著提高了抽取结果的完整性、精确性和模式一致性。 Conclusion: 将事件抽取视为结构化的代码生成过程，利用多代理协作与迭代验证，能有效提升大模型在零样本场景下的事件抽取性能，展示了类软件工程方法在该任务中的潜力。 Abstract: Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs--such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks--retrieval, planning, coding, and verification--each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC-GQJ/Agent-Event-Coder.

[65] A Comparative Analysis of Recurrent and Attention Architectures for Isolated Sign Language Recognition

Nigar Alishzade,Gulchin Abdullayeva

Main category: cs.CL

TL;DR: 本研究系统比较了循环神经网络和基于注意力机制的模型在孤立手语识别中的表现，发现Vanilla Transformer在准确率上优于ConvLSTM，而ConvLSTM在计算效率上更具优势。

Details

Motivation: 为了比较不同神经网络架构在手语识别任务中的性能差异，特别是循环结构与注意力机制之间的优劣。 Method: 采用ConvLSTM和Vanilla Transformer两种代表性模型，在AzSLD和WLASL两个数据集上进行实验评估。 Result: Vanilla Transformer在Top-1和Top-5准确率上均优于ConvLSTM，分别在AzSLD上达到76.8%，在WLASL上达到88.3%；ConvLSTM计算效率更高但准确率较低，尤其在小规模数据集上表现较差。 Conclusion: Transformer在准确性和 signer 独立性方面更优，而ConvLSTM在计算效率和时序建模上有优势，研究为不同应用场景下的模型选择提供了权衡依据。 Abstract: This study presents a systematic comparative analysis of recurrent and attention-based neural architectures for isolated sign language recognition. We implement and evaluate two representative models-ConvLSTM and Vanilla Transformer-on the Azerbaijani Sign Language Dataset (AzSLD) and the Word-Level American Sign Language (WLASL) dataset. Our results demonstrate that the attention-based Vanilla Transformer consistently outperforms the recurrent ConvLSTM in both Top-1 and Top-5 accuracy across datasets, achieving up to 76.8% Top-1 accuracy on AzSLD and 88.3% on WLASL. The ConvLSTM, while more computationally efficient, lags in recognition accuracy, particularly on smaller datasets. These findings highlight the complementary strengths of each paradigm: the Transformer excels in overall accuracy and signer independence, whereas the ConvLSTM offers advantages in computational efficiency and temporal modeling. The study provides a nuanced analysis of these trade-offs, offering guidance for architecture selection in sign language recognition systems depending on application requirements and resource constraints.

[66] Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels

Sourya Dipta Das,Shubham Kumar,Kuldeep Yadav

Main category: cs.CL

TL;DR: 提出了一种零样本语法能力评估框架，利用大语言模型生成伪标签训练Transformer模型，有效应对口语中语法评估的挑战。

Details

Motivation: 口语的自发性、非结构化和不流畅性给语法能力评估带来挑战，且依赖专家标注的数据成本高、难以大规模获取。 Method: 使用大语言模型基于语法评分标准生成伪标签，并在无标签数据上训练Transformer模型，采用新设计的抗噪声训练框架。 Result: 实验证明该方法能高效准确地估计语法能力得分，LLM选择和清洁/噪声样本比例显著影响性能。 Conclusion: 所提方法实现了可扩展、低资源的语法评估系统，具有良好的鲁棒性和可解释性。 Abstract: Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.

[67] Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Zaara Zabeen Arpa,Sadnam Sakib Apurbo,Nazia Karim Khan Oishee,Ajwad Abrar

Main category: cs.CL

TL;DR: 本文提出了首个公开的2万行孟加拉语语料库，用于区分ASR转录中的重复口误与形态重叠现象，并通过多语言大模型和微调编码器模型进行基准测试，结果表明基于孟加拉语的BanglaBERT模型表现最佳。

Details

Motivation: 由于孟加拉语等低资源语言的ASR转录中存在重复词是口误还是语法构造的歧义，传统去口误方法会错误删除有效语言信息，因此需要一种能区分两者的精确方法。 Method: 构建了一个手动标注的20,000行孟加拉语语料库，明确区分重复口误与形态重叠；采用多语言大模型（少样本提示）和针对任务微调的编码器模型（如BanglaBERT）进行基准测试。 Result: 大模型在少样本下达到82.68%准确率，而微调后的BanglaBERT模型表现最优，准确率达84.78%，F1分数为0.677。 Conclusion: 该研究建立了语言学上可靠的基线，提供了关键数据支持，推动了保留语义的孟加拉语文本归一化系统发展。 Abstract: Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78\% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.

[68] TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine

Tianai Huang,Jiayuan Chen,Lu Lu,Pengcheng Chen,Tianbin Li,Bing Han,Wenchao Tang,Jie Xu,Ming Li

Main category: cs.CL

TL;DR: 本文提出了TCM-5CEval，一个用于评估大语言模型在中医领域五项核心能力的综合基准，发现现有模型在经典文本理解和推理稳定性方面存在显著缺陷。

Details

Motivation: 为了更全面、细致地评估大语言模型在中医这一高度专业化且文化内涵丰富的领域的表现，弥补现有评测基准的不足。 Method: 构建包含五大维度的TCM-5CEval评测集（核心知识、经典文献、临床决策、中药学、非药物疗法），并对15个主流大模型进行系统评估，引入基于选项排列的一致性测试以检验模型推理稳定性。 Result: 评估揭示了模型在基础知识点记忆上表现尚可，但在经典文本解读和临床推理任务中表现不佳；所有模型在选项顺序变化时性能显著下降，暴露出严重的推理脆弱性和位置偏差问题；deepseek_r1和gemini_2_5_pro为当前表现最佳模型。 Conclusion: TCM-5CEval为中医领域的大模型评估提供了更精细的诊断工具，揭示了模型在文化语境理解与稳定推理方面的根本性弱点，强调了开发具备真正领域理解能力模型的重要性。 Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek\_r1 and gemini\_2\_5\_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.

[69] Translation Entropy: A Statistical Framework for Evaluating Translation Systems

Ronit D. Gross,Yanir Harel,Ido Kanter

Main category: cs.CL

TL;DR: 本研究提出了一种量化翻译熵的方法，用于客观评估神经机器翻译模型的性能。通过分析句子中单个词元替换后翻译结果保持不变的概率，计算出翻译熵，并可用于不同翻译系统的定量排名。

Details

Motivation: 由于缺乏对语言熵的了解，目前尚无客观的量化方法来评估机器翻译系统的性能。因此，需要一种可测量的指标来实现翻译系统的客观比较和基准测试。 Method: 提出通过在源句中替换单一目标词元并统计翻译结果不变的情况，估计该词元的熵；通过对所有位置词元取平均得到整体翻译熵。进一步扩展至双词元替换以研究其组合效应。实验基于MarianMT、T5-Base和NLLB-200翻译系统。 Result: 成功量化了多个公开翻译系统的翻译熵，发现翻译熵沿解码器模块递增；不同翻译系统间存在非对称的互译熵；双词元替换表现出翻译退化程度的乘积效应。 Conclusion: 翻译熵是一种可测量的属性，可作为人工翻译系统的客观基准指标，为翻译质量评估提供了新的量化工具。 Abstract: The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.

[70] Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study

Mihai Dan Nadas,Laura Diosan

Main category: cs.CL

TL;DR: 该研究评估了多种大语言模型在罗马尼亚语文本重音符号恢复任务中的表现，发现GPT-4o等模型表现优异，而Llama系列模型表现波动较大，结果凸显了模型架构、训练数据和提示设计对重音恢复性能的影响。

Details

Motivation: 提升富重音语言（如罗马尼亚语）的自然语言处理能力，解决自动重音符号恢复这一关键问题。 Method: 使用包含多种大语言模型（如GPT系列、Gemini、Llama、Mixtral等）和多种提示模板（从零样本到多样本）的实验设置，在综合语料库上测试重音符号恢复效果。 Result: GPT-4o等模型在重音恢复准确率上表现优异，显著优于基线，而Llama系列模型表现不稳定，提示设计对性能有显著影响。 Conclusion: 模型架构、训练数据和提示设计共同影响重音恢复性能，优化这些因素可为富重音语言的NLP工具发展提供方向。 Abstract: Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.

[71] Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Tyler Loakman,Joseph James,Chenghua Lin

Main category: cs.CL

TL;DR: 本文研究了视觉语言模型（VLMs）在理解和解释语音的频谱图和波形图方面的能力，通过构建包含4000多个孤立英语单词的新数据集，并设计基于音素编辑距离的多项选择任务进行评估。结果表明，无论是零样本还是微调模型，表现均难以超过随机水平，说明仅靠配对样本不足以让模型掌握解读这些语音表示所需的具体参数知识。

Details

Motivation: 探索视觉语言模型（VLMs）是否具备像专业语音学家一样解读语音频谱图和波形图的能力，填补多模态模型在语音可视化理解方面的研究空白。 Method: 构建一个包含4000多个孤立英文单词及其对应频谱图和波形图的新数据集，采用多项选择任务评估VLMs对语音表示的理解能力，干扰项根据与真实标签的音素编辑距离选取。 Result: 实验结果显示，无论是零样本还是经过微调的VLMs，在该任务上的表现很少超过随机猜测水平。 Conclusion: VLMs难以有效理解语音的可视化表示（如频谱图和波形），表明模型需要专门的参数化知识来解析此类图像，而不仅仅是依赖视觉-语言配对数据。 Abstract: With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.

[72] Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Shalini Maiti,Amar Budhiraja,Bhavul Gauri,Gaurav Chaurasia,Anton Protopopov,Alexis Audran-Reiss,Michael Slater,Despoina Magka,Tatiana Shavrina,Roberta Raileanu,Yoram Bachrach

Main category: cs.CL

TL;DR: 本文提出了Soup Of Category Experts (SoCE)，一种基于基准测试分类的非均匀加权模型融合方法，通过识别各类别的专家模型并进行优化加权平均，提升了大语言模型在多领域下的性能与鲁棒性。

Details

Motivation: 大语言模型训练成本高昂，模型融合（model souping）作为一种无需重新训练即可提升性能的技术具有潜力，但现有方法多采用均匀加权，未能充分利用不同模型在不同任务上的专长。因此需要一种更优的融合策略。 Method: 提出SoCE方法，首先基于基准测试结果对类别进行聚类分析，识别出低相关性的类别簇，并找出每个簇中的专家模型；然后采用非均匀加权平均方式融合模型权重，权重通过优化搜索确定，以最大化整体性能。 Result: SoCE在多个领域（如多语言能力、工具调用、数学）均展现出性能提升，并在Berkeley Function Calling Leaderboard上达到SOTA水平，优于均匀融合及其他基线方法。 Conclusion: SoCE为模型融合提供了一种有效且实用的框架，能够通过结构化选择和加权专家模型，在不增加推理成本的前提下显著提升模型综合表现。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.

[73] RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service Copyright Protection

Shufan Yang,Zifeng Cheng,Zhiwei Jiang,Yafeng Yin,Cong Wang,Shiping Ge,Yuchen Fu,Qing Gu

Main category: cs.CL

TL;DR: 提出了一种名为RegionMarker的区域触发语义水印框架，用于保护嵌入即服务（EaaS）模型的版权，该框架在低维空间中定义触发区域并嵌入水印，具有强抗攻击能力。

Details

Motivation: 现有的EaaS水印方法只能抵抗部分攻击，缺乏全面的版权保护机制。 Method: 通过秘密的降维矩阵将文本嵌入投影到低维子空间，并在随机选择的触发区域内嵌入语义水印，利用整个区域和文本嵌入本身增强鲁棒性。 Result: 实验表明，RegionMarker在多个数据集上能有效抵御模型提取、改写和维度扰动等多种攻击。 Conclusion: RegionMarker提供了一种全面且鲁棒的水印方案，显著提升了EaaS系统的版权保护能力。 Abstract: Embedding-as-a-Service (EaaS) is an effective and convenient deployment solution for addressing various NLP tasks. Nevertheless, recent research has shown that EaaS is vulnerable to model extraction attacks, which could lead to significant economic losses for model providers. For copyright protection, existing methods inject watermark embeddings into text embeddings and use them to detect copyright infringement. However, current watermarking methods often resist only a subset of attacks and fail to provide \textit{comprehensive} protection. To this end, we present the region-triggered semantic watermarking framework called RegionMarker, which defines trigger regions within a low-dimensional space and injects watermarks into text embeddings associated with these regions. By utilizing a secret dimensionality reduction matrix to project onto this subspace and randomly selecting trigger regions, RegionMarker makes it difficult for watermark removal attacks to evade detection. Furthermore, by embedding watermarks across the entire trigger region and using the text embedding as the watermark, RegionMarker is resilient to both paraphrasing and dimension-perturbation attacks. Extensive experiments on various datasets show that RegionMarker is effective in resisting different attack methods, thereby protecting the copyright of EaaS.

[74] AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects

Maram Alharbi,Salmane Chafik,Saad Ezzini,Ruslan Mitkov,Tharindu Ranasinghe,Hansi Hettiarachchi

Main category: cs.CL

TL;DR: 本论文介绍了针对阿拉伯语方言在酒店业领域的客户评论情感分析的共享任务，旨在推动阿拉伯语方言情感检测技术的发展。

Details

Motivation: 阿拉伯世界 hospitality 行业日益依赖客户反馈来优化服务，但缺乏针对阿拉伯语方言的高效情感分析工具，因此需要构建专门的数据集和评估框架。 Method: 通过将现代标准阿拉伯语（MSA）的酒店评论人工翻译成沙特和摩洛哥方言（Darija），并由母语者验证翻译质量和情感一致性，构建了一个包含538条情感平衡评论的多方言标注数据集，并组织了共享任务以评估参赛系统性能。 Result: 共有超过40支队伍注册参与，12支队伍提交了系统；表现最佳的系统达到了0.81的F1分数，显示出跨阿拉伯语方言进行情感分析的可行性与挑战。 Conclusion: 该研究为阿拉伯语方言的情感分析提供了宝贵资源和基准，推动了面向现实应用场景的方言感知NLP系统的发展。 Abstract: The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.

[75] Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

Kajetan Dymkiewicz,Ivan Vulic,Helen Yannakoudakis,Eilam Shapira,Roi Reichart,Anna Korhonen

Main category: cs.CL

TL;DR: 研究了大语言模型在不同任务和语言间的迁移效果，发现同任务跨语言迁移表现良好，而跨任务迁移常导致性能下降，并揭示了语言与任务间的稳定贡献结构。

Details

Motivation: 理解大语言模型在一个任务或语言上的改进如何影响其他任务和语言的表现。 Method: 通过PEFT/LoRA方法，在多个开源大模型上进行控制实验，以任务和语言为迁移轴，分析单任务-语言源微调后的迁移效果。 Result: 发现同任务跨语言迁移始终为正向，而跨任务迁移常导致负向迁移；同时识别出语言和任务间存在稳定的‘枢纽型贡献者’与‘脆弱接收者’结构。 Conclusion: 迁移效果具有不对称性，建议在微调时考虑风险意识和模型专业化设计。 Abstract: Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.

[76] Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Siyu Zhu,Mouxiao Bian,Yue Xie,Yongyu Tang,Zhikang Yu,Tianbin Li,Pengcheng Chen,Bing Han,Jie Xu,Xiaoyan Dong

Main category: cs.CL

TL;DR: 本研究开发了PEDIASBench框架，系统评估12种大语言模型在儿科临床环境中的表现，发现尽管基础医学知识掌握较好，但在复杂推理、动态诊疗和人文关怀方面仍存在局限，表明当前模型尚不能独立承担儿科诊疗任务，但可作为决策支持、医学教育和医患沟通的辅助工具。

Details

Motivation: 随着大语言模型在医学领域的快速发展，亟需评估其是否能在真实儿科临床环境中胜任基本诊疗任务，特别是在知识应用、动态决策和医学伦理方面的综合能力。 Method: 提出PEDIASBench评估框架，从基础知识应用、动态诊断治疗能力和儿科医疗安全与伦理三个维度，对过去两年发布的12个代表性模型进行系统评估，涵盖19个儿科亚专科和211种典型疾病。 Result: 先进模型在基础知识测试中表现良好（如Qwen3-235B-A22B在执照级问题上准确率超90%），但任务复杂度增加时性能下降约15%；多选题显示整合推理与知识回忆能力不足；在动态诊疗中DeepSeek-R1得分最高（均值0.58），但多数模型难以适应实时患者变化；伦理与安全任务中Qwen2.5-72B表现最佳（准确率92.05%），但人文敏感性仍有限。 Conclusion: 当前大语言模型在儿科应用中受限于动态决策能力和人文关怀水平，未来应聚焦多模态融合与临床反馈-模型迭代闭环，以提升安全性、可解释性及人机协作，推动构建可信智能儿科医疗系统。 Abstract: With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.

[77] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Zhaopei Huang,Qifeng Dai,Guozheng Wu,Xiaopeng Wu,Kehan Chen,Chuan Yu,Xubin Li,Tiezheng Ge,Wenxuan Wang,Qin Jin

Main category: cs.CL

TL;DR: 本文提出了PAL-Bench，一个用于评估服务型对话代理在长期人机交互中个性化能力的新基准，并构建了首个中文多会话用户日志数据集PAL-Set。同时提出H²Memory分层异构记忆框架以提升个性化响应生成效果。

Details

Motivation: 现有方法在长期交互中难以捕捉用户的主观特征和个性化需求，缺乏合适的中文数据集和评估基准来推动服务型对话系统的个性化发展。 Method: 通过多步LLM合成 pipeline 生成模拟用户行为数据，并经人工标注验证构建PAL-Set数据集；基于此建立PAL-Bench评估基准；提出H²Memory框架，结合检索增强生成实现分层异构记忆管理。 Result: 实验表明H²Memory在PAL-Bench和外部数据集上均显著优于基线模型，有效提升个性化响应生成质量。 Conclusion: PAL-Bench和PAL-Set为服务型对话系统提供了重要的中文个性化评估与训练资源，H²Memory为长期交互中的个性化建模提供了有效解决方案。 Abstract: With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users' subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

[78] Non-Linear Scoring Model for Translation Quality Evaluation

Serge Gladkoff,Lifeng Han,Katerina Gasova

Main category: cs.CL

TL;DR: 提出一种基于对数增长误差容忍度的非线性翻译质量评分模型，相较于传统线性模型更符合人类感知，提升评估的公平性、可解释性和跨尺度一致性。

Details

Motivation: 传统基于线性误差惩罚的翻译质量评估在不同样本长度下存在偏差，短文本被过度惩罚，长文本惩罚不足，与人类感知不一致。 Method: 基于Multi-Range框架，提出E(x) = a * ln(1 + b * x)的非线性模型，利用两个校准点通过一维寻根法确定参数，并结合心理物理学（Weber-Fechner定律）和认知负荷理论进行解释和支持。 Result: 实证数据显示错误可接受数量随样本长度呈对数增长；新模型在三种大规模企业环境中表现出更优的评估一致性，且线性近似仅在特定区间内保持±20%以内误差。 Conclusion: 该非线性模型更贴近人类对翻译质量的感知，提升了人工与AI生成文本评估的准确性与可扩展性，适用于CAT/LQA系统，并为AI驱动的文档级评估提供更可靠基础。 Abstract: Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

[79] Aspect-Level Obfuscated Sentiment in Thai Financial Disclosures and Its Impact on Abnormal Returns

Attapol T. Rutherford,Sirisak Chueykamhang,Thachaparn Bunditlurdruk,Nanthicha Angsuwichitkul

Main category: cs.CL

TL;DR: 本研究提出了一种基于方面的情感分析（ABSA）新方法，用于解码泰国财务年报中的隐晦情感，并通过标注数据集和模型 benchmark 展示了良好的情感分类性能，事件研究进一步表明报告中特定方面的情感会影响股市反应。

Details

Motivation: 财务文件常使用模糊语言掩盖真实情况，难以准确判断市场情绪，因此需要更精细的方法来识别和分析其中的隐晦情感。 Method: 采用基于方面的情感分析（ABSA），制定针对泰语财务年报中隐晦情感的标注规范，并对百余份报告进行标注，随后在该数据集上评估多种文本分类模型的表现，同时通过事件研究分析情感分析结果对股价的实际影响。 Result: 模型在情感分类任务中表现良好，事件研究表明市场对报告中特定方面的信息有选择性反应，说明隐晦语言对市场情绪具有实际影响。 Conclusion: 财务文本中的情感分析具有复杂性，必须处理隐晦语言才能更准确地评估市场情绪，ABSA 方法在此类任务中具有应用价值。 Abstract: Understanding sentiment in financial documents is crucial for gaining insights into market behavior. These reports often contain obfuscated language designed to present a positive or neutral outlook, even when underlying conditions may be less favorable. This paper presents a novel approach using Aspect-Based Sentiment Analysis (ABSA) to decode obfuscated sentiment in Thai financial annual reports. We develop specific guidelines for annotating obfuscated sentiment in these texts and annotate more than one hundred financial reports. We then benchmark various text classification models on this annotated dataset, demonstrating strong performance in sentiment classification. Additionally, we conduct an event study to evaluate the real-world implications of our sentiment analysis on stock prices. Our results suggest that market reactions are selectively influenced by specific aspects within the reports. Our findings underscore the complexity of sentiment analysis in financial texts and highlight the importance of addressing obfuscated language to accurately assess market sentiment.

[80] Applying Large Language Models to Characterize Public Narratives

Elinor Poole-Dayan,Daniel T Kessler,Hannah Chiou,Margaret Hughes,Emily S Lin,Marshall Ganz,Deb Roy

Main category: cs.CL

TL;DR: 提出了一种利用大语言模型（LLM）自动化标注公共叙事的计算框架，实现了接近人类专家的性能，并扩展至政治演讲分析，展示了LLM在可扩展叙事分析中的潜力。

Details

Motivation: 公共叙事对领导力发展和公民动员至关重要，但因其主观解释性和高昂的专家标注成本，系统分析具有挑战性。 Method: 结合领域专家共同开发的编码手册，评估大语言模型在多个叙事和编码类别上的标注表现，并将其应用于更大规模的故事数据集和政治演讲分析。 Result: LLM在8个叙事和14个编码上平均F1得分为0.80，接近人类专家水平；并在22个故事和政治演讲中成功识别出叙事框架元素。 Conclusion: LLM辅助标注具有用于大规模公共叙事分析的潜力，为计算型公民叙事研究提供了新路径，但也需关注其局限性。 Abstract: Public Narratives (PNs) are key tools for leadership development and civic mobilization, yet their systematic analysis remains challenging due to their subjective interpretation and the high cost of expert annotation. In this work, we propose a novel computational framework that leverages large language models (LLMs) to automate the qualitative annotation of public narratives. Using a codebook we co-developed with subject-matter experts, we evaluate LLM performance against that of expert annotators. Our work reveals that LLMs can achieve near-human-expert performance, achieving an average F1 score of 0.80 across 8 narratives and 14 codes. We then extend our analysis to empirically explore how PN framework elements manifest across a larger dataset of 22 stories. Lastly, we extrapolate our analysis to a set of political speeches, establishing a novel lens in which to analyze political rhetoric in civic spaces. This study demonstrates the potential of LLM-assisted annotation for scalable narrative analysis and highlights key limitations and directions for future research in computational civic storytelling.

[81] Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets

Máté Gedeon,Piroska Zsófia Barta,Péter Mihajlik,Tekla Etelka Gráczi,Anna Kohári,Katalin Mády

Main category: cs.CL

TL;DR: 本文介绍了两个新的匈牙利语自发对话语音数据集BEA-Large和BEA-Dialogue，填补了低资源语言在自然对话识别中的空白，并提供了可复现的ASR与说话人分离基线结果。

Details

Motivation: 由于缺乏自发和对话式语料库，匈牙利语等低资源语言的自动语音识别发展受限，本文旨在利用未充分使用的BEA语音语料库构建高质量数据集以推动相关研究。 Method: 从原有的BEA语音语料库中提取未处理部分，构建包含255小时单人语音的BEA-Large和85小时自然对话的BEA-Dialogue；对数据进行细粒度标注并划分说话人无关子集；使用公开ASR模型（如Fast Conformer）进行微调，并评估其在语音识别和说话人分离任务上的性能。 Result: Fine-tuned Fast Conformer模型在自发语音上的词错误率低至14.18%，重复语音为4.8%；说话人分离实验的错误率介于13.05%至18.26%之间；结果显示对话语音识别仍面临不流畅、重叠和非正式表达等挑战。 Conclusion: 所发布的新数据集和基线系统不仅推动了匈牙利语语音技术的发展，也为其他语言的自发与对话式语音基准建设提供了方法论参考。 Abstract: The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18\% on spontaneous and 4.8\% on repeated speech. Diarization experiments yield diarization error rates between 13.05\% and 18.26\%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.

[82] Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation

Hao Wang,Yuanfeng Song,Xiaoming Yin,Xing Chen

Main category: cs.CL

TL;DR: 提出了一种基于多维度分类的Text-to-SQL新分类法，并据此构建了更具多样性和覆盖性的合成数据集SQL-Synth，验证了现有LLM在该任务上的局限性，同时表明微调可显著提升性能。

Details

Motivation: 现有Text-to-SQL数据集覆盖范围有限，缺乏多样性，难以反映真实应用场景的复杂性。 Method: 提出一个包含核心意图、语句类型、语法结构和关键操作的分类体系，用于评估现有数据集（如Spider和Bird），并设计了一个基于该分类法和大语言模型的合成数据生成流程，构建新数据集SQL-Synth。 Result: SQL-Synth在多样性和覆盖性上优于现有基准数据集；实验显示当前大语言模型在该数据集上表现受限，但通过微调可大幅提升性能。 Conclusion: 所提分类法能有效分析数据集和模型表现，并指导高质量训练数据的构建，对Text-to-SQL领域具有重要应用价值。 Abstract: Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.

[83] Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Piaohong Wang,Motong Tian,Jiaxian Li,Yuan Liang,Yuqing Wang,Qianben Chen,Tiannan Wang,Zhicong Lu,Jiawei Ma,Yuchen Eleanor Jiang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: 本文提出了一种基于主动用户画像的新型记忆框架O-Mem，通过动态提取和更新用户特征与事件记录，提升LLM代理在复杂环境中的长期交互能力，在上下文一致性和个性化响应方面表现更优，并在多个基准上超越现有技术。

Details

Motivation: 现有记忆系统依赖语义分组进行检索，容易忽略关键但语义不相关的用户信息，导致检索噪声，难以维持复杂环境中长期、连贯的个性化交互。 Method: 提出O-Mem框架，基于用户与代理的主动交互，动态提取和更新用户特征与事件记录，支持人格属性和主题相关上下文的分层检索，增强响应的适应性与一致性。 Result: 在LoCoMo基准上达到51.76%，较LangMem提升近3%；在PERSONAMEM上达到62.99%，较A-Mem提升3.5%；同时提升了token和交互响应效率。 Conclusion: O-Mem有效改善了LLM代理在长期交互中的个性化与一致性问题，为构建高效、类人化的AI助手提供了新方向。 Abstract: Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.76% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.

[84] Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues

Jiaming Qu,Mengtian Guo,Yue Wang

Main category: cs.CL

TL;DR: 本文探讨了如何利用大语言模型（LLM）将机器学习分类器学到的隐晦词汇线索转化为人类可理解的语言现象，以区分虚假和真实评论。

Details

Motivation: 虚假评论误导消费者、损害企业并破坏在线市场信任。现有的机器学习方法虽有效但难以解释其判断依据。 Method: 提出使用大语言模型将机器学习模型学到的细微、碎片化特征转化为人类可理解的语言现象，并验证其在跨领域场景下的泛化能力。 Result: 生成的语言现象具有数据实证基础，跨域通用性更强，且比LLM先验知识或上下文学习获得的现象更具预测力。 Conclusion: 该方法有助于在缺乏欺骗检测分类器的环境中，帮助人们批判性评估在线评论的可信度。 Abstract: Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of training examples to effectively distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret. In this work, we explore using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that can differentiate deceptive reviews from genuine ones. We show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena either in LLMs' prior knowledge or obtained through in-context learning. These language phenomena have the potential to aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.

[85] Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

Sofia Jamil,Kotla Sai Charan,Sriparna Saha,Koustava Goswami,Joseph K J

Main category: cs.CL

TL;DR: 本文提出了一个名为TAI（翻译与图像生成）的框架，利用大语言模型和潜在扩散模型，通过提示调优实现对印度语言诗歌的准确英译和视觉化呈现，从而提升其全球可及性，并支持优质教育和减少不平等的可持续发展目标。

Details

Motivation: 印度诗歌具有复杂的语言结构和深厚的文化内涵，但其多层意义、文化隐喻和丰富的形态结构给非母语读者理解带来困难。现有研究大多忽视了印度语言诗歌，缺乏相关资源和技术支持，因此需要一种有效的方法来促进其传播与理解。 Method: 提出TAI框架，包含两个模块：一是基于比值比偏好对齐算法的翻译模块，用于将形态丰富的诗歌精确翻译为英语；二是基于语义图的图像生成模块，捕捉词汇、依存关系及隐喻间的语义联系，生成富有视觉意义的诗歌图像。同时构建了包含21种低资源印度语言共1570首诗的MorphoVerse数据集。 Result: 实验结果表明，TAI Diffusion在诗歌图像生成任务中优于强基线模型，无论是人类评估还是定量指标均表现出优越性能。所构建的MorphoVerse数据集有助于缓解印度语言诗歌资源稀缺的问题。 Conclusion: TAI框架有效提升了印度语言诗歌的可访问性和理解度，推动了文化遗产的传播，支持了可持续发展目标，在诗歌翻译与视觉化方面具有重要意义。 Abstract: Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.

[86] Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang,Angelica Chen,Xu Han,Xujin Chris Liu,Radhika Dua,Kevin Eaton,Frederick Wolff,Robert Steele,Jeff Zhang,Anton Alyakin,Qingkai Pan,Yanbing Chen,Karl L. Sangwon,Daniel A. Alber,Jaden Stryker,Jin Vivian Lee,Yindalon Aphinyanaphongs,Kyunghyun Cho,Eric Karl Oermann

Main category: cs.CL

TL;DR: Lang1是一系列基于电子健康记录（EHR）和互联网文本预训练的专用大模型，通过在真实医疗场景中评估的ReMedE基准测试，证明其在医院运营预测任务中显著优于通用大模型，尤其是在经过监督微调后表现突出。

Details

Motivation: 通用大模型在医疗操作决策任务上表现不足，缺乏针对医院运营等专业领域的专门知识，因此需要结合领域内预训练与监督微调的专用模型来提升实际医疗场景中的预测性能。 Method: 提出Lang1模型家族，使用800亿临床EHR数据和6270亿互联网文本进行预训练，并构建ReMedE基准（基于66.8万份EHR笔记），评估五项关键医疗预测任务；采用零样本和微调设置，比较不同规模模型的表现。 Result: 在零样本设置下，各类模型在四项任务中表现不佳（AUROC为36.6%-71.7%），仅死亡率预测例外；经微调后，1B参数的Lang1显著优于大至70倍的通用模型和大至671倍的零样本模型，AUROC提升3.64%-6.75%和1.66%-23.66%；多任务联合微调带来跨任务性能提升，并在外部系统和任务中展现良好泛化能力。 Conclusion: 医疗系统AI的有效性依赖于领域内预训练、监督微调和真实世界评估的结合；专用模型在特定医疗任务上可超越远大于其规模的通用模型，凸显发展垂直领域大模型的重要性。 Abstract: Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

cs.CV [Back]

[87] Psychological stress during Examination and its estimation by handwriting in answer script

Abhijeet Kumar,Chetan Agarwal,Pronoy B. Neogi,Mayank Goswami

Main category: cs.CV

TL;DR: 本研究提出了一种结合笔迹学与人工智能的方法，通过分析学生考试手写卷面来量化心理压力水平。

Details

Motivation: 传统评分系统无法反映学生在考试期间的认知与情绪状态，因此需要一种能够评估学生心理压力的新方法。 Method: 利用高分辨率图像处理、TrOCR进行文字识别，并结合基于RoBERTa的模型进行情感分析与情感熵融合，通过五模型投票机制和无监督异常检测生成数值化的压力指数。 Result: 该方法成功构建了一个稳健的数据驱动框架，能有效量化学生的心理压力水平，在学术取证中表现出创新性和实用性。 Conclusion: 所提出的框架超越了传统评分系统，为教育评估中的心理健康监测提供了可行的技术路径。 Abstract: This research explores the fusion of graphology and artificial intelligence to quantify psychological stress levels in students by analyzing their handwritten examination scripts. By leveraging Optical Character Recognition and transformer based sentiment analysis models, we present a data driven approach that transcends traditional grading systems, offering deeper insights into cognitive and emotional states during examinations. The system integrates high resolution image processing, TrOCR, and sentiment entropy fusion using RoBERTa based models to generate a numerical Stress Index. Our method achieves robustness through a five model voting mechanism and unsupervised anomaly detection, making it an innovative framework in academic forensics.

[88] Real-time pothole detection with onboard sensors and camera on vehicles

Aswath Muthuselvam,Jeevak Raj S,Mohanaprasad K

Main category: cs.CV

TL;DR: 本文提出了一种利用车载传感器实时检测路面坑洼的方法，采用SVM分类器实现了98.1%的检测准确率。

Details

Motivation: 由于道路上车辆数量逐年增加，频繁监测道路状况对于保障交通流畅和及时修复小型裂缝以防止其发展为大坑洼至关重要。 Method: 使用车载传感器收集数据，并采用支持向量机（SVM）分类器对坑洼进行实时识别。 Result: 在本地2公里长、分布有26个坑洼的道路上测试，系统达到了98.1%的检测准确率。 Conclusion: 该方法能够高效准确地识别道路坑洼，具备大规模应用潜力，有助于提升道路维护效率。 Abstract: Road conditions play an important role in our everyday commute. With the proliferating number of vehicles on the road each year, it has become necessary to access the road conditions very frequently, this would ensure that the traffic also flows smoothly. Even the smallest crack in the road could be easily be chipped into a large pothole due to changing surface temperatures of the road and from the force of vehicles riding over it. In this paper, we have addressed how we could better identify these potholes in realtime with the help of onboard sensors in vehicles so that the data could be useful for analysis and better management of potholes on a large scale. For the implementation, we used an SVM classifier to detect potholes, we achieved 98.1% accuracy based on data collected from a local road for about 2 km which had 26 potholes distributed along the road. Code is available at: https://github.com/aswathselvam/Potholes

[89] A Method for Identifying Farmland System Habitat Types Based on the Dynamic-Weighted Feature Fusion Network Model

Kesong Zheng,Zhi Song,Peizhou Li,Shuyi Yao,Zhenxing Bian

Main category: cs.CV

TL;DR: 本研究针对耕地生态系统栖息地分类标准不统一、类型覆盖不全及现有模型语义与纹理特征融合不足的问题，构建了包含15类耕地栖息地的超高清遥感影像数据集，并提出一种动态加权特征融合网络（DWFF-Net），通过冻结参数的DINOv3编码器和动态权重融合策略提升分割精度，实验显示其mIoU达0.6979，F1-score达0.8049，显著优于基线模型。

Details

Motivation: 现有耕地生态系统栖息地分类缺乏统一标准，覆盖类型不完整，且主流模型难以有效融合语义与纹理特征，导致多尺度栖息地（如大田块与微生境）分割精度低、边界模糊，亟需提高识别精度与细粒度刻画能力。 Method: 构建了一个涵盖15类耕地栖息地的高分辨率遥感图像数据集；提出DWFF-Net模型，其编码器采用冻结参数的DINOv3提取基础特征，引入数据级自适应动态加权策略进行特征融合，解码器集成动态权重计算网络以实现多层次特征深度融合，并采用混合损失函数优化训练过程。 Result: 在自建数据集上，DWFF-Net取得0.6979的mIoU和0.8049的F1-score，分别较基线模型提升0.021和0.0161；消融实验表明多层特征融合能有效提升田埂等微生境类别的IoU表现。 Conclusion: 本研究建立了基于自适应多层特征融合的耕地系统栖息地识别框架，实现了亚米级精度、低成本的栖息地制图，为耕地景观的精细化监测提供了有力的技术支持。 Abstract: Addressing the current lack of a standardized habitat classification system for cultivated land ecosystems, incomplete coverage of habitat types, and the inability of existing models to effectively integrate semantic and texture features-resulting in insufficient segmentation accuracy and blurred boundaries for multi-scale habitats (e.g., large-scale field plots and micro-habitats)-this study developed a comprehensively annotated ultra-high-resolution remote sensing image dataset encompassing 15 categories of cultivated land system habitats. Furthermore, we propose a Dynamic-Weighted Feature Fusion Network (DWFF-Net). The encoder of this model utilizes a frozen-parameter DINOv3 to extract foundational features. By analyzing the relationships between different category images and feature maps, we introduce a data-level adaptive dynamic weighting strategy for feature fusion. The decoder incorporates a dynamic weight computation network to achieve thorough integration of multi-layer features, and a hybrid loss function is adopted to optimize model training. Experimental results on the constructed dataset demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 0.6979 and an F1-score of 0.8049, outperforming the baseline network by 0.021 and 0.0161, respectively. Ablation studies further confirm the complementary nature of multi-layer feature fusion, which effectively improves the IoU for micro-habitat categories such as field ridges. This study establishes a habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at a low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes.

[90] AGENet: Adaptive Edge-aware Geodesic Distance Learning for Few-Shot Medical Image Segmentation

Ziyuan Gao

Main category: cs.CV

TL;DR: 提出AGENet，一种基于自适应测地线边缘感知的轻量级医学图像分割框架，在少量标注数据下实现精确边界分割。

Details

Motivation: 现有少样本医学图像分割方法在边界精确划分上表现不佳，尤其在缺乏足够空间上下文时难以区分解剖结构相似区域。 Method: 通过边缘感知的测地线距离学习模块（结合快速行进法）、自适应原型提取（空间加权聚合）和自适应参数学习，利用几何先验提升分割精度。 Result: 在多个医学影像数据集上优于现有最先进方法，显著减少边界误差，同时保持计算高效性。 Conclusion: AGENet通过引入几何感知机制，在低标注成本下实现了高精度医学图像分割，适用于临床实际应用。 Abstract: Medical image segmentation requires large annotated datasets, creating a significant bottleneck for clinical applications. While few-shot segmentation methods can learn from minimal examples, existing approaches demonstrate suboptimal performance in precise boundary delineation for medical images, particularly when anatomically similar regions appear without sufficient spatial context. We propose AGENet (Adaptive Geodesic Edge-aware Network), a novel framework that incorporates spatial relationships through edge-aware geodesic distance learning. Our key insight is that medical structures follow predictable geometric patterns that can guide prototype extraction even with limited training data. Unlike methods relying on complex architectural components or heavy neural networks, our approach leverages computationally lightweight geometric modeling. The framework combines three main components: (1) An edge-aware geodesic distance learning module that respects anatomical boundaries through iterative Fast Marching refinement, (2) adaptive prototype extraction that captures both global structure and local boundary details via spatially-weighted aggregation, and (3) adaptive parameter learning that automatically adjusts to different organ characteristics. Extensive experiments across diverse medical imaging datasets demonstrate improvements over state-of-the-art methods. Notably, our method reduces boundary errors compared to existing approaches while maintaining computational efficiency, making it highly suitable for clinical applications requiring precise segmentation with limited annotated data.

[91] EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios with Language Guidance

Jiahui Wang,Haiyue Zhu,Haoren Guo,Abdullah Al Mamun,Cheng Xiang,Tong Heng Lee

Main category: cs.CV

TL;DR: 提出了一种无需预训练的点云语义分割网络EPSegFZ，通过引入ProERA、DRPE和LGPE模块，有效利用视觉与文本信息，在少样本和零样本场景下显著提升性能。

Details

Motivation: 现有少样本点云分割方法依赖预训练且忽视支持集中的文本信息，限制了模型灵活性与零样本能力。 Method: 设计了EPSegFZ网络，包含ProERA模块增强原型注意力，DRPE机制优化查询-原型匹配，并通过LGPE模块融合语言引导的原型嵌入以利用文本标注。 Result: 在S3DIS和ScanNet上分别比当前最优方法提升5.68%和3.82%，实现了更优的少样本与零样本分割性能。 Conclusion: EPSegFZ无需预训练即可有效融合多模态支持信息，提升了模型灵活性、适应性及零样本推理能力。 Abstract: Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference. Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.

Lian He,Meng Liu,Qilang Ye,Yu Zhou,Xiang Deng,Gangyi Ding

Main category: cs.CV

TL;DR: 提出了一种名为TASA的新框架，用于任务感知的3D场景级可操作性分割，结合2D语义线索和3D几何推理，显著提升了准确性和效率。

Details

Motivation: 现有方法主要关注对象级可操作性或仅将2D预测扩展到3D，忽略了点云中的丰富几何结构信息，并导致高计算成本。需要一种能有效融合语义推理与空间定位的方法。 Method: TASA采用粗到精的方式，联合利用2D语义线索和3D几何推理；包含任务感知的2D可操作性检测模块和3D可操作性优化模块，前者识别可操作点并选择任务相关视图，后者融合2D语义先验与局部3D几何信息以生成精确的3D可操作性掩码。 Result: 在SceneFun3D数据集上的实验表明，TASA在场景级可操作性分割任务中显著优于基线方法，无论是在准确性还是效率方面。 Conclusion: TASA通过有效整合2D语义与3D几何信息，实现了高效、准确的3D场景级可操作性分割，为具身智能体在复杂环境中的交互提供了更强的支持。 Abstract: Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.

[93] LE-CapsNet: A Light and Enhanced Capsule Network

Pouya Shiri,Amirali Baniasadi

Main category: cs.CV

TL;DR: 提出了一种轻量、增强且更准确的CapsNet变体LE-CapsNet，在减少参数和提升推理速度的同时，提高了在CIFAR-10和AffNIST数据集上的分类准确率。

Details

Motivation: CapsNet虽然在处理重叠类别和变换图像方面优于CNN，但存在速度慢、资源消耗大、参数多且精度不足的问题，因此需要一种更高效、准确的模型。 Method: 设计了LE-CapsNet，通过优化网络结构减少参数数量（使用380万权重），提升推理速度，并增强对仿射变换图像的鲁棒性。 Result: 在CIFAR-10上达到76.73%的准确率，推理速度比CapsNet快4倍；在AffNIST上准确率达到94.3%，优于CapsNet的90.52%。 Conclusion: LE-CapsNet在保持CapsNet优势的同时，显著提升了效率和准确性，是一种更具实用性的Capsule Network改进方案。 Abstract: Capsule Network (CapsNet) classifier has several advantages over CNNs, including better detection of images containing overlapping categories and higher accuracy on transformed images. Despite the advantages, CapsNet is slow due to its different structure. In addition, CapsNet is resource-hungry, includes many parameters and lags in accuracy compared to CNNs. In this work, we propose LE-CapsNet as a light, enhanced and more accurate variant of CapsNet. Using 3.8M weights, LECapsNet obtains 76.73% accuracy on the CIFAR-10 dataset while performing inference 4x faster than CapsNet. In addition, our proposed network is more robust at detecting images with affine transformations compared to CapsNet. We achieve 94.3% accuracy on the AffNIST dataset (compared to CapsNet 90.52%).

[94] Target-Balanced Score Distillation

Zhou Xu,Qi Wang,Yuxiao Yang,Luyuan Zhang,Zhang Liang,Yang Li

Main category: cs.CV

TL;DR: 提出Target-Balanced Score Distillation (TBSD) 方法，通过自适应多目标优化解决使用负提示导致的纹理与形状失真权衡问题，显著提升3D资产的纹理保真度和几何准确性。

Details

Motivation: vanilla SDS存在过饱和和过平滑问题，引入负提示虽可改善纹理但带来纹理优化不足或形状失真的权衡问题。 Method: 进行系统性分析，揭示负提示中嵌入目标信息（TNP）是导致纹理增强但形状失真的关键原因；提出TBSD，将生成建模为多目标优化问题，并设计自适应策略平衡纹理与形状。 Result: 实验表明TBSD在保持几何准确性的前提下显著提升纹理质量，优于现有最先进方法。 Conclusion: TBSD有效解决了负提示在3D生成中带来的纹理-形状权衡问题，实现了高质量3D资产生成。 Abstract: Score Distillation Sampling (SDS) enables 3D asset generation by distilling priors from pretrained 2D text-to-image diffusion models, but vanilla SDS suffers from over-saturation and over-smoothing. To mitigate this issue, recent variants have incorporated negative prompts. However, these methods face a critical trade-off: limited texture optimization, or significant texture gains with shape distortion. In this work, we first conduct a systematic analysis and reveal that this trade-off is fundamentally governed by the utilization of the negative prompts, where Target Negative Prompts (TNP) that embed target information in the negative prompts dramatically enhancing texture realism and fidelity but inducing shape distortions. Informed by this key insight, we introduce the Target-Balanced Score Distillation (TBSD). It formulates generation as a multi-objective optimization problem and introduces an adaptive strategy that effectively resolves the aforementioned trade-off. Extensive experiments demonstrate that TBSD significantly outperforms existing state-of-the-art methods, yielding 3D assets with high-fidelity textures and geometrically accurate shape.

[95] CompressNAS : A Fast and Efficient Technique for Model Compression using Decomposition

Sudhakar Sah,Nikhil Chabbra,Matthieu Durnerin

Main category: cs.CV

TL;DR: 提出CompressNAS框架，通过全局搜索优化低秩张量分解的秩选择，在保持精度的同时显著压缩CNN模型。

Details

Motivation: 深度卷积神经网络在微控制器和轻量级NPU上部署困难，现有低秩分解方法多局部选择秩，忽略压缩与精度间的全局权衡。 Method: 将秩选择视为全局搜索问题，采用快速精度估计器在内存和精度约束下高效探索候选分解。 Result: 在ImageNet上将ResNet-18压缩8倍且精度下降小于4%；在COCO上YOLOv5s压缩2倍无精度损失，YOLOv5n压缩2倍精度下降2.5%；并提出性能优越的STResNet模型家族。 Conclusion: CompressNAS能有效实现CNN模型的高倍压缩与精度保持之间的平衡，适用于资源受限设备部署。 Abstract: Deep Convolutional Neural Networks (CNNs) are increasingly difficult to deploy on microcontrollers (MCUs) and lightweight NPUs (Neural Processing Units) due to their growing size and compute demands. Low-rank tensor decomposition, such as Tucker factorization, is a promising way to reduce parameters and operations with reasonable accuracy loss. However, existing approaches select ranks locally and often ignore global trade-offs between compression and accuracy. We introduce CompressNAS, a MicroNAS-inspired framework that treats rank selection as a global search problem. CompressNAS employs a fast accuracy estimator to evaluate candidate decompositions, enabling efficient yet exhaustive rank exploration under memory and accuracy constraints. In ImageNet, CompressNAS compresses ResNet-18 by 8x with less than 4% accuracy drop; on COCO, we achieve 2x compression of YOLOv5s without any accuracy drop and 2x compression of YOLOv5n with a 2.5% drop. Finally, we present a new family of compressed models, STResNet, with competitive performance compared to other efficient models.

[96] AdaptFly: Prompt-Guided Adaptation of Foundation Models for Low-Altitude UAV Networks

Jiao Chen,Haoyi Wang,Jianhua Tang,Junyi Wang

Main category: cs.CV

TL;DR: 提出AdaptFly，一种无需权重更新的提示引导测试时自适应框架，用于提升低空无人机网络中的语义分割鲁棒性。

Details

Motivation: 现有分割基础模型在天气、光照和视角变化下性能迅速下降，且无人机资源差异大，难以高效进行测试时自适应。 Method: 设计两种互补的自适应模式：资源受限无人机通过共享全局记忆库进行轻量级token提示检索；资源丰富无人机采用基于协方差矩阵自适应进化策略的无梯度稀疏视觉提示优化。通过激活统计检测器触发自适应，并利用跨无人机知识池整合提示知识，实现低带宽开销的群体协作。 Result: 在UAVid和VDD基准及真实无人机部署中验证，AdaptFly显著优于静态模型和现有测试时自适应方法，提升了分割精度与鲁棒性。 Conclusion: AdaptFly为低空经济中资源异构的无人机网络提供了高效、通信节约的感知自适应方案。 Abstract: Low-altitude Unmanned Aerial Vehicle (UAV) networks rely on robust semantic segmentation as a foundational enabler for distributed sensing-communication-control co-design across heterogeneous agents within the network. However, segmentation foundation models deteriorate quickly under weather, lighting, and viewpoint drift. Resource-limited UAVs cannot run gradient-based test-time adaptation, while resource-massive UAVs adapt independently, wasting shared experience. To address these challenges, we propose AdaptFly, a prompt-guided test-time adaptation framework that adjusts segmentation models without weight updates. AdaptFly features two complementary adaptation modes. For resource-limited UAVs, it employs lightweight token-prompt retrieval from a shared global memory. For resource-massive UAVs, it uses gradient-free sparse visual prompt optimization via Covariance Matrix Adaptation Evolution Strategy. An activation-statistic detector triggers adaptation, while cross-UAV knowledge pool consolidates prompt knowledge and enables fleet-wide collaboration with negligible bandwidth overhead. Extensive experiments on UAVid and VDD benchmarks, along with real-world UAV deployments under diverse weather conditions, demonstrate that AdaptFly significantly improves segmentation accuracy and robustness over static models and state-of-the-art TTA baselines. The results highlight a practical path to resilient, communication-efficient perception in the emerging low-altitude economy.

Zekai Shi,Zhixi Cai,Kalin Stefanov

Main category: cs.CV

TL;DR: 提出一种受人类视觉盲区启发的自监督掩码策略，用于学习视觉表征，并结合对比学习模型实现儿童语言习得中的词-指称映射。

Details

Motivation: 模仿儿童在无先验知识下通过视觉和听觉线索学习词汇的过程，探索生物可解释的自监督学习方法。 Method: 基于掩码自编码器的视觉骨干网络，采用模拟人眼盲区的新型掩码策略；结合对比学习框架进行视频-文本建模，学习词-指称映射。 Result: 所提出的生物可解释掩码策略在学习词-指称映射任务上至少与随机掩码效果相当，且更具生物学合理性。 Conclusion: 模拟生理特征的掩码策略可有效提升视觉表示学习的生物合理性，同时保持良好的跨模态对齐性能。 Abstract: Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes' field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pretrained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.

[98] GROVER: Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion

Yongjun Xiao,Dian Meng,Xinlei Huang,Yanran Liu,Shiwei Ruan,Ziyue Qiao,Xubin Zheng

Main category: cs.CV

TL;DR: 提出GROVER框架，通过图卷积网络与对比学习实现空间多组学与病理图像的自适应融合。

Details

Motivation: 现有方法难以有效整合空间转录组、蛋白质组、表观基因组与组织病理图像，存在模态异质性、分辨率不匹配和样本制备扰动等问题。 Method: 采用基于Kolmogorov-Arnold网络的图卷积编码器建模非线性依赖，引入斑点-特征对对比学习策略对齐多模态表示，并设计动态专家路由机制自适应选择高质量模态输入。 Result: 在真实数据集上实验表明，GROVER优于现有最先进方法，在多模态整合中表现出更强的鲁棒性和可靠性。 Conclusion: GROVER为复杂组织中多模态空间组学数据的融合提供了有效解决方案，有助于更全面地理解疾病组织的生物学机制。 Abstract: Effectively modeling multimodal spatial omics data is critical for understanding tissue complexity and underlying biological mechanisms. While spatial transcriptomics, proteomics, and epigenomics capture molecular features, they lack pathological morphological context. Integrating these omics with histopathological images is therefore essential for comprehensive disease tissue analysis. However, substantial heterogeneity across omics, imaging, and spatial modalities poses significant challenges. Naive fusion of semantically distinct sources often leads to ambiguous representations. Additionally, the resolution mismatch between high-resolution histology images and lower-resolution sequencing spots complicates spatial alignment. Biological perturbations during sample preparation further distort modality-specific signals, hindering accurate integration. To address these challenges, we propose Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion (GROVER), a novel framework for adaptive integration of spatial multi-omics data. GROVER leverages a Graph Convolutional Network encoder based on Kolmogorov-Arnold Networks to capture the nonlinear dependencies between each modality and its associated spatial structure, thereby producing expressive, modality-specific embeddings. To align these representations, we introduce a spot-feature-pair contrastive learning strategy that explicitly optimizes the correspondence across modalities at each spot. Furthermore, we design a dynamic expert routing mechanism that adaptively selects informative modalities for each spot while suppressing noisy or low-quality inputs. Experiments on real-world spatial omics datasets demonstrate that GROVER outperforms state-of-the-art baselines, providing a robust and reliable solution for multimodal integration.

[99] Exposing DeepFakes via Hyperspectral Domain Mapping

Aditya Mehta,Swarnim Chaudhary,Pratik Narang,Jagat Sesh Challa

Main category: cs.CV

TL;DR: 提出HSI-Detect，一种两阶段管道，通过从标准RGB输入重建31通道高光谱图像并在高光谱域执行检测，有效增强在RGB域中微弱或不可见的伪造痕迹，显著提升Deepfake检测性能。

Details

Motivation: 现有检测方法仅在RGB空间操作，受限于三个光谱通道，难以捕捉细微的伪造痕迹，因此需要更丰富的光谱信息来提高检测能力。 Method: HSI-Detect采用两阶段框架：首先将RGB图像重建为31通道的高光谱图像，然后在高光谱域中进行Deepfake检测，利用密集光谱波段放大操纵伪影。 Result: 在FaceForensics++数据集上的实验表明，HSI-Detect consistently优于仅使用RGB的方法，验证了光谱域映射对Deepfake检测的有效性。 Conclusion: 通过引入高光谱信息，HSI-Detect能够更有效地揭示伪造痕迹，展示了光谱扩展在生成图像检测中的潜力。 Abstract: Modern generative and diffusion models produce highly realistic images that can mislead human perception and even sophisticated automated detection systems. Most detection methods operate in RGB space and thus analyze only three spectral channels. We propose HSI-Detect, a two-stage pipeline that reconstructs a 31-channel hyperspectral image from a standard RGB input and performs detection in the hyperspectral domain. Expanding the input representation into denser spectral bands amplifies manipulation artifacts that are often weak or invisible in the RGB domain, particularly in specific frequency bands. We evaluate HSI-Detect across FaceForensics++ dataset and show the consistent improvements over RGB-only baselines, illustrating the promise of spectral-domain mapping for Deepfake detection.

[100] Toward bilipshiz geometric models

Yonatan Sverdlov,Eitan Rosen,Nadav Dym

Main category: cs.CV

TL;DR: 本文研究了点云神经网络在对称性下的双利普希茨等价性质，指出常用的不变网络不满足Procrustes Matching度量下的双利普希茨条件，并提出改进方法以获得该性质，实验证明新模型在3D点云匹配任务中表现更优。

Details

Motivation: 受等变学习中双利普希茨模型优势的启发，探究点云神经网络是否保持对称性感知距离的双利普希茨性质，以提升模型在对应关系匹配等任务中的性能。 Method: 分析两种对称性感知度量（Procrustes Matching和Hard Gromov Wasserstein距离）之间的双利普希茨等价性，理论证明常用不变网络不满足该性质，并提出修改方案使网络具备双利普希茨保证。 Result: 证明了PM与Hard GW距离之间不具备双利普希茨等价性，推导出标准不变网络在PM度量下非双利普希茨；提出了可实现双利普希茨性质的网络改进方法，并在3D点云匹配任务中验证了其优于标准模型。 Conclusion: 标准点云不变网络不满足对称性感知距离的双利普希茨性质，但通过适当修改可以实现该性质，且所提模型在对应匹配任务中表现出更好性能。 Abstract: Many neural networks for point clouds are, by design, invariant to the symmetries of this datatype: permutations and rigid motions. The purpose of this paper is to examine whether such networks preserve natural symmetry aware distances on the point cloud spaces, through the notion of bi-Lipschitz equivalence. This inquiry is motivated by recent work in the Equivariant learning literature which highlights the advantages of bi-Lipschitz models in other scenarios. We consider two symmetry aware metrics on point clouds: (a) The Procrustes Matching (PM) metric and (b) Hard Gromov Wasserstien distances. We show that these two distances themselves are not bi-Lipschitz equivalent, and as a corollary deduce that popular invariant networks for point clouds are not bi-Lipschitz with respect to the PM metric. We then show how these networks can be modified so that they do obtain bi-Lipschitz guarantees. Finally, we provide initial experiments showing the advantage of the proposed bi-Lipschitz model over standard invariant models, for the tasks of finding correspondences between 3D point clouds.

[101] Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

Sanchit Sinha,Guangzhi Xiong,Zhenghao He,Aidong Zhang

Main category: cs.CV

TL;DR: 提出多智能体系统Concept-RuleNet，通过从图像中挖掘视觉概念并结合大语言模型生成可解释的神经符号规则，在提升性能的同时减少幻觉符号。

Details

Motivation: 现有神经符号系统依赖任务标签提取符号，导致符号与视觉数据脱节，缺乏视觉 grounding，易产生幻觉。 Method: 1) 多模态概念生成器从训练图像中挖掘判别性视觉概念；2) 利用这些概念指导符号发现，增强视觉 grounding；3) 大语言模型将符号组合成一阶逻辑规则；4) 视觉验证代理在推理时量化符号存在并触发规则执行。 Result: 在五个基准（包括医学影像和自然图像）上平均性能提升5%，幻觉符号减少达50%。 Conclusion: Concept-RuleNet 实现了良好的视觉 grounding 与透明推理，有效提升神经符号系统的准确性与可靠性。 Abstract: Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

[102] Batch Transformer Architecture: Case of Synthetic Image Generation for Emotion Expression Facial Recognition

Stanislav Selitskiy

Main category: cs.CV

TL;DR: 提出了一种隐式稀疏风格的新型Transformer变体架构——Batch Transformers，通过关注重要维度来减少编码器-解码器架构中的瓶颈规模。

Details

Motivation: 传统Transformer对整个序列或批次实体进行全维度注意力计算，导致计算资源消耗大且存在信息冗余，因此需要一种更高效的注意力机制。 Method: 在Batch Transformers中，仅对“重要”维度（主成分）进行注意力计算，实现特征选择和隐式稀疏性，从而降低模型复杂度。 Result: 在化妆和遮挡数据集的人脸识别任务中测试了该架构，并用于合成图像生成，有效增加了有限原始数据集的多样性。 Conclusion: 所提方法能显著减小编码器-解码器架构的瓶颈规模，提升数据利用率和模型效率，适用于数据受限场景下的图像生成与识别任务。 Abstract: A novel Transformer variation architecture is proposed in the implicit sparse style. Unlike "traditional" Transformers, instead of attention to sequential or batch entities in their entirety of whole dimensionality, in the proposed Batch Transformers, attention to the "important" dimensions (primary components) is implemented. In such a way, the "important" dimensions or feature selection allows for a significant reduction of the bottleneck size in the encoder-decoder ANN architectures. The proposed architecture is tested on the synthetic image generation for the face recognition task in the case of the makeup and occlusion data set, allowing for increased variability of the limited original data set.

[103] Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

Hossein Mohebbi,Mohammed Abdulrahman,Yanting Miao,Pascal Poupart,Suraj Kothawade

Main category: cs.CV

TL;DR: 本文提出了Image-POSER，一种基于反射性强化学习的框架，通过动态任务分解和视觉语言模型的结构化反馈，协调多种预训练图像生成专家模型，有效处理长且复杂的文本提示，在对齐性、保真度和美学方面优于现有模型，并在人类评估中更受青睐。

Details

Motivation: 现有的文本到图像生成模型在处理长且复杂的创意提示时表现不稳定，难以满足实际创作需求，因此需要一个能够自动分解任务并协调多个专家模型的系统。 Method: 将图像生成与编辑建模为马尔可夫决策过程，采用反射性强化学习框架，动态调用预训练的文本到图像和图像到图像模型，并通过视觉语言模型作为批评者提供每步的结构化反馈，实现端到端的提示处理。 Result: 实验表明，Image-POSER在行业标准和自定义基准上均优于基线模型（包括前沿模型），在对齐性、图像质量和美学评分方面表现更优，且在人类评估中被更频繁地选择。 Conclusion: 强化学习能够赋予AI系统自主分解、重排和组合视觉模型的能力，Image-POSER推动了通用型视觉助手的发展。 Abstract: Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

[104] SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction

Zhongping Dong,Pengyang Yu,Shuangjian Li,Liming Chen,Mohand Tahar Kechadi

Main category: cs.CV

TL;DR: SOTFormer是一种轻量级、恒定内存的时序Transformer模型，统一了目标检测、跟踪和短时轨迹预测，通过真值引导的记忆机制和burn-in锚点损失实现稳定的身份传播，在Mini-LaSOT上表现出高精度和实时性。

Details

Motivation: 现有单目标跟踪方法在遮挡、尺度变化和时序漂移下难以保持时序一致性，且多数模型因递归或堆叠结构导致内存消耗大、难以实时推理。 Method: 提出SOTFormer，采用单一轻量级时序注意力层和真值引导的记忆机制，结合burn-in锚点损失来稳定初始化，实现端到端的检测、跟踪与短期运动预测。 Result: 在Mini-LaSOT（20%）基准上达到76.3 AUC和53.7 FPS，显存仅4.3 GB，显著优于TrackFormer和MOTRv2，尤其在快速运动、尺度变化和遮挡场景下表现更优。 Conclusion: SOTFormer通过简化时序建模方式，在保证低内存和高帧率的同时提升了跟踪鲁棒性和准确性，适用于实时感知系统。 Abstract: Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.

[105] MP-GFormer: A 3D-Geometry-Aware Dynamic Graph Transformer Approach for Machining Process Planning

Fatemeh Elhambakhsh,Gaurav Ameta,Aditi Roy,Hyunwoong Ko

Main category: cs.CV

TL;DR: 提出了一种3D几何感知的动态图Transformer（MP-GFormer），用于预测加工操作序列，通过融合立体光刻表面网格和边界表示法，提升了加工过程规划的准确性。

Details

Motivation: 现有动态图学习方法在加工过程规划中未能融入零件的三维几何信息，缺乏领域感知能力，难以准确捕捉随加工操作演变的动态依赖关系。 Method: 提出MP-GFormer模型，结合3D几何信息与动态图学习，利用注意力机制融合每步加工后的立体光刻表面网格，并采用边界表示法处理初始设计，以建模加工过程中的几何演化。 Result: 在合成数据集上评估显示，相比现有最先进方法，主操作和子操作预测的准确率分别提升了24%和36%。 Conclusion: MP-GFormer通过引入3D几何感知能力，显著提高了加工操作序列预测的准确性，为智能加工过程规划提供了有效解决方案。 Abstract: Machining process planning (MP) is inherently complex due to structural and geometrical dependencies among part features and machining operations. A key challenge lies in capturing dynamic interdependencies that evolve with distinct part geometries as operations are performed. Machine learning has been applied to address challenges in MP, such as operation selection and machining sequence prediction. Dynamic graph learning (DGL) has been widely used to model dynamic systems, thanks to its ability to integrate spatio-temporal relationships. However, in MP, while existing DGL approaches can capture these dependencies, they fail to incorporate three-dimensional (3D) geometric information of parts and thus lack domain awareness in predicting machining operation sequences. To address this limitation, we propose MP-GFormer, a 3D-geometry-aware dynamic graph transformer that integrates evolving 3D geometric representations into DGL through an attention mechanism to predict machining operation sequences. Our approach leverages StereoLithography surface meshes representing the 3D geometry of a part after each machining operation, with the boundary representation method used for the initial 3D designs. We evaluate MP-GFormer on a synthesized dataset and demonstrate that the method achieves improvements of 24\% and 36\% in accuracy for main and sub-operation predictions, respectively, compared to state-of-the-art approaches.

[106] Defending Unauthorized Model Merging via Dual-Stage Weight Protection

Wei-Jia Chen,Min-Yen Tsai,Cheng-Yi Lee,Chia-Mu Yu

Main category: cs.CV

TL;DR: 本文提出了MergeGuard，一种双阶段权重保护框架，旨在防止未经授权的模型合并，通过重新分配任务相关信息和注入结构化扰动来破坏合并兼容性，同时保持原模型性能。

Details

Motivation: 随着预训练模型和开源仓库的普及，未经授权的模型合并行为日益增多，侵犯了知识产权并削弱了模型所有权与责任追溯。因此，亟需一种有效机制来保护模型不被非法合并。 Method: MergeGuard采用双阶段方法：第一阶段通过L2正则化优化在各层间重新分配任务相关的信息，使重要梯度均匀分布；第二阶段注入结构化扰动以错开任务子空间，破坏损失流形中的曲率兼容性，从而阻碍有效合并。 Result: 在视觉（ViT-L-14）和语言模型（Llama2, Gemma2, Mistral）上的实验表明，MergeGuard可使合并模型的准确率下降高达90%，而对受保护模型的任务性能影响小于1.5%。 Conclusion: MergeGuard能有效防御未经授权的模型合并，在几乎不影响原始任务性能的前提下，显著降低合并后模型的可用性，为模型版权保护提供了可行方案。 Abstract: The rapid proliferation of pretrained models and open repositories has made model merging a convenient yet risky practice, allowing free-riders to combine fine-tuned models into a new multi-capability model without authorization. Such unauthorized model merging not only violates intellectual property rights but also undermines model ownership and accountability. To address this issue, we present MergeGuard, a proactive dual-stage weight protection framework that disrupts merging compatibility while maintaining task fidelity. In the first stage, we redistribute task-relevant information across layers via L2-regularized optimization, ensuring that important gradients are evenly dispersed. In the second stage, we inject structured perturbations to misalign task subspaces, breaking curvature compatibility in the loss landscape. Together, these stages reshape the model's parameter geometry such that merged models collapse into destructive interference while the protected model remains fully functional. Extensive experiments on both vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models demonstrate that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.

[107] FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision

Muzammal Shafique,Nasir Rahim,Jamil Ahmad,Mohammad Siadat,Khalid Malik,Ghaus Malik

Main category: cs.CV

TL;DR: 本文提出了一种基于符号距离函数（SDF）的新型损失函数FocusSDF，通过自适应地增强边界区域的权重，提升医学图像分割中的边界保持能力，并在多种模型和数据集上验证了其优越性。

Details

Motivation: 现有的医学图像分割模型大多未显式编码边界信息，导致边界保留效果不佳，影响临床诊断与治疗的精确性。 Method: 提出FocusSDF损失函数，利用符号距离函数对靠近病变或器官边界的像素赋予更高的权重，使网络更关注边界区域，从而实现边界感知的分割训练。 Result: 在包括MedSAM在内的五种先进分割模型上，结合四种基于距离的损失函数，在涵盖脑动脉瘤、中风、肝脏和乳腺肿瘤等多个数据集上进行实验，结果表明FocusSDF consistently 优于现有的基于距离变换的损失函数。 Conclusion: FocusSDF能有效提升医学图像分割中的边界精度，具有良好的通用性和鲁棒性，适用于多模态、多任务的医学图像分割场景。 Abstract: Segmentation of medical images constitutes an essential component of medical image analysis, providing the foundation for precise diagnosis and efficient therapeutic interventions in clinical practices. Despite substantial progress, most segmentation models do not explicitly encode boundary information; as a result, making boundary preservation a persistent challenge in medical image segmentation. To address this challenge, we introduce FocusSDF, a novel loss function based on the signed distance functions (SDFs), which redirects the network to concentrate on boundary regions by adaptively assigning higher weights to pixels closer to the lesion or organ boundary, effectively making it boundary aware. To rigorously validate FocusSDF, we perform extensive evaluations against five state-of-the-art medical image segmentation models, including the foundation model MedSAM, using four distance-based loss functions across diverse datasets covering cerebral aneurysm, stroke, liver, and breast tumor segmentation tasks spanning multiple imaging modalities. The experimental results consistently demonstrate the superior performance of FocusSDF over existing distance transform based loss functions.

[108] Lacking Data? No worries! How synthetic images can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus)

Simon Durand,Samuel Foucher,Alexandre Delplanque,Joëlle Taillon,Jérôme Théau

Main category: cs.CV

TL;DR: 本研究探讨了利用合成影像（SI）增强稀疏数据条件下麝牛检测的深度学习目标检测模型性能，结果表明合成影像能有效提升零样本和少样本场景下的检测效果。

Details

Motivation: 由于传统野生动物调查方法资源消耗大且受限于后勤挑战，而深度学习模型又受限于小样本数据集，因此需要探索新的数据增强方法以提高对稀疏分布物种的监测能力。 Method: 比较基于真实影像的基线模型与逐步增加合成影像训练数据的零样本（ZS）和少样本（FS）模型，评估其在精度、召回率和F1分数上的表现。 Result: 在零样本设置中，加入合成影像显著提升了检测性能，但超过基线数据集100%后性能趋于饱和；在少样本设置中，结合真实与合成影像略提高了召回率和整体准确率，但无统计学显著性。 Conclusion: 合成影像有助于在真实数据稀缺时训练高效的目标检测模型，为野生动物监测提供了可行方案，尤其适用于罕见或难以接近的物种，并可随时间积累真实数据进行优化。 Abstract: Accurate population estimates are essential for wildlife management, providing critical insights into species abundance and distribution. Traditional survey methods, including visual aerial counts and GNSS telemetry tracking, are widely used to monitor muskox populations in Arctic regions. These approaches are resource intensive and constrained by logistical challenges. Advances in remote sensing, artificial intelligence, and high resolution aerial imagery offer promising alternatives for wildlife detection. Yet, the effectiveness of deep learning object detection models (ODMs) is often limited by small datasets, making it challenging to train robust ODMs for sparsely distributed species like muskoxen. This study investigates the integration of synthetic imagery (SI) to supplement limited training data and improve muskox detection in zero shot (ZS) and few-shot (FS) settings. We compared a baseline model trained on real imagery with 5 ZS and 5 FS models that incorporated progressively more SI in the training set. For the ZS models, where no real images were included in the training set, adding SI improved detection performance. As more SI were added, performance in precision, recall and F1 score increased, but eventually plateaued, suggesting diminishing returns when SI exceeded 100% of the baseline model training dataset. For FS models, combining real and SI led to better recall and slightly higher overall accuracy compared to using real images alone, though these improvements were not statistically significant. Our findings demonstrate the potential of SI to train accurate ODMs when data is scarce, offering important perspectives for wildlife monitoring by enabling rare or inaccessible species to be monitored and to increase monitoring frequency. This approach could be used to initiate ODMs without real data and refine it as real images are acquired over time.

[109] Advancing Annotat3D with Harpia: A CUDA-Accelerated Library For Large-Scale Volumetric Data Segmentation

Camila Machado de Araujo,Egon P. B. S. Borges,Ricardo Marcelo Canteiro Grangeiro,Allan Pinto

Main category: cs.CV

TL;DR: 本文介绍了Harpia，一个基于CUDA的处理库，用于在高性能计算环境中实现大规模3D数据集的可扩展、交互式分割工作流。

Details

Motivation: 高分辨率三维成像技术产生的数据量日益增长，现有工具在处理、分割和交互探索方面面临效率挑战。 Method: 开发了名为Harpia的新CUDA库，集成到Annotat3D中，支持严格的内存控制、原生分块执行，并提供GPU加速的滤波、标注和量化工具。 Result: 实验结果表明，与NVIDIA cuCIM和scikit-image等常用框架相比，Harpia在处理速度、内存效率和可扩展性方面均有显著提升。 Conclusion: Harpia结合交互式人机接口和高效的GPU资源管理，适用于共享HPC环境中的协作式科学成像工作流。 Abstract: High-resolution volumetric imaging techniques, such as X-ray tomography and advanced microscopy, generate increasingly large datasets that challenge existing tools for efficient processing, segmentation, and interactive exploration. This work introduces new capabilities to Annotat3D through Harpia, a new CUDA-based processing library designed to support scalable, interactive segmentation workflows for large 3D datasets in high-performance computing (HPC) and remote-access environments. Harpia features strict memory control, native chunked execution, and a suite of GPU-accelerated filtering, annotation, and quantification tools, enabling reliable operation on datasets exceeding single-GPU memory capacity. Experimental results demonstrate significant improvements in processing speed, memory efficiency, and scalability compared to widely used frameworks such as NVIDIA cuCIM and scikit-image. The system's interactive, human-in-the-loop interface, combined with efficient GPU resource management, makes it particularly suitable for collaborative scientific imaging workflows in shared HPC infrastructures.

[110] Prompt Triage: Structured Optimization Enhances Vision-Language Model Performance on Medical Imaging Benchmarks

Arnav Singhvi,Vasiliki Bikia,Asad Aali,Akshay Chaudhari,Roxana Daneshjou

Main category: cs.CV

TL;DR: 本研究将Declarative Self-improving Python (DSPy)框架应用于医学视觉-语言系统，实现结构化自动提示优化，在五个医学影像任务中对10个开源视觉-语言模型进行了评估，结果显示中位相对性能提升达53%，在零样本表现较差的任务上最大提升达300%至3400%，显著提升了临床图像解释的准确性，同时减少对人工提示设计的依赖，具备可扩展性和数据隐私保护优势。

Details

Motivation: 现有医学视觉-语言模型表现不佳，微调需要大量数据和计算资源，手动提示工程难以泛化且不易被医疗机构采用，因此需要一种不依赖人工提示、可扩展且无需修改模型权重的优化方法。 Method: 采用DSPy框架进行自动化提示优化，构建了涵盖放射学、胃肠病学和皮肤病学五个任务的提示流水线，评估了10个开源视觉-语言模型，并比较了四种提示优化技术的效果。 Result: 优化后的流水线相比零样本基线中位相对提升53%，部分任务提升幅度达300%至3400%；方法具有良好的可扩展性，能在不暴露敏感数据的情况下提升性能，并支持开源模型部署。 Conclusion: 自动化提示优化能显著提升医学视觉-语言模型的性能，减少对人工提示设计的依赖，有助于临床医生专注于患者护理和决策，具备实际部署潜力，研究代码已公开以促进可重复研究。 Abstract: Vision-language foundation models (VLMs) show promise for diverse imaging tasks but often underperform on medical benchmarks. Prior efforts to improve performance include model finetuning, which requires large domain-specific datasets and significant compute, or manual prompt engineering, which is hard to generalize and often inaccessible to medical institutions seeking to deploy these tools. These challenges motivate interest in approaches that draw on a model's embedded knowledge while abstracting away dependence on human-designed prompts to enable scalable, weight-agnostic performance improvements. To explore this, we adapt the Declarative Self-improving Python (DSPy) framework for structured automated prompt optimization in medical vision-language systems through a comprehensive, formal evaluation. We implement prompting pipelines for five medical imaging tasks across radiology, gastroenterology, and dermatology, evaluating 10 open-source VLMs with four prompt optimization techniques. Optimized pipelines achieved a median relative improvement of 53% over zero-shot prompting baselines, with the largest gains ranging from 300% to 3,400% on tasks where zero-shot performance is low. These results highlight the substantial potential of applying automated prompt optimization to medical AI systems, demonstrating significant gains for vision-based applications requiring accurate clinical image interpretation. By reducing dependence on prompt design to elicit intended outputs, these techniques allow clinicians to focus on patient care and clinical decision-making. Furthermore, our experiments offer scalability and preserve data privacy, demonstrating performance improvement on open-source VLMs. We publicly release our evaluation pipelines to support reproducible research on specialized medical tasks, available at https://github.com/DaneshjouLab/prompt-triage-lab.

[111] PI-NAIM: Path-Integrated Neural Adaptive Imputation Model

Afifa Khaled,Ebrahim Hamid Sumiea

Main category: cs.CV

TL;DR: 提出PI-NAIM，一种双路径动态路由架构，用于医学图像和多模态临床场景中的缺失模态填补，结合统计方法与神经网络并融合注意力机制，在填补精度和下游任务性能上均达到SOTA。

Details

Motivation: 现有缺失模态填补方法在表征能力或计算效率上存在不足，难以应对复杂缺失模式并兼顾下游任务性能。 Method: 设计双路径架构：低缺失样本由MICE处理，复杂缺失由GAIN结合时序分析处理；通过缺失感知的交叉注意力融合两路径；端到端联合优化填补准确性和下游任务表现。 Result: 在MIMIC-III等数据集上RMSE达0.108（优于基线0.119-0.152），死亡率预测AUROC为0.812，显著提升下游任务性能。 Conclusion: PI-NAIM通过动态路由与融合策略，实现了高效且精准的多模态缺失填补，具有良好的模块化与实际应用潜力。 Abstract: Medical imaging and multi-modal clinical settings often face the challange of missing modality in their diagnostic pipelines. Existing imputation methods either lack representational capacity or are computationally expensive. We propose PI-NAIM, a novel dual-path architecture that dynamically routes samples to optimized imputation approaches based on missingness complexity. Our framework integrates: (1) intelligent path routing that directs low missingness samples to efficient statistical imputation (MICE) and complex patterns to powerful neural networks (GAIN with temporal analysis); (2) cross-path attention fusion that leverages missingness-aware embeddings to intelligently combine both branches; and (3) end-to-end joint optimization of imputation accuracy and downstream task performance. Extensive experiments on MIMIC-III and multimodal benchmarks demonstrate state-of-the-art performance, achieving RMSE of 0.108 (vs. baselines' 0.119-0.152) and substantial gains in downstream tasks with an AUROC of 0.812 for mortality prediction. PI-NAIM's modular design enables seamless integration into vision pipelines handling incomplete sensor measurements, missing modalities, or corrupted inputs, providing a unified solution for real-world scenario. The code is publicly available at https://github.com/AfifaKhaled/PI-NAIM-Path-Integrated-Neural-Adaptive-Imputation-Model

[112] Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Siyou Li,Huanan Wu,Juexi Shao,Yinghao Ma,Yujian Gan,Yihao Luo,Yuwei Wang,Dong Nie,Lu Wang,Wengqing Wu,Le Zhang,Massimo Poesio,Juntao Yu

Main category: cs.CV

TL;DR: 提出QTSplus，一种轻量级视觉令牌选择模块，用于提升多模态大模型在长视频理解中的效率和性能，显著压缩视觉流并降低延迟，同时保持高精度。

Details

Motivation: 长视频理解中视觉令牌数量随视频长度线性增长，导致注意力成本、内存和延迟激增，现有方法难以高效处理。 Method: 设计Query-aware Token Selector (QTSplus)，通过跨注意力评分、预测实例特定的保留预算、使用可微直通估计器训练和推理时硬门控选择Top-n令牌，并引入小型重新编码器保留时间顺序。 Result: 在Qwen2.5-VL中集成QTSplus后，视觉流最多压缩89%，端到端延迟降低28%，在八个长视频基准上表现接近原模型精度，在TempCompass方向和顺序准确率上分别提升+20.5和+5.6点。 Conclusion: QTSplus是一种有效且通用的机制，能够在保留任务相关证据的同时，扩展MLLMs至实际长视频应用场景。 Abstract: Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89\%} and reduces end-to-end latency by \textbf{28\%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence. We will make all code, data, and trained models' weights publicly available.

[113] From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing

Ling Wang,Yunfan Lu,Wenzong Ma,Huizai Yao,Pengteng Li,Hui Xiong

Main category: cs.CV

TL;DR: 本文首次提出使用事件相机进行去雾，结合高动态范围的事件数据与扩散模型，设计了事件引导模块以实现从模糊图像中恢复清晰图像，并在真实场景中取得了最先进的效果。

Details

Motivation: 传统基于RGB帧的去雾方法受限于动态范围不足，在浓雾条件下容易丢失结构和光照细节，导致去雾结果不理想。 Method: 提出一种事件引导的扩散模型，通过将事件相机获取的高动态范围（HDR）特征（如边缘、角点）映射到扩散模型的潜在空间，实现对去雾过程的精确结构引导。 Result: 在两个基准数据集和自采集的浓雾无人机数据集上实验表明，该方法达到了最先进的去雾性能。 Conclusion: 事件相机与扩散模型的结合为去雾任务提供了新思路，有效提升了复杂雾境下图像恢复的质量与真实感。 Abstract: Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results. However, they operate on RGB frames, which suffer from limited dynamic range. Therefore, dehazing remains ill-posed and can erase structure and illumination details. To address this, we use event cameras for dehazing for the \textbf{first time}. Event cameras offer much higher HDR ($120 dBvs.60 dB$) and microsecond latency, therefore they suit hazy scenes. In practice, transferring HDR cues from events to frames is hard because real paired data are scarce. To tackle this, we propose an event-guided diffusion model that utilizes the strong generative priors of diffusion models to reconstruct clear images from hazy inputs by effectively transferring HDR information from events. Specifically, we design an event-guided module that maps sparse HDR event features, \textit{e.g.,} edges, corners, into the diffusion latent space. This clear conditioning provides precise structural guidance during generation, improves visual realism, and reduces semantic drift. For real-world evaluation, we collect a drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors. Experiments on two benchmarks and our dataset achieve state-of-the-art results.

[114] Evaluation of Attention Mechanisms in U-Net Architectures for Semantic Segmentation of Brazilian Rock Art Petroglyphs

Leonardi Melo,Luís Gustavo,Dimmy Magalhães,Lucciani Vieira,Mauro Araújo

Main category: cs.CV

TL;DR: 本研究比较了三种基于U-Net的架构在巴西考古遗址岩刻语义分割中的性能，其中引入注意力机制的Attention-Residual BEGL-UNet表现最佳，Dice Score达到0.710，较基线模型提升显著，表明注意力机制对考古遗产数字化保护具有有效性。

Details

Motivation: 提高岩刻图像语义分割的准确性，以支持考古遗产的数字化保护。 Method: 采用三种U-Net变体：BEGL-UNet、Attention-Residual BEGL-UNet和Spatial Channel Attention BEGL-UNet，均使用结合二元交叉熵与高斯边缘增强的BEGL损失函数，并在巴西Poço da Bebidinha遗址图像上进行五折交叉验证实验。 Result: Attention-Residual BEGL-UNet取得最优性能（Dice Score 0.710，验证损失0.067，召回率0.854）；Spatial Channel Attention BEGL-UNet性能相近（Dice Score 0.707，召回率0.857）；基线BEGL-UNet为0.690。注意力机制使Dice Score提升2.5%-2.9%。 Conclusion: 注意力机制显著提升岩刻分割效果，尤其Attention-Residual BEGL-UNet在保持高召回率的同时改善整体分割精度，验证了其在考古图像分析中的应用潜力。 Abstract: This study presents a comparative analysis of three U-Net-based architectures for semantic segmentation of rock art petroglyphs from Brazilian archaeological sites. The investigated architectures were: (1) BEGL-UNet with Border-Enhanced Gaussian Loss function; (2) Attention-Residual BEGL-UNet, incorporating residual blocks and gated attention mechanisms; and (3) Spatial Channel Attention BEGL-UNet, which employs spatial-channel attention modules based on Convolutional Block Attention Module. All implementations employed the BEGL loss function combining binary cross-entropy with Gaussian edge enhancement. Experiments were conducted on images from the Poço da Bebidinha Archaeological Complex, Piauí, Brazil, using 5-fold cross-validation. Among the architectures, Attention-Residual BEGL-UNet achieved the best overall performance with Dice Score of 0.710, validation loss of 0.067, and highest recall of 0.854. Spatial Channel Attention BEGL-UNet obtained comparable performance with DSC of 0.707 and recall of 0.857. The baseline BEGL-UNet registered DSC of 0.690. These results demonstrate the effectiveness of attention mechanisms for archaeological heritage digital preservation, with Dice Score improvements of 2.5-2.9% over the baseline.

Zhenhao Guo,Rachit Saluja,Tianyuan Yao,Quan Liu,Junchao Zhu,Haibo Wang,Daniel Reisenbüchler,Yuankai Huo,Benjamin Liechty,David J. Pisapia,Kenji Ikemura,Steven Salvatoree,Surya Seshane,Mert R. Sabuncu,Yihe Yang,Ruining Deng

Main category: cs.CV

TL;DR: 本研究将细粒度肾小球亚型分类建模为少样本学习问题，系统评估了病理学专用和通用视觉-语言模型在数据受限下的表现，发现专用模型结合简单微调即可在极少量标注样本下实现有效分类，并揭示了图像-文本对齐与正负样本区分对模型性能的共同影响。

Details

Motivation: 细粒度肾小球亚型分类对临床诊断至关重要，但高质量标注数据稀缺且获取困难；现有方法多依赖全监督和图像单模态模型，难以适应真实临床场景下的数据限制，因此需要探索适用于少样本条件的视觉-语言模型适配策略。 Method: 将细粒度肾小球亚型分类视为少样本问题，系统评估多种病理专用和通用视觉-语言模型；采用准确率、AUC、F1等指标评估分类性能，并分析图像与文本嵌入的特征对齐情况及亚型可分性；同时考察样本数量、模型架构、领域知识和适配策略的影响。 Result: 病理专用视觉-语言模型结合标准微调在仅4-8个样本/亚型条件下即表现出显著的分类能力和校准性提升；更多标注仍能带来持续改进；研究发现正负样本间的判别能力与图像-文本对齐同样重要；不同适配策略和监督水平共同影响诊断性能与多模态结构。 Conclusion: 在临床现实的少样本条件下，基于病理专用视觉-语言模型并采用简单微调是最有效的起点；监督程度与适配策略共同决定模型性能与多模态表示结构，为模型选择、适配方法设计及标注资源投入提供了实践指导。 Abstract: Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.

[116] BeyondFacial: Identity-Preserving Personalized Generation Beyond Facial Close-ups

Songsong Zhang,Chuanqi Tang,Hongguang Zhang,Guijian Tang,Minglong Li,Xueqiong Li,Shaowu Yang,Yuanxi Peng,Wenjing Yang,Jing Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的身份保持个性化生成（IPPG）方法，通过双路径推理、自适应融合和身份聚合模块，解决了现有方法过度依赖面部特写、语义一致性差的问题，实现了身份保真与场景语义生成的协同优化。

Details

Motivation: 现有IPPG方法过度关注面部区域，导致输出多为面部特写，缺乏视觉叙事性且在复杂文本提示下语义不一致，核心问题在于身份特征嵌入削弱了生成模型的语义表达能力。 Method: 设计了双路径推理（DLI）架构以分离身份与语义表征，提出身份自适应融合（IdAF）策略，在噪声预测阶段延迟融合并避免身份嵌入对语义的干扰，同时引入身份聚合前置（IdAP）模块增强身份保持。 Result: 实验结果表明该方法在非面部特写的IPPG任务中表现出稳定有效的性能，无需手动掩码或微调即可实现高效生成，并可作为即插即用组件集成到现有框架中。 Conclusion: 所提方法打破了面部特写的限制，提升了生成内容的视觉叙事性和语义一致性，推动了电影级角色-场景创作的发展，为相关领域提供了更丰富的个性化生成能力。 Abstract: Identity-Preserving Personalized Generation (IPPG) has advanced film production and artistic creation, yet existing approaches overemphasize facial regions, resulting in outputs dominated by facial close-ups.These methods suffer from weak visual narrativity and poor semantic consistency under complex text prompts, with the core limitation rooted in identity (ID) feature embeddings undermining the semantic expressiveness of generative models. To address these issues, this paper presents an IPPG method that breaks the constraint of facial close-ups, achieving synergistic optimization of identity fidelity and scene semantic creation. Specifically, we design a Dual-Line Inference (DLI) pipeline with identity-semantic separation, resolving the representation conflict between ID and semantics inherent in traditional single-path architectures. Further, we propose an Identity Adaptive Fusion (IdAF) strategy that defers ID-semantic fusion to the noise prediction stage, integrating adaptive attention fusion and noise decision masking to avoid ID embedding interference on semantics without manual masking. Finally, an Identity Aggregation Prepending (IdAP) module is introduced to aggregate ID information and replace random initializations, further enhancing identity preservation. Experimental results validate that our method achieves stable and effective performance in IPPG tasks beyond facial close-ups, enabling efficient generation without manual masking or fine-tuning. As a plug-and-play component, it can be rapidly deployed in existing IPPG frameworks, addressing the over-reliance on facial close-ups, facilitating film-level character-scene creation, and providing richer personalized generation capabilities for related domains.

[117] Dynamic Parameter Optimization for Highly Transferable Transformation-Based Attacks

Jiaming Liang,Chi-Man Pun

Main category: cs.CV

TL;DR: 本文提出了一种新的动态参数优化方法（DPO），基于提出的同心衰减模型（CDM），有效提升了基于变换的对抗攻击在不同模型、迭代次数和任务下的可迁移性，同时显著降低了参数优化的计算复杂度。

Details

Motivation: 现有的基于变换的对抗攻击在参数优化方面存在盲点，如仅考虑低迭代情况、使用统一参数设置以及依赖计算代价高昂的网格搜索，限制了攻击的可迁移性和实际效果。 Method: 通过实证研究发现三种关于参数强度的可迁移性动态模式，提出同心衰减模型（CDM）进行解释，并基于“先升后降”的模式设计高效的动态参数优化方法（DPO），将复杂度从O(m^n)降至O(n log m)。 Result: 在多种代理模型、迭代次数和任务上实验表明，所提DPO方法能显著提升现有变换攻击的可迁移性，且优化效率更高。 Conclusion: DPO为变换-based对抗攻击提供了更高效、自适应的参数优化方案，揭示了参数强度与可迁移性之间的动态关系，推动了黑盒攻击的发展。 Abstract: Despite their wide application, the vulnerabilities of deep neural networks raise societal concerns. Among them, transformation-based attacks have demonstrated notable success in transfer attacks. However, existing attacks suffer from blind spots in parameter optimization, limiting their full potential. Specifically, (1) prior work generally considers low-iteration settings, yet attacks perform quite differently at higher iterations, so characterizing overall performance based only on low-iteration results is misleading. (2) Existing attacks use uniform parameters for different surrogate models, iterations, and tasks, which greatly impairs transferability. (3) Traditional transformation parameter optimization relies on grid search. For n parameters with m steps each, the complexity is O(mn). Large computational overhead limits further optimization of parameters. To address these limitations, we conduct an empirical study with various transformations as baselines, revealing three dynamic patterns of transferability with respect to parameter strength. We further propose a novel Concentric Decay Model (CDM) to effectively explain these patterns. Building on these insights, we propose an efficient Dynamic Parameter Optimization (DPO) based on the rise-then-fall pattern, reducing the complexity to O(nlogm). Comprehensive experiments on existing transformation-based attacks across different surrogate models, iterations, and tasks demonstrate that our DPO can significantly improve transferability.

[118] LithoSeg: A Coarse-to-Fine Framework for High-Precision Lithography Segmentation

Xinyu He,Botong Zhao,Bingbing Li,Shujing Lyu,Jiwei Shen,Yue Lu

Main category: cs.CV

TL;DR: 本文提出了一种用于光刻扫描电镜图像分割的粗到精网络LithoSeg，通过结合人机协同的SAM引导策略和将2D分割转化为1D回归的轻量级MLP细化方法，在减少监督需求的同时实现了高精度分割与测量。

Details

Motivation: 现有光刻SEM图像分割方法在精度和鲁棒性方面不足，难以适应多样化的图形几何形状和工艺窗口，限制了其在实际半导体制造中的应用。 Method: 提出LithoSeg：第一阶段采用人机协同引导的SAM进行粗分割以实现低监督下的鲁棒性；第二阶段利用粗掩模提取沟槽法向轮廓，将2D分割问题转化为1D回归问题，并通过轻量级MLP进行逐点精细化。 Result: LithoSeg在分割准确性和计量精度上均优于先前方法，同时所需人工标注更少，具备良好的实际应用前景。 Conclusion: LithoSeg通过粗到精的两阶段设计，有效提升了光刻图像分割的精度与鲁棒性，降低了对大量标注数据的依赖，适用于复杂多变的半导体制造场景。 Abstract: Accurate segmentation and measurement of lithography scanning electron microscope (SEM) images are crucial for ensuring precise process control, optimizing device performance, and advancing semiconductor manufacturing yield. Lithography segmentation requires pixel-level delineation of groove contours and consistent performance across diverse pattern geometries and process window. However, existing methods often lack the necessary precision and robustness, limiting their practical applicability. To overcome this challenge, we propose LithoSeg, a coarse-to-fine network tailored for lithography segmentation. In the coarse stage, we introduce a Human-in-the-Loop Bootstrapping scheme for the Segment Anything Model (SAM) to attain robustness with minimal supervision. In the subsequent fine stage, we recast 2D segmentation as 1D regression problem by sampling groove-normal profiles using the coarse mask and performing point-wise refinement with a lightweight MLP. LithoSeg outperforms previous approaches in both segmentation accuracy and metrology precision while requiring less supervision, offering promising prospects for real-world applications.

[119] Uncertainty-Guided Selective Adaptation Enables Cross-Platform Predictive Fluorescence Microscopy

Kai-Wen K. Yang,Andrew Bai,Alexandra Bermudez,Yunqi Hong,Zoe Latham,Iris Sloan,Michael Liu,Vishrut Goyal,Cho-Jui Hsieh,Neil Y. C. Lin

Main category: cs.CV

TL;DR: 提出SIT-ADDA-Auto方法，仅通过适应最浅层卷积层并自动选择适应深度，实现无需标签的显微镜图像跨域迁移，提升重建与分割性能。

Details

Motivation: 深度学习在显微镜图像分析中面临跨设备和成像参数变化时泛化能力差的问题，传统对抗域适应方法需重训练整个网络，破坏已学习的语义特征。 Method: 提出SIT-ADDA-Auto框架，仅对最早几层卷积层进行对抗性域适应，冻结深层网络，并结合预测不确定性自动选择最佳适应深度，无需目标域标签。 Result: 在多种成像条件变化、跨仪器转移和不同染色场景下，SIT-ADDA-Auto在图像重建和下游分割任务上优于全编码器适应和其他非对抗基线方法，且语义特征漂移更小。通过多指标评估、盲法专家评估和不确定性-深度消融实验验证了方法的鲁棒性。 Conclusion: 仅适应浅层卷积层并自动选择深度是有效的无监督域适应策略，为显微镜图像分析提供了可推广的设计准则和实用解决方案。 Abstract: Deep learning is transforming microscopy, yet models often fail when applied to images from new instruments or acquisition settings. Conventional adversarial domain adaptation (ADDA) retrains entire networks, often disrupting learned semantic representations. Here, we overturn this paradigm by showing that adapting only the earliest convolutional layers, while freezing deeper layers, yields reliable transfer. Building on this principle, we introduce Subnetwork Image Translation ADDA with automatic depth selection (SIT-ADDA-Auto), a self-configuring framework that integrates shallow-layer adversarial alignment with predictive uncertainty to automatically select adaptation depth without target labels. We demonstrate robustness via multi-metric evaluation, blinded expert assessment, and uncertainty-depth ablations. Across exposure and illumination shifts, cross-instrument transfer, and multiple stains, SIT-ADDA improves reconstruction and downstream segmentation over full-encoder adaptation and non-adversarial baselines, with reduced drift of semantic features. Our results provide a design rule for label-free adaptation in microscopy and a recipe for field settings; the code is publicly available.

[120] Enhancing Road Safety Through Multi-Camera Image Segmentation with Post-Encroachment Time Analysis

Shounak Ray Chaudhuri,Arash Jahangiri,Christopher Paolini

Main category: cs.CV

TL;DR: 提出了一种基于多摄像头计算机视觉的实时安全评估框架，通过计算后侵入时间（PET）实现对信号交叉口的高分辨率、实时、可扩展的安全分析。

Details

Motivation: 传统基于事故数据的交通安全分析受限于数据稀疏性和延迟，难以支持实时风险识别，因此需要一种更高效、连续且精确的替代方法。 Method: 采用四个同步摄像头和YOLOv11进行车辆检测，通过单应性矩阵将检测结果映射到统一的鸟瞰图，并提出一种像素级PET算法，在边缘设备上实现实时处理与动态热图可视化。 Result: 系统在边缘设备上实现了平均2.68 FPS的处理速度，生成800x800像素的对数热图，能够以亚秒级精度识别高风险区域，定位精度达3.3平方厘米。 Conclusion: 验证了去中心化视觉PET分析在智能交通系统中的可行性，提供了一种可复制的高分辨率、实时交叉口安全评估方法。 Abstract: Traffic safety analysis at signalized intersections is vital for reducing vehicle and pedestrian collisions, yet traditional crash-based studies are limited by data sparsity and latency. This paper presents a novel multi-camera computer vision framework for real-time safety assessment through Post-Encroachment Time (PET) computation, demonstrated at the intersection of H Street and Broadway in Chula Vista, California. Four synchronized cameras provide continuous visual coverage, with each frame processed on NVIDIA Jetson AGX Xavier devices using YOLOv11 segmentation for vehicle detection. Detected vehicle polygons are transformed into a unified bird's-eye map using homography matrices, enabling alignment across overlapping camera views. A novel pixel-level PET algorithm measures vehicle position without reliance on fixed cells, allowing fine-grained hazard visualization via dynamic heatmaps, accurate to 3.3 sq-cm. Timestamped vehicle and PET data is stored in an SQL database for long-term monitoring. Results over various time intervals demonstrate the framework's ability to identify high-risk regions with sub-second precision and real-time throughput on edge devices, producing data for an 800 x 800 pixel logarithmic heatmap at an average of 2.68 FPS. This study validates the feasibility of decentralized vision-based PET analysis for intelligent transportation systems, offering a replicable methodology for high-resolution, real-time, and scalable intersection safety evaluation.

[121] LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension

Xianglong Shi,Silin Cheng,Sirui Zhao,Yunhan Jiang,Enhong Chen,Yang Liu,Sebastien Ourselin

Main category: cs.CV

TL;DR: 本文提出了弱监督广义指代表达理解（WGREC）新任务，以解决传统WREC方法无法处理零或多个目标的问题，并提出LIHE框架，通过两阶段方法和混合相似性模块HEMix有效应对监督信号模糊与语义表征坍塌挑战。

Details

Motivation: 现有WREC方法受限于一对一映射假设，难以应对真实场景中指代表达对应零个或多个人物对象的情况，缺乏更灵活、实用的模型来处理变数目标。 Method: 提出LIHE框架，包含两个阶段：第一阶段为指代解耦，预测目标数量并分解复杂表达为子表达；第二阶段为指代表征定位，采用融合欧氏与双曲几何优势的混合相似性模块HEMix进行精确定位。 Result: 在gRefCOCO和Ref-ZOM数据集上建立了首个有效的弱监督WGREC基线，HEMix在标准REC基准上显著提升性能，IoU@0.5最高提升2.5%。 Conclusion: LIHE成功解决了WGREC中的监督模糊和语义坍塌问题，实现了对多目标或无目标场景的有效建模，推动了弱监督指代表达理解向更实际应用的发展。 Abstract: Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5\%. The code is available at https://anonymous.4open.science/r/LIHE.

[122] Null-Space Diffusion Distillation for Efficient Photorealistic Lensless Imaging

Jose Reinaldo Cunha Santos A V Silva Neto,Hodaka Kawachi,Yasushi Yagi,Tomoya Nakamura

Main category: cs.CV

TL;DR: 提出了一种名为Null-Space Diffusion Distillation (NSDD) 的新方法，用于在无配对监督的情况下实现快速、逼真的无透镜成像，通过蒸馏迭代求解器的零空间分量，在保持测量一致性的同时显著降低计算开销。

Details

Motivation: 现有基于配对监督的无透镜相机重建方法易受镜头-无透镜域不匹配的影响，而无需真实图像的扩散先验方法在噪声大、高度复用且病态严重的反卷积任务中表现不佳，因此需要更稳定且高效的方法。 Method: 提出NSDD方法，将值空间约束与零空间的扩散先验更新分离，通过单次前向网络蒸馏迭代DDNM+求解器的零空间分量，并以无透镜测量和值空间锚点为条件，实现无需配对监督的重建。 Result: 在Lensless-FFHQ和PhlatCam数据集上，NSDD是第二快的方法（仅次于Wiener），实现了接近教师模型的感知质量（LPIPS排名第二，优于DPS和传统凸优化方法），同时大幅减少运行时间和内存消耗。 Conclusion: NSDD为实现快速、无需真实图像监督、高质量的无透镜成像提供了一条实用路径，展示了零空间蒸馏在复杂逆问题中的潜力。 Abstract: State-of-the-art photorealistic reconstructions for lensless cameras often rely on paired lensless-lensed supervision, which can bias models due to lens-lensless domain mismatch. To avoid this, ground-truth-free diffusion priors are attractive; however, generic formulations tuned for conventional inverse problems often break under the noisy, highly multiplexed, and ill-posed lensless deconvolution setting. We observe that methods which separate range-space enforcement from null-space diffusion-prior updates yield stable, realistic reconstructions. Building on this, we introduce Null-Space Diffusion Distillation (NSDD): a single-pass student that distills the null-space component of an iterative DDNM+ solver, conditioned on the lensless measurement and on a range-space anchor. NSDD preserves measurement consistency and achieves photorealistic results without paired supervision at a fraction of the runtime and memory. On Lensless-FFHQ and PhlatCam, NSDD is the second fastest, behind Wiener, and achieves near-teacher perceptual quality (second-best LPIPS, below DDNM+), outperforming DPS and classical convex baselines. These results suggest a practical path toward fast, ground-truth-free, photorealistic lensless imaging.

[123] Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark

Rulin Zhou,Wenlong He,An Wang,Jianhang Zhang,Xuanhui Zeng,Xi Zhang,Chaowei Zhu,Haijun Hu,Hongliang Ren

Main category: cs.CV

TL;DR: 本文提出了首个大规模多模态手术点跟踪数据集VL-SurgPT，结合视觉跟踪与文本描述，提升了在复杂视觉条件下（如烟雾、反射和组织变形）的跟踪鲁棒性，并通过提出的文本引导方法TG-SurgPT验证了语义信息对跟踪性能的显著提升。

Details

Motivation: 现有手术跟踪数据集缺乏语义上下文，难以分析跟踪失败机制，且在复杂视觉条件下（如烟雾、反光、组织形变）性能受限，因此需要引入语义信息以提升跟踪系统的鲁棒性和可解释性。 Method: 构建了包含908个体内视频片段的大规模多模态数据集VL-SurgPT，涵盖组织和器械的多种挑战场景，并提出TG-SurgPT，一种利用文本描述指导视觉跟踪的方法，通过融合语言模态增强对关键点状态的理解。 Result: 在八种先进跟踪方法上建立了基准实验，结果表明引入文本语义信息显著提高了跟踪精度和可靠性，尤其是在传统纯视觉方法表现较差的恶劣视觉条件下，TG-SurgPT展现出更强的鲁棒性。 Conclusion: VL-SurgPT通过融合视觉与语言模态，为开发具备上下文感知能力的手术跟踪系统提供了新途径，推动了在复杂术中环境下仍能保持高性能的计算机辅助手术系统的发展。 Abstract: Accurate point tracking in surgical environments remains challenging due to complex visual conditions, including smoke occlusion, specular reflections, and tissue deformation. While existing surgical tracking datasets provide coordinate information, they lack the semantic context necessary to understand tracking failure mechanisms. We introduce VL-SurgPT, the first large-scale multimodal dataset that bridges visual tracking with textual descriptions of point status in surgical scenes. The dataset comprises 908 in vivo video clips, including 754 for tissue tracking (17,171 annotated points across five challenging scenarios) and 154 for instrument tracking (covering seven instrument types with detailed keypoint annotations). We establish comprehensive benchmarks using eight state-of-the-art tracking methods and propose TG-SurgPT, a text-guided tracking approach that leverages semantic descriptions to improve robustness in visually challenging conditions. Experimental results demonstrate that incorporating point status information significantly improves tracking accuracy and reliability, particularly in adverse visual scenarios where conventional vision-only methods struggle. By bridging visual and linguistic modalities, VL-SurgPT enables the development of context-aware tracking systems crucial for advancing computer-assisted surgery applications that can maintain performance even under challenging intraoperative conditions.

[124] GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

Jeong Hun Yeo,Sangyun Chung,Sungjune Park,Dae Hoe Kim,Jinyoung Moon,Yong Man Ro

Main category: cs.CV

TL;DR: 本文提出了一种名为GCAgent的新型全局上下文感知代理框架，通过结构化的叙事与图式记忆解决长视频理解中的长期依赖问题，在多个基准上实现了7B规模MLLM中的最先进性能。

Details

Motivation: 现有的多模态大语言模型在长视频理解中受限于token长度和难以捕捉长期时序依赖，无法有效建模全局上下文和复杂事件关系，因此需要一种能够支持深度推理的框架。 Method: 提出GCAgent框架，核心是“图式与叙事记忆”（Schematic and Narrative Episodic Memory），以结构化方式建模事件间的因果与时序关系；该框架通过多阶段的感知-行动-反思循环运行，并利用记忆管理器检索相关记忆以支持上下文感知的推理。 Result: 在Video-MME Long split上比强基线提升了最多23.5%的准确率，在7B规模MLLM中达到73.4%的准确率，并在整体Video-MME基准上取得71.9%的平均分，表现优于现有方法。 Conclusion: GCAgent通过结构化记忆和基于代理的推理范式，有效解决了长视频理解中的长期依赖问题，验证了认知启发式设计在多模态推理中的有效性。 Abstract: Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5\% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4\% accuracy on the Long split and the highest overall average (71.9\%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.

[125] VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou,Chi Xu,Kaifeng Tang,Yuting Ge,Tingrui Guo,Li Cheng

Main category: cs.CV

TL;DR: 提出了一种结合视觉和物理线索的新型框架，用于从单张RGB图像中估计手部与物体的3D姿态，通过联合学习和候选姿态聚合，在姿态准确性和物理合理性上显著优于现有方法。

Details

Motivation: 现有方法主要依赖视觉线索，常导致违反物理约束的结果；引入物理推理的方法多依赖后优化或非可微物理引擎，影响视觉一致性和端到端训练。 Method: 1) 联合视觉-物理线索学习：模型同时提取2D视觉线索和3D物理线索；2) 候选姿态聚合：通过扩散生成多个候选姿态，并结合视觉与物理预测进行聚合优化。 Result: 在多个实验中，该方法在姿态估计精度和物理合理性方面均显著优于当前最先进的方法。 Conclusion: 所提出的框架有效实现了视觉一致性与物理合理性的平衡，推动了手-物交互3D姿态估计的发展。 Abstract: Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

[126] Improved Masked Image Generation with Knowledge-Augmented Token Representations

Guotao Liang,Baoquan Zhang,Zhiyuan Wen,Zihao Han,Yunming Ye

Main category: cs.CV

TL;DR: 提出了一种知识增强的掩码图像生成框架KA-MIG，通过引入显式的语义依赖先验知识（三种知识图）来提升生成质量。

Details

Motivation: 现有掩码图像生成方法难以从无明确语义的长序列视觉标记中学习语义依赖，限制了生成性能。 Method: 构建三种标记级知识图（共现图、语义相似性图、位置-标记不兼容图），设计图感知编码器学习标记和位置感知表示，并通过轻量融合机制集成到现有MIG方法中。 Result: 在ImageNet类别条件图像生成任务上，KA-MIG显著优于现有MIG方法，提升了生成质量和语义一致性。 Conclusion: 引入外部先验知识能有效增强模型对视觉标记语义依赖的建模能力，为掩码图像生成提供了新思路。 Abstract: Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction. Existing methods typically rely solely on the model itself to learn semantic dependencies among visual token sequences. However, directly learning such semantic dependencies from data is challenging because the individual tokens lack clear semantic meanings, and these sequences are usually long. To address this limitation, we propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies (\emph{i.e.}, extracted from the training data) as priors to learn richer representations for improving performance. In particular, we explore and identify three types of advantageous token knowledge graphs, including two positive and one negative graphs (\emph{i.e.}, the co-occurrence graph, the semantic similarity graph, and the position-token incompatibility graph). Based on three prior knowledge graphs, we design a graph-aware encoder to learn token and position-aware representations. After that, a lightweight fusion mechanism is introduced to integrate these enriched representations into the existing MIG methods. Resorting to such prior knowledge, our method effectively enhances the model's ability to capture semantic dependencies, leading to improved generation quality. Experimental results demonstrate that our method improves upon existing MIG for class-conditional image generation on ImageNet.

[127] Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu,Xiaobo Xia,Jiaheng Wei,Shuo Yang,Xiu Su,See-Kiong Ng,Tat-Seng Chua

Main category: cs.CV

TL;DR: 提出CalMRL方法，通过表示级补全和双步学习校准缺失模态下的多模态对齐，缓解锚点偏移问题。

Details

Motivation: 现有跨模态对齐方法要求所有模态完整，难以利用含缺失模态的普遍数据集。 Method: 利用模态间先验和内在联系，在表示层面建模缺失模态的补全；采用双步学习与共享隐变量后验分布的闭式解优化模型。 Result: 理论证明能缓解锚点偏移并保证收敛；实验表明该方法在含缺失模态数据上优于现有方法。 Conclusion: CalMRL为不完整多模态数据的学习提供了新灵活性，扩展了现有先进方法的应用范围。 Abstract: Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.

[128] SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

Xinyuan Hu,Changyue Shi,Chuxiao Yang,Minghao Chen,Jiajun Ding,Tao Wei,Chen Wei,Zhou Yu,Min Tan

Main category: cs.CV

TL;DR: 提出SRSplat，一种从少量低分辨率视图中重建高分辨率3D场景的前馈框架，通过结合外部高质量参考图像和内部纹理线索来增强纹理细节恢复。

Details

Motivation: 现有方法在从低分辨率输入重建3D场景时难以恢复精细纹理细节，因缺乏高频信息。 Method: 构建场景特定的参考库（使用MLLM和扩散模型生成），设计参考引导特征增强（RGFE）模块融合低分辨率输入与参考图像特征，并采用纹理感知密度控制（TADC）模块根据内部纹理丰富度自适应调整高斯密度。 Result: 在RealEstate10K、ACID和DTU等多个数据集上优于现有方法，具备良好的跨数据集和跨分辨率泛化能力。 Conclusion: SRSplat通过融合外部参考与内部纹理信息，有效提升了低分辨率多视图3D重建的纹理质量与整体性能。 Abstract: Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose \textbf{SRSplat}, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the \textit{Reference-Guided Feature Enhancement (RGFE)} module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from \textit{RGFE}. To further refine predicted Gaussian primitives, we introduce \textit{Texture-Aware Density Control (TADC)}, which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.

[129] FedSDA: Federated Stain Distribution Alignment for Non-IID Histopathological Image Classification

Cheng-Chang Tsai,Kai-Wen Cheng,Chun-Shien Lu

Main category: cs.CV

TL;DR: 提出了一种名为FedSDA的联邦学习方法，通过调整各客户端的染色分布来缓解非独立同分布（non-IID）组织病理学图像中的特征分布偏移问题，有效提升模型性能并兼顾隐私保护。

Details

Motivation: 非IID数据在联邦学习中导致特征分布偏移，尤其是在组织病理学图像中由于染色差异显著影响模型训练效果，现有方法对此关注有限。 Method: 基于扩散模型拟合数据分布的能力，结合染色分离技术提取与non-IID特性相关的关键特征，在联邦学习框架下对各客户端的染色分布进行对齐，且避免在原始数据上训练扩散模型以降低隐私泄露风险。 Result: 实验表明，FedSDA不仅优于缓解客户端模型更新差异的方法，也优于其他从数据分布角度处理non-IID问题的方法，在多个基准上提升了性能。 Conclusion: FedSDA为解决non-IID组织病理图像在联邦学习中的分布偏移提供了有效且实用的方案，具有重要的应用价值和推广意义。 Abstract: Federated learning (FL) has shown success in collaboratively training a model among decentralized data resources without directly sharing privacy-sensitive training data. Despite recent advances, non-IID (non-independent and identically distributed) data poses an inevitable challenge that hinders the use of FL. In this work, we address the issue of non-IID histopathological images with feature distribution shifts from an intuitive perspective that has only received limited attention. Specifically, we address this issue from the perspective of data distribution by solely adjusting the data distributions of all clients. Building on the success of diffusion models in fitting data distributions and leveraging stain separation to extract the pivotal features that are closely related to the non-IID properties of histopathological images, we propose a Federated Stain Distribution Alignment (FedSDA) method. FedSDA aligns the stain distribution of each client with a target distribution in an FL framework to mitigate distribution shifts among clients. Furthermore, considering that training diffusion models on raw data in FL has been shown to be susceptible to privacy leakage risks, we circumvent this problem while still effectively achieving alignment. Extensive experimental results show that FedSDA is not only effective in improving baselines that focus on mitigating disparities across clients' model updates but also outperforms baselines that address the non-IID data issues from the perspective of data distribution. We show that FedSDA provides valuable and practical insights for the computational pathology community.

[130] DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging

Huimin Cheng,Xiaowei Yu,Shushan Wu,Luyang Fang,Chao Cao,Jing Zhang,Tianming Liu,Dajiang Zhu,Wenxuan Zhong,Ping Ma

Main category: cs.CV

TL;DR: 提出DCMM-Transformer，一种结合度校正混合成员模型的新型视觉Transformer，用于医学图像分析，通过可微分方式建模解剖结构，提升性能与可解释性。

Details

Motivation: 标准Vision Transformer无法利用医学图像中的潜在解剖结构（如器官、组织），现有结构建模方法存在非可微、训练不稳定和难以捕捉复杂社区结构的问题。 Method: 在自注意力机制中引入可微分的度校正混合成员模型（DCMM）作为加性偏置，以建模社区结构和节点度异质性，避免使用非可微的二值掩码采样。 Result: 在脑部、胸部、乳腺和眼部等多种医学影像数据集上验证了方法的优越性能和泛化能力，生成的注意力图具有解剖合理性和语义一致性，显著提升模型可解释性。 Conclusion: DCMM-Transformer通过可微分、可解释的方式有效融合医学图像的解剖结构先验，在多个模态上实现了优于现有方法的性能，并增强了注意力机制的语义合理性。 Abstract: Medical images exhibit latent anatomical groupings, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit. While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffer from non-differentiability, training instability, and the inability to model complex community structure. We present DCMM-Transformer, a novel ViT architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.

[131] DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training

Saksham Kumar,Ashish Singh,Srinivasarao Thota,Sunil Kumar Singh,Chandan Kumar

Main category: cs.CV

TL;DR: 提出了一种基于DeiT的双阶段渐进训练策略（DeiTFake）用于面部深度伪造检测，在OpenForensics数据集上实现了99.22%的准确率和0.9997的AUROC，优于现有基线方法。

Details

Motivation: 深度伪造对数字媒体的真实性构成严重威胁，现有检测方法在面对复杂伪造技术时鲁棒性不足，需要更有效的模型和训练策略来提升检测性能。 Method: 采用基于DeiT的Transformer架构，结合两阶段渐进式增强训练策略：第一阶段使用标准数据增强进行迁移学习，第二阶段引入高级仿射变换和深度伪造特异性增强进行微调，利用知识蒸馏捕捉细微篡改痕迹。 Result: 在OpenForensics数据集（190,335张图像）上，第一阶段达到98.71%准确率，第二阶段提升至99.22%准确率，AUROC为0.9997，超过当前最优基线方法，并通过消融实验验证了增强策略和训练调度的有效性。 Conclusion: DeiTFake通过结合DeiT模型与渐进式增强训练显著提升了深度伪造检测的性能，为面部伪造检测提供了实用的基准和可行的技术方案。 Abstract: Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a novel two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations followed by a fine-tuning phase using advanced affine and deepfake-specific augmentations. DeiT's knowledge distillation model captures subtle manipulation artifacts, increasing robustness of the detection model. Trained on the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71\% accuracy after stage one and 99.22\% accuracy with an AUROC of 0.9997, after stage two, outperforming the latest OpenForensics baselines. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection.

[132] UniABG: Unified Adversarial View Bridging and Graph Correspondence for Unsupervised Cross-View Geo-Localization

Cuiqun Chen,Qi Chen,Bin Yang,Xingyi Zhang

Main category: cs.CV

TL;DR: 提出了一种名为UniABG的双阶段无监督跨视角地理定位框架，结合对抗视图桥接与基于图的对应校准，在无需标注的情况下显著提升性能，甚至超过有监督方法。

Details

Motivation: 现有有监督方法依赖大量成对标注数据，难以扩展；无监督方法因跨视角域差异导致伪标签噪声严重，影响性能。 Method: 首先通过视图感知对抗桥接（VAAB）学习视不变特征并增强伪标签鲁棒性；然后利用异构图滤波校准（HGFC）构建双跨视图结构图以优化跨视图匹配关系。 Result: 在University-1652和SUES-200数据集上，卫星到无人机的AP分别提升了10.63%和16.73%，达到当前无监督方法最优性能，并超越部分有监督基线。 Conclusion: UniABG有效解决了无监督跨视角地理定位中的域差距与伪标签噪声问题，实现了高效且准确的跨视角匹配。 Abstract: Cross-view geo-localization (CVGL) matches query images ($\textit{e.g.}$, drone) to geographically corresponding opposite-view imagery ($\textit{e.g.}$, satellite). While supervised methods achieve strong performance, their reliance on extensive pairwise annotations limits scalability. Unsupervised alternatives avoid annotation costs but suffer from noisy pseudo-labels due to intrinsic cross-view domain gaps. To address these limitations, we propose $\textit{UniABG}$, a novel dual-stage unsupervised cross-view geo-localization framework integrating adversarial view bridging with graph-based correspondence calibration. Our approach first employs View-Aware Adversarial Bridging (VAAB) to model view-invariant features and enhance pseudo-label robustness. Subsequently, Heterogeneous Graph Filtering Calibration (HGFC) refines cross-view associations by constructing dual inter-view structure graphs, achieving reliable view correspondence. Extensive experiments demonstrate state-of-the-art unsupervised performance, showing that UniABG improves Satellite $\rightarrow$ Drone AP by +10.63\% on University-1652 and +16.73\% on SUES-200, even surpassing supervised baselines. The source code is available at https://github.com/chenqi142/UniABG

[133] PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Sijie Wang,Qiang Wang,Shaohuai Shi

Main category: cs.CV

TL;DR: 本文提出了一种名为PipeDiT的新型流水线框架，用于加速基于扩散Transformer（DiT）的视频生成模型，通过序列并行流水线、解耦扩散与VAE模块及注意力协同处理技术，显著降低了推理延迟和内存消耗。

Details

Motivation: 基于DiT的视频生成模型在实际部署中受限于推理速度慢和内存消耗高，亟需高效的加速框架。 Method: 设计了PipeSP流水线算法实现序列并行计算与通信重叠；提出DeDiVAE将扩散模块与VAE模块解耦并在不同GPU组上流水执行；引入注意力协同处理（Aco）方法提升VAE组的GPU利用率。 Result: 在OpenSoraPlan和HunyuanVideo两个开源框架上集成PipeDiT，在8-GPU系统上实现了1.06x到4.02x的推理速度提升。 Conclusion: PipeDiT通过多层次流水线优化有效提升了视频生成效率，具有良好的实用性和广泛的应用前景。 Abstract: Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.

[134] MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity

Zhichen Lai,Hua Lu,Huan Li,Jialiang Li,Christian S. Jensen

Main category: cs.CV

TL;DR: 本文提出了一种新的基于对比学习的轨迹相似性计算框架MovSemCL，通过提取运动语义特征、分块处理和注意力机制实现高效、准确的轨迹表示，并引入曲率引导的数据增强策略，在保持物理合理性的前提下提升模型性能。

Details

Motivation: 现有基于学习的轨迹相似性计算方法在轨迹语义和层次建模方面不足，计算成本高，且使用了破坏轨迹语义的不合理数据增强手段。 Method: MovSemCL将原始GPS轨迹转换为运动语义特征并分块，利用块内和块间注意力机制编码局部与全局模式，采用曲率引导的增强策略保留关键路段并屏蔽冗余部分。 Result: 实验表明，MovSemCL在真实数据集上优于现有最先进方法，相似性搜索的平均排名接近理想值1，启发式近似任务性能最高提升20.3%，推理延迟最多降低43.4%。 Conclusion: MovSemCL有效解决了轨迹相似性计算中的语义建模、效率和数据增强问题，实现了更优的性能和更低的计算开销。 Abstract: Trajectory similarity computation is fundamental functionality that is used for, e.g., clustering, prediction, and anomaly detection. However, existing learning-based methods exhibit three key limitations: (1) insufficient modeling of trajectory semantics and hierarchy, lacking both movement dynamics extraction and multi-scale structural representation; (2) high computational costs due to point-wise encoding; and (3) use of physically implausible augmentations that distort trajectory semantics. To address these issues, we propose MovSemCL, a movement-semantics contrastive learning framework for trajectory similarity computation. MovSemCL first transforms raw GPS trajectories into movement-semantics features and then segments them into patches. Next, MovSemCL employs intra- and inter-patch attentions to encode local as well as global trajectory patterns, enabling efficient hierarchical representation and reducing computational costs. Moreover, MovSemCL includes a curvature-guided augmentation strategy that preserves informative segments (e.g., turns and intersections) and masks redundant ones, generating physically plausible augmented views. Experiments on real-world datasets show that MovSemCL is capable of outperforming state-of-the-art methods, achieving mean ranks close to the ideal value of 1 at similarity search tasks and improvements by up to 20.3% at heuristic approximation, while reducing inference latency by up to 43.4%.

[135] DCA-LUT: Deep Chromatic Alignment with 5D LUT for Purple Fringing Removal

Jialang Lu,Shuning Sun,Pu Wang,Chen Wu,Feng Gao,Lina Gong,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的紫边去除框架DCA-LUT，通过引入色觉感知坐标变换模块和5D查找表，实现了对紫边的有效分离与校正，在合成与真实场景数据上均达到最优性能。

Details

Motivation: 传统方法依赖昂贵的硬件和手工特征，无法有效解决由镜头色散引起的紫边问题，缺乏数据驱动的解决方案。 Method: 提出DCA-LUT框架，包含CA-CT模块用于学习图像自适应颜色空间并分离紫边通道，结合5D LUT进行非线性颜色校正，并构建大规模合成数据集PF-Synth用于训练和评估。 Result: 在合成和真实数据集上实验表明，该方法在定量和视觉质量上均优于现有方法，实现了最先进的紫边去除效果。 Conclusion: DCA-LUT首次将深度学习应用于紫边去除，通过物理启发的模块设计和高效的颜色映射策略，显著提升了成像质量，具有良好的实用性和推广潜力。 Abstract: Purple fringing, a persistent artifact caused by Longitudinal Chromatic Aberration (LCA) in camera lenses, has long degraded the clarity and realism of digital imaging. Traditional solutions rely on complex and expensive apochromatic (APO) lens hardware and the extraction of handcrafted features, ignoring the data-driven approach. To fill this gap, we introduce DCA-LUT, the first deep learning framework for purple fringing removal. Inspired by the physical root of the problem, the spatial misalignment of RGB color channels due to lens dispersion, we introduce a novel Chromatic-Aware Coordinate Transformation (CA-CT) module, learning an image-adaptive color space to decouple and isolate fringing into a dedicated dimension. This targeted separation allows the network to learn a precise ``purple fringe channel", which then guides the accurate restoration of the luminance channel. The final color correction is performed by a learned 5D Look-Up Table (5D LUT), enabling efficient and powerful% non-linear color mapping. To enable robust training and fair evaluation, we constructed a large-scale synthetic purple fringing dataset (PF-Synth). Extensive experiments in synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in purple fringing removal.

[136] Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound

Dengming Zhang,Weitao You,Jingxiong Li,Weishen Lin,Wenda Shi,Xue Zhao,Heda Zuo,Junxian Wu,Lingyun Sun

Main category: cs.CV

TL;DR: 提出VAEmotionLLM，一个两阶段框架，通过视觉引导的音频对齐和跨模态情感适配器，在有限音频预训练下实现视觉语言模型的情感理解，并构建艺术导向的情感评测基准ArtEmoBenchmark。

Details

Motivation: 现有情感理解方法多基于单模态或人类中心视角，忽视艺术作品本身传递的情感；同时音视频大模型依赖大规模音频预训练，限制了可扩展性。 Method: 第一阶段采用视觉引导音频对齐（VG-Align），通过同步音视频片段中视觉与音频通路的下一词分布对齐，实现以看促听；第二阶段引入轻量级跨模态情感适配器（EmoAdapter），包含情感增强器和监督器，提升跨模态情感理解。 Result: 在新构建的艺术情感基准ArtEmoBenchmark上达到SOTA性能，优于单模态及多模态基线模型，消融实验验证各组件互补性。 Conclusion: VAEmotionLLM能有效结合视听信息进行跨模态情感理解，且无需大规模音频预训练，具备良好可扩展性和应用前景。 Abstract: Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized audio-video clips, enabling hearing without a large audio dataset. In Stage 2, a lightweight Cross-Modal Emotion Adapter (EmoAdapter), composed of the Emotion Enhancer and the Emotion Supervisor, injects emotion-sensitive residuals and applies emotion supervision to enhance cross-modal emotion understanding. We also construct ArtEmoBenchmark, an art-centric emotion benchmark that evaluates content and emotion understanding under audio-only, visual-only, and audio-visual inputs. VAEmotionLLM achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show that the proposed components are complementary.

[137] Point Cloud Quantization through Multimodal Prompting for 3D Understanding

Hongxuan Li,Wencheng Zhu,Huiying Xu,Xinzhong Zhu,Pengfei Zhu

Main category: cs.CV

TL;DR: 提出一种基于多模态提示的点云分析向量量化框架，利用文本嵌入作为语义先验，并通过双约束量化空间融合几何与语义信息。

Details

Motivation: 现有基于原型的向量量化方法在代表性与可解释性方面不足，且难以有效弥合视觉-语言语义鸿沟。 Method: 利用预训练模型的文本嵌入作为语义原型先验，结合多模态提示进行自适应优化，并设计紧凑性与分离性正则化的双约束量化空间，采用Gumbel-Softmax实现可微分离散化。 Result: 在ModelNet40和ScanObjectNN数据集上实验表明，该方法在点云分类任务中显著优于现有量化方法。 Conclusion: 多模态提示驱动的量化框架能有效提升点云表示的语义丰富性与量化性能，为多模态统一表示提供了新思路。 Abstract: Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.

[138] Supervised Multilabel Image Classification Using Residual Networks with Probabilistic Reasoning

Lokender Singh,Saksham Kumar,Chandan Kumar

Main category: cs.CV

TL;DR: 提出了一种基于改进ResNet-101和概率推理的多标签图像分类新方法，在COCO-2014上达到0.794 mAP，优于现有基线。

Details

Motivation: 多标签图像分类在计算机视觉中具有广泛应用，但标签间的依赖性和不确定性带来挑战，需更有效的建模方法。 Method: 采用改进的ResNet-101架构，结合概率推理模拟标签依赖关系和不确定性，提升多标签分类性能。 Result: 在COCO-2014数据集上取得0.794 mAP，超过ResNet-SRN（0.771）和Vision Transformer（0.785），并在precision-recall等指标上表现优异。 Conclusion: 将概率推理融入深度学习模型能有效应对多标签分类中的复杂标签关系，显著提升性能。 Abstract: Multilabel image categorization has drawn interest recently because of its numerous computer vision applications. The proposed work introduces a novel method for classifying multilabel images using the COCO-2014 dataset and a modified ResNet-101 architecture. By simulating label dependencies and uncertainties, the approach uses probabilistic reasoning to improve prediction accuracy. Extensive tests show that the model outperforms earlier techniques and approaches to state-of-the-art outcomes in multilabel categorization. The work also thoroughly assesses the model's performance using metrics like precision-recall score and achieves 0.794 mAP on COCO-2014, outperforming ResNet-SRN (0.771) and Vision Transformer baselines (0.785). The novelty of the work lies in integrating probabilistic reasoning into deep learning models to effectively address the challenges presented by multilabel scenarios.

[139] SemanticStitch: Enhancing Image Coherence through Foreground-Aware Seam Carving

Ji-Ping Jin,Chen-Bin Feng,Rui Fan,Chi-Man Vong

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的图像拼接框架SemanticStitch，利用前景对象的语义先验来保持其完整性并提升视觉连贯性。

Details

Motivation: 传统图像拼接方法在处理视角变化、位置差异和物体运动时容易产生错位和视觉不一致，且缺乏对语义信息的考虑，导致前景断裂等问题。 Method: 引入SemanticStitch框架，结合前景对象的语义先验；设计一种新的损失函数，强调显著对象的语义完整性，并构建两个真实世界数据集进行评估。 Result: 实验结果表明，该方法在拼接质量上显著优于传统技术，尤其在前景完整性和视觉连贯性方面表现突出。 Conclusion: SemanticStitch通过融合语义先验有效提升了图像拼接的质量，具有较强的实用性与推广价值。 Abstract: Image stitching often faces challenges due to varying capture angles, positional differences, and object movements, leading to misalignments and visual discrepancies. Traditional seam carving methods neglect semantic information, causing disruptions in foreground continuity. We introduce SemanticStitch, a deep learning-based framework that incorporates semantic priors of foreground objects to preserve their integrity and enhance visual coherence. Our approach includes a novel loss function that emphasizes the semantic integrity of salient objects, significantly improving stitching quality. We also present two specialized real-world datasets to evaluate our method's effectiveness. Experimental results demonstrate substantial improvements over traditional techniques, providing robust support for practical applications.

[140] Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning

Shengqin Jiang,Tianqi Kong,Yuankai Qi,Haokui Zhang,Lina Yao,Quan Z. Sheng,Qingshan Liu,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 提出一种分层分组提示调优方法，通过共享组内提示和根提示生成子提示来提升持续学习中的模型稳定性。

Details

Motivation: 现有基于提示的持续学习方法在每层使用独立任务特定提示，可能导致过度灵活的调整，增加灾难性遗忘的风险。 Method: 提出分层分组提示调优方法：1）将网络层分组，组内共享提示并通过位置编码微调；2）用单一任务特定根提示生成各层组的子提示，增强协同性。 Result: 在四个基准上的实验表明，该方法优于多种前沿方法，有效缓解灾难性遗忘并提升持续学习性能。 Conclusion: 所提方法通过结构化提示设计，在保持灵活性的同时增强了模型稳定性，有助于平衡新任务适应与旧知识保留。 Abstract: Prompt-based continual learning methods fine-tune only a small set of additional learnable parameters while keeping the pre-trained model's parameters frozen. It enables efficient adaptation to new tasks while mitigating the risk of catastrophic forgetting. These methods typically attach one independent task-specific prompt to each layer of pre-trained models to locally modulate its features, ensuring that the layer's representation aligns with the requirements of the new task. However, although introducing learnable prompts independently at each layer provides high flexibility for adapting to new tasks, this overly flexible tuning could make certain layers susceptible to unnecessary updates. As all prompts till the current task are added together as a final prompt for all seen tasks, the model may easily overwrite feature representations essential to previous tasks, which increases the risk of catastrophic forgetting. To address this issue, we propose a novel hierarchical layer-grouped prompt tuning method for continual learning. It improves model stability in two ways: (i) Layers in the same group share roughly the same prompts, which are adjusted by position encoding. This helps preserve the intrinsic feature relationships and propagation pathways of the pre-trained model within each group. (ii) It utilizes a single task-specific root prompt to learn to generate sub-prompts for each layer group. In this way, all sub-prompts are conditioned on the same root prompt, enhancing their synergy and reducing independence. Extensive experiments across four benchmarks demonstrate that our method achieves favorable performance compared with several state-of-the-art methods.

[141] Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillatio

Shuhan Ye,Yi Yu,Qixin Zhang,Chenqi Kong,Qiangqiang Wu,Kun Wang,Xudong Jiang

Main category: cs.CV

TL;DR: PACE是首个面向脉冲神经网络（SNN）和事件视觉的数据集蒸馏框架，通过ST-DSM和PEQ-N两个核心模块，显著降低SNN训练时间和存储成本，同时保持较高准确率。

Details

Motivation: SNN虽然能效高，但因时间编码导致训练成本高，限制了其实际部署。需要一种高效方法来降低SNN在事件数据上的训练开销。 Method: 提出PACE框架，包含ST-DSM模块（利用残余膜电位进行细粒度时空匹配）和PEQ-N模块（提供可兼容的标准事件流量化器），将大规模事件数据集蒸馏为紧凑的合成数据集，从而加速SNN训练。 Result: 在DVS-Gesture、CIFAR10-DVS和N-MNIST数据集上优于现有基线方法，在N-MNIST上达到84.4%准确率（约为全数据集性能的85%），训练时间减少50倍以上，存储成本降低6000倍。 Conclusion: PACE有效解决了SNN训练成本高的问题，实现了分钟级SNN训练和高效的边缘部署，推动了事件驱动视觉系统的发展。 Abstract: Event cameras sense brightness changes and output binary asynchronous event streams, attracting increasing attention. Their bio-inspired dynamics align well with spiking neural networks (SNNs), offering a promising energy-efficient alternative to conventional vision systems. However, SNNs remain costly to train due to temporal coding, which limits their practical deployment. To alleviate the high training cost of SNNs, we introduce \textbf{PACE} (Phase-Aligned Condensation for Events), the first dataset distillation framework to SNNs and event-based vision. PACE distills a large training dataset into a compact synthetic one that enables fast SNN training, which is achieved by two core modules: \textbf{ST-DSM} and \textbf{PEQ-N}. ST-DSM uses residual membrane potentials to densify spike-based features (SDR) and to perform fine-grained spatiotemporal matching of amplitude and phase (ST-SM), while PEQ-N provides a plug-and-play straight through probabilistic integer quantizer compatible with standard event-frame pipelines. Across DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, PACE outperforms existing coreset selection and dataset distillation baselines, with particularly strong gains on dynamic event streams and at low or moderate IPC. Specifically, on N-MNIST, it achieves $84.4\%$ accuracy, about $85\%$ of the full training set performance, while reducing training time by more than $50\times$ and storage cost by $6000\times$, yielding compact surrogates that enable minute-scale SNN training and efficient edge deployment.

[142] Sparse by Rule: Probability-Based N:M Pruning for Spiking Neural Networks

Shuhan Ye,Yi Yu,Qixin Zhang,Chenqi Kong,Qiangqiang Wu,Xudong Jiang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了SpikeNM，首个面向脉冲神经网络（SNN）的半结构化N:M剪枝框架，能够从零开始学习稀疏SNN，在保持精度的同时实现硬件友好的高效稀疏模式。

Details

Motivation: 深层SNN虽然具有事件驱动和稀疏计算带来的能效优势，但参数量和计算成本高，现有非结构化或结构化剪枝方法在硬件加速与精度之间难以平衡，缺乏灵活性。 Method: 提出SpikeNM框架，采用M路基-对数参数化和可微top-k采样器，将每块复杂度线性化为O(M)，避免指数级组合复杂度；并引入受神经科学启发的资格信用蒸馏（EID），利用时序累积信用生成块级软目标，以对齐掩码概率与脉冲动态，稳定高稀疏下的搜索过程。 Result: 在2:4稀疏度下，SpikeNM在主流数据集上保持甚至提升精度，同时生成适合硬件加速的稀疏模式，有效结合了脉冲本身的稀疏性。 Conclusion: SpikeNM实现了SNN中高效、灵活且硬件友好的稀疏训练，为边缘部署提供了可行方案。 Abstract: Brain-inspired Spiking neural networks (SNNs) promise energy-efficient intelligence via event-driven, sparse computation, but deeper architectures inflate parameters and computational cost, hindering their edge deployment. Recent progress in SNN pruning helps alleviate this burden, yet existing efforts fall into only two families: \emph{unstructured} pruning, which attains high sparsity but is difficult to accelerate on general hardware, and \emph{structured} pruning, which eases deployment but lack flexibility and often degrades accuracy at matched sparsity. In this work, we introduce \textbf{SpikeNM}, the first SNN-oriented \emph{semi-structured} $N{:}M$ pruning framework that learns sparse SNNs \emph{from scratch}, enforcing \emph{at most $N$} non-zeros per $M$-weight block. To avoid the combinatorial space complexity $\sum_{k=1}^{N}\binom{M}{k}$ growing exponentially with $M$, SpikeNM adopts an $M$-way basis-logit parameterization with a differentiable top-$k$ sampler, \emph{linearizing} per-block complexity to $\mathcal O(M)$ and enabling more aggressive sparsification. Further inspired by neuroscience, we propose \emph{eligibility-inspired distillation} (EID), which converts temporally accumulated credits into block-wise soft targets to align mask probabilities with spiking dynamics, reducing sampling variance and stabilizing search under high sparsity. Experiments show that at $2{:}4$ sparsity, SpikeNM maintains and even with gains across main-stream datasets, while yielding hardware-amenable patterns that complement intrinsic spike sparsity.

[143] DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

Xianhao Zhou,Jianghao Wu,Ku Zhao,Jinlong He,Huangxuan Zhao,Lei Chen,Shaoting Zhang,Guotai Wang

Main category: cs.CV

TL;DR: 提出了一种结合冻结的自监督DINOv3 Transformer和可训练CNN编码器-解码器的DGCF框架，用于从CBCT或MRI生成合成CT图像，在保持局部细节的同时增强全局语义理解，实现了最先进的性能。

Details

Motivation: 现有CNN模型缺乏全局语义理解，而Transformer在小规模医学数据集上易过拟合，因此需要一种兼顾局部特征与全局语义且适应小数据的模型。 Method: 提出DINOv3引导的跨模态融合（DGCF）框架，通过可学习的跨融合模块分层融合Transformer的全局表征与CNN的局部特征，并引入多级DINOv3感知（MLDP）损失，在DINOv3特征空间中提升合成CT与真实CT的语义相似性。 Result: 在SynthRAD2023骨盆数据集上，DGCF在MS-SSIM、PSNR和基于分割的指标上均达到最先进水平，适用于MRI→CT和CBCT→CT两种转换任务。 Conclusion: DGCF有效结合了Transformer的全局语义与CNN的局部细节，首次将DINOv3应用于医学图像翻译，验证了自监督Transformer引导在语义感知CT合成中的潜力。 Abstract: Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3's feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI$\rightarrow$CT and CBCT$\rightarrow$CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at https://github.com/HiLab-git/DGCF.

[144] Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models

Tianle Cheng,Zeyan Zhang,Kaifeng Gao,Jun Xiao

Main category: cs.CV

TL;DR: 本文提出了一种用于自回归视频扩散模型的自适应起始令牌（ada-BOV），通过可学习嵌入和自适应归一化调制来提升长视频生成的全局一致性和局部动态质量，并结合改进的流去噪策略与训练噪声调度，在多个指标上取得了优异的定性和定量结果。

Details

Motivation: 现有视频扩散模型在生成长视频时面临去噪延迟、误差累积、一致性脆弱和运动动态差等问题，亟需一种能同时保证全局一致性和局部动态质量的生成机制。 Method: 提出自适应起始令牌（ada-BOV），利用类似自适应层归一化的调制机制吸收前序帧信息；引入解耦采样轨迹长度与注意力窗口的流去噪优化策略；设计扰动增强的训练噪声调度以提升模型鲁棒性。 Result: 实验表明，该方法在多种评估指标下均优于现有方法，显著提升了长视频生成的视觉质量和时间连贯性。 Conclusion: ada-BOV为自回归视频扩散模型提供了一种有效框架，兼顾了长视频生成中的全局一致性与局部动态表现，推动了高质量长视频生成的发展。 Abstract: Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs. The BOV tokens are special learnable embeddings on VDMs. They adaptively absorb denoised preceding frames via an adaptive-layer-norm-like modulation. This design preserves the global consistency while allowing for flexible conditioning in dynamic scenarios. To ensure the quality of local dynamics essential in modulating BOV tokens, we further propose a refinement strategy for stream denoising. It decouples the sampling trajectory length from the attention window size constraint, leading to improved local guidance and overall imaging quality. We also propose a disturbance-augmented training noise schedule, which balances the convergence speed with model robustness for the stream denoising. Extensive experiments demonstrate that our method achieves compelling qualitative and quantitative results across multiple metrics.

[145] Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation

Yannan Chen,Ruoyu Chen,Bin Zeng,Wei Wang,Shiming Liu,Qunli Zhang,Zheng Hu,Laiyuan Wang,Yaowei Wang,Xiaochun Cao

Main category: cs.CV

TL;DR: 提出了一种基于归因的反事实增强方法SS-CA，通过识别并替换对模型预测关键的最小区域来提升模型在分布内外的泛化能力和鲁棒性。

Details

Motivation: 现有视觉模型训练中，模型依赖于有限的关键特征进行预测，导致对分布偏移敏感；而归因方法揭示的反事实样本与人类感知存在差异，表明模型学习到的依赖关系缺乏充分因果性。 Method: 基于LIMA归因方法发展出Counterfactual LIMA，用于识别可改变预测的最小空间区域集；提出SS-CA策略，将这些区域替换为自然背景进行数据增强，并在原始和增强样本上联合训练模型。 Result: 在多个ImageNet变体上实验表明，SS-CA提升了模型在分布内测试数据上的表现，并在ImageNet-R、ImageNet-S等OOD基准上取得更优性能；同时在噪声等扰动下也表现出更强的泛化能力。 Conclusion: SS-CA能有效利用可解释性信息纠正模型缺陷，促进更完整的因果学习，从而提高模型的性能与鲁棒性。 Abstract: In current visual model training, models often rely on only limited sufficient causes for their predictions, which makes them sensitive to distribution shifts or the absence of key features. Attribution methods can accurately identify a model's critical regions. However, masking these areas to create counterfactuals often causes the model to misclassify the target, while humans can still easily recognize it. This divergence highlights that the model's learned dependencies may not be sufficiently causal. To address this issue, we propose Subset-Selected Counterfactual Augmentation (SS-CA), which integrates counterfactual explanations directly into the training process for targeted intervention. Building on the subset-selection-based LIMA attribution method, we develop Counterfactual LIMA to identify minimal spatial region sets whose removal can selectively alter model predictions. Leveraging these attributions, we introduce a data augmentation strategy that replaces the identified regions with natural background, and we train the model jointly on both augmented and original samples to mitigate incomplete causal learning. Extensive experiments across multiple ImageNet variants show that SS-CA improves generalization on in-distribution (ID) test data and achieves superior performance on out-of-distribution (OOD) benchmarks such as ImageNet-R and ImageNet-S. Under perturbations including noise, models trained with SS-CA also exhibit enhanced generalization, demonstrating that our approach effectively uses interpretability insights to correct model deficiencies and improve both performance and robustness.

[146] BdSL-SPOTER: A Transformer-Based Framework for Bengali Sign Language Recognition with Cultural Adaptation

Sayad Ibna Azad,Md. Atiqur Rahman

Main category: cs.CV

TL;DR: 提出了一种基于姿态的Transformer框架BdSL-SPOTER，用于准确高效识别孟加拉手语（BdSL），在BdSLW60基准上显著优于基线模型。

Details

Motivation: 针对低资源区域性手语识别任务，现有模型存在性能不足和计算成本高的问题，需设计高效且准确的专用框架。 Method: 扩展SPOTER范式，引入文化特定预处理、紧凑四层Transformer编码器与可学习位置编码，并采用课程学习提升小数据下的泛化能力与收敛速度。 Result: 在BdSLW60基准上达到97.92%的Top-1验证准确率，较Bi-LSTM基线提升22.82%，同时参数更少、FLOPs更低、FPS更高。 Conclusion: BdSL-SPOTER是一种高效实用的手语识别框架，适用于现实场景中的无障碍应用，并为其他低资源手语提供了可扩展的建模范例。 Abstract: We introduce BdSL-SPOTER, a pose-based transformer framework for accurate and efficient recognition of Bengali Sign Language (BdSL). BdSL-SPOTER extends the SPOTER paradigm with cultural specific preprocessing and a compact four-layer transformer encoder featuring optimized learnable positional encodings, while employing curriculum learning to enhance generalization on limited data and accelerate convergence. On the BdSLW60 benchmark, it achieves 97.92% Top-1 validation accuracy, representing a 22.82% improvement over the Bi-LSTM baseline, all while keeping computational costs low. With its reduced number of parameters, lower FLOPs, and higher FPS, BdSL-SPOTER provides a practical framework for real-world accessibility applications and serves as a scalable model for other low-resource regional sign languages.

[147] TEMPO: Global Temporal Building Density and Height Estimation from Satellite Imagery

Tammy Glazer,Gilles Q. Hacheme,Akram Zaytar,Luana Marotti,Amy Michaels,Girmaw Abebe Tadesse,Kevin White,Rahul Dodhia,Andrew Zolli,Inbal Becker-Reshef,Juan M. Lavista Ferres,Caleb Robinson

Main category: cs.CV

TL;DR: TEMPO是一个利用深度学习模型从高分辨率卫星图像生成全球、时间连续的建筑密度和高度数据集的方法，具有高效、低成本和高时空一致性。

Details

Motivation: 为了实现对全球建成区变化的高效、低成本监测，支持大规模发展态势和气候影响分析。 Method: 结合现有建筑轮廓与高度数据及季度PlanetScope卫星影像，训练多任务深度学习模型，在37.6米分辨率上预测建筑密度与高度，并应用于2018年第一季度至2025年第二季度的全球影像。 Result: 模型在不同手工标注子集上的F1得分达到85%至88%，五年趋势一致性得分为0.96，且计算成本显著低于现有方法。 Conclusion: TEMPO能够以高精度和时间稳定性捕捉季度尺度的建成区变化，为全球韧性与适应性研究提供了强有力的数据支持。 Abstract: We present TEMPO, a global, temporally resolved dataset of building density and height derived from high-resolution satellite imagery using deep learning models. We pair building footprint and height data from existing datasets with quarterly PlanetScope basemap satellite images to train a multi-task deep learning model that predicts building density and building height at a 37.6-meter per pixel resolution. We apply this model to global PlanetScope basemaps from Q1 2018 through Q2 2025 to create global, temporal maps of building density and height. We validate these maps by comparing against existing building footprint datasets. Our estimates achieve an F1 score between 85% and 88% on different hand-labeled subsets, and are temporally stable, with a 0.96 five-year trend-consistency score. TEMPO captures quarterly changes in built settlements at a fraction of the computational cost of comparable approaches, unlocking large-scale monitoring of development patterns and climate impacts essential for global resilience and adaptation efforts.

[148] Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection

Tianxiang Zhang,Peipeng Yu,Zhihua Xia,Longchen Dai,Xiaoyu Zhou,Hui Gao

Main category: cs.CV

TL;DR: 提出了一种基于DINOv2的轻量级深度伪造细粒度适配器（DFF-Adapter），通过多任务学习同时实现真实性检测和伪造方法分类，提升了对不同伪造痕迹的敏感性，在仅使用350万可训练参数的情况下达到或超越现有复杂方法的检测性能。

Details

Motivation: 现有的深度伪造检测方法通常将DINOv2用于通用二分类任务，忽略了不同伪造技术产生的特异性痕迹，导致检测精度受限。 Method: 在DINOv2的每个Transformer块中引入轻量化的多头LoRA模块，并设计共享分支将细粒度伪造类型信息传递给真实性检测头，实现多任务协同优化。 Result: 该方法在保持仅有350万可训练参数的前提下，检测准确率与当前最先进的复杂模型相当甚至更优。 Conclusion: DFF-Adapter通过细粒度伪造类型分类增强真实性检测，验证了利用伪造方式特异性知识提升检测性能的有效性，兼顾高效性与高精度。 Abstract: The proliferation of sophisticated deepfakes poses significant threats to information integrity. While DINOv2 shows promise for detection, existing fine-tuning approaches treat it as generic binary classification, overlooking distinct artifacts inherent to different deepfake methods. To address this, we propose a DeepFake Fine-Grained Adapter (DFF-Adapter) for DINOv2. Our method incorporates lightweight multi-head LoRA modules into every transformer block, enabling efficient backbone adaptation. DFF-Adapter simultaneously addresses authenticity detection and fine-grained manipulation type classification, where classifying forgery methods enhances artifact sensitivity. We introduce a shared branch propagating fine-grained manipulation cues to the authenticity head. This enables multi-task cooperative optimization, explicitly enhancing authenticity discrimination with manipulation-specific knowledge. Utilizing only 3.5M trainable parameters, our parameter-efficient approach achieves detection accuracy comparable to or even surpassing that of current complex state-of-the-art methods.

[149] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

Qinyue Tong,Ziqian Lu,Jun Liu,Rui Zuo,Zheming Lu

Main category: cs.CV

TL;DR: 本文提出了多轮实体级医学推理分割（MEMR-Seg）新任务，构建了包含17.7万个多轮医学分割对话的大规模数据集MR-MedSeg，并提出了具备判断与校正机制的基准模型MediRound，有效提升了多轮医学图像分割的性能。

Details

Motivation: 现有医学图像分割方法多为任务特定且缺乏交互性，基于文本提示的方法局限于单轮对话，无法进行多轮推理，难以满足临床中复杂的交互式分割需求。 Method: 提出MEMR-Seg任务和MR-MedSeg数据集，设计MediRound模型，并引入轻量化的判断与校正机制以缓解多轮分割中的误差传播问题。 Result: 实验结果表明，MediRound在MEMR-Seg任务上优于传统的医学指代表分割方法，验证了多轮推理和校正机制的有效性。 Conclusion: 该工作推动了交互式医学图像分割的发展，通过多轮实体级推理和纠错机制，提升了模型在复杂临床场景下的实用性与准确性。 Abstract: Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

[150] RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving

Ruiqi Cheng,Huijun Di,Jian Li,Feng Liu,Wei Liang

Main category: cs.CV

TL;DR: 本文提出了一种名为RadarMP的新方法，利用连续两帧的低层雷达回波信号实现精确的3D场景运动感知。该方法通过统一架构联合建模雷达目标检测与运动估计，并设计了基于多普勒频移和回波强度的自监督损失函数，在无需显式标注的情况下提升了空间与运动一致性。实验表明，RadarMP在多种天气和光照条件下均表现出优异的性能，优于现有雷达解耦方法，增强了全场景自动驾驶系统的感知能力。

Details

Motivation: 由于雷达点云稀疏且含噪，现有方法在恶劣天气下难以实现精确的3D运动感知，而光学传感器此时性能下降，导致自动驾驶系统感知能力受限。因此需要一种更鲁棒、高精度的纯雷达运动感知方法。 Method: 提出RadarMP，直接利用两帧低层雷达回波信号，在统一架构中联合进行雷达目标检测与3D场景流预测；设计基于多普勒频移和回波强度的自监督损失函数，保证空间和运动的一致性。 Result: 在公开数据集上的实验表明，RadarMP在不同天气和光照条件下均实现了可靠的运动感知性能，显著优于基于雷达的解耦式感知流程。 Conclusion: RadarMP通过联合建模和自监督学习，有效提升了4D毫米波雷达在复杂环境下的3D运动感知精度，为全场景自动驾驶提供了更强的感知能力。 Abstract: Accurate 3D scene motion perception significantly enhances the safety and reliability of an autonomous driving system. Benefiting from its all-weather operational capability and unique perceptual properties, 4D mmWave radar has emerged as an essential component in advanced autonomous driving. However, sparse and noisy radar points often lead to imprecise motion perception, leaving autonomous vehicles with limited sensing capabilities when optical sensors degrade under adverse weather conditions. In this paper, we propose RadarMP, a novel method for precise 3D scene motion perception using low-level radar echo signals from two consecutive frames. Unlike existing methods that separate radar target detection and motion estimation, RadarMP jointly models both tasks in a unified architecture, enabling consistent radar point cloud generation and pointwise 3D scene flow prediction. Tailored to radar characteristics, we design specialized self-supervised loss functions guided by Doppler shifts and echo intensity, effectively supervising spatial and motion consistency without explicit annotations. Extensive experiments on the public dataset demonstrate that RadarMP achieves reliable motion perception across diverse weather and illumination conditions, outperforming radar-based decoupled motion perception pipelines and enhancing perception capabilities for full-scenario autonomous driving systems.

[151] OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description

Quanxing Xu,Ling Zhou,Feifei Zhang,Jinyu Tian,Rubing Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为OAD-Promoter的新方法，用于增强基于大语言模型（LLM）的视觉问答（VQA），通过缓解语言偏见和提升领域迁移鲁棒性来改善少样本或零样本场景下的性能。

Details

Motivation: 现有LLM在VQA中因依赖大规模训练数据而继承语言偏见，导致预测不可靠且难以应对分布外（OOD）样本，限制了其泛化能力。 Method: 提出OAD-Promoter，包含三个模块：对象集中示例生成（OEG）模块生成全局描述和对象集中样本以减少偏见；记忆知识辅助（MKA）模块从存储示例中检索知识以支持OOD问题；OAD提示整合前两个模块输出以优化LLM推理。 Result: 实验表明，OAD-Promoter在少样本和零样本设置下显著提升了LLM-based VQA方法的性能，达到了新的最先进水平。 Conclusion: OAD-Promoter有效缓解了LLM在VQA中的语言偏见问题，并增强了对分布外数据的泛化能力，为知识密集型VQA任务提供了更可靠的解决方案。 Abstract: Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few-shot or zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during the acquisition of knowledge. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few-shot or zero-shot settings, achieving new state-of-the-art results.

[152] Compression and Inference of Spiking Neural Networks on Resource-Constrained Hardware

Karol C. Jurzec,Tomasz Szydlo,Maciej Wielgosz

Main category: cs.CV

TL;DR: 提出了一种轻量级C语言运行时，用于在边缘设备上高效执行脉冲神经网络（SNN）推理，并通过模型压缩和稀疏性优化显著降低延迟和内存占用，实现在微控制器上的部署。

Details

Motivation: SNN具有事件驱动、低功耗的优势，但在训练和部署方面存在挑战，尤其是在资源受限的边缘设备上缺乏高效的推理运行时。 Method: 基于C语言构建轻量级推理运行时，将SNNTorch训练的模型转换为紧凑的C表示形式，采用静态且缓存友好的数据布局，预分配内存以避免开销，并利用脉冲稀疏性对不活跃的神经元和突触进行剪枝，减少上游卷积层的计算量。 Result: 在N-MNIST和ST-MNIST数据集上实现了与Python基线相当的准确率，桌面CPU上获得约10倍速度提升，结合剪枝进一步提升性能，显著降低内存占用，成功在Arduino Portenta H7微控制器上部署。 Conclusion: 通过优化的运行时系统和脉冲驱动的模型压缩，SNN可以在传统嵌入式平台上高效执行，具备在边缘设备上实际应用的潜力。 Abstract: Spiking neural networks (SNNs) communicate via discrete spikes in time rather than continuous activations. Their event-driven nature offers advantages for temporal processing and energy efficiency on resource-constrained hardware, but training and deployment remain challenging. We present a lightweight C-based runtime for SNN inference on edge devices and optimizations that reduce latency and memory without sacrificing accuracy. Trained models exported from SNNTorch are translated to a compact C representation; static, cache-friendly data layouts and preallocation avoid interpreter and allocation overheads. We further exploit sparse spiking activity to prune inactive neurons and synapses, shrinking computation in upstream convolutional layers. Experiments on N-MNIST and ST-MNIST show functional parity with the Python baseline while achieving ~10 speedups on desktop CPU and additional gains with pruning, together with large memory reductions that enable microcontroller deployment (Arduino Portenta H7). Results indicate that SNNs can be executed efficiently on conventional embedded platforms when paired with an optimized runtime and spike-driven model compression. Code: https://github.com/karol-jurzec/snn-generator/

[153] MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

Seokwon Song,Minsu Park,Gunhee Kim

Main category: cs.CV

TL;DR: MAVIS是首个用于评估多模态源归因系统的基准，包含15.7万视觉问答实例，每条答案均标注了细粒度的多模态引用，提出自动评估指标并揭示了信息性、可信度与流畅性之间的权衡。

Details

Motivation: 现有研究主要关注纯文本场景下的AI生成答案溯源，忽视了多模态情境中视觉信息的作用，缺乏对多模态证据检索与引用的支持。 Method: 构建MAVIS基准数据集，包含157K带细粒度事实级引用标注的视觉问答实例；设计涵盖信息性、可信度和流畅性的自动评估指标；比较不同提示方法下基于多模态RAG的LVLM表现。 Result: (1) 多模态RAG比单模态RAG生成的答案更具信息性和流畅性，但对图像文档的可信度较弱，且该差距在多模态设置中被放大；(2) 在相同多模态文档下，不同提示方法在信息性与可信度之间存在权衡；(3) 解释图像文档时的上下文偏差是影响性能的关键因素。 Conclusion: 多模态源归因需兼顾信息性与可信度，减少图像理解中的上下文偏差是未来研究的关键方向。 Abstract: Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but they exhibit weaker groundedness for image documents than for text documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research. The dataset and experimental code are available at https://github.com/seokwon99/MAVIS

[154] Breaking the Modality Wall: Time-step Mixup for Efficient Spiking Knowledge Transfer from Static to Event Domain

Yuqi Xie,Shuhan Ye,Yi Yu,Chong Wang,Qixin Zhang,Jiazhen Xu,Le Shen,Yuanbin Qian,Jiangbo Qian,Guoqi Li

Main category: cs.CV

TL;DR: 提出了一种名为Time-step Mixup Knowledge Transfer (TMKT) 的跨模态训练框架，用于提升事件相机与脉冲神经网络（SNN）中的知识迁移效果，通过时间步混合策略和模态感知损失，显著改善了稀疏事件数据下的分类性能。

Details

Motivation: 由于事件数据稀缺且DVS输出稀疏，现有从RGB到DVS的知识迁移方法因模态间分布差异大而表现不佳，亟需更有效的跨模态训练方法。 Method: 提出TMKT框架，采用概率性时间步Mixup（TSM）策略，在不同时间步插值RGB和DVS输入，构建序列内平滑的学习过程；引入两种轻量级模态感知监督：帧级的Modality Aware Guidance (MAG) 和序列级的Mixup Ratio Perception (MRP)，以对齐混合过程中的时空特征。 Result: 在多个基准数据集和不同SNN主干网络上实验表明，TMKT显著优于现有方法，有效降低梯度方差，稳定优化过程，并提升spiking图像分类性能。 Conclusion: TMKT通过时间步混合与模态感知监督，实现了更平滑、高效的跨模态知识迁移，缓解了模态不匹配问题，为基于事件相机的高效视觉理解提供了有效解决方案。 Abstract: The integration of event cameras and spiking neural networks (SNNs) promises energy-efficient visual intelligence, yet scarce event data and the sparsity of DVS outputs hinder effective training. Prior knowledge transfers from RGB to DVS often underperform because the distribution gap between modalities is substantial. In this work, we present Time-step Mixup Knowledge Transfer (TMKT), a cross-modal training framework with a probabilistic Time-step Mixup (TSM) strategy. TSM exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time steps to produce a smooth curriculum within each sequence, which reduces gradient variance and stabilizes optimization with theoretical analysis. To employ auxiliary supervision from TSM, TMKT introduces two lightweight modality-aware objectives, Modality Aware Guidance (MAG) for per-frame source supervision and Mixup Ratio Perception (MRP) for sequence-level mix ratio estimation, which explicitly align temporal features with the mixing schedule. TMKT enables smoother knowledge transfer, helps mitigate modality mismatch during training, and achieves superior performance in spiking image classification tasks. Extensive experiments across diverse benchmarks and multiple SNN backbones, together with ablations, demonstrate the effectiveness of our method.

[155] FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing

Kaixiang Yang,Boyang Shen,Xin Li,Yuchen Dai,Yuxuan Luo,Yueran Ma,Wei Fang,Qiang Li,Zhiwei Wang

Main category: cs.CV

TL;DR: 本文提出FIA-Edit，一种无需隐变量反演的文本引导图像编辑框架，通过频域交互注意力机制实现高保真和语义精确的编辑。

Details

Motivation: 现有无反演方法在编辑过程中难以有效融合源图像信息，导致背景保留差、空间不一致和过度编辑等问题。 Method: 设计了频率表示交互（FRI）模块和特征注入（FIJ）模块，分别在自注意力和交叉注意力中增强跨域对齐并保留结构与语义信息。 Result: 实验表明FIA-Edit在512*512图像上仅需约6秒即可完成编辑，在视觉质量、背景保真度和可控性方面优于现有方法，并首次将其应用于临床出血分类任务。 Conclusion: FIA-Edit实现了高效、高保真的图像编辑，且在医学图像数据增强中展现出巨大潜力。 Abstract: Text-guided image editing has advanced rapidly with the rise of diffusion models. While flow-based inversion-free methods offer high efficiency by avoiding latent inversion, they often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing due to the lack of effective integration of source information. In this paper, we present FIA-Edit, a novel inversion-free framework that achieves high-fidelity and semantically precise edits through a Frequency-Interactive Attention. Specifically, we design two key components: (1) a Frequency Representation Interaction (FRI) module that enhances cross-domain alignment by exchanging frequency components between source and target features within self-attention, and (2) a Feature Injection (FIJ) module that explicitly incorporates source-side queries, keys, values, and text embeddings into the target branch's cross-attention to preserve structure and semantics. Comprehensive and extensive experiments demonstrate that FIA-Edit supports high-fidelity editing at low computational cost (~6s per 512 * 512 image on an RTX 4090) and consistently outperforms existing methods across diverse tasks in visual quality, background fidelity, and controllability. Furthermore, we are the first to extend text-guided image editing to clinical applications. By synthesizing anatomically coherent hemorrhage variations in surgical images, FIA-Edit opens new opportunities for medical data augmentation and delivers significant gains in downstream bleeding classification. Our project is available at: https://github.com/kk42yy/FIA-Edit.

[156] Codebook-Centric Deep Hashing: End-to-End Joint Learning of Semantic Hash Centers and Neural Hash Function

Shuo Yin,Zhiyuan Yin,Yuqing Hou,Rui Liu,Yong Chen,Dell Zhang

Main category: cs.CV

TL;DR: 提出了一种端到端的哈希方法CRH，通过动态重分配哈希中心并联合优化哈希函数，避免了传统两阶段方法的复杂性和性能损失，同时利用多头机制增强语义表达能力，在多个基准上取得了优于现有方法的检索性能。

Details

Motivation: 现有的基于哈希中心的方法因随机初始化忽略类间语义关系，而两阶段优化方法存在额外复杂性、计算开销和阶段间不一致的问题，限制了性能。 Method: 提出Center-Reassigned Hashing (CRH)，在预设码本中动态重分配哈希中心，无需显式的中心优化阶段，并引入多头机制以增强哈希中心的表示能力，实现哈希函数与中心的联合端到端优化。 Result: 在三个基准数据集上实验表明，CRH能学习到语义上有意义的哈希中心，并在检索任务中优于当前最先进的深度哈希方法。 Conclusion: CRH通过动态重分配机制和多头结构，实现了更优的语义保持哈希学习，提升了检索性能，验证了其在端到端框架下整合语义关系的有效性。 Abstract: Hash center-based deep hashing methods improve upon pairwise or triplet-based approaches by assigning fixed hash centers to each class as learning targets, thereby avoiding the inefficiency of local similarity optimization. However, random center initialization often disregards inter-class semantic relationships. While existing two-stage methods mitigate this by first refining hash centers with semantics and then training the hash function, they introduce additional complexity, computational overhead, and suboptimal performance due to stage-wise discrepancies. To address these limitations, we propose $\textbf{Center-Reassigned Hashing (CRH)}$, an end-to-end framework that $\textbf{dynamically reassigns hash centers}$ from a preset codebook while jointly optimizing the hash function. Unlike previous methods, CRH adapts hash centers to the data distribution $\textbf{without explicit center optimization phases}$, enabling seamless integration of semantic relationships into the learning process. Furthermore, $\textbf{a multi-head mechanism}$ enhances the representational capacity of hash centers, capturing richer semantic structures. Extensive experiments on three benchmarks demonstrate that CRH learns semantically meaningful hash centers and outperforms state-of-the-art deep hashing methods in retrieval tasks.

[157] Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective

Wang Luo,Di Wu,Hengyuan Na,Yinlin Zhu,Miao Hu,Guocong Quan

Main category: cs.CV

TL;DR: 提出了一种新的点云补全范式Completion-by-Correction，利用图像到3D的生成先验并通过特征空间校正实现结构一致且对齐观测的补全。

Details

Motivation: 现有基于补全-修复范式的多模态方法因几何与语义约束不足导致结构不一致和拓扑伪影，需更鲁棒的补全方法。 Method: 提出Completion-by-Correction范式，使用预训练图像到3D模型生成拓扑完整的形状先验，并在特征空间进行校正以对齐部分观测；设计PGNet多阶段框架，通过双特征编码、粗略骨架生成和分层细化逐步优化结果。 Result: 在ShapeNetViPC数据集上实验表明，相比现有最优方法，PGNet平均 Chamfer Distance 下降23.5%，F-score 提升7.1%。 Conclusion: Completion-by-Correction范式通过从无约束生成转向引导式修正，提升了点云补全的结构一致性与观测对齐能力，PGNet框架有效实现了该范式。 Abstract: Point cloud completion aims to reconstruct complete 3D shapes from partial observations, which is a challenging problem due to severe occlusions and missing geometry. Despite recent advances in multimodal techniques that leverage complementary RGB images to compensate for missing geometry, most methods still follow a Completion-by-Inpainting paradigm, synthesizing missing structures from fused latent features. We empirically show that this paradigm often results in structural inconsistencies and topological artifacts due to limited geometric and semantic constraints. To address this, we rethink the task and propose a more robust paradigm, termed Completion-by-Correction, which begins with a topologically complete shape prior generated by a pretrained image-to-3D model and performs feature-space correction to align it with the partial observation. This paradigm shifts completion from unconstrained synthesis to guided refinement, enabling structurally consistent and observation-aligned reconstruction. Building upon this paradigm, we introduce PGNet, a multi-stage framework that conducts dual-feature encoding to ground the generative prior, synthesizes a coarse yet structurally aligned scaffold, and progressively refines geometric details via hierarchical correction. Experiments on the ShapeNetViPC dataset demonstrate the superiority of PGNet over state-of-the-art baselines in terms of average Chamfer Distance (-23.5%) and F-score (+7.1%).

[158] MixAR: Mixture Autoregressive Image Generation

Jinyuan Hu,Jiayou Zhang,Shaobo Cui,Kun Zhang,Guangyi Chen

Main category: cs.CV

TL;DR: 本文提出MixAR框架，通过离散-连续混合训练范式提升连续自回归图像生成的效率与质量，其中DC-Mix和TI-Mix策略有效平衡了计算效率与生成保真度。

Details

Motivation: 离散标记的量化过程丢失细粒度信息，限制生成质量；而连续空间虽能提升质量，但缺乏结构导致建模困难。 Method: 提出MixAR框架，采用离散标记作为先验指导连续自回归预测，探索多种混合策略（如DC-SA、DC-CA、DC-Mix），并引入TI-Mix策略使训练与推理分布一致。 Result: 实验表明DC-Mix在计算效率和生成保真度之间取得良好平衡，TI-Mix带来持续性能提升。 Conclusion: MixAR通过融合离散先验与连续建模，在连续自回归图像生成中实现了高效且高质量的生成效果。 Abstract: Autoregressive (AR) approaches, which represent images as sequences of discrete tokens from a finite codebook, have achieved remarkable success in image generation. However, the quantization process and the limited codebook size inevitably discard fine-grained information, placing bottlenecks on fidelity. Motivated by this limitation, recent studies have explored autoregressive modeling in continuous latent spaces, which offers higher generation quality. Yet, unlike discrete tokens constrained by a fixed codebook, continuous representations lie in a vast and unstructured space, posing significant challenges for efficient autoregressive modeling. To address these challenges, we introduce MixAR, a novel framework that leverages mixture training paradigms to inject discrete tokens as prior guidance for continuous AR modeling. MixAR is a factorized formulation that leverages discrete tokens as prior guidance for continuous autoregressive prediction. We investigate several discrete-continuous mixture strategies, including self-attention (DC-SA), cross-attention (DC-CA), and a simple approach (DC-Mix) that replaces homogeneous mask tokens with informative discrete counterparts. Moreover, to bridge the gap between ground-truth training tokens and inference tokens produced by the pre-trained AR model, we propose Training-Inference Mixture (TI-Mix) to achieve consistent training and generation distributions. In our experiments, we demonstrate a favorable balance of the DC-Mix strategy between computational efficiency and generation fidelity, and consistent improvement of TI-Mix.

Abdelrahman Elsayed,Ahmed Jaheen,Mohammad Yaqub

Main category: cs.CV

TL;DR: 提出一种轻量级网络MMRINet，用于在资源受限环境下进行高效的多参数MRI脑肿瘤分割，结合Mamba模型和双路径特征优化模块，在BraTS-Lighthouse SSA 2025中表现出良好的性能。

Details

Motivation: 在计算资源受限的环境中，现有的深度3D网络因计算成本高而难以应用于自动化脑肿瘤分割，因此需要一种更轻量、高效的模型。 Method: 采用线性复杂度的Mamba状态空间模型替代传统的二次复杂度注意力机制，并设计了双路径特征优化（DPFR）模块和渐进式特征聚合（PFA）策略，以提升特征多样性和多尺度融合效果。 Result: 在BraTS-Lighthouse SSA 2025上，模型以仅约250万参数实现了平均Dice分数0.752和HD95为12.23的良好性能。 Conclusion: MMRINet在保持低参数量和高计算效率的同时，实现了准确的脑肿瘤分割，适用于资源受限的临床环境。 Abstract: Automated brain tumor segmentation in multi-parametric MRI remains challenging in resource-constrained settings where deep 3D networks are computationally prohibitive. We propose MMRINet, a lightweight architecture that replaces quadratic-complexity attention with linear-complexity Mamba state-space models for efficient volumetric context modeling. Novel Dual-Path Feature Refinement (DPFR) modules maximize feature diversity without additional data requirements, while Progressive Feature Aggregation (PFA) enables effective multi-scale fusion. In the BraTS-Lighthouse SSA 2025, our model achieves strong performance with an average Dice score of (0.752) and an average HD95 of (12.23) with only ~2.5M parameters, demonstrating efficient and accurate segmentation suitable for low-resource clinical environments. Our GitHub repository can be accessed here: github.com/BioMedIA-MBZUAI/MMRINet.

Aditi Bhalla,Christian Hellert,Enkelejda Kasneci

Main category: cs.CV

TL;DR: 提出了一种两阶段的跨视角、跨模态无监督域适应框架，用于实时驾驶员监控，显著提升了在不同视角和模态下的驾驶员行为识别准确率。

Details

Motivation: 现有方法通常单独处理跨视角泛化或无监督域适应，难以实现模型在多样车辆配置中的鲁棒和可扩展部署。 Method: 第一阶段利用对比学习在单一模态内学习视角不变且动作可区分的特征；第二阶段使用信息瓶颈损失在无需新域标注数据的情况下进行跨模态域适应。 Result: 在Drive&Act数据集上，结合Video Swin和MViT等视频Transformer，相比有监督对比学习的跨视图方法，top-1准确率提升近50%；相比仅做无监督域适应的方法，性能最高提升5%。 Conclusion: 所提出的联合框架能有效应对跨视角和跨模态挑战，显著提升驾驶员行为识别模型的泛化能力和实用性。 Abstract: Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.

[161] Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation

Sujun Sun,Haowen Gu,Cheng Xie,Yanxu Ren,Mingwu Ren,Haofeng Zhang

Main category: cs.CV

TL;DR: 提出了一种用于跨域少样本分割的层次语义学习（HSL）框架，通过双风格随机化和层次语义挖掘模块增强对不同粒度语义的理解，并引入原型置信度调制阈值模块减少分割歧义，在多个数据集上实现了最先进的性能。

Details

Motivation: 现有跨域少样本分割方法主要关注源域与目标域之间的风格差异，忽略了分割粒度的差异，导致对目标域中新类别的语义区分能力不足。 Method: 提出了包含双风格随机化（DSR）、层次语义挖掘（HSM）和原型置信度调制阈值（PCMT）三个模块的HSL框架。DSR通过前景和全局风格随机化模拟目标域数据；HSM利用多尺度超像素引导模型在不同粒度上挖掘类内一致性和类间差异性；PCMT缓解前景与背景过于相似时的分割模糊问题。 Result: 在四个主流目标域数据集上进行了广泛实验，结果表明所提方法在性能上优于现有方法，达到了最先进水平。 Conclusion: HSL框架有效提升了跨域少样本分割中对新类别在不同语义粒度下的识别能力，特别是在存在显著风格和粒度差异的情况下表现出优越的泛化能力和鲁棒性。 Abstract: Cross-domain Few-shot Segmentation (CD-FSS) aims to segment novel classes from target domains that are not involved in training and have significantly different data distributions from the source domain, using only a few annotated samples, and recent years have witnessed significant progress on this task. However, existing CD-FSS methods primarily focus on style gaps between source and target domains while ignoring segmentation granularity gaps, resulting in insufficient semantic discriminability for novel classes in target domains. Therefore, we propose a Hierarchical Semantic Learning (HSL) framework to tackle this problem. Specifically, we introduce a Dual Style Randomization (DSR) module and a Hierarchical Semantic Mining (HSM) module to learn hierarchical semantic features, thereby enhancing the model's ability to recognize semantics at varying granularities. DSR simulates target domain data with diverse foreground-background style differences and overall style variations through foreground and global style randomization respectively, while HSM leverages multi-scale superpixels to guide the model to mine intra-class consistency and inter-class distinction at different granularities. Additionally, we also propose a Prototype Confidence-modulated Thresholding (PCMT) module to mitigate segmentation ambiguity when foreground and background are excessively similar. Extensive experiments are conducted on four popular target domain datasets, and the results demonstrate that our method achieves state-of-the-art performance.

[162] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Feng Chen,Yefei He,Shaoxuan He,Yuanyu He,Jing Liu,Lequan Lin,Akide Liu,Zhaoyang Li,Jiyuan Zhang,Zhenbang Sun,Bohan Zhuang,Qi Wu

Main category: cs.CV

TL;DR: 本文提出了OmniSparse，一种训练感知的细粒度稀疏注意力框架，用于长视频MLLM，在训练和推理中通过动态分配token预算实现高效处理。

Details

Motivation: 现有稀疏注意力方法主要关注推理加速，但存在训练-推理不一致问题，且缺乏在查询、键值和注意力头等多个维度上的细粒度token选择能力，导致性能不佳和加速有限。 Method: OmniSparse包含三个自适应机制：(1)基于懒-活跃分类的查询选择；(2)基于最平坦头共享预算的KV选择；(3)根据解码头模式选择性获取视觉KV缓存的KV缓存瘦身。 Result: 实验表明，OmniSparse在保持全注意力性能的同时，预填充阶段最高实现2.7倍加速，解码阶段内存减少2.4倍。 Conclusion: OmniSparse通过训练感知和多维度细粒度稀疏化，在不牺牲性能的前提下显著提升了长视频MLLM的效率。 Abstract: Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

[163] LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image

Zhuojiang Cai,Yiheng Zhang,Meitong Guo,Mingdao Wang,Yuwang Wang

Main category: cs.CV

TL;DR: 提出了一种名为LSS3D的高质量图像到3D生成方法，通过可学习的空间偏移来解决多视角不一致和非正面视角输入的问题。

Details

Motivation: 现有的多视角扩散模型在生成3D内容时存在形状和纹理错位、对非正面视角鲁棒性差等问题，影响几何完整性和纹理质量。 Method: 引入可学习的空间偏移参数，结合重建网格引导各视角向空间一致的目标对齐，并将输入视角作为优化约束，提升对非正面视角（尤其是高仰角）的鲁棒性。 Result: 在几何和纹理评价指标上均取得领先结果，支持更灵活的输入视角，且提供了可用于社区比较的定量评估流程。 Conclusion: LSS3D有效缓解了多视角不一致性问题，在多种输入视角下实现了更完整几何细节和更清晰纹理的高质量3D生成。 Abstract: Recently, multi-view diffusion-based 3D generation methods have gained significant attention. However, these methods often suffer from shape and texture misalignment across generated multi-view images, leading to low-quality 3D generation results, such as incomplete geometric details and textural ghosting. Some methods are mainly optimized for the frontal perspective and exhibit poor robustness to oblique perspective inputs. In this paper, to tackle the above challenges, we propose a high-quality image-to-3D approach, named LSS3D, with learnable spatial shifting to explicitly and effectively handle the multiview inconsistencies and non-frontal input view. Specifically, we assign learnable spatial shifting parameters to each view, and adjust each view towards a spatially consistent target, guided by the reconstructed mesh, resulting in high-quality 3D generation with more complete geometric details and clean textures. Besides, we include the input view as an extra constraint for the optimization, further enhancing robustness to non-frontal input angles, especially for elevated viewpoint inputs. We also provide a comprehensive quantitative evaluation pipeline that can contribute to the community in performance comparisons. Extensive experiments demonstrate that our method consistently achieves leading results in both geometric and texture evaluation metrics across more flexible input viewpoints.

[164] GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction

Jiaqi Wu,Yaosen Chen,Shuyuan Zhu

Main category: cs.CV

TL;DR: 提出了一种几何引导的多视角扩散模型（Geometry-guided Multi-View Diffusion Model），通过提取深度、法线和前景分割等几何信息，结合解耦的几何增强注意力机制和自适应学习策略，实现跨视角一致且细节丰富的多视角图像生成。

Details

Motivation: 现有基于单图扩展的多视角图像生成方法在跨视角一致性与高分辨率输出方面存在计算挑战，难以同时保证几何一致性和细节质量。 Method: 设计多视角几何信息提取模块，利用深度图、法线图和前景分割掩码构建共享几何结构；引入解耦的几何增强注意力机制以强化关键几何特征；采用自适应学习策略和动态几何信息强度调节机制，并结合迭代优化过程提升生成质量。 Result: 模型在保持跨视角几何一致性的同时，显著提升了生成图像的细节丰富度和视觉真实感，在多视角生成任务中表现出优越性能。 Conclusion: 所提方法有效解决了多视角图像生成中的跨视角不一致和细节丢失问题，通过几何引导机制实现了高质量、高分辨率且视觉自然的多视角图像生成。 Abstract: Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: https://github.com/SobeyMIL/GeoMVD.com.

[165] A Novel AI-Driven System for Real-Time Detection of Mirror Absence, Helmet Non-Compliance, and License Plates Using YOLOv8 and OCR

Nishant Vasantkumar Hegde,Aditi Agarwal,Minal Moharir

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的交通违规自动检测系统，利用YOLOv8和EasyOCR实现头盔佩戴、后视镜缺失及车牌识别，具备高精度与实际部署潜力。

Details

Motivation: 传统交通执法依赖人工，资源消耗大且效率低，难以持续有效地执行头盔和车辆安全规定。 Method: 采用YOLOv8进行目标检测，识别未戴头盔和缺少后视镜的摩托车；结合EasyOCR提取车牌信息，并使用自建标注数据集训练模型，通过图像预处理提升识别鲁棒性，前端由Streamlit实现可视化监控。 Result: 模型在测试中达到0.9147的精确率、0.886的召回率，mAP@50为0.843，mAP@50:95为0.503，表现出优异的检测性能，尤其在复杂条件下仍保持稳定识别能力。 Conclusion: 该系统为自动化交通执法提供了高效、可行的解决方案，具有良好的实际应用前景，可用于提升道路安全管理水平。 Abstract: Road safety is a critical global concern, with manual enforcement of helmet laws and vehicle safety standards (e.g., rear-view mirror presence) being resource-intensive and inconsistent. This paper presents an AI-powered system to automate traffic violation detection, significantly enhancing enforcement efficiency and road safety. The system leverages YOLOv8 for robust object detection and EasyOCR for license plate recognition. Trained on a custom dataset of annotated images (augmented for diversity), it identifies helmet non-compliance, the absence of rear-view mirrors on motorcycles, an innovative contribution to automated checks, and extracts vehicle registration numbers. A Streamlit-based interface facilitates real-time monitoring and violation logging. Advanced image preprocessing enhances license plate recognition, particularly under challenging conditions. Based on evaluation results, the model achieves an overall precision of 0.9147, a recall of 0.886, and a mean Average Precision (mAP@50) of 0.843. The mAP@50 95 of 0.503 further indicates strong detection capability under stricter IoU thresholds. This work demonstrates a practical and effective solution for automated traffic rule enforcement, with considerations for real-world deployment discussed.

[166] Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

Haozhe Liu,Ding Liu,Mingchen Zhuge,Zijian Zhou,Tian Xie,Sen He,Yukang Yang,Shuming Liu,Yuren Cong,Jiadong Guo,Hongyu Xu,Ke Xu,Kam-Woh Ng,Juan C. Pérez,Juan-Manuel~Pérez-Rúa,Tao Xiang,Wei Liu,Shikun Liu,Jürgen Schmidhuber

Main category: cs.CV

TL;DR: 提出了一种名为MoS（状态混合）的新融合范式，用于多模态扩散模型，通过可学习的、逐token的路由器实现模态间的灵活交互，在文本到图像生成和编辑任务中达到最先进的效果，且模型参数仅为3B到5B。

Details

Motivation: 现有的多模态扩散模型在模态融合方面缺乏灵活性和效率，难以在保持小模型规模的同时实现高性能。 Method: 引入MoS（Mixture of States），使用一个可学习的、逐token的路由器，在去噪过程中根据时间步和输入动态选择最相关的隐藏状态进行模态融合，采用top-k稀疏选择和ε-贪婪策略训练，仅需极少的可学习参数。 Result: 在文本到图像生成（MoS-Image）和编辑（MoS-Editing）任务上达到最先进水平，3B–5B参数的模型性能媲美甚至超过大4倍的模型，计算开销极低。 Conclusion: MoS是一种灵活且计算高效的多模态扩散模型融合范式，为未来大规模多模态建模提供了可行路径。 Abstract: We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Peng Zhang,Zhihui Lai,Wenting Chen,Xu Wu,Heng Kong

Main category: cs.CV

TL;DR: 提出了一种语义增强的医学视觉-语言预训练框架FaNe，通过语义感知的正样本挖掘、文本条件稀疏注意力池化和难负样本感知对比损失，有效缓解了假阴性问题并提升了细粒度跨模态对齐性能。

Details

Motivation: 现有医学视觉-语言预训练方法受限于语义相似文本导致的假阴性问题以及细粒度跨模态对齐不足。 Method: 引入基于文本相似性的自适应归一化正样本挖掘策略；设计文本条件稀疏注意力池化模块以实现细粒度图像-文本对齐；提出难负样本感知的对比损失来自适应重加权语义相近的负样本。 Result: 在五个下游医学影像基准上实验表明，FaNe在图像分类、目标检测和语义分割任务中均达到最先进性能。 Conclusion: FaNe框架有效缓解了假阴性问题并增强了跨模态对齐与模态内判别能力，显著提升了医学视觉-语言理解性能。 Abstract: Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

[168] Suppressing VLM Hallucinations with Spectral Representation Filtering

Ameen Ali,Tamim Zoabi,Lior Wolf

Main category: cs.CV

TL;DR: 提出了一种名为Spectral Representation Filtering (SRF)的轻量级、无需训练的方法，通过分析和修正模型表征的协方差结构来抑制视觉-语言模型中的幻觉问题。

Details

Motivation: 视觉-语言模型由于过度依赖语言先验和跨模态对齐不准确，常产生描述不存在物体、属性或关系的幻觉，需要一种有效且无需重新训练的解决方案。 Method: 通过对真实和幻觉性描述对应的特征差异的协方差矩阵进行特征分解，识别出低秩的幻觉模式，并在深层vLLM的前馈投影权重中使用软谱滤波器衰减这些模式，从而修正模型表示。 Result: 在LLaVA-1.5、MiniGPT-4和mPLUG-Owl2等三类VLM上，SRF在MSCOCO、POPE-VQA等多个基准上显著降低了幻觉率，实现了最先进的保真度提升，且不损害生成质量。 Conclusion: SRF是一种无需训练、无推理开销、无需修改架构的后处理方法，能有效抑制VLM中的幻觉，具有广泛适用性和实用性。 Abstract: Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model's representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.

[169] Model Inversion Attack Against Deep Hashing

Dongdong Zhao,Qiben Xu,Ranxin Fang,Baogang Song

Main category: cs.CV

TL;DR: 提出首个基于扩散模型的深度哈希模型反演框架DHMI，可在黑盒场景下高效重构高质量、高分辨率的训练数据图像，揭示了深度哈希系统中严重的隐私风险。

Details

Motivation: 深度哈希虽提升检索效率，但其潜在的模型反演攻击风险被忽视，尤其是真实训练哈希码不可获取和汉明空间离散性导致现有方法难以适用，亟需专门针对深度哈希的反演攻击研究。 Method: 提出DHMI框架：首先对辅助数据集聚类生成语义哈希中心作为代理锚点；设计代理引导的去噪优化方法，结合分类一致性与哈希邻近性的新攻击度量动态筛选候选样本；利用代理模型簇优化候选样本，生成高保真且语义一致的图像。 Result: 在多个数据集上实验表明，DHMI在最具挑战性的黑盒设置下仍能成功重建高分辨率、高质量图像，且性能优于现有最先进的黑盒模型反演攻击方法。 Conclusion: DHMI验证了深度哈希系统存在严重隐私泄露风险，为未来安全哈希模型的设计提供了重要警示和评估基准。 Abstract: Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.

[170] Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

Huy M. Le,Dat Tien Nguyen,Phuc Binh Nguyen,Gia-Bao Le-Tran,Phu Truong Thien,Cuong Dinh,Minh Nguyen,Nga Nguyen,Thuy T. N. Nguyen,Huy Gia Ngo,Tan Nhat Nguyen,Binh T. Nguyen,Monojit Choudhury

Main category: cs.CV

TL;DR: Fusionista2.0是一个为视频检索优化的高效系统，通过技术升级和界面改进，在减少最多75%检索时间的同时提升了准确性和用户满意度。

Details

Motivation: 为了在严格的时间限制下满足Video Browser Showdown (VBS)对高精度检索的需求，需要一个既快速又易用的视频检索系统。 Method: 重新设计核心模块：使用ffmpeg进行快速关键帧提取，采用Vintern-1B-v3.5实现多语言OCR，利用faster-whisper实现实时语音识别，并结合轻量级视觉语言模型进行问答；同时重构用户界面以提升响应性、可访问性和工作流效率。 Result: 检索时间最多减少75%，准确性和用户满意度均有所提升。 Conclusion: Fusionista2.0是一个兼具竞争力和用户友好性的大规模视频搜索系统。 Abstract: The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.

[171] Prompt-Conditioned FiLM and Multi-Scale Fusion on MedSigLIP for Low-Dose CT Quality Assessment

Tolga Demiroglu,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim

Main category: cs.CV

TL;DR: 提出一种基于MedSigLIP的提示条件框架，通过FiLM和多尺度池化注入文本先验，用于低剂量CT图像质量评估，在LDCTIQA2023挑战赛中以少量训练数据取得了领先性能。

Details

Motivation: 为了实现数据高效学习和快速适应临床意图，需将文本提示与图像特征结合，提升低剂量CT图像质量评估的准确性和可解释性。 Method: 构建一个提示条件框架，利用Feature-wise Linear Modulation (FiLM) 将文本先验注入MedSigLIP模型，并采用多尺度池化（全局、局部、纹理感知）结合独立回归头和轻量级MLP融合，使用成对排序损失进行训练。 Result: 在LDCTIQA2023数据集上，仅使用1,000张训练图像，取得了PLCC=0.9575、SROCC=0.9561、KROCC=0.8301的结果，优于已发表的最优方法。 Conclusion: 所提出的提示引导框架能有效结合临床语义信息，在小样本条件下显著提升LDCT图像质量评估性能，具有良好的应用潜力。 Abstract: We propose a prompt-conditioned framework built on MedSigLIP that injects textual priors via Feature-wise Linear Modulation (FiLM) and multi-scale pooling. Text prompts condition patch-token features on clinical intent, enabling data-efficient learning and rapid adaptation. The architecture combines global, local, and texture-aware pooling through separate regression heads fused by a lightweight MLP, trained with pairwise ranking loss. Evaluated on the LDCTIQA2023 (a public LDCT quality assessment challenge) with 1,000 training images, we achieve PLCC = 0.9575, SROCC = 0.9561, and KROCC = 0.8301, surpassing the top-ranked published challenge submissions and demonstrating the effectiveness of our prompt-guided approach.

[172] A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation

Puzhen Wu,Hexin Dong,Yi Lin,Yihao Ding,Yifan Peng

Main category: cs.CV

TL;DR: 提出了一种双阶段疾病感知框架用于胸部X光报告生成，通过学习疾病感知语义标记和融合视觉-语言表征，在多个基准数据集上实现了最先进的性能。

Details

Motivation: 现有方法在视觉表示中缺乏足够的疾病感知能力，且视觉与语言对齐不足，导致难以准确生成临床报告。 Method: 第一阶段利用交叉注意力机制和多标签分类学习疾病感知语义标记（DASTs），并通过对比学习对齐视觉与语言表征；第二阶段引入疾病-视觉注意力融合模块（DVAF）和双模态相似性检索机制（DMSR）以增强特征融合与上下文引导。 Result: 在CheXpert Plus、IU X-ray和MIMIC-CXR数据集上实验表明，该方法在临床准确性和语言质量方面均显著优于现有方法，达到最先进水平。 Conclusion: 所提出的双阶段疾病感知框架有效提升了胸部X光报告生成的准确性与可解释性，具有较强的临床应用潜力。 Abstract: Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists' workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage~1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage~2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.

[173] CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

Jingyao Li,Jingyun Wang,Molin Tan,Haochen Wang,Cilin Yan,Likun Shi,Jiayin Cai,Xiaolong Jiang,Yao Hu

Main category: cs.CV

TL;DR: 本文提出了CrossVid，首个用于全面评估多模态大语言模型（MLLMs）在跨视频情境下时空推理能力的基准。该基准包含多种层次任务、5331个视频和9015个问答对，实验表明现有MLLMs在跨视频推理上表现有限，主要瓶颈在于难以整合和比较多个视频中的信息。

Details

Motivation: 现有的视频理解基准主要集中于单视频分析，缺乏对MLLMs在多视频间进行推理能力的系统评估；而现有跨视角视频基准任务有限，无法覆盖真实场景中复杂的跨视频推理需求。 Method: 构建了一个名为CrossVid的新基准，涵盖四个高层维度和十个具体任务，包含5,331个视频和9,015个挑战性问答对，支持单选、多选和开放性问题；并在多个开源与闭源MLLM上进行广泛实验与案例分析。 Result: Gemini-2.5-Pro在CrossVid上表现最佳，平均准确率为50.4%；但大多数现有MLLM在跨视频推理任务上表现不佳，主要问题在于无法有效整合和比较分布在多个视频中的证据。 Conclusion: CrossVid为评估和推动MLLMs在跨视频推理方面的能力提供了重要工具，揭示了当前模型的局限性，并指明了未来改进方向。 Abstract: Cross-Video Reasoning (CVR) presents a significant challenge in video understanding, which requires simultaneous understanding of multiple videos to aggregate and compare information across groups of videos. Most existing video understanding benchmarks focus on single-video analysis, failing to assess the ability of multimodal large language models (MLLMs) to simultaneously reason over various videos. Recent benchmarks evaluate MLLMs' capabilities on multi-view videos that capture different perspectives of the same scene. However, their limited tasks hinder a thorough assessment of MLLMs in diverse real-world CVR scenarios. To this end, we introduce CrossVid, the first benchmark designed to comprehensively evaluate MLLMs' spatial-temporal reasoning ability in cross-video contexts. Firstly, CrossVid encompasses a wide spectrum of hierarchical tasks, comprising four high-level dimensions and ten specific tasks, thereby closely reflecting the complex and varied nature of real-world video understanding. Secondly, CrossVid provides 5,331 videos, along with 9,015 challenging question-answering pairs, spanning single-choice, multiple-choice, and open-ended question formats. Through extensive experiments on various open-source and closed-source MLLMs, we observe that Gemini-2.5-Pro performs best on CrossVid, achieving an average accuracy of 50.4%. Notably, our in-depth case study demonstrates that most current MLLMs struggle with CVR tasks, primarily due to their inability to integrate or compare evidence distributed across multiple videos for reasoning. These insights highlight the potential of CrossVid to guide future advancements in enhancing MLLMs' CVR capabilities.

[174] ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

Ruixun Liu,Bowen Fu,Jiayi Song,Kaiyu Li,Wanchen Li,Lanxuan Xue,Hui Qiao,Weizhan Zhang,Deyu Meng,Xiangyong Cao

Main category: cs.CV

TL;DR: 本文提出了一种用于超高清遥感图像处理的主动感知新范式，构建了大规模基准数据集LRS-GRO，并提出了自适应裁剪-缩放框架ZoomEarth，在多种任务中实现先进性能且具备良好泛化性和可扩展性。

Details

Motivation: 现有动态分辨率和令牌剪枝方法受限于被动感知范式，在处理更精细视觉输入时冗余增加，难以高效利用超高清遥感图像中的丰富信息。 Method: 提出主动感知范式，构建包含17种问题类型的大型基准数据集LRS-GRO，并设计ZoomEarth框架，结合自适应裁剪-缩放机制与区域引导奖励（Region-Guided reward），通过监督微调（SFT）和组相对策略优化（GRPO）进行训练。 Result: ZoomEarth在LRS-GRO上达到最先进的性能，并在三个公开UHR遥感基准上展现优秀的零样本迁移能力；同时可无缝集成到下游任务（如去云、去噪、分割、图像编辑）中，表现出强通用性和可扩展性。 Conclusion: 主动感知范式为超高清遥感图像处理提供了更高效的信息利用方式，ZoomEarth框架结合区域引导奖励和自适应缩放机制，显著提升模型性能并具备广泛的应用潜力。 Abstract: Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.

[175] TM-UNet: Token-Memory Enhanced Sequential Modeling for Efficient Medical Image Segmentation

Yaxuan Jiao,Qing Xu,Yuxiang Luo,Xiangjian He,Zhen Chen,Wenting Duan

Main category: cs.CV

TL;DR: 提出了一种轻量级的医学图像分割框架TM-UNet，通过引入多尺度令牌记忆块（MSTM）实现高效全局推理，在降低计算成本的同时优于现有方法。

Details

Motivation: Transformer-based方法虽然效果显著，但计算成本高，限制了在临床中的应用。因此需要一种更高效的医学图像分割模型。 Method: 设计了TM-UNet框架，核心是多尺度令牌记忆块（MSTM），将2D空间特征转化为令牌序列，利用矩阵记忆单元选择性保留和传播上下文信息，并结合指数门控和并行池化操作进行多尺度上下文提取。 Result: 在多个医学图像分割任务上超越了最先进的方法，同时显著降低了计算成本。 Conclusion: TM-UNet通过新颖的令牌记忆机制实现了高效、准确的医学图像分割，具备良好的临床部署潜力。 Abstract: Medical image segmentation is essential for clinical diagnosis and treatment planning. Although transformer-based methods have achieved remarkable results, their high computational cost hinders clinical deployment. To address this issue, we propose TM-UNet, a novel lightweight framework that integrates token sequence modeling with an efficient memory mechanism for efficient medical segmentation. Specifically, we introduce a multi-scale token-memory (MSTM) block that transforms 2D spatial features into token sequences through strategic spatial scanning, leveraging matrix memory cells to selectively retain and propagate discriminative contextual information across tokens. This novel token-memory mechanism acts as a dynamic knowledge store that captures long-range dependencies with linear complexity, enabling efficient global reasoning without redundant computation. Our MSTM block further incorporates exponential gating to identify token effectiveness and multi-scale contextual extraction via parallel pooling operations, enabling hierarchical representation learning without computational overhead. Extensive experiments demonstrate that TM-UNet outperforms state-of-the-art methods across diverse medical segmentation tasks with substantially reduced computation cost. The code is available at https://github.com/xq141839/TM-UNet.

[176] D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

Shuochen Chang,Xiaofeng Zhang,Qingyang Liu,Li Niu

Main category: cs.CV

TL;DR: 提出D$^{3}$ToM，一种基于决策引导的动态token合并方法，用于加速扩散型多模态大语言模型（Diffusion MLLMs）的推理过程，同时保持性能。

Details

Motivation: Diffusion MLLMs在视觉-语言任务中表现出色，但由于每步去噪都对整个序列进行全双向自注意力计算，导致推理速度慢、计算复杂度高，尤其是在处理大量视觉token时。 Method: 提出D$^{3}$ToM，利用前一步生成的decider token构建视觉token的重要性图，动态保留重要token并合并冗余token；通过相似性聚合实现token压缩，且合并比例随去噪步骤动态调整；该模块可插入单个Transformer层，缩短后续层的序列长度而不改变模型参数。 Result: 实验表明，D$^{3}$ToM在显著加速推理的同时，保持了具有竞争力的性能，在相同计算预算下优于固定压缩比等基线方法。 Conclusion: D$^{3}$ToM是一种即插即用、高效且灵活的推理加速方法，有效降低了Diffusion MLLMs的计算开销，适用于处理长序列视觉token的场景。 Abstract: Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance. The code is released at https://github.com/bcmi/D3ToM-Diffusion-MLLM.

[177] One target to align them all: LiDAR, RGB and event cameras extrinsic calibration for Autonomous Driving

Andrea Bertogalli,Giacomo Boracchi,Luca Magri

Main category: cs.CV

TL;DR: 提出了一种用于事件相机、LiDAR和RGB相机的多模态外参标定框架，通过设计一种可被三种传感器同时感知的3D标定目标，实现一次性联合标定。

Details

Motivation: 现有方法通常依赖于成对分离标定，难以高效准确地完成多传感器系统（尤其是包含事件相机）的外参标定。 Method: 设计了一个包含平面特征、ChArUco码和主动LED图案的3D标定板，分别适配LiDAR、RGB相机和事件相机；利用该标定板实现三者之间的同步感知，并构建联合优化流程进行外参估计。 Result: 在自建自动驾驶传感器数据集上进行了广泛实验，验证了该方法在精度和鲁棒性方面的优越性。 Conclusion: 所提方法能够有效实现事件相机、LiDAR和RGB相机的一体化外参标定，适用于高精度自动驾驶感知系统的部署。 Abstract: We present a novel multi-modal extrinsic calibration framework designed to simultaneously estimate the relative poses between event cameras, LiDARs, and RGB cameras, with particular focus on the challenging event camera calibration. Core of our approach is a novel 3D calibration target, specifically designed and constructed to be concurrently perceived by all three sensing modalities. The target encodes features in planes, ChArUco, and active LED patterns, each tailored to the unique characteristics of LiDARs, RGB cameras, and event cameras respectively. This unique design enables a one-shot, joint extrinsic calibration process, in contrast to existing approaches that typically rely on separate, pairwise calibrations. Our calibration pipeline is designed to accurately calibrate complex vision systems in the context of autonomous driving, where precise multi-sensor alignment is critical. We validate our approach through an extensive experimental evaluation on a custom built dataset, recorded with an advanced autonomous driving sensor setup, confirming the accuracy and robustness of our method.

[178] DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

Xiaoyu Lin,Aniket Ghorpade,Hansheng Zhu,Justin Qiu,Dea Rrozhani,Monica Lama,Mick Yang,Zixuan Bian,Ruohan Ren,Alan B. Hong,Jiatao Gu,Chris Callison-Burch

Main category: cs.CV

TL;DR: 本文提出DenseAnnotate，一个基于音频驱动的在线标注平台，支持高效生成图像和3D资产的密集细粒度多模态标注。通过语音同步标注区域并结合语音转文本技术，克服传统文本输入在表达性和速度上的局限。研究涵盖1000多名标注者，构建了包含3531张图像、898个3D场景及多种语言音频对齐标注的数据集。在该数据集上训练的模型在多语言理解、文化对齐和3D空间能力方面分别提升5%、47%和54%，验证了该方法的有效性与普适性。

Details

Motivation: 现有视觉标注数据多依赖互联网爬取或手动文本输入，注释稀疏且表达受限，难以捕捉图像的丰富视觉内容，尤其在跨文化图像和3D资产等复杂领域。传统文本标注方式效率低、表达不充分，亟需一种更高效、细腻、可扩展的标注范式以支持多模态大模型的发展。 Method: 提出DenseAnnotate平台，采用音频驱动方式，允许标注者通过口头叙述同步关联语音片段与图像区域或3D场景部分。系统集成语音识别（ASR）与注意力区域标记功能，实现语音到文本的自动转录及空间对齐。通过在线协作模式，在两个领域（多元文化图像与3D场景）开展大规模案例研究，收集高质量、多语言的密集标注数据。 Result: 构建了一个包含3,531张图像、898个3D场景、7,460个3D对象的人工标注多模态数据集，涵盖20种语言的音频对齐密集标注，包括8,746条图像描述、2,000条场景描述和19,000条对象描述。基于此数据训练的模型在多语言性能上提升5%，文化对齐能力提升47%，3D空间理解能力提升54%。 Conclusion: DenseAnnotate为创建高质量、密集的多模态标注提供了一种可行且高效的解决方案，显著优于传统文本标注方式。该平台适用于多样化任务和数据类型，有望推动未来视觉-语言模型在文化多样性、三维理解等方面的研究进展。 Abstract: With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image's visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.

[179] Rethinking Bias in Generative Data Augmentation for Medical AI: a Frequency Recalibration Method

Chi Liu,Jincheng Liu,Congcong Zhu,Minghao Wang,Sheng Shen,Jia Gu,Tianqing Zhu,Wanlei Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为频率重校准（FreRec）的方法，用于减少生成数据增强中真实与合成医学图像之间的频率分布差异，从而提升下游任务性能。

Details

Motivation: 生成数据增强在医学领域应用时常因AI生成图像的偏差而影响下游任务，其中频率不匹配是一个关键问题。 Method: 提出FreRec方法，包括统计高频替换（SHR）和重建高频映射（RHM），以对齐高频成分并提升图像质量。 Result: 在多种医学图像数据集上验证，FreRec显著提升了分类性能。 Conclusion: FreRec是一种即插即用的后处理方法，兼容各类生成模型，可有效改善医学图像生成数据增强的效果。 Abstract: Developing Medical AI relies on large datasets and easily suffers from data scarcity. Generative data augmentation (GDA) using AI generative models offers a solution to synthesize realistic medical images. However, the bias in GDA is often underestimated in medical domains, with concerns about the risk of introducing detrimental features generated by AI and harming downstream tasks. This paper identifies the frequency misalignment between real and synthesized images as one of the key factors underlying unreliable GDA and proposes the Frequency Recalibration (FreRec) method to reduce the frequency distributional discrepancy and thus improve GDA. FreRec involves (1) Statistical High-frequency Replacement (SHR) to roughly align high-frequency components and (2) Reconstructive High-frequency Mapping (RHM) to enhance image quality and reconstruct high-frequency details. Extensive experiments were conducted in various medical datasets, including brain MRIs, chest X-rays, and fundus images. The results show that FreRec significantly improves downstream medical image classification performance compared to uncalibrated AI-synthesized samples. FreRec is a standalone post-processing step that is compatible with any generative model and can integrate seamlessly with common medical GDA pipelines.

[180] Co-Layout: LLM-driven Co-optimization for Interior Layout

Chucheng Xiang,Ruchao Bao,Biyin Feng,Wenzheng Wu,Zhongyuan Liu,Yirui Guan,Ligang Liu

Main category: cs.CV

TL;DR: 提出了一种结合大语言模型（LLM）与基于网格的整数规划的自动化室内设计框架，通过联合优化房间布局和家具摆放，显著优于传统两阶段方法。

Details

Motivation: 传统室内设计自动化方法通常采用分步优化，难以兼顾全局最优与用户需求；本文旨在通过LLM理解自然语言指令并联合优化布局，提升设计质量与效率。 Method: 利用LLM从文本提示中提取结构化设计约束，并将其编码为受“Modulor”启发的统一网格表示；采用基于整数规划的粗到精优化策略，在低分辨率网格上求解简化问题后引导高分辨率优化。 Result: 实验表明，该方法在多种场景下相比现有两阶段流程显著提升了布局质量，同时通过粗到精策略实现了更高的计算效率。 Conclusion: 结合LLM与整数规划的联合优化框架能有效实现高质量、高效率的自动化室内设计，具备处理复杂空间约束和用户偏好的能力。 Abstract: We present a novel framework for automated interior design that combines large language models (LLMs) with grid-based integer programming to jointly optimize room layout and furniture placement. Given a textual prompt, the LLM-driven agent workflow extracts structured design constraints related to room configurations and furniture arrangements. These constraints are encoded into a unified grid-based representation inspired by ``Modulor". Our formulation accounts for key design requirements, including corridor connectivity, room accessibility, spatial exclusivity, and user-specified preferences. To improve computational efficiency, we adopt a coarse-to-fine optimization strategy that begins with a low-resolution grid to solve a simplified problem and guides the solution at the full resolution. Experimental results across diverse scenarios demonstrate that our joint optimization approach significantly outperforms existing two-stage design pipelines in solution quality, and achieves notable computational efficiency through the coarse-to-fine strategy.

[181] LiDAR-GS++:Improving LiDAR Gaussian Reconstruction via Diffusion Priors

Qifeng Chen,Jiarun Liu,Rengan Xie,Tao Tang,Sicong Du,Yiru Zhao,Yuchi Huo,Sheng Yang

Main category: cs.CV

TL;DR: 提出LiDAR-GS++，一种融合扩散先验的激光雷达高斯点阵重建方法，用于在公共城市道路上实现实时、高保真的重模拟，在插值和外推视角下均达到SOTA性能。

Details

Motivation: 现有基于高斯点阵（GS）的方法在单次扫描重建不完整时，外推新视角合成存在伪影问题，限制了其在复杂场景中的应用。 Method: 引入基于扩散先验的可控激光雷达生成模型，结合粗略外推渲染生成几何一致的额外扫描，并通过有效的蒸馏机制实现扩展重建，增强欠拟合区域的几何一致性。 Result: 在多个公开数据集上验证了方法的有效性，LiDAR-GS++在外推和插值视角下均优于现有的GS和NeRF方法，实现了更高质量的重建与重模拟。 Conclusion: LiDAR-GS++通过引入扩散先验显著提升了GS方法在外推场景下的几何一致性和视觉质量，为自动驾驶等实际应用场景提供了更可靠的实时重建方案。 Abstract: Recent GS-based rendering has made significant progress for LiDAR, surpassing Neural Radiance Fields (NeRF) in both quality and speed. However, these methods exhibit artifacts in extrapolated novel view synthesis due to the incomplete reconstruction from single traversal scans. To address this limitation, we present LiDAR-GS++, a LiDAR Gaussian Splatting reconstruction method enhanced by diffusion priors for real-time and high-fidelity re-simulation on public urban roads. Specifically, we introduce a controllable LiDAR generation model conditioned on coarsely extrapolated rendering to produce extra geometry-consistent scans and employ an effective distillation mechanism for expansive reconstruction. By extending reconstruction to under-fitted regions, our approach ensures global geometric consistency for extrapolative novel views while preserving detailed scene surfaces captured by sensors. Experiments on multiple public datasets demonstrate that LiDAR-GS++ achieves state-of-the-art performance for both interpolated and extrapolated viewpoints, surpassing existing GS and NeRF-based methods.

[182] Learning Time in Static Classifiers

Xi Ding,Lei Wang,Piotr Koniusz,Yongsheng Gao

Main category: cs.CV

TL;DR: 提出了一种无需修改网络结构的时序推理框架，通过Support-Exemplar-Query（SEQ）学习范式和软DTW损失，在不引入循环模块的情况下为前馈分类器引入时序归纳偏置，提升了图像分类和视频异常检测的性能。

Details

Motivation: 现实视觉数据具有时序连续性，但传统分类器假设样本独立，难以捕捉时间动态变化。 Method: 设计了SEQ学习范式，将训练数据构造成时序连贯的轨迹，利用类特定时序原型和可微软DTW损失对齐预测序列，并通过多任务目标增强语义一致性和时序平滑性。 Result: 在细粒度与超细粒度图像分类及视频异常检测任务中均取得性能提升，实现了静态与时序任务的统一建模。 Conclusion: 仅通过损失函数设计即可为标准分类器引入有效时序建模能力，方法简洁、模块化且数据高效。 Abstract: Real-world visual data rarely presents as isolated, static instances. Instead, it often evolves gradually over time through variations in pose, lighting, object state, or scene context. However, conventional classifiers are typically trained under the assumption of temporal independence, limiting their ability to capture such dynamics. We propose a simple yet effective framework that equips standard feedforward classifiers with temporal reasoning, all without modifying model architectures or introducing recurrent modules. At the heart of our approach is a novel Support-Exemplar-Query (SEQ) learning paradigm, which structures training data into temporally coherent trajectories. These trajectories enable the model to learn class-specific temporal prototypes and align prediction sequences via a differentiable soft-DTW loss. A multi-term objective further promotes semantic consistency and temporal smoothness. By interpreting input sequences as evolving feature trajectories, our method introduces a strong temporal inductive bias through loss design alone. This proves highly effective in both static and temporal tasks: it enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection. Despite its simplicity, our approach bridges static and temporal learning in a modular and data-efficient manner, requiring only a simple classifier on top of pre-extracted features.

[183] SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

Sepehr Kazemi Ranjbar,Kumail Alhamoud,Marzyeh Ghassemi

Main category: cs.CV

TL;DR: 提出一种无需训练的框架，通过在联合嵌入空间中将否定建模为子空间来提升视觉语言模型对否定的理解能力，在多个任务上平均提升约30%，同时保持零样本性能。

Details

Motivation: 现有方法通过微调处理否定，但会损害模型在肯定提示上的零样本性能，因此需要一种不依赖微调且能有效理解否定的方法。 Method: 基于VLM（如CLIP）嵌入空间可划分为语义一致子空间的特性，将否定‘A但非N’建模为围绕A和N嵌入的两个球冠区域的交集，并利用靠近A且远离N的中心方向对图像进行评分。 Result: 在检索、多项选择和文本到图像任务中，该方法比先前方法平均提升约30%，显著缩小了肯定与否定提示之间的性能差距。 Conclusion: 该训练-free方法有效提升了VLM对否定的理解，同时保留了模型原有的零样本推理能力，具有良好的通用性和应用前景。 Abstract: Vision-Language Models (VLMs) struggle with negation. Given a prompt like "retrieve (or generate) a street scene without pedestrians," they often fail to respect the "not." Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model's zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as "A but not N," we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.

[184] Ground Plane Projection for Improved Traffic Analytics at Intersections

Sajjad Pakdamansavoji,Kumar Vaibhav Jha,Baher Abdulhai,James H Elder

Main category: cs.CV

TL;DR: 本文探讨了通过将基础设施摄像头检测到的车辆反投影到地面平面进行分析，相较于传统的图像平面分析，能够提高转弯运动计数的准确性，并通过多摄像头数据融合进一步提升精度。

Details

Motivation: 为了提高交叉口转弯运动计数的准确性，以支持信号控制、交通管理和城市规划，探索在三维真实世界坐标中分析交通流的优势。 Method: 采用单个或多个基础设施摄像头检测车辆，并将其从图像平面反投影到地面平面，在3D坐标下进行轨迹分类和转弯计数，同时结合多摄像头的弱融合方法提升精度。 Result: 单摄像头系统中，反投影显著提高了轨迹分类和转弯计数的准确性；多摄像头的弱融合进一步提升了整体精度。 Conclusion: 交通流分析应在地面平面而非图像平面进行，以获得更高的准确性。 Abstract: Accurate turning movement counts at intersections are important for signal control, traffic management and urban planning. Computer vision systems for automatic turning movement counts typically rely on visual analysis in the image plane of an infrastructure camera. Here we explore potential advantages of back-projecting vehicles detected in one or more infrastructure cameras to the ground plane for analysis in real-world 3D coordinates. For single-camera systems we find that back-projection yields more accurate trajectory classification and turning movement counts. We further show that even higher accuracy can be achieved through weak fusion of back-projected detections from multiple cameras. These results suggeest that traffic should be analyzed on the ground plane, not the image plane

[185] CLAReSNet: When Convolution Meets Latent Attention for Hyperspectral Image Classification

Asmit Bandyopadhyay,Anindita Das Bhattacharjee,Rakesh Das

Main category: cs.CV

TL;DR: 提出了一种名为CLAReSNet的混合网络架构，结合卷积与变换器注意力机制，通过自适应潜在瓶颈和多尺度光谱注意力（MSLA）降低计算复杂度，在Indian Pines和Salinas数据集上实现了接近完美的分类精度。

Details

Motivation: 高光谱图像分类面临高维光谱、复杂的谱空相关性和训练样本少且类别不平衡等挑战；现有CNN和Transformer方法单独使用存在二次复杂度和归纳偏置不足的问题。 Method: 设计CLAReSNet，结合多尺度卷积骨干、残差块和注意力模块提取空间特征，光谱编码器融合双向RNN与多尺度谱注意力（MSLA），通过自适应潜在token分配降低计算复杂度，并采用层次化交叉注意力融合多级特征。 Result: 在Indian Pines和Salinas数据集上分别达到99.71%和99.96%的整体精度，显著优于HybridSN、SSRN和SpectralFormer，学习到的嵌入表现出更强的类间分离性和类内紧凑性。 Conclusion: CLAReSNet有效应对了高光谱图像分类中的关键挑战，在有限样本和严重类别不平衡下仍表现优异，具备较强的特征提取与分类能力。 Abstract: Hyperspectral image (HSI) classification faces critical challenges, including high spectral dimensionality, complex spectral-spatial correlations, and limited training samples with severe class imbalance. While CNNs excel at local feature extraction and transformers capture long-range dependencies, their isolated application yields suboptimal results due to quadratic complexity and insufficient inductive biases. We propose CLAReSNet (Convolutional Latent Attention Residual Spectral Network), a hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck. The model employs a multi-scale convolutional stem with deep residual blocks and an enhanced Convolutional Block Attention Module for hierarchical spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA). MSLA reduces complexity from $\mathcal{O}(T^2D)$ to $\mathcal{O}(T\log(T)D)$ by adaptive latent token allocation (8-64 tokens) that scales logarithmically with the sequence length. Hierarchical cross-attention fusion dynamically aggregates multi-level representations for robust classification. Experiments conducted on the Indian Pines and Salinas datasets show state-of-the-art performance, achieving overall accuracies of 99.71% and 99.96%, significantly surpassing HybridSN, SSRN, and SpectralFormer. The learned embeddings exhibit superior inter-class separability and compact intra-class clustering, validating CLAReSNet's effectiveness under limited samples and severe class imbalance.

[186] Explainable AI-Generated Image Detection RewardBench

Michael Yang,Shijian Deng,William T. Doan,Kai Wang,Tianyu Yang,Harsh Singh,Yapeng Tian

Main category: cs.CV

TL;DR: 提出了首个用于评估多模态大语言模型（MLLM）在判断AI生成图像检测解释质量方面能力的基准 XAIGID-RewardBench，包含约3000个标注三元组，结果显示当前最佳MLLM得分为88.76%，与人类98.30%的一致性相比仍有差距。

Details

Motivation: 传统的AI生成图像检测方法无法提供人类可理解的解释，降低了可信度；尽管使用MLLM生成解释成为趋势，但其作为‘评判者’评估解释质量的能力尚未被充分研究。 Method: 构建了一个名为 XAIGID-RewardBench 的基准，包含来自多种图像生成模型和MLLM的约3000个标注三元组，用于系统评估MLLM作为奖励模型（即评判者）的能力。 Result: 当前表现最好的MLLM在该基准上得分为88.76%，而人类标注者间的一致性达到98.30%，表明MLLM在解释评判能力上仍显著落后于人类，并识别出模型常见的错误模式。 Conclusion: XAIGID-RewardBench 揭示了现有MLLM在判断AI生成图像检测解释方面的局限性，强调了进一步提升模型推理与评判能力的必要性。 Abstract: Conventional, classification-based AI-generated image detection methods cannot explain why an image is considered real or AI-generated in a way a human expert would, which reduces the trustworthiness and persuasiveness of these detection tools for real-world applications. Leveraging Multimodal Large Language Models (MLLMs) has recently become a trending solution to this issue. Further, to evaluate the quality of generated explanations, a common approach is to adopt an "MLLM as a judge" methodology to evaluate explanations generated by other MLLMs. However, how well those MLLMs perform when judging explanations for AI-generated image detection generated by themselves or other MLLMs has not been well studied. We therefore propose \textbf{XAIGID-RewardBench}, the first benchmark designed to evaluate the ability of current MLLMs to judge the quality of explanations about whether an image is real or AI-generated. The benchmark consists of approximately 3,000 annotated triplets sourced from various image generation models and MLLMs as policy models (detectors) to assess the capabilities of current MLLMs as reward models (judges). Our results show that the current best reward model scored 88.76\% on this benchmark (while human inter-annotator agreement reaches 98.30\%), demonstrating that a visible gap remains between the reasoning abilities of today's MLLMs and human-level performance. In addition, we provide an analysis of common pitfalls that these models frequently encounter. Code and benchmark are available at https://github.com/RewardBench/XAIGID-RewardBench.

[187] Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

Yiqing Shen,Mathias Unberath

Main category: cs.CV

TL;DR: 提出DT-R1，一种基于强化学习的框架，通过构建视觉输入的数字孪生表示来统一处理多种视觉推理任务，在六项基准上优于现有专用模型。

Details

Motivation: 现有视觉推理方法依赖任务特定架构和监督微调，缺乏统一性，限制了跨任务和跨模态泛化能力。 Method: 采用强化学习框架GRPO训练大语言模型构建多模态视觉输入的数字孪生表示，并在此高维表示上进行推理；引入验证结构完整性和输出准确性的新型奖励机制。 Result: 在涵盖两种模态和四种任务类型的六项视觉推理基准上，DT-R1 consistently超越最先进的任务专用模型。 Conclusion: DT-R1展示了通过强化学习结合数字孪生表示实现统一视觉推理的新方向，推动视觉推理从专用模型向通用范式演进。 Abstract: Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.

[188] Fast Reasoning Segmentation for Images and Videos

Yiqing Shen,Mathias Unberath

Main category: cs.CV

TL;DR: FastReasonSeg提出了一种基于数字孪生表示的高效推理分割模型蒸馏方法，通过解耦感知与推理，结合监督微调和强化微调，在保持高精度的同时显著降低计算需求，实现在资源受限设备上的实时推理分割。

Details

Motivation: 现有推理分割方法依赖大规模多模态语言模型，难以在边缘设备部署；且传统蒸馏方法无法有效传递多步推理能力，需新方法保留推理链。 Method: 提出FastReasonSeg，利用数字孪生表示解耦感知与推理；采用两阶段蒸馏：先对教师模型生成的推理链进行监督微调，再通过联合奖励（分割准确性与推理质量）进行强化微调。 Result: 在四个基准（JiTBench、RVTBench、ReasonSeg、LLM-Seg40K）上达到最先进性能；0.6B参数的蒸馏模型超越大20倍参数的模型，实现7.79 FPS和仅2.1GB内存消耗。 Conclusion: FastReasonSeg实现了高效、轻量化的推理分割，支持在资源受限环境中实时部署，推动了具身智能体在真实场景中的自主运行能力。 Abstract: Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.

[189] Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

Chamuditha Jayanga Galappaththige,Jason Lai,Lloyd Windrim,Donald Dansereau,Niko Sünderhauf,Dimity Miller

Main category: cs.CV

TL;DR: 提出了一种新的在线场景变化检测方法，首次实现无姿态依赖、无标签且多视角一致的高效检测，性能超越现有离线方法。

Details

Motivation: 现有在线场景变化检测方法精度远低于离线方法，且难以应对无约束视角和实时性要求。 Method: 引入自监督融合损失、基于PnP的快速位姿估计，以及面向3D高斯溅射表示的快速变化引导更新策略。 Result: 在复杂真实世界数据集上实验证明，该方法运行速度超过10 FPS，并在在线和离线基准上均达到最先进性能。 Conclusion: 所提方法是首个兼具高效性、多视角一致性与高精度的在线SCD方法，性能优于现有离线方法。 Abstract: Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.

[190] Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

Yiqing Shen,Chenxiao Fan,Chenjia Li,Mathias Unberath

Main category: cs.CV

TL;DR: 本文提出了一种用于隐式文本到视频检索的推理框架，通过数字孪生表示视频内容，结合大语言模型进行推理，在新构建的基准上显著超越现有方法。

Details

Motivation: 现有文本到视频检索方法难以处理需要推理的隐式查询，缺乏对查询中实体的细粒度定位和逻辑推理能力。 Method: 提出两阶段框架：首先将查询分解并与视频的数字孪生（结构化场景表示）进行成分对齐；然后利用大语言模型进行基于即时 refine 的推理，并调用专用视觉模型填补信息缺口。 Result: 在新构建的 ReasonT2VBench-135 和 ReasonT2VBench-1000 基准上分别达到 81.2% 和 81.7% 的 R@1，超过最强基线 50 个百分点以上，并在三个传统基准上达到 SOTA。 Conclusion: 通过结构化视频表示与大语言模型推理相结合，能有效处理需复杂推理的隐式文本查询，推动文本到视频检索向更高层次的认知任务发展。 Abstract: The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).

[191] AGGRNet: Selective Feature Extraction and Aggregation for Enhanced Medical Image Classification

Ansh Makwe,Akansh Agrawal,Prateek Jain,Akshan Agrawal,Priyanka Bagade

Main category: cs.CV

TL;DR: 提出AGGRNet框架，通过提取信息性和非信息性特征来改进复杂医学图像分类任务的性能。

Details

Motivation: 现有注意力模型在区分细微类别时效果不佳，难以捕捉类间相似性和类内变异性，导致诊断错误。 Method: 设计AGGRNet框架，提取信息性和非信息性特征，以更好地理解细粒度视觉模式。 Result: 在多个医学图像数据集上达到最先进水平，在Kvasir数据集上最高提升5%。 Conclusion: AGGRNet能有效提升复杂医学图像分析任务的分类准确性。 Abstract: Medical image analysis for complex tasks such as severity grading and disease subtype classification poses significant challenges due to intricate and similar visual patterns among classes, scarcity of labeled data, and variability in expert interpretations. Despite the usefulness of existing attention-based models in capturing complex visual patterns for medical image classification, underlying architectures often face challenges in effectively distinguishing subtle classes since they struggle to capture inter-class similarity and intra-class variability, resulting in incorrect diagnosis. To address this, we propose AGGRNet framework to extract informative and non-informative features to effectively understand fine-grained visual patterns and improve classification for complex medical image analysis tasks. Experimental results show that our model achieves state-of-the-art performance on various medical imaging datasets, with the best improvement up to 5% over SOTA models on the Kvasir dataset.

[192] Leveraging Quantum-Based Architectures for Robust Diagnostics

Shabnam Sodagari,Tommy Long

Main category: cs.CV

TL;DR: 本研究提出了一种结合预训练ResNet50与量子卷积神经网络（QCNN）的混合量子-经典框架，用于基于CT图像的肾脏结石、囊肿和肿瘤的分类诊断。通过经典预处理和特征提取结合量子计算，模型在12量子比特配置下实现了接近完美的召回率与F1分数，测试准确率达到99%。

Details

Motivation: 旨在探索量子计算在医学影像诊断中的应用潜力，特别是在肾脏疾病分类任务中提升诊断准确性与稳定性。 Method: 采用预训练ResNet50作为编码器提取深层特征，使用角度编码将特征映射为量子态，并由QCNN进行处理；结合CLAHE和去噪进行图像预处理，通过数据增强和加权采样解决类别不平衡问题。 Result: 模型在8和12量子比特配置下均表现出快速收敛和稳定的训练表现；测试准确率达0.99，12量子比特模型在囊肿和肿瘤检测中表现更优，囊肿召回率为1.0，肿瘤F1-score达0.9956，混淆矩阵显示误分类极少。 Conclusion: 融合经典图像处理与量子神经网络可有效提升医学诊断性能，表明量子-经典混合模型在临床辅助诊断中具有可行性和前景。 Abstract: The objective of this study is to diagnose and differentiate kidney stones, cysts, and tumors using Computed Tomography (CT) images of the kidney. This study leverages a hybrid quantum-classical framework in this regard. We combine a pretrained ResNet50 encoder, with a Quantum Convolutional Neural Network (QCNN) to explore quantum-assisted diagnosis. We pre-process the kidney images using denoising and contrast limited adaptive histogram equalization to enhance feature extraction. We address class imbalance through data augmentation and weighted sampling. Latent features extracted by the encoder are transformed into qubits via angle encoding and processed by a QCNN. The model is evaluated on both 8-qubit and 12-qubit configurations. Both architectures achieved rapid convergence with stable learning curves and high consistency between training and validation performance. The models reached a test accuracy of 0.99, with the 12-qubit configuration providing improvements in overall recall and precision, particularly for Cyst and Tumor detection, where it achieved perfect recall for Cysts and a tumor F1-score of 0.9956. Confusion matrix analysis further confirmed reliable classification behavior across all classes, with very few misclassifications. Results demonstrate that integrating classical pre-processing and deep feature extraction with quantum circuits enhances medical diagnostic performance.

[193] Calibrated Decomposition of Aleatoric and Epistemic Uncertainty in Deep Features for Inference-Time Adaptation

Divake Kumar,Patrick Poggi,Sina Tayebati,Devashri Naik,Nilesh Ahuja,Amit Ranjan Trivedi

Main category: cs.CV

TL;DR: 提出了一种轻量级推理时框架Uncertainty-Guided Inference-Time Selection，解耦了数据驱动和模型驱动的不确定性，无需采样或集成，显著提升预测区间校准效果，并实现约60%的计算节省。

Details

Motivation: 大多数估计器将多种不确定性模式合并为单一置信度分数，难以可靠判断何时应分配更多计算资源或调整推断过程。 Method: 在深度特征空间中直接解耦aleatoric和epistemic不确定性：前者通过正则化全局密度模型估计，后者由三个互补成分构成（局部支持不足、流形谱坍缩、跨层特征不一致），三者经验上正交且无需额外前向传播。 Result: 该方法在MOT17上实现约60%的计算节省且精度损失可忽略；消融实验显示其相比总不确定性基线平均提升13.6个百分点的计算节约；结合无分布假设的保校准方法获得更紧致的预测区间。 Conclusion: 所提出的正交不确定性分解框架支持高效自适应模型选择与推理调节，在保持高覆盖率的同时显著降低视觉任务的计算开销，具备实际应用价值。 Abstract: Most estimators collapse all uncertainty modes into a single confidence score, preventing reliable reasoning about when to allocate more compute or adjust inference. We introduce Uncertainty-Guided Inference-Time Selection, a lightweight inference time framework that disentangles aleatoric (data-driven) and epistemic (model-driven) uncertainty directly in deep feature space. Aleatoric uncertainty is estimated using a regularized global density model, while epistemic uncertainty is formed from three complementary components that capture local support deficiency, manifold spectral collapse, and cross-layer feature inconsistency. These components are empirically orthogonal and require no sampling, no ensembling, and no additional forward passes. We integrate the decomposed uncertainty into a distribution free conformal calibration procedure that yields significantly tighter prediction intervals at matched coverage. Using these components for uncertainty guided adaptive model selection reduces compute by approximately 60 percent on MOT17 with negligible accuracy loss, enabling practical self regulating visual inference. Additionally, our ablation results show that the proposed orthogonal uncertainty decomposition consistently yields higher computational savings across all MOT17 sequences, improving margins by 13.6 percentage points over the total-uncertainty baseline.

[194] MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting

Xu Yang,Gady Agam

Main category: cs.CV

TL;DR: MSLoRA是一种通用、高效的参数微调方法，通过重加权特征响应而非调整骨干网络，统一支持CNN和ViT架构，在分类、检测和分割任务中以少于5%的参数提升迁移性能。

Details

Motivation: 现有低秩适配方法主要局限于视觉Transformer（ViT），难以跨架构泛化，且缺乏对卷积神经网络（CNN）的有效支持。 Method: MSLoRA结合低秩线性投影与多尺度非线性变换，通过点乘和残差连接融合，联合调制空间与通道注意力，实现对预训练骨干网络特征响应的重加权。 Result: 在分类、检测和分割任务上，MSLoRA以不到5%的骨干参数显著提升迁移性能，具有快速收敛、优化稳定和强跨架构泛化能力。 Conclusion: MSLoRA提供了一种简单、通用的方法，适用于冻结视觉骨干网络的高效适配，推动了参数高效微调技术的统一与实用化。 Abstract: We introduce MSLoRA, a backbone-agnostic, parameter-efficient adapter that reweights feature responses rather than re-tuning the underlying backbone. Existing low-rank adaptation methods are mostly confined to vision transformers (ViTs) and struggle to generalize across architectures. MSLoRA unifies adaptation for both convolutional neural networks (CNNs) and ViTs by combining a low-rank linear projection with a multi-scale nonlinear transformation that jointly modulates spatial and channel attention. The two components are fused through pointwise multiplication and a residual connection, yielding a lightweight module that shifts feature attention while keeping pretrained weights frozen. Extensive experiments demonstrate that MSLoRA consistently improves transfer performance on classification, detection, and segmentation tasks with roughly less than 5\% of backbone parameters. The design further enables stable optimization, fast convergence, and strong cross-architecture generalization. By reweighting rather than re-tuning, MSLoRA provides a simple and universal approach for efficient adaptation of frozen vision backbones.

[195] VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

Hyunki Seong,Seongwoo Moon,Hojin Ahn,Jehun Kang,David Hyunchul Shim

Main category: cs.CV

TL;DR: 提出了一种名为Vision-Language Action Retrieval (VLA-R)的开放世界端到端自动驾驶框架，通过结合视觉-语言模型与动作检索范式，实现对未知环境的强泛化能力。

Details

Motivation: 在非结构化户外环境中，传统端到端自动驾驶系统因训练时无法覆盖所有场景而缺乏对未知情况的适应能力，需要具备开放世界感知和强泛化性能的新方法。 Method: 利用冻结的视觉-语言模型进行无需微调的开放世界检测与分割，通过Q-Former瓶颈融合细粒度视觉与语言对齐特征，并引入视觉-动作对比学习来对齐感知与动作嵌入，实现动作检索。 Result: 在真实机器人平台上验证了VLA-R在未见非结构化环境中的优异探索与泛化能力，即使在数据有限的情况下仍表现良好。 Conclusion: VLA-R通过融合开放世界感知与视觉-动作检索，为开放世界端到端自动驾驶提供了一个可解释、可迁移且高泛化的解决方案。 Abstract: Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.

[196] Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection

Xi Xiao,Zhuxuanzi Wang,Mingqiao Mo,Chen Liu,Chenrui Ma,Yanshu Li,Smita Krishnaswamy,Xiao Wang,Tianyang Wang

Main category: cs.CV

TL;DR: 提出了一种名为\ours的自监督框架，通过视觉探测目标域来提升路面缺陷检测的跨域泛化能力。

Details

Motivation: 现有方法在跨域场景下泛化能力差，监督方法需大量标注，自监督方法对域偏移敏感。 Method: \ours包含自监督提示增强模块（SPEM）和域感知提示对齐目标（DAPA），利用无标签目标数据生成缺陷感知提示并指导冻结的ViT主干网络，同时对齐源域与目标域的表示。 Result: 在四个基准上实验表明，\ours在零样本迁移、抗域变化能力和小样本适应效率方面均优于现有方法。 Conclusion: 自监督提示是一种构建可扩展、自适应视觉检测系统的有效途径。 Abstract: The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: https://github.com/xixiaouab/PROBE/tree/main

[197] Towards Rotation-only Imaging Geometry: Rotation Estimation

Xinrui Li,Qi Cai,Yuanxin Wu

Main category: cs.CV

TL;DR: 本文提出了一种仅基于旋转的优化框架，用于从运动中恢复结构（SfM），通过将平移表示为旋转的函数，显著提高了3D视觉计算的准确性与鲁棒性。

Details

Motivation: 现有的SfM方法在处理成像几何时通常耦合3D坐标与相机姿态，限制了性能；本文旨在通过解耦并探索场景结构、旋转和平移之间的关系来提升SfM性能。 Method: 采用仅姿态的成像几何视角，推导出平移可由旋转表示的关系，并在此基础上构建基于重投影误差的旋转唯一优化框架，适用于双视图和多视图场景。 Result: 实验结果表明，该方法在旋转估计上优于当前最先进的方法，其精度甚至可与多次束调整迭代的结果相媲美。 Conclusion: 所提出的旋转唯一优化框架有效提升了SfM的准确性、鲁棒性和效率，为3D视觉计算提供了新的思路和工具。 Abstract: Structure from Motion (SfM) is a critical task in computer vision, aiming to recover the 3D scene structure and camera motion from a sequence of 2D images. The recent pose-only imaging geometry decouples 3D coordinates from camera poses and demonstrates significantly better SfM performance through pose adjustment. Continuing the pose-only perspective, this paper explores the critical relationship between the scene structures, rotation and translation. Notably, the translation can be expressed in terms of rotation, allowing us to condense the imaging geometry representation onto the rotation manifold. A rotation-only optimization framework based on reprojection error is proposed for both two-view and multi-view scenarios. The experiment results demonstrate superior accuracy and robustness performance over the current state-of-the-art rotation estimation methods, even comparable to multiple bundle adjustment iteration results. Hopefully, this work contributes to even more accurate, efficient and reliable 3D visual computing.

[198] Seeing Through the Rain: Resolving High-Frequency Conflicts in Deraining and Super-Resolution via Diffusion Guidance

Wenjie Li,Jinglei Shi,Jin Han,Heng Guo,Zhanyu Ma

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的高频引导方法DHGM，用于去雨和超分辨率联合任务，有效解决了去雨与超分辨率之间的高频冲突问题。

Details

Motivation: 现有方法在去雨和超分辨率级联时存在高频信息去除与重建的矛盾，导致恢复内容不一致，影响小物体检测等视觉任务性能。 Method: 提出DHGM，结合预训练扩散先验与高通滤波器，在去雨的同时增强结构细节，实现清洁且高分辨率图像生成。 Result: 实验表明DHGM在去雨和超分辨率性能上优于现有方法，且计算成本更低。 Conclusion: DHGM能有效协调去雨与超分辨率对高频信息的不同需求，为恶劣天气图像恢复提供了高效解决方案。 Abstract: Clean images are crucial for visual tasks such as small object detection, especially at high resolutions. However, real-world images are often degraded by adverse weather, and weather restoration methods may sacrifice high-frequency details critical for analyzing small objects. A natural solution is to apply super-resolution (SR) after weather removal to recover both clarity and fine structures. However, simply cascading restoration and SR struggle to bridge their inherent conflict: removal aims to remove high-frequency weather-induced noise, while SR aims to hallucinate high-frequency textures from existing details, leading to inconsistent restoration contents. In this paper, we take deraining as a case study and propose DHGM, a Diffusion-based High-frequency Guided Model for generating clean and high-resolution images. DHGM integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details. Extensive experiments demonstrate that DHGM achieves superior performance over existing methods, with lower costs.

[199] MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation

Nuolin Sun,Linyuan Wang,Haonan Wei,Lei Li,Bin Yan

Main category: cs.CV

TL;DR: 提出MFI-ResNet，利用MeanFlow模块和选择性孵化策略，在减少近一半参数的同时提升准确率，实现了高效且高性能的图像分类模型。

Details

Motivation: 受ResNet与ODE联系以及MeanFlow单步生成建模的启发，探索如何用生成流场建模ResNet中的特征变换过程，以同时提升参数效率和性能。 Method: 采用压缩-扩展策略：压缩阶段将ResNet每阶段多层结构简化为一或两个MeanFlow模块；扩展阶段对前三个阶段进行选择性孵化恢复至ResNet结构，最后一阶段保留MeanFlow形式，并进行微调。 Result: 在CIFAR-10和CIFAR-100上，相比ResNet-50减少了46.28%和45.59%的参数，同时准确率分别提升了0.23%和0.17%。 Conclusion: 生成流场能有效刻画ResNet中的特征变换，MFI-ResNet为连接生成建模与判别学习提供了新视角，并在效率与性能间取得更好平衡。 Abstract: ResNet has achieved tremendous success in computer vision through its residual connection mechanism. ResNet can be viewed as a discretized form of ordinary differential equations (ODEs). From this perspective, the multiple residual blocks within a single ResNet stage essentially perform multi-step discrete iterations of the feature transformation for that stage. The recently proposed flow matching model, MeanFlow, enables one-step generative modeling by learning the mean velocity field to transform distributions. Inspired by this, we propose MeanFlow-Incubated ResNet (MFI-ResNet), which employs a compression-expansion strategy to jointly improve parameter efficiency and discriminative performance. In the compression phase, we simplify the multi-layer structure within each ResNet stage to one or two MeanFlow modules to construct a lightweight meta model. In the expansion phase, we apply a selective incubation strategy to the first three stages, expanding them to match the residual block configuration of the baseline ResNet model, while keeping the last stage in MeanFlow form, and fine-tune the incubated model. Experimental results show that on CIFAR-10 and CIFAR-100 datasets, MFI-ResNet achieves remarkable parameter efficiency, reducing parameters by 46.28% and 45.59% compared to ResNet-50, while still improving accuracy by 0.23% and 0.17%, respectively. This demonstrates that generative flow-fields can effectively characterize the feature transformation process in ResNet, providing a new perspective for understanding the relationship between generative modeling and discriminative learning.

[200] RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Jingqi Xu,Jingxi Lu,Chenghao Li,Sreetama Sarkar,Souvik Kundu,Peter A. Beerel

Main category: cs.CV

TL;DR: 本文提出了一种名为RedVTP的响应驱动视觉令牌剪枝策略，用于提升扩散式视觉语言模型（DVLMs）的推理效率。该方法利用掩码响应令牌的注意力来估计视觉令牌的重要性，并在首次推理后剪除不重要的视觉令牌，显著提升了生成吞吐量并降低了延迟，且不损害甚至提升准确性。

Details

Motivation: 尽管DVLMs支持并行解码，但大量视觉令牌仍严重影响其推理效率；现有剪枝方法主要针对自回归VLMs，缺乏对DVLMs的有效优化策略。 Method: 提出RedVTP，通过掩码响应令牌的注意力权重评估视觉令牌重要性，并基于跨步重要性一致性，在第一步推理后永久剪除低重要性视觉令牌。 Result: 在LLaDA-V和LaViDa模型上，生成吞吐量分别提升最高186%和28.05%，延迟降低最高64.97%和21.87%，同时保持或提升准确率。 Conclusion: RedVTP有效提升了DVLMs的推理效率，为高效率多模态模型部署提供了可行方案。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.

[201] Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion

Xilai Li,Xiaosong Li,Weijun Jiang

Main category: cs.CV

TL;DR: 本文提出了一种基于通道扰动和预训练知识集成的统一多模态图像融合框架UP-Fusion，通过语义感知通道剪枝、几何仿射调制和文本引导通道扰动模块，有效缓解模态差异导致的梯度冲突，提升融合性能与跨任务泛化能力。

Details

Motivation: 现有统一模型因模态差异大导致梯度冲突，而专用编码器虽提升融合质量却牺牲了在不同融合任务间的泛化能力，亟需一种兼顾性能与通用性的新方法。 Method: 提出UP-Fusion框架，包含三个核心模块：1）语义感知通道剪枝模块（SCPM），利用预训练模型语义感知能力筛选增强特征通道；2）几何仿射调制模块（GAM），用原始模态特征对融合特征进行仿射变换以保持模态可区分性；3）文本引导通道扰动模块（TCPM），在解码阶段重塑通道分布，降低对模态专用通道的依赖。 Result: 大量实验表明，所提方法在多模态图像融合及下游任务中均优于现有方法，实现了更优的融合效果与更强的泛化能力。 Conclusion: UP-Fusion通过引入通道选择与扰动机制，结合预训练知识，在不使用模态专用编码器的情况下有效解决了梯度冲突问题，为统一多模态融合提供了高效且通用的新范式。 Abstract: Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.

[202] Real-Time Drivers' Drowsiness Detection and Analysis through Deep Learning

ANK Zaman,Prosenjit Chatterjee,Rajat Sharma

Main category: cs.CV

TL;DR: 提出了一种基于深度卷积神经网络（DCNN）和OpenCV的实时驾驶员疲劳检测系统，通过分析面部特征（如眼睛开合和嘴部动作）来检测疲劳状态，并在检测到疲劳时发出实时警报。

Details

Motivation: 长时间驾驶容易导致驾驶员疲劳，进而引发交通事故，因此需要一种实时、非侵入式且低成本的疲劳检测方法来提升道路安全。 Method: 利用实时摄像头捕捉驾驶员面部图像，结合OpenCV提取面部关键点（如眼睛和嘴巴），并使用预训练的深度卷积神经网络（DCNN）模型进行疲劳状态分类。 Result: 在NTHU-DDD数据集上达到99.6%的检测准确率，在Yawn-Eye-Dataset上达到97%的准确率，系统能有效实时发出疲劳警报。 Conclusion: 该方法是一种高效、低成本且非侵入式的驾驶员疲劳检测方案，具有良好的实际应用前景，有助于减少因疲劳驾驶引发的交通事故。 Abstract: A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers' safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and OpenCV.Our proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car technology.By potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.

[203] CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

Jiahe Qian,Yuhao Shen,Zhangtianyi Chen,Juexiao Zhou,Peisong Wang

Main category: cs.CV

TL;DR: 提出了一种名为CoTBox-TTT的测试时训练方法，用于提升医学视觉问答模型在域偏移下的可靠性，通过冻结主干网络并仅更新软提示，利用视觉链式思考信号增强答案与图像证据的一致性。

Details

Motivation: 现有医学视觉问答系统在域偏移下表现不佳，且常关注图像中的伪影区域，缺乏可靠性和可解释性，而重新训练或获取额外标签在实际部署中不现实。 Method: 提出CoTBox-TTT，采用证据优先的测试时训练策略，冻结所有主干网络，仅更新连续软提示；通过视觉链式思考（CoT）信号定位问题相关区域，并在原始图像和局部裁剪区域间保持答案一致性。 Result: 在pathVQA数据集上，将CoTBox-TTT应用于LLaVA模型后，闭式答案准确率提升了12.3%，验证了其有效性与即插即用特性。 Conclusion: CoTBox-TTT是一种实用、无需标签、可插拔的测试时适应方法，显著提升医学VQA模型在域偏移下的鲁棒性和接地性，适合临床实际部署。 Abstract: Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.

[204] MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Zhanheng Nie,Chenghan Fu,Daoze Zhang,Junxian Wu,Wanxian Guan,Pengjie Wang,Jian Xu,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了MOON2.0，一种用于电商产品理解的动态模态平衡多模态表示学习框架，以解决现有模型在模态不平衡、模态间对齐关系利用不足和噪声处理方面的挑战。

Details

Motivation: 现有的多模态大语言模型在电商产品理解中存在模态混合训练导致的模态不平衡、未充分利用产品内部图文对齐关系以及对噪声数据处理能力有限等问题。 Method: 提出MOON2.0框架，包括：模态驱动的专家混合模块（Modality-driven MoE）实现多模态联合学习以缓解模态不平衡；双层次对齐方法增强产品内语义对齐；基于MLLM的图文协同增强策略结合文本丰富化与视觉扩展，并引入动态样本过滤提升训练数据质量。同时构建了MBE2.0基准用于评测。 Result: 实验表明，MOON2.0在MBE2.0及多个公开数据集上实现了最先进的零样本性能，注意力热图可视化验证了其更好的多模态对齐效果。 Conclusion: MOON2.0有效解决了电商场景下多模态学习中的关键问题，在表示学习和实际性能上均表现出显著优势。 Abstract: The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

[205] MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

Jingshan Hong,Haigen Hu,Huihuang Zhang,Qianwei Zhou,Zhao Li

Main category: cs.CV

TL;DR: 提出MaskAnyNet，通过重学机制利用被掩码区域的语义多样性，提升图像表征能力。

Details

Motivation: 传统图像掩码丢弃像素导致上下文信息丢失，且可能去除细粒度任务中的关键特征；而掩码图像建模表明掩码区域可重建，蕴含语义多样性。 Method: 设计MaskAnyNet，引入额外分支对重组的掩码区域进行联合学习，将掩码内容作为辅助知识而非忽略。 Result: 在CNN和Transformer骨干网络上多个基准测试均取得性能提升，验证了方法能有效增强语义多样性和保留细粒度细节。 Conclusion: 掩码区域应被视为有价值的语义来源，MaskAnyNet通过复用掩码内容显著提升了模型表现。 Abstract: In supervised learning, traditional image masking faces two key issues: (i) discarded pixels are underutilized, leading to a loss of valuable contextual information; (ii) masking may remove small or critical features, especially in fine-grained tasks. In contrast, masked image modeling (MIM) has demonstrated that masked regions can be reconstructed from partial input, revealing that even incomplete data can exhibit strong contextual consistency with the original image. This highlights the potential of masked regions as sources of semantic diversity. Motivated by this, we revisit the image masking approach, proposing to treat masked content as auxiliary knowledge rather than ignored. Based on this, we propose MaskAnyNet, which combines masking with a relearning mechanism to exploit both visible and masked information. It can be easily extended to any model with an additional branch to jointly learn from the recomposed masked region. This approach leverages the semantic diversity of the masked regions to enrich features and preserve fine-grained details. Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks. Further analysis confirms that the proposed method improves semantic diversity through the reuse of masked content.

[206] Towards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion

Jongseong Bae,Junwoo Ha,Jinnyeong Heo,Yeongin Lee,Ha Young Kim

Main category: cs.CV

TL;DR: 提出了一种名为C3DFusion的模块，通过融合当前帧与历史帧的3D特征来改善相机-based 3D语义场景补全中对不可见区域的重建效果。

Details

Motivation: 现有方法在利用时间线索增强当前帧特征时，难以有效恢复自车周围侧边等关键的视野外区域，尽管历史帧包含这些区域的重要上下文信息。 Method: 提出了Current-Centric Contextual 3D Fusion (C3DFusion)模块，通过显式对齐当前帧和历史帧的3D提升点特征生成隐含区域感知的3D特征几何；采用历史上下文模糊和当前中心特征稠密化两种技术进行增强的时间融合。 Result: 在SemanticKITTI和SSCBench-KITTI-360数据集上显著优于现有最先进方法，并在多种基线模型上表现出良好的泛化能力。 Conclusion: C3DFusion能有效利用历史帧中的上下文信息改善视野外区域的3D语义补全性能，具有强有效性与广泛适用性。 Abstract: Recent camera-based 3D semantic scene completion (SSC) methods have increasingly explored leveraging temporal cues to enrich the features of the current frame. However, while these approaches primarily focus on enhancing in-frame regions, they often struggle to reconstruct critical out-of-frame areas near the sides of the ego-vehicle, although previous frames commonly contain valuable contextual information about these unseen regions. To address this limitation, we propose the Current-Centric Contextual 3D Fusion (C3DFusion) module, which generates hidden region-aware 3D feature geometry by explicitly aligning 3D-lifted point features from both current and historical frames. C3DFusion performs enhanced temporal fusion through two complementary techniques-historical context blurring and current-centric feature densification-which suppress noise from inaccurately warped historical point features by attenuating their scale, and enhance current point features by increasing their volumetric contribution. Simply integrated into standard SSC architectures, C3DFusion demonstrates strong effectiveness, significantly outperforming state-of-the-art methods on the SemanticKITTI and SSCBench-KITTI-360 datasets. Furthermore, it exhibits robust generalization, achieving notable performance gains when applied to other baseline models.

[207] Visible Structure Retrieval for Lightweight Image-Based Relocalisation

Fereidoon Zangeneh,Leonard Bruns,Amit Dekel,Alessandro Pieropan,Patric Jensfelt

Main category: cs.CV

TL;DR: 提出一种基于神经网络的可见结构检索方法，通过直接映射图像到场景3D结构的子集，实现高效、低存储消耗的相机位姿估计。

Details

Motivation: 现有基于结构的重定位方法依赖图像检索或启发式搜索，导致流程复杂或存储开销随观测数量增长。 Method: 设计一个神经网络（可见结构检索网络），学习从输入图像到可见3D结构点的直接映射，用前向传播确定查询图像所见的3D点子集，缩小2D-3D匹配搜索空间。 Result: 在保持与当前最先进方法相当定位精度的同时，显著降低了计算和存储开销。 Conclusion: 该方法为结构化环境下的相机重定位提供了一种更高效、可扩展的新范式。 Abstract: Accurate camera pose estimation from an image observation in a previously mapped environment is commonly done through structure-based methods: by finding correspondences between 2D keypoints on the image and 3D structure points in the map. In order to make this correspondence search tractable in large scenes, existing pipelines either rely on search heuristics, or perform image retrieval to reduce the search space by comparing the current image to a database of past observations. However, these approaches result in elaborate pipelines or storage requirements that grow with the number of past observations. In this work, we propose a new paradigm for making structure-based relocalisation tractable. Instead of relying on image retrieval or search heuristics, we learn a direct mapping from image observations to the visible scene structure in a compact neural network. Given a query image, a forward pass through our novel visible structure retrieval network allows obtaining the subset of 3D structure points in the map that the image views, thus reducing the search space of 2D-3D correspondences. We show that our proposed method enables performing localisation with an accuracy comparable to the state of the art, while requiring lower computational and storage footprint.

[208] DINO-Detect: A Simple yet Effective Framework for Blur-Robust AI-Generated Image Detection

Jialiang Shen,Jiyang Zheng,Yunqi Xue,Huajie Chen,Yu Yao,Hui Kang,Ruiqi Liu,Helin Gong,Yang Yang,Dadong Wang,Tongliang Liu

Main category: cs.CV

TL;DR: 提出一种基于教师-学生知识蒸馏的模糊鲁棒AI生成图像检测框架，利用在清晰图像上训练的高容量教师模型（DINOv3）指导学生模型学习，在运动模糊和清晰条件下均实现最先进的性能。

Details

Motivation: 现有AI生成图像检测器在现实场景下的退化（尤其是运动模糊）影响严重，导致性能显著下降，难以满足实际应用需求。 Method: 采用教师-学生知识蒸馏框架，教师模型（DINOv3）在清晰图像上训练并固定，提取特征和logit作为监督信号，指导在模糊图像上训练的学生模型学习保持一致的表征能力。 Result: 在运动模糊和清晰条件下均达到最先进的检测性能，表现出更强的泛化能力和现实适用性。 Conclusion: 该方法有效提升了AIGI检测器在真实世界模糊退化下的鲁棒性，推动了检测技术向实际应用场景的落地。 Abstract: With growing concerns over image authenticity and digital safety, the field of AI-generated image (AIGI) detection has progressed rapidly. Yet, most AIGI detectors still struggle under real-world degradations, particularly motion blur, which frequently occurs in handheld photography, fast motion, and compressed video. Such blur distorts fine textures and suppresses high-frequency artifacts, causing severe performance drops in real-world settings. We address this limitation with a blur-robust AIGI detection framework based on teacher-student knowledge distillation. A high-capacity teacher (DINOv3), trained on clean (i.e., sharp) images, provides stable and semantically rich representations that serve as a reference for learning. By freezing the teacher to maintain its generalization ability, we distill its feature and logit responses from sharp images to a student trained on blurred counterparts, enabling the student to produce consistent representations under motion degradation. Extensive experiments benchmarks show that our method achieves state-of-the-art performance under both motion-blurred and clean conditions, demonstrating improved generalization and real-world applicability. Source codes will be released at: https://github.com/JiaLiangShen/Dino-Detect-for-blur-robust-AIGC-Detection.

[209] MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics

Jing Li,Yifan Wang,Jiafeng Yan,Renlong Zhang,Bin Yang

Main category: cs.CV

TL;DR: 提出了一种基于大语言模型驱动的退化感知红外与可见光图像融合框架MdaIF，通过引入混合专家系统和视觉-语言模型提取多退化场景下的语义先验，结合退化感知通道注意力模块实现自适应特征交互，显著提升了复杂天气条件下的融合性能。

Details

Motivation: 现有方法在恶劣天气下可见光图像质量下降时融合性能受限，且固定网络结构难以适应多种退化场景，因此需要一种能够感知并适应不同退化类型的自适应融合框架。 Method: 引入混合专家（MoE）系统处理多退化场景；利用预训练视觉-语言模型（VLM）提取退化知识和场景特征作为语义先验；设计退化感知通道注意力模块（DCAM），通过退化原型分解实现多模态特征交互；利用语义先验和通道调制特征指导MoE进行专家路由。 Result: 在多个退化场景（如雾、雨、雪）下进行了广泛实验，结果表明MdaIF在定量指标和视觉质量上均优于当前最先进的方法。 Conclusion: 所提出的MdaIF框架通过语义先验引导的退化感知机制和动态专家路由，在复杂多变的退化条件下实现了鲁棒且高效的红外与可见光图像融合，具有良好的泛化能力和应用前景。 Abstract: Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby compromising fusion performance; and 2) rely on fixed network architectures, limiting their adaptability to diverse degradation scenarios. To address these issues, we propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF). Given the distinct scattering characteristics of different degradation scenarios (e.g., haze, rain, and snow) in atmospheric transmission, a mixture-of-experts (MoE) system is introduced to tackle image fusion across multiple degradation scenarios. To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, collectively referred to as the semantic prior, we employ a pre-trained vision-language model (VLM) in our framework. Guided by the semantic prior, we propose degradation-aware channel attention module (DCAM), which employ degradation prototype decomposition to facilitate multi-modal feature interaction in channel domain. In addition, to achieve effective expert routing, the semantic prior and channel-domain modulated features are utilized to guide the MoE, enabling robust image fusion in complex degradation scenarios. Extensive experiments validate the effectiveness of our MdaIF, demonstrating superior performance over SOTA methods.

[210] D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

Zheyuan Zhang,Jiwei Zhang,Boyu Zhou,Linzhimeng Duan,Hong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为$D^{2}$-VPR的新框架，结合知识蒸馏与可变形聚合机制，在保持视觉基础模型强大特征提取能力的同时显著降低模型复杂度和计算开销，实现了更优的性能-效率权衡。

Details

Motivation: 现有的基于DINOv2等视觉基础模型的视觉位置识别方法虽然性能优越，但模型复杂、计算量大，难以部署在资源受限设备上。因此需要一种高效且高性能的VPR方法。 Method: 提出$D^{2}$-VPR框架：采用两阶段训练策略（知识蒸馏与微调），引入蒸馏恢复模块（DRM）以减少教师与学生模型间的特征空间差异；设计基于自上而下注意力的可变形聚合器（TDDA），利用全局语义动态调整感兴趣区域进行特征聚合。 Result: 实验表明，该方法在性能上达到与当前最先进方法相当的水平，同时相比CricaVPR减少了约64.2%的参数量和62.6%的FLOPs。 Conclusion: $D^{2}$-VPR在保持高精度的同时大幅提升了模型效率，适合在资源受限设备上部署，为高效视觉位置识别提供了一个有效的解决方案。 Abstract: Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2's exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$-VPR, a $D$istillation- and $D$eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and FLOPs by about 62.6% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.

[211] ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

Yuan Zhou,Litao Hua,Shilong Jin,Wentao Huang,Haoran Duan

Main category: cs.CV

TL;DR: 提出了一种名为ReaSon的框架，通过因果信息瓶颈（CIB）优化关键帧选择，结合预测充分性和因果必要性，利用强化学习在有限帧情况下显著提升视频理解性能。

Details

Motivation: 由于视觉语言模型输入token有限且视频中相关信息在时间上稀疏，如何选择既具信息量又具因果决定性的关键帧成为视频理解的关键挑战。 Method: 将关键帧选择建模为优化问题，提出因果信息瓶颈（CIB）准则；使用可学习策略网络从候选帧中选择满足预测充分性的关键帧，并通过反事实干预评估因果必要性，结合复合奖励函数指导强化学习训练。 Result: 在NExT-QA、EgoSchema和Video-MME数据集上实验表明，ReaSon在有限帧设置下 consistently 超越现有最先进方法，展现出优越的性能和泛化能力。 Conclusion: ReaSon通过显式建模关键帧的预测充分性与因果必要性，有效提升了视频理解中关键帧选择的质量，为VLMs处理长视频提供了高效可靠的解决方案。 Abstract: Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.

[212] HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models

Zhiguang Lu,Qianqian Xu,Peisong Wen,Siran Da,Qingming Huang

Main category: cs.CV

TL;DR: 提出了一种层次引导的细粒度增强方法HiGFA，利用扩散采样过程中的时序动态，结合文本、轮廓和细粒度分类器的分阶段引导，生成既多样化又保真的细粒度图像。

Details

Motivation: 标准生成方法（如CFG）在细粒度任务中缺乏足够的特异性，难以准确捕捉类别定义的细微特征，可能导致生成误导性样本并降低分类性能。 Method: HiGFA在扩散采样的早期到中期使用强文本和变换后的轮廓引导来建立整体场景、风格和结构；在最后阶段激活专门的细粒度分类器引导，并根据预测置信度动态调整所有引导信号的强度，实现分层且基于置信度的引导协调。 Result: 在多个细粒度视觉分类（FGVC）数据集上的实验表明，HiGFA能有效生成高质量的合成图像，提升分类器性能。 Conclusion: HiGFA通过分阶段、动态调节的多模态引导机制，有效平衡了全局结构与细节精度，在细粒度图像生成中表现出优越的保真度和实用性。 Abstract: Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.

[213] EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

Yijie Guo,Dexiang Hong,Weidong Chen,Zihan She,Cheng Ye,Xiaojun Chang,Zhendong Mao

Main category: cs.CV

TL;DR: 本文提出了EmoVerse，一个大规模开源数据集，支持可解释的视觉情感分析，通过多层级标注和双情绪表征推动可解释情感理解的研究。

Details

Motivation: 现有视觉情感分析研究受限于缺乏开放且可解释的数据集，且多采用单一整体情绪标签，难以揭示视觉元素对情感的影响机制。 Method: 构建了一个基于知识图谱启发的多层标注数据集EmoVerse，分解情绪为背景-属性-主体（B-A-S）三元组，并关联到图像区域；引入双情绪标注（CES和DES）；设计多阶段自动化标注流程和可解释模型映射视觉线索到维度情感空间。 Result: EmoVerse包含超过21.9万张图像，提供词级和主体级情感推理支持，实现了高可靠性标注，并开发了能生成归因解释的可解释模型。 Conclusion: EmoVerse数据集、标注流程和可解释模型共同为可解释的高级情感理解提供了坚实基础，推动视觉情感分析向更精细、透明的方向发展。 Abstract: Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.

[214] SEMC: Structure-Enhanced Mixture-of-Experts Contrastive Learning for Ultrasound Standard Plane Recognition

Qing Cai,Guihao Yan,Fan Zhang,Cheng Zhang,Zhi Liu

Main category: cs.CV

TL;DR: 提出了一种结构增强的混合专家对比学习框架SEMC，用于超声标准平面识别，通过融合多尺度结构信息和分层对比学习提升识别性能。

Details

Motivation: 现有方法未能有效利用浅层结构信息，且在通过图像增强生成的对比样本中难以捕捉细粒度语义差异，导致对超声标准平面的结构和判别细节识别效果不佳。 Method: 提出SEMC框架，包含语义-结构融合模块（SSFM）以对齐浅层与深层特征并增强多尺度结构信息感知；设计混合专家对比识别模块（MCRM），结合MoE机制进行多层次特征上的分层对比学习与分类。 Result: 在自建的大规模肝脏超声数据集及两个公开数据集上实验表明，SEMC在多项指标上优于当前最先进的方法。 Conclusion: SEMC能更有效地捕捉超声图像中的结构和判别细节，显著提升标准平面识别性能，具有临床应用潜力。 Abstract: Ultrasound standard plane recognition is essential for clinical tasks such as disease screening, organ evaluation, and biometric measurement. However, existing methods fail to effectively exploit shallow structural information and struggle to capture fine-grained semantic differences through contrastive samples generated by image augmentations, ultimately resulting in suboptimal recognition of both structural and discriminative details in ultrasound standard planes. To address these issues, we propose SEMC, a novel Structure-Enhanced Mixture-of-Experts Contrastive learning framework that combines structure-aware feature fusion with expert-guided contrastive learning. Specifically, we first introduce a novel Semantic-Structure Fusion Module (SSFM) to exploit multi-scale structural information and enhance the model's ability to perceive fine-grained structural details by effectively aligning shallow and deep features. Then, a novel Mixture-of-Experts Contrastive Recognition Module (MCRM) is designed to perform hierarchical contrastive learning and classification across multi-level features using a mixture-of-experts (MoE) mechanism, further improving class separability and recognition performance. More importantly, we also curate a large-scale and meticulously annotated liver ultrasound dataset containing six standard planes. Extensive experimental results on our in-house dataset and two public datasets demonstrate that SEMC outperforms recent state-of-the-art methods across various metrics.

[215] Through-Foliage Surface-Temperature Reconstruction for early Wildfire Detection

Mohamed Youssef,Lukas Brunner,Klaus Rundhammer,Gerald Czech,Oliver Bimber

Main category: cs.CV

TL;DR: 提出了一种结合信号处理与机器学习的新方法，通过自主无人机实现森林遮挡下的地表温度重建，用于早期野火监测。

Details

Motivation: 实现完全自动化的空中野火监测，能够在烟雾或火焰出现前检测地面火情，克服森林植被遮挡和合成孔径成像带来的热模糊问题。 Method: 训练一个视觉状态空间模型，从模糊的合成孔径热数据中恢复被部分遮挡的地表和热点的细微热信号；使用隐扩散模型生成大量逼真的地表温度模拟数据，并通过温度增强和程序化森林热模拟扩展数据集。 Result: 在模拟数据上，RMSE比传统热成像和未校正SA成像降低2到2.5倍；实地实验中对高温热点的RMSE改善达12.8倍（相比传统热成像）和2.6倍（相比未校正SA）；成功重建火灾和人体热信号的完整形态，克服了部分遮挡问题。 Conclusion: 该方法显著提升了遮挡环境下地表热信号的检测能力，具备良好的泛化性，可用于野火早期预警和搜救等应用。 Abstract: We introduce a novel method for reconstructing surface temperatures through occluding forest vegetation by combining signal processing and machine learning. Our goal is to enable fully automated aerial wildfire monitoring using autonomous drones, allowing for the early detection of ground fires before smoke or flames are visible. While synthetic aperture (SA) sensing mitigates occlusion from the canopy and sunlight, it introduces thermal blur that obscures the actual surface temperatures. To address this, we train a visual state space model to recover the subtle thermal signals of partially occluded soil and fire hotspots from this blurred data. A key challenge was the scarcity of real-world training data. We overcome this by integrating a latent diffusion model into a vector quantized to generated a large volume of realistic surface temperature simulations from real wildfire recordings, which we further expanded through temperature augmentation and procedural thermal forest simulation. On simulated data across varied ambient and surface temperatures, forest densities, and sunlight conditions, our method reduced the RMSE by a factor of 2 to 2.5 compared to conventional thermal and uncorrected SA imaging. In field experiments focused on high-temperature hotspots, the improvement was even more significant, with a 12.8-fold RMSE gain over conventional thermal and a 2.6-fold gain over uncorrected SA images. We also demonstrate our model's generalization to other thermal signals, such as human signatures for search and rescue. Since simple thresholding is frequently inadequate for detecting subtle thermal signals, the morphological characteristics are equally essential for accurate classification. Our experiments demonstrated another clear advantage: we reconstructed the complete morphology of fire and human signatures, whereas conventional imaging is defeated by partial occlusion.

[216] Beyond Pixels: Semantic-aware Typographic Attack for Geo-Privacy Protection

Jiayi Zhu,Yihao Huang,Yue Cao,Xiaojun Jia,Qing Guo,Felix Juefei-Xu,Geguang Pu,Bin Wang

Main category: cs.CV

TL;DR: 提出一种基于语义感知的文本攻击方法，通过在图像外部添加欺骗性文本来有效保护用户地理位置隐私，同时保持图像视觉质量。

Details

Motivation: 大型视觉语言模型（LVLMs）能从社交媒体图像中推断用户地理位置，带来严重的隐私泄露风险；现有对抗扰动方法因显著降低图像质量而限制了实用性。 Method: 探索文本攻击作为保护地理隐私的新方向，设计两阶段语义感知的文本攻击方法，生成具有干扰语义的外部文本以误导LVLMs的地理定位能力。 Result: 在三个数据集上对五个最先进的商业LVLMs进行实验，结果表明该方法显著降低了地理位置预测准确率，同时保持图像视觉完整性。 Conclusion: 所提出的语义感知文本攻击是一种实用且视觉无损的防御策略，可有效应对LVLMs带来的地理隐私威胁。 Abstract: Large Visual Language Models (LVLMs) now pose a serious yet overlooked privacy threat, as they can infer a social media user's geolocation directly from shared images, leading to unintended privacy leakage. While adversarial image perturbations provide a potential direction for geo-privacy protection, they require relatively strong distortions to be effective against LVLMs, which noticeably degrade visual quality and diminish an image's value for sharing. To overcome this limitation, we identify typographical attacks as a promising direction for protecting geo-privacy by adding text extension outside the visual content. We further investigate which textual semantics are effective in disrupting geolocation inference and design a two-stage, semantics-aware typographical attack that generates deceptive text to protect user privacy. Extensive experiments across three datasets demonstrate that our approach significantly reduces geolocation prediction accuracy of five state-of-the-art commercial LVLMs, establishing a practical and visually-preserving protection strategy against emerging geo-privacy threats.

[217] TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

Yukuo Ma,Cong Liu,Junke Wang,Junqi Liu,Haibin Huang,Zuxuan Wu,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: TempoMaster是一种新的长视频生成框架，通过逐步提高帧率来生成高质量的视频，实现了视觉和时间质量上的新突破。

Details

Motivation: 现有的视频生成方法在处理长时间视频时难以保持时间连贯性和细节质量，因此需要一种更高效且能保证连续性的方法。 Method: 将长视频生成建模为下一帧率预测任务：先生成低帧率片段作为整体结构蓝图，再逐级提升帧率以细化视觉和运动连续性；在每一帧率级别使用双向注意力，并在帧率间进行自回归。 Result: 实验表明，TempoMaster在多个指标上达到了最先进的性能，在视觉保真度和时间一致性方面均表现优异。 Conclusion: TempoMaster通过跨帧率的自回归与层级化双向注意力机制，有效解决了长视频生成中的效率与质量平衡问题，推动了该领域的技术发展。 Abstract: We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

[218] Rank-Aware Agglomeration of Foundation Models for Immunohistochemistry Image Cell Counting

Zuqi Huang,Mengxin Tian,Huan Liu,Wentao Li,Baobao Liang,Jie Wu,Fang Yan,Zhaoqing Tang,Zhongyu Li

Main category: cs.CV

TL;DR: 提出了一种基于排名感知的聚合框架CountIHC，用于免疫组化图像中的多类别细胞计数，结合多个基础模型的优势并引入视觉-语言对齐策略，显著提升了在多种生物标志物和组织类型下的细胞计数性能。

Details

Motivation: 免疫组化图像中准确的细胞计数对癌症诊断至关重要，但由于染色重叠、染色变异和细胞形态多样等挑战，现有回归方法难以实现端到端的多类别计数，且基础模型在此任务中的潜力尚未充分挖掘。 Method: 提出Rank-Aware Teacher Selecting (RATS)策略，通过建模全局到局部patch排序评估教师模型的计数能力，实现样本级教师选择；采用知识蒸馏聚合多个基础模型；在微调阶段将多类别计数任务转化为视觉-语言对齐问题，利用结构化文本提示生成语义锚点，指导类别特定密度图回归。 Result: CountIHC在12种IHC生物标志物和5种组织类型上超越了现有最先进方法，并与病理学家评估高度一致；在H&E染色数据上也表现出良好泛化能力。 Conclusion: 该方法有效解决了IHC图像中细胞计数的异质性与重叠问题，通过排名感知的知识蒸馏和视觉-语言对齐策略，实现了高性能、可扩展的多类别细胞计数。 Abstract: Accurate cell counting in immunohistochemistry (IHC) images is critical for quantifying protein expression and aiding cancer diagnosis. However, the task remains challenging due to the chromogen overlap, variable biomarker staining, and diverse cellular morphologies. Regression-based counting methods offer advantages over detection-based ones in handling overlapped cells, yet rarely support end-to-end multi-class counting. Moreover, the potential of foundation models remains largely underexplored in this paradigm. To address these limitations, we propose a rank-aware agglomeration framework that selectively distills knowledge from multiple strong foundation models, leveraging their complementary representations to handle IHC heterogeneity and obtain a compact yet effective student model, CountIHC. Unlike prior task-agnostic agglomeration strategies that either treat all teachers equally or rely on feature similarity, we design a Rank-Aware Teacher Selecting (RATS) strategy that models global-to-local patch rankings to assess each teacher's inherent counting capacity and enable sample-wise teacher selection. For multi-class cell counting, we introduce a fine-tuning stage that reformulates the task as vision-language alignment. Discrete semantic anchors derived from structured text prompts encode both category and quantity information, guiding the regression of class-specific density maps and improving counting for overlapping cells. Extensive experiments demonstrate that CountIHC surpasses state-of-the-art methods across 12 IHC biomarkers and 5 tissue types, while exhibiting high agreement with pathologists' assessments. Its effectiveness on H&E-stained data further confirms the scalability of the proposed method.

[219] Fine-Grained Representation for Lane Topology Reasoning

Guoqing Xu,Yiheng Li,Yang Yang

Main category: cs.CV

TL;DR: 本文提出了一种细粒度车道拓扑推理框架TopoFG，通过引入空间和序列先验信息以及去噪策略，显著提升了复杂道路结构下的车道拓扑建模精度，在OpenLane-V2基准上达到SOTA性能。

Details

Motivation: 现有方法通常使用单一查询表示车道，难以准确建模复杂的车道拓扑结构，导致拓扑预测不可靠。为此，本文旨在提升车道拓扑建模的精细度和鲁棒性。 Method: 提出TopoFG框架，包含三个阶段：1）层次化先验提取器（HPE）提取BEV掩码中的全局空间先验和车道关键点序列中的局部序列先验；2）区域聚焦解码器（RFD）结合上述先验构建细粒度查询，并在掩码感兴趣区域采样参考点，通过交叉注意力优化查询表示；3）基于边界点查询特征进行拓扑推理，并采用拓扑去噪策略减少匹配歧义。 Result: 在OpenLane-V2数据集上取得新的SOTA性能，子集A的OLS为48.0%，子集B为45.4%。 Conclusion: TopoFG通过融合空间与序列先验并引入细粒度查询和去噪机制，有效提升了复杂场景下车道拓扑预测的准确性和可靠性，具有较强的实用价值。 Abstract: Precise modeling of lane topology is essential for autonomous driving, as it directly impacts navigation and control decisions.Existing methods typically represent each lane with a single query and infer topological connectivity based on the similarity between lane queries.However, this kind of design struggles to accurately model complex lane structures, leading to unreliable topology prediction.In this view, we propose a Fine-Grained lane topology reasoning framework (TopoFG).It divides the procedure from bird's-eye-view (BEV) features to topology prediction via fine-grained queries into three phases, i.e., Hierarchical Prior Extractor (HPE), Region-Focused Decoder (RFD), and Robust Boundary-Point Topology Reasoning (RBTR).Specifically, HPE extracts global spatial priors from the BEV mask and local sequential priors from in-lane keypoint sequences to guide subsequent fine-grained query modeling.RFD constructs fine-grained queries by integrating the spatial and sequential priors. It then samples reference points in RoI regions of the mask and applies cross-attention with BEV features to refine the query representations of each lane.RBTR models lane connectivity based on boundary-point query features and further employs a topological denoising strategy to reduce matching ambiguity.By integrating spatial and sequential priors into fine-grained queries and applying a denoising strategy to boundary-point topology reasoning, our method precisely models complex lane structures and delivers trustworthy topology predictions.Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoFG achieves new state-of-the-art performance, with an OLS of 48.0% on subsetA and 45.4% on subsetB.

[220] Seg-VAR: Image Segmentation with Visual Autoregressive Modeling

Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: 提出Seg-VAR，一种将分割视为条件自回归掩码生成问题的新框架，在多种分割任务上优于先前方法。

Details

Motivation: 探索视觉自回归建模在需要精确低层空间感知的分割任务中的潜力，此前该方法主要用于图像生成。 Method: 通过用潜在学习替代判别学习，设计包含图像编码器、空间感知seglat编码器和解码器的三组件框架，并采用多阶段训练策略。 Result: 实验表明Seg-VAR在多个分割任务和基准上优于之前的判别式和生成式方法。 Conclusion: Seg-VAR通过将分割重构为序列化分层预测任务，为自回归推理融入空间感知视觉系统提供了新方向。 Abstract: While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems. Code will be available at https://github.com/rkzheng99/Seg-VAR.

[221] LoRA-Enhanced Vision Transformer for Single Image based Morphing Attack Detection via Knowledge Distillation from EfficientNet

Ria Shekhawat,Sushrut Patwardhan,Raghavendra Ramachandra,Praveen Kumar Chandaliya,Kishor P. Upla

Main category: cs.CV

TL;DR: 提出一种基于教师-学生框架的单图像人脸合成攻击检测方法（S-MAD），结合CNN与ViT模型，并引入LoRA进行高效微调，在检测性能和计算效率上均优于现有方法。

Details

Motivation: 人脸识别系统易受人脸合成攻击威胁，现有检测方法在效率和泛化能力上存在不足，需提升单图像场景下的检测精度与实用性。 Method: 采用CNN-based教师模型指导ViT-based学生模型，通过知识蒸馏提升检测能力；引入低秩适应（LoRA）技术对模型进行轻量级微调，降低计算开销。 Result: 在融合三个公开人脸数据集构建的合成数据集上测试，涵盖十种不同的合成算法，实验表明该方法在检测准确率和速度上均优于六种前沿S-MAD方法。 Conclusion: 所提S-MAD方法在保持高检测精度的同时显著提升计算效率，具备良好的实际部署潜力，尤其适用于资源受限环境下的合成攻击防御。 Abstract: Face Recognition Systems (FRS) are critical for security but remain vulnerable to morphing attacks, where synthetic images blend biometric features from multiple individuals. We propose a novel Single-Image Morphing Attack Detection (S-MAD) approach using a teacher-student framework, where a CNN-based teacher model refines a ViT-based student model. To improve efficiency, we integrate Low-Rank Adaptation (LoRA) for fine-tuning, reducing computational costs while maintaining high detection accuracy. Extensive experiments are conducted on a morphing dataset built from three publicly available face datasets, incorporating ten different morphing generation algorithms to assess robustness. The proposed method is benchmarked against six state-of-the-art S-MAD techniques, demonstrating superior detection performance and computational efficiency.

[222] Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Drishya Karki,Merey Ramazanova,Anthony Cioppa,Silvio Giancola,Bernard Ghanem

Main category: cs.CV

TL;DR: 本文提出了SoccerNet-GAR，一个包含广播视频和球员轨迹数据的多模态足球活动识别数据集，用于比较像素（视频）和位置（跟踪）两种模态在群体活动识别中的性能。研究定义了统一评估协议，并提出一种基于角色感知图神经网络的跟踪模型，在准确率、训练速度和参数量上均优于视频基线模型，揭示了位置模态与角色建模对群体活动识别的重要性。

Details

Motivation: 现有群体活动识别研究多集中于视频模态，而对能显式编码空间交互的轨迹数据关注不足，且缺乏统一基准进行跨模态公平比较。因此，亟需构建同步的多模态数据集以评估不同模态的有效性。 Method: 基于2022年世界杯64场比赛构建SoccerNet-GAR数据集，同步广播视频与球员跟踪数据，并标注94,285个群体活动实例，涵盖10类活动。提出统一评估协议，对比基于视频的分类器与基于图神经网络的跟踪分类器；其中跟踪模型引入角色感知图结构，通过位置边和时间注意力机制建模战术结构。 Result: 所提跟踪模型在群体活动识别中达到67.2%的平衡准确率，显著高于最佳视频基线的58.1%，同时训练速度快4.25倍，参数量仅为197K，远少于视频模型的8630万。 Conclusion: 位置轨迹模态在群体活动识别中优于视频模态，尤其在效率和建模效率方面表现突出；角色感知的图结构有助于捕捉团队战术模式，未来研究应重视模态选择与结构化建模。 Abstract: Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $94{,}285$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) a competitive video-based classifiers and (ii) a tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges and temporal attention. Our tracking model achieves $67.2\%$ balanced accuracy compared to $58.1\%$ for the best video baseline, while training $4.25 \times$ faster with $438 \times$ fewer parameters ($197K$ \vs $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition. Overall, it highlights the importance of modality choice and role-aware modeling for GAR.

[223] Open-World Test-Time Adaptation with Hierarchical Feature Aggregation and Attention Affine

Ziqiong Liu,Yushun Tang,Junyang Ji,Zhihai He

Main category: cs.CV

TL;DR: 提出了一种基于分层梯形网络（HLN）和注意力仿射网络（AAN）的测试时自适应方法，通过OOD检测与注意力机制优化，提升模型在分布外样本和域偏移下的鲁棒性与分类性能。

Details

Motivation: 现有测试时自适应方法在面对未知分布（OOD）样本时性能显著下降，容易将OOD样本误分类为已知类别，影响模型预测准确性和自适应过程。 Method: 设计分层梯形网络（HLN）从Transformer各层聚合的类别token中提取OOD特征，并结合原始模型预测与HLN输出进行加权概率融合以增强OOD检测；引入注意力仿射网络（AAN）根据token信息自适应调整自注意力机制，提升对域偏移的适应能力；采用加权熵机制动态抑制低置信度样本在自适应中的影响。 Result: 在多个基准数据集上的实验表明，该方法显著优于现有TTA方法，尤其在包含OOD样本和域偏移的场景下表现出更强的鲁棒性和更高的分类精度。 Conclusion: 所提出的HLN与AAN联合框架有效提升了模型在复杂真实场景下的测试时自适应能力，通过协同优化OOD识别与注意力机制，增强了模型对分布变化的鲁棒性。 Abstract: Test-time adaptation (TTA) refers to adjusting the model during the testing phase to cope with changes in sample distribution and enhance the model's adaptability to new environments. In real-world scenarios, models often encounter samples from unseen (out-of-distribution, OOD) categories. Misclassifying these as known (in-distribution, ID) classes not only degrades predictive accuracy but can also impair the adaptation process, leading to further errors on subsequent ID samples. Many existing TTA methods suffer substantial performance drops under such conditions. To address this challenge, we propose a Hierarchical Ladder Network that extracts OOD features from class tokens aggregated across all Transformer layers. OOD detection performance is enhanced by combining the original model prediction with the output of the Hierarchical Ladder Network (HLN) via weighted probability fusion. To improve robustness under domain shift, we further introduce an Attention Affine Network (AAN) that adaptively refines the self-attention mechanism conditioned on the token information to better adapt to domain drift, thereby improving the classification performance of the model on datasets with domain shift. Additionally, a weighted entropy mechanism is employed to dynamically suppress the influence of low-confidence samples during adaptation. Experimental results on benchmark datasets show that our method significantly improves the performance on the most widely used classification datasets.

[224] OPFormer: Object Pose Estimation leveraging foundation model with geometric encoding

Artem Moroz,Vít Zeman,Martin Mikšík,Elizaveta Isianova,Miroslav David,Pavel Burget,Varun Burde

Main category: cs.CV

TL;DR: 提出了一种统一的端到端框架，结合物体检测与姿态估计，并通过CAD模型或NeRF实现快速建模。

Details

Motivation: 解决在缺乏3D模型时精确进行6D姿态估计的问题，提升系统对不同输入条件的适应性。 Method: 采用CNOS检测器定位物体，利用基于Transformer的OPFormer模块结合多视角模板和NOCS几何先验进行姿态估计。 Result: 在BOP基准上表现出精度与效率的良好平衡，适用于基于模型和无模型场景。 Conclusion: 该框架在多种条件下均具有良好的实用性与鲁棒性，推动了6D姿态估计的实际应用。 Abstract: We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.

[225] C3Net: Context-Contrast Network for Camouflaged Object Detection

Baber Jan,Aiman H. El-Maleh,Abdul Jabbar Siddiqui,Abdul Bais,Saeed Anwar

Main category: cs.CV

TL;DR: 本文提出了C3Net，一种针对伪装物体检测（COD）中六大挑战的双路径解码器网络，在多个基准上实现了最先进的性能。

Details

Motivation: 现有方法在处理与背景高度相似的伪装物体时表现不佳，难以应对边缘模糊、尺度变化大、环境复杂等问题，因此需要专门设计的网络结构来系统性解决这些挑战。 Method: 提出C3Net，包含两条路径：边缘优化路径利用梯度初始化的边缘增强模块恢复精确边界；上下文定位路径通过基于图像的上下文引导机制抑制内在显著性；并通过注意力融合模块进行空间门控融合。 Result: C3Net在COD10K、CAMO和NC4K数据集上分别取得了0.898、0.904和0.913的S-measure，性能领先且计算高效。 Conclusion: 复杂的伪装物体检测需要针对性的架构创新，C3Net通过协同工作的专用组件实现了对多重挑战的全面覆盖，优于孤立改进的方法。 Abstract: Camouflaged object detection identifies objects that blend seamlessly with their surroundings through similar colors, textures, and patterns. This task challenges both traditional segmentation methods and modern foundation models, which fail dramatically on camouflaged objects. We identify six fundamental challenges in COD: Intrinsic Similarity, Edge Disruption, Extreme Scale Variation, Environmental Complexities, Contextual Dependencies, and Salient-Camouflaged Object Disambiguation. These challenges frequently co-occur and compound the difficulty of detection, requiring comprehensive architectural solutions. We propose C3Net, which addresses all challenges through a specialized dual-pathway decoder architecture. The Edge Refinement Pathway employs gradient-initialized Edge Enhancement Modules to recover precise boundaries from early features. The Contextual Localization Pathway utilizes our novel Image-based Context Guidance mechanism to achieve intrinsic saliency suppression without external models. An Attentive Fusion Module synergistically combines the two pathways via spatial gating. C3Net achieves state-of-the-art performance with S-measures of 0.898 on COD10K, 0.904 on CAMO, and 0.913 on NC4K, while maintaining efficient processing. C3Net demonstrates that complex, multifaceted detection challenges require architectural innovation, with specialized components working synergistically to achieve comprehensive coverage beyond isolated improvements. Code, model weights, and results are available at https://github.com/Baber-Jan/C3Net.

[226] Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

Yushe Cao,Dianxi Shi,Xing Fu,Xuechao Zou,Haikuo Peng,Xueqi Li,Chun Yu,Junliang Xing

Main category: cs.CV

TL;DR: 本文提出了一种名为MDiTFace的新型扩散变换器框架，用于多模态人脸生成，通过统一的令牌化策略和解耦注意力机制实现高效的跨模态交互与计算优化。

Details

Motivation: 传统多模态融合方法在语义掩码与文本输入间缺乏有效交互，导致生成效果不佳。 Method: 提出MDiTFace框架，采用统一令牌化处理多模态输入，设计堆叠的多变量变换器模块实现同步条件建模，并引入解耦注意力机制分离动态与静态计算路径以提升效率。 Result: 实验表明，该方法在面部保真度和条件一致性上显著优于现有方法，同时减少超过94%的掩码条件计算开销。 Conclusion: MDiTFace通过增强跨模态交互和高效注意力机制，在多模态人脸生成中实现了高性能与低计算成本的平衡。 Abstract: While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace--a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.

[227] Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Xunzhi Xiang,Xingye Tian,Guiyu Zhang,Yabo Chen,Shaofeng Zhang,Xuebo Wang,Xin Tao,Qi Fan

Main category: cs.CV

TL;DR: 本文提出了一种基于谱自正则化和谱对齐策略的Denoising-VAE，用于解决高维潜在空间中冗余高频分量影响扩散模型训练收敛的问题，在不依赖视觉基础模型的情况下实现了更快的收敛速度和优异的重建与生成性能。

Details

Motivation: 高维潜在空间虽能提升VAE重建质量，但其中冗余的高频成分会阻碍扩散模型的训练收敛，现有方法依赖外部视觉基础模型进行正则化，缺乏对生成模型优化影响的深入理解。 Method: 提出谱自正则化策略以抑制高维潜在空间中的高频噪声并保持重建质量，设计ViT-based的Denoising-VAE，并引入谱对齐策略促进生成模型优化。 Result: 在ImageNet 256×256上实现2倍加速收敛，达到SOTA重建质量（rFID=0.28，PSNR=27.26）和竞争力的生成性能（gFID=1.82）。 Conclusion: Denoising-VAE通过谱域正则化有效改善了高维潜在空间对扩散模型训练的负面影响，在无需外部模型的前提下显著提升了重建质量、生成性能和训练效率。 Abstract: Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.

[228] Medical Knowledge Intervention Prompt Tuning for Medical Image Classification

Ye Du,Nanxi Yu,Shujun Wang

Main category: cs.CV

TL;DR: 提出CILMP方法，通过引入大语言模型（LLM）增强视觉-语言基础模型（VLM）的提示调优，实现疾病特异性特征提取与实例自适应提示生成，在多种医学图像分类任务中优于现有提示调优方法。

Details

Motivation: 现有提示调优方法难以精确区分不同医学概念，忽略了跨模态医学图像中关键的疾病相关特征；而大语言模型（LLM）擅长捕捉专业医学知识，因此希望将其融入提示调优过程以提升性能。 Method: 提出CILMP（Conditional Intervention of Large Language Models for Prompt Tuning），从LLM中提取疾病特异性表征，在低秩线性子空间中进行干预，并结合条件机制根据每张医学图像生成实例自适应的提示，从而将医学知识注入VLM提示中。 Result: 在多个多样的医学图像数据集上实验表明，CILMP consistently 优于当前最先进的提示调优方法，展现出更优的性能和适应性。 Conclusion: CILMP有效融合LLM的医学知识与VLM的提示调优，提升了医学图像分类中的特征表达能力与模型性能，为高效、精准的医学视觉任务提供了一种新范式。 Abstract: Vision-language foundation models (VLMs) have shown great potential in feature transfer and generalization across a wide spectrum of medical-related downstream tasks. However, fine-tuning these models is resource-intensive due to their large number of parameters. Prompt tuning has emerged as a viable solution to mitigate memory usage and reduce training time while maintaining competitive performance. Nevertheless, the challenge is that existing prompt tuning methods cannot precisely distinguish different kinds of medical concepts, which miss essentially specific disease-related features across various medical imaging modalities in medical image classification tasks. We find that Large Language Models (LLMs), trained on extensive text corpora, are particularly adept at providing this specialized medical knowledge. Motivated by this, we propose incorporating LLMs into the prompt tuning process. Specifically, we introduce the CILMP, Conditional Intervention of Large Language Models for Prompt Tuning, a method that bridges LLMs and VLMs to facilitate the transfer of medical knowledge into VLM prompts. CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and utilizes them to create disease-specific prompts. Additionally, a conditional mechanism is incorporated to condition the intervention process on each individual medical image, generating instance-adaptive prompts and thus enhancing adaptability. Extensive experiments across diverse medical image datasets demonstrate that CILMP consistently outperforms state-of-the-art prompt tuning methods, demonstrating its effectiveness. Code is available at https://github.com/usr922/cilmp.

[229] DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry

Cheng Liao

Main category: cs.CV

TL;DR: 本文提出了一种名为DPVO-QAT++的分层量化优化框架，通过异构精度设计和CUDA内核融合，在保持轨迹精度的同时显著提升了深度视觉里程计的运行效率。

Details

Motivation: 深度学习驱动的视觉SLAM系统虽然具有优秀的几何推理能力，但高计算开销限制了其在资源受限平台上的部署。因此，需要一种兼顾精度与效率的优化方法。 Method: 提出DPVO-QAT++框架，结合可学习尺度参数化、前后端异构精度设计（前端使用FP16/FP32模拟量化，后端保留全精度）以及GPU原生的CUDA内核融合技术，实现高效推理。 Result: 在TartanAir和EuRoC数据集上，系统实现了最高52.1%的FPS提升，峰值GPU内存占用减少达64.9%，同时保持与原始DPVO相当的轨迹精度（ATE）。 Conclusion: DPVO-QAT++有效平衡了高精度深度视觉里程计与实际部署效率之间的矛盾，为嵌入式平台提供了可行的工程解决方案。 Abstract: Deep learning-based Visual SLAM (vSLAM) systems exhibit exceptional geometric reasoning capabilities, yet their prohibitive computational overhead severely restricts deployment on resource-constrained autonomous platforms. This paper presents a hierarchical quantization optimization framework, DPVO-QAT++ (DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry). Through the synergistic integration of learnable scale parameterization, a heterogeneous precision design for the Visual Odometry (VO) front-end and back-end (front-end floating-point fake quantization with FP16/FP32; back-end full precision), and GPU-native kernel fusion for fake quantization (custom CUDA kernels), our framework significantly reduces memory footprint and increases processing speed while preserving the trajectory accuracy of the original model. On the TartanAir dataset, our framework achieves an average FPS increase of 52.1%, a 29.1% reduction in median latency, and a 64.9% reduction in peak GPU memory reservation, while maintaining trajectory accuracy (ATE) comparable to the original DPVO model across 32 validation sequences. On the EuRoC dataset, it realizes an average FPS increase of 30.1%, a 23.1% reduction in median latency, and a 37.7% reduction in peak GPU memory reservation, maintaining comparable trajectory accuracy (ATE) across 11 validation sequences. Experimental results demonstrate that DPVO-QAT++ effectively bridges the gap between high-precision deep VO and the efficiency requirements for practical deployment, offering a viable engineering paradigm for the application of this technology on real-world embedded platforms. Keywords: Visual Odometry, Heterogeneous Precision Architecture, Quantization-Aware Training, CUDA Kernel Fusion, Scale-Only Training, Deep Patch Visual Odometry, GPU-Native Kernel Fusion.

[230] Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis

Zeqin Yu,Haotao Xie,Jian Zhang,Jiangqun Ni,Wenkan Su,Jiwu Huang

Main category: cs.CV

TL;DR: 提出了一种基于傅里叶级数的文本图像篡改合成方法FSTS，通过真实篡改实例建模生成更贴近现实的训练数据，显著提升文本图像伪造定位模型在真实场景下的泛化能力。

Details

Motivation: 现有文本图像伪造定位方法因真实数据集规模有限及合成数据与真实篡改之间分布差异大而导致泛化性能差。 Method: 构建了一个结构化、可解释的篡改合成框架FSTS：首先通过多格式日志收集16,750个真实篡改实例；分析个体和群体层面的行为模式，采用类似傅里叶级数的方法对篡改参数进行分层建模；利用基函数及其权重逼近真实分布，并从中采样生成多样且逼真的训练数据。 Result: 在四个评估协议上的实验表明，使用FSTS生成数据训练的模型在真实世界数据集上表现出显著更好的泛化性能。 Conclusion: FSTS通过基于真实行为模式的分层概率建模，有效缩小了合成与真实篡改之间的差距，为文本图像伪造定位提供了更有效的训练数据生成方案。 Abstract: Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation-parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \href{https://github.com/ZeqinYu/FSTS}{Project Page}.

[231] Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans

Hongbin Huang,Junwei Li,Tianxin Xie,Zhuang Li,Cekai Weng,Yaodong Yang,Yue Luo,Li Liu,Jing Tang,Zhijing Shao,Zeyu Wang

Main category: cs.CV

TL;DR: 提出了一种高保真、实时的对话式数字人系统，结合了3D avatar、表达性语音合成和基于知识的对话生成，支持自然交互。

Details

Motivation: 实现视觉真实感与实时响应之间的平衡，以满足交互应用中对高质量数字人的需求。 Method: 采用异步执行流水线协调多模态组件，并引入检索增强方法，包括历史增强和基于意图的路由。 Result: 系统实现了低延迟的自然交互，具备唤醒词检测、情感语调表达和上下文感知的精准回复生成能力。 Conclusion: 该集成系统能够实现响应迅速且逼真的数字人，适用于通信、教育和娱乐等沉浸式应用。 Abstract: High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.

[232] DensePercept-NCSSD: Vision Mamba towards Real-time Dense Visual Perception with Non-Causal State Space Duality

Tushar Anand,Advik Sinha,Abhijit Das

Main category: cs.CV

TL;DR: 提出了一种基于非因果Mamba块的实时光流与视差估计模型，通过在非因果选择性状态空间中融合成对图像，在保持高精度和低GPU占用的同时显著降低推理时间，适用于统一的实时3D密集感知任务。

Details

Motivation: 为了实现实时且准确的3D密集感知（如光流和视差估计），需要在保证高精度的同时降低推理时间和计算资源消耗，尤其是在实际应用场景中对效率和速度有严格要求。 Method: 提出一种基于非因果Mamba块的模型，在非因果选择性状态空间中融合成对输入图像，以高效处理密集感知任务，兼顾速度与精度。 Result: 该模型显著降低了推理时间，同时保持高精度和低GPU使用率，并在真实场景中得到验证，适用于光学流和视差图生成。 Conclusion: 所提出的模型能够有效支持实时、准确的统一3D密集感知任务，具有良好的实际应用前景。 Abstract: In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in a real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis, and validation in real-life scenario justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at https://github.com/vimstereo/DensePerceptNCSSD

[233] Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis

Saar Stern,Ido Sobol,Or Litany

Main category: cs.CV

TL;DR: 提出了一种针对新视角合成（NVS）任务的评估框架PRISM，利用Zero123模型特征和轻量微调，设计了基于参考和无参考的两个评估指标，能够更可靠地衡量生成图像的真实性与视角一致性，并在多个基准上展现出与人类偏好一致的模型排序能力。

Details

Motivation: 现有评估指标（如像素相似性和分布度量）无法准确评估NVS生成图像在真实性和视角变换忠实性方面的表现，导致错误结果被误判，亟需一种任务感知、更可靠的评估方法。 Method: 利用强大的NVS基础模型Zero123提取特征，通过轻量级微调增强判别能力，提出两种互补的评估指标：基于参考的D_PRISM和无参考的MMD_PRISM。 Result: 所提指标在Toys4K、GSO和OmniObject3D三个基准上对六种NVS方法进行了评估，MMD_PRISM展现出清晰且稳定的模型排序能力，且与人类偏好高度一致。 Conclusion: PRISM提供了一种原理清晰且实用的NVS质量评估方案，弥补了当前评估方法的不足，有助于推动NVS领域的可靠发展。 Abstract: The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, $D_{\text{PRISM}}$, and a reference-free score, $\text{MMD}_{\text{PRISM}}$. Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where $\text{MMD}_{\text{PRISM}}$ produces a clear and stable ranking, with lower scores consistently indicating stronger models.

[234] BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Subin Varghese,Joshua Gao,Asad Ur Rahman,Vedhus Hoskere

Main category: cs.CV

TL;DR: 提出并发布了BridgeEQA基准，用于开放词汇的具身问答（EQA），基于真实桥梁检测场景，包含200个场景和2,200个问题，引入图像引用相关性指标，并提出EMVR方法提升模型在长程视觉推理和记忆下的表现。

Details

Motivation: 现有具身问答研究缺乏贴近实际应用的基准，难以评估模型在复杂现实环境中的多尺度推理、长距离空间理解与语义关联能力；桥梁检测作为一个自然且具有标准化评价体系的任务，为EQA提供了理想的开放领域测试平台。 Method: 构建BridgeEQA数据集，基于专业检测报告生成2,200个开放词汇问题，每场景平均47.93张图像，并引入基于图像引用的相关性指标（Image Citation Relevance）；提出EMVR方法，将检测建模为基于图像节点的场景图上的序列导航，通过马尔可夫决策过程进行证据收集与推理。 Result: 当前最先进的视觉语言模型在BridgeEQA上表现不佳，尤其在需要 episodic memory 的设置下存在显著性能差距；EMVR方法在该基准上显著优于基线模型，展现出更强的视觉证据整合与跨图像推理能力。 Conclusion: 桥梁基础设施检测是一个富有挑战且实用的EQA研究方向，BridgeEQA为评估开放词汇、多图像、长程推理提供了可靠基准，EMVR展示了具身记忆与结构化视觉推理结合的有效性，推动了现实世界具身智能的发展。 Abstract: Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

[235] R$^{2}$Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

Shuaike Shen,Ke Liu,Jiaqing Xie,Shangde Gao,Chunhua Shen,Ge Liu,Mireia Crispin-Ortuzar,Shangqi Gao

Main category: cs.CV

TL;DR: 提出R²Seg，一种无需训练的两阶段Reason-and-Reject框架，用于提升医学图像分割中对分布外肿瘤的鲁棒性，通过LLM引导的解剖推理和统计拒绝机制有效抑制假阳性。

Details

Motivation: 基础模型在面对分布外（OOD）肿瘤时容易产生碎片化的假阳性，缺乏鲁棒性。 Method: 第一阶段（Reason）利用LLM引导的解剖推理定位器官锚点并生成多尺度ROI；第二阶段（Reject）在冻结的基础模型（BiomedParse）生成的候选区域内应用双样本统计检验，仅保留与正常组织显著不同的区域。 Result: 在多中心、多模态肿瘤分割基准上，R²Seg显著优于强基线和原始基础模型，提升Dice系数、特异性和敏感性。 Conclusion: R²Seg作为一种无需训练的框架，能有效提升医学图像中OOD肿瘤分割的鲁棒性，兼容测试时增强且避免灾难性遗忘。 Abstract: Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R$^{2}$Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R$^{2}$Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models. Code are available at https://github.com/Eurekashen/R2Seg.

[236] HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models

Sushant Gautam,Michael A. Riegler,Pål Halvorsen

Main category: cs.CV

TL;DR: HEDGE是一个用于检测视觉语言模型（VLM）幻觉的统一框架，结合了视觉扰动、语义聚类和不确定性度量，适用于多种多模态架构，并通过hedge-bench工具支持可复现评估。

Details

Motivation: 视觉语言模型在开放域视觉问答中易产生幻觉，缺乏可靠、通用的幻觉检测方法。 Method: 提出HEDGE框架，整合控制性视觉扰动、语义聚类（基于蕴含和嵌入）与不确定性度量（如VASE），构建可复现的检测流程，并在不同VLM上系统评估。 Result: 实验显示Qwen2.5-VL幻觉最易检测，Med-Gemma最难；嵌入聚类在答案层面表现更好，VASE指标最稳健；提示设计影响显著，标签式输出优于句子式。 Conclusion: HEDGE为多模态可靠性评估提供了原理清晰、计算可控的基础，强调模型架构、提示设计、采样规模和聚类策略的共同作用。 Abstract: Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures. Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the generated answers, whereas NLI-based clustering remains advantageous for LLaVA-Med and for longer, sentence-level responses. Across configurations, the VASE metric consistently provides the most robust hallucination signal, especially when paired with embedding clustering and a moderate sampling budget (n ~ 10-15). Prompt design also matters: concise, label-style outputs offer clearer semantic structure than syntactically constrained one-sentence responses. By framing hallucination detection as a geometric robustness problem shaped jointly by sampling scale, prompt structure, model architecture, and clustering strategy, HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE .

[237] X-VMamba: Explainable Vision Mamba

Mohamed A. Mabrok,Yalda Zafari

Main category: cs.CV

TL;DR: 提出基于可控性的解释性框架，用于分析视觉SSM（如Mamba）如何处理空间信息，通过Jacobian和Gramian方法量化输入部分对模型状态的影响，实验证明该框架能揭示医学图像中层次化特征提炼过程，并具有跨领域的应用潜力。

Details

Motivation: 缺乏类似注意力机制的透明性使得理解Vision SSM如何处理空间信息变得困难，因此需要一种无需修改架构的通用解释方法。 Method: 提出两种互补方法：Jacobian-based方法适用于任意SSM，通过完整状态传播链衡量输入影响；Gramian-based方法针对对角SSM，利用闭式解析解实现更快速度。两者均单次前向传播、线性复杂度。 Result: 在三种医学影像模态上验证，发现SSM从浅层扩散低级纹理到深层聚焦临床相关模式的层次化特征提炼过程，揭示出与诊断标准一致的领域特异性可控性特征、网络层级中的渐进空间选择性，以及扫描策略对注意力模式的重大影响。 Conclusion: 所提可控性分析框架为SSM提供了一种统一、基础的可解释性范式，适用于计算机视觉、自然语言处理等多个领域。 Abstract: State Space Models (SSMs), particularly the Mamba architecture, have recently emerged as powerful alternatives to Transformers for sequence modeling, offering linear computational complexity while achieving competitive performance. Yet, despite their effectiveness, understanding how these Vision SSMs process spatial information remains challenging due to the lack of transparent, attention-like mechanisms. To address this gap, we introduce a controllability-based interpretability framework that quantifies how different parts of the input sequence (tokens or patches) influence the internal state dynamics of SSMs. We propose two complementary formulations: a Jacobian-based method applicable to any SSM architecture that measures influence through the full chain of state propagation, and a Gramian-based approach for diagonal SSMs that achieves superior speed through closed-form analytical solutions. Both methods operate in a single forward pass with linear complexity, requiring no architectural modifications or hyperparameter tuning. We validate our framework through experiments on three diverse medical imaging modalities, demonstrating that SSMs naturally implement hierarchical feature refinement from diffuse low-level textures in early layers to focused, clinically meaningful patterns in deeper layers. Our analysis reveals domain-specific controllability signatures aligned with diagnostic criteria, progressive spatial selectivity across the network hierarchy, and the substantial influence of scanning strategies on attention patterns. Beyond medical imaging, we articulate applications spanning computer vision, natural language processing, and cross-domain tasks. Our framework establishes controllability analysis as a unified, foundational interpretability paradigm for SSMs across all domains. Code and analysis tools will be made available upon publication

[238] Counting Through Occlusion: Framework for Open World Amodal Counting

Safaeid Hossain Arib,Rabeya Akter,Abdul Monaf Chowdhury,Md Jubair Ahmed Sourov,Md Mehedi Hasan

Main category: cs.CV

TL;DR: 本文提出了一种名为CountOCC的遮挡场景下物体计数新方法，通过多模态引导重建被遮挡物体的完整特征表示，并引入视觉等价性目标来提升计数准确性，在多个基准上实现了显著的性能提升。

Details

Motivation: 现有最先进方法在物体被遮挡时表现不佳，原因是骨干网络编码了遮挡表面而非目标物体本身，导致特征表示失真，限制了实际应用中的鲁棒性。 Method: 提出CountOCC框架，利用可见片段的空间上下文与文本和视觉嵌入的语义先验，分层生成被遮挡区域的类判别特征；同时引入视觉等价性目标，确保遮挡与未遮挡场景在注意力空间中的一致性。 Result: 在FSC147上验证集和测试集的MAE分别降低了26.72%和20.80%，在CARPK上MAE降低49.89%，在CAPTUREReal上降低28.79%，均达到SOTA性能。 Conclusion: CountOCC通过显式重建被遮挡物体的特征并保持注意力一致性，有效解决了遮挡下的计数难题，展现出强大的泛化能力和跨域鲁棒性。 Abstract: Object counting has achieved remarkable success on visible instances, yet state-of-the-art (SOTA) methods fail under occlusion, a pervasive challenge in real world deployment. This failure stems from a fundamental architectural limitation where backbone networks encode occluding surfaces rather than target objects, thereby corrupting the feature representations required for accurate enumeration. To address this, we present CountOCC, an amodal counting framework that explicitly reconstructs occluded object features through hierarchical multimodal guidance. Rather than accepting degraded encodings, we synthesize complete representations by integrating spatial context from visible fragments with semantic priors from text and visual embeddings, generating class-discriminative features at occluded locations across multiple pyramid levels. We further introduce a visual equivalence objective that enforces consistency in attention space, ensuring that both occluded and unoccluded views of the same scene produce spatially aligned gradient-based attention maps. Together, these complementary mechanisms preserve discriminative properties essential for accurate counting under occlusion. For rigorous evaluation, we establish occlusion-augmented versions of FSC 147 and CARPK spanning both structured and unstructured scenes. CountOCC achieves SOTA performance on FSC 147 with 26.72% and 20.80% MAE reduction over prior baselines under occlusion in validation and test, respectively. CountOCC also demonstrates exceptional generalization by setting new SOTA results on CARPK with 49.89% MAE reduction and on CAPTUREReal with 28.79% MAE reduction, validating robust amodal counting across diverse visual domains. Code will be released soon.

[239] FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

Kaiser Hamid,Can Cui,Khandakar Ashrafi Akbar,Ziran Wang,Nade Liang

Main category: cs.CV

TL;DR: 提出FSDAM框架，通过仅约100个标注样本实现驾驶员认知注意力预测与文本解释生成，显著减少对大规模数据的依赖。

Details

Motivation: 现有驾驶员认知注意力模型依赖大规模眼动数据集，采集和标注成本高，限制了实际应用。 Method: 提出双路径架构FSDAM，分别处理空间注意力预测和文本生成任务，并通过跨模态对齐保持语义一致性，实现在极少量标注样本下的联合学习。 Result: 在极少监督下，FSDAM在注意力预测上达到竞争性性能，能生成连贯且符合上下文的解释，并在多个驾驶基准上表现出强零样本泛化能力。 Conclusion: 证明在低数据条件下仍可实现有效的注意力驱动生成，为数据受限场景下的可解释驾驶辅助系统提供了可行方案。 Abstract: Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.

Ankita Raj,Chetan Arora

Main category: cs.CV

TL;DR: 本文提出了TrAP，一种针对开放词汇目标检测器（OVOD）的多模态后门攻击方法，通过在图像和文本模态中联合优化提示参数和视觉触发器，在不重训练模型权重的情况下实现隐蔽且高效的攻击。

Details

Motivation: 随着OVOD在高风险领域应用增多，其安全性问题亟需研究；现有工作缺乏对提示调优引入的新攻击面的探讨。 Method: 提出TrAP攻击方法，结合图像与文本模态的可学习提示参数，采用课程学习策略逐步缩小触发器尺寸，实现小触发补丁下的高效后门激活。 Result: 实验表明TrAP在多个数据集上实现了高攻击成功率，支持目标误分类和消失攻击，同时提升了干净样本下的下游任务性能。 Conclusion: TrAP揭示了基于提示调优的OVOD系统中存在的新型安全威胁，表明轻量级提示参数即可成为后门载体，为相关模型的安全部署敲响警钟。 Abstract: Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.

[241] Direct Visual Grounding by Directing Attention of Visual Tokens

Parsa Esmaeilkhani,Longin Jan Latecki

Main category: cs.CV

TL;DR: 提出一种新的KL散度注意力损失（KLAL），通过直接监督视觉语言模型中视觉标记的注意力，提升其在视觉任务中的表现。

Details

Motivation: 标准的下一个词预测损失无法有效引导模型关注与问题相关的视觉标记，导致回答错误。 Method: 引入KLAL损失函数，利用真实注意力图（来自合成数据的任务几何或真实图像的标注）对视觉标记的注意力分布进行监督，通过KL散度对齐注意力分布。 Result: 在几何任务、指代定位和表达理解等任务上，模型性能显著提升；并在新构建的线追踪数据集上验证了方法的有效性。 Conclusion: 直接监督视觉标记的注意力机制能有效提升VLM在视觉问答任务中的准确性和可解释性，尤其在需要精细视觉定位的任务中表现突出。 Abstract: Vision Language Models (VLMs) mix visual tokens and text tokens. A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens, where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers. This fact may result in wrong answers to visual questions, as our experimental results confirm. It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens. We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks. To demonstrate that this is indeed the case, we propose a novel loss function that directly supervises the attention of visual tokens. It directly grounds the answer language tokens in images by directing their attention to the relevant visual tokens. This is achieved by aligning the attention distribution of visual tokens to ground truth attention maps with KL divergence. The ground truth attention maps are obtained from task geometry in synthetic cases or from standard grounding annotations (e.g., bounding boxes or point annotations) in real images, and are used inside the LLM for attention supervision without requiring new labels. The obtained KL attention loss (KLAL) when combined with NTP encourages VLMs to attend to relevant visual tokens while generating answer tokens. This results in notable improvements across geometric tasks, pointing, and referring expression comprehension on both synthetic and real-world data, as demonstrated by our experiments. We also introduce a new dataset to evaluate the line tracing abilities of VLMs. Surprisingly, even commercial VLMs do not perform well on this task.

[242] Deep Imbalanced Multi-Target Regression: 3D Point Cloud Voxel Content Estimation in Simulated Forests

Amirhossein Hassanzadeh,Bartosz Krawczyk,Michael Saunders,Rob Wible,Keith Krause,Dimah Dera,Jan van Aardt

Main category: cs.CV

TL;DR: 本研究提出了一种基于核点卷积（KPConv）的多目标回归方法，利用密度相关性（DBR）处理类别不平衡问题，从高阶体素化LiDAR点云中推断低层级体素内容信息（如树皮、叶片、土壤等的占用百分比），并通过敏感性分析揭示体素大小对森林结构建模精度的影响。

Details

Motivation: 体素化虽能降低LiDAR数据处理的计算成本，但会损失精细结构信息。现有方法难以准确估计森林冠层内部细小体素中的多类目标占比，尤其在分辨率较高时存在显著误差，因此需要一种能在不同体素尺度下有效推断体素内容的深度学习模型。 Method: 采用KPConv网络框架进行多目标回归，引入成本敏感学习策略——密度相关性（DBR）以应对样本不平衡；使用加权均方误差、焦点回归（FocalR）和正则化方法优化训练过程；基于DIRSIG模拟生成的3D LiDAR点云数据，在0.25至2米的不同体素尺寸下进行实验与敏感性分析。 Result: 较大体素尺寸（如2米）因变异性较低而误差更小，较小体素尺寸（如0.25或0.5米）尤其在冠层区域误差更高；树皮和叶片类别的估计误差在细分辨率下显著上升；所提方法在处理不平衡多目标回归任务中表现更优。 Conclusion: 体素大小的选择应根据具体应用需求权衡精度与稳定性；本研究填补了针对森林LiDAR点云的深度不平衡多目标回归模型与模拟数据集的研究空白，证明了利用高阶体素数据推断低层级结构信息的可行性。 Abstract: Voxelization is an effective approach to reduce the computational cost of processing Light Detection and Ranging (LiDAR) data, yet it results in a loss of fine-scale structural information. This study explores whether low-level voxel content information, specifically target occupancy percentage within a voxel, can be inferred from high-level voxelized LiDAR point cloud data collected from Digital Imaging and remote Sensing Image Generation (DIRSIG) software. In our study, the targets include bark, leaf, soil, and miscellaneous materials. We propose a multi-target regression approach in the context of imbalanced learning using Kernel Point Convolutions (KPConv). Our research leverages cost-sensitive learning to address class imbalance called density-based relevance (DBR). We employ weighted Mean Saquared Erorr (MSE), Focal Regression (FocalR), and regularization to improve the optimization of KPConv. This study performs a sensitivity analysis on the voxel size (0.25 - 2 meters) to evaluate the effect of various grid representations in capturing the nuances of the forest. This sensitivity analysis reveals that larger voxel sizes (e.g., 2 meters) result in lower errors due to reduced variability, while smaller voxel sizes (e.g., 0.25 or 0.5 meter) exhibit higher errors, particularly within the canopy, where variability is greatest. For bark and leaf targets, error values at smaller voxel size datasets (0.25 and 0.5 meter) were significantly higher than those in larger voxel size datasets (2 meters), highlighting the difficulty in accurately estimating within-canopy voxel content at fine resolutions. This suggests that the choice of voxel size is application-dependent. Our work fills the gap in deep imbalance learning models for multi-target regression and simulated datasets for 3D LiDAR point clouds of forests.

[243] SAGE: Saliency-Guided Contrastive Embeddings

Colton R. Crum,Adam Czajka

Main category: cs.CV

TL;DR: 提出SAGE（Saliency-Guided Contrastive Embeddings）方法，通过在模型的潜在空间中引入基于人类显著性指导的对比嵌入损失函数，提升分类性能并增强模型泛化能力。

Details

Motivation: 现有显著性引导训练方法多局限于图像空间，依赖不可靠的内部模型机制，导致效果受限。希望将指导从图像空间转移到模型的潜在空间，以更有效地整合人类感知先验。 Method: 提出SAGE损失函数，利用对比三元组损失，在潜在空间中对输入进行显著性保持和降级的信号增强，捕捉嵌入和logits变化，引导模型关注显著特征、忽略非显著特征，并通过logit分布的合理性检验确保模型输出与显著性一致。 Result: 在开集和闭集场景下均优于当前最先进的显著性方法，提升了分类性能，且在不同骨干网络和任务中表现出良好的泛化性。 Conclusion: SAGE通过在潜在空间中进行显著性引导训练，有效整合人类感知先验，增强了模型的准确性与鲁棒性，为高风险领域中的模型对齐提供了可靠方案。 Abstract: Integrating human perceptual priors into the training of neural networks has been shown to raise model generalization, serve as an effective regularizer, and align models with human expertise for applications in high-risk domains. Existing approaches to integrate saliency into model training often rely on internal model mechanisms, which recent research suggests may be unreliable. Our insight is that many challenges associated with saliency-guided training stem from the placement of the guidance approaches solely within the image space. Instead, we move away from the image space, use the model's latent space embeddings to steer human guidance during training, and we propose SAGE (Saliency-Guided Contrastive Embeddings): a loss function that integrates human saliency into network training using contrastive embeddings. We apply salient-preserving and saliency-degrading signal augmentations to the input and capture the changes in embeddings and model logits. We guide the model towards salient features and away from non-salient features using a contrastive triplet loss. Additionally, we perform a sanity check on the logit distributions to ensure that the model outputs match the saliency-based augmentations. We demonstrate a boost in classification performance across both open- and closed-set scenarios against SOTA saliency-based methods, showing SAGE's effectiveness across various backbones, and include experiments to suggest its wide generalization across tasks.

[244] Which Way from B to A: The role of embedding geometry in image interpolation for Stable Diffusion

Nicholas Karris,Luke Durell,Javier Flores,Tegan Emerson

Main category: cs.CV

TL;DR: 提出将CLIP嵌入矩阵视为Wasserstein空间中的点云，利用最优传输进行嵌入插值，从而在Stable Diffusion中生成更平滑、连贯的图像。

Details

Motivation: Stable Diffusion对CLIP嵌入矩阵具有排列不变性，启发作者从几何角度重新理解嵌入空间结构。 Method: 将CLIP嵌入视为Wasserstein空间中的点云，将插值问题重构为最优传输问题，并求解其测地路径。 Result: 相比传统插值方法，该方法生成的中间图像更平滑、视觉效果更连贯。 Conclusion: 将嵌入视为点云并使用最优传输能更好揭示和利用嵌入空间的几何结构，提升图像插值质量。 Abstract: It can be shown that Stable Diffusion has a permutation-invariance property with respect to the rows of Contrastive Language-Image Pretraining (CLIP) embedding matrices. This inspired the novel observation that these embeddings can naturally be interpreted as point clouds in a Wasserstein space rather than as matrices in a Euclidean space. This perspective opens up new possibilities for understanding the geometry of embedding space. For example, when interpolating between embeddings of two distinct prompts, we propose reframing the interpolation problem as an optimal transport problem. By solving this optimal transport problem, we compute a shortest path (or geodesic) between embeddings that captures a more natural and geometrically smooth transition through the embedding space. This results in smoother and more coherent intermediate (interpolated) images when rendered by the Stable Diffusion generative model. We conduct experiments to investigate this effect, comparing the quality of interpolated images produced using optimal transport to those generated by other standard interpolation methods. The novel optimal transport--based approach presented indeed gives smoother image interpolations, suggesting that viewing the embeddings as point clouds (rather than as matrices) better reflects and leverages the geometry of the embedding space.

[245] RoCoISLR: A Romanian Corpus for Isolated Sign Language Recognition

Cătălin-Alexandru Rîpanu,Andrei-Theodor Hotnog,Giulia-Stefania Imbrea,Dumitru-Clementin Cercel

Main category: cs.CV

TL;DR: 本文介绍了用于罗马尼亚孤立手语识别（RoISLR）的第一个大规模标准化数据集RoCoISLR，包含9000多个视频样本和近6000个标准化词汇，并基于多个先进视频识别模型建立了基准结果。

Details

Motivation: 由于缺乏大规模、标准化的数据集，罗马尼亚手语识别的研究进展受限，而现有数据集主要集中于美国手语，因此需要构建针对罗马尼亚手语的数据集以推动相关研究。 Method: 收集并构建了一个名为RoCoISLR的新语料库，包含来自多个来源的9000多个视频样本和近6000个标准化词汇；在统一实验设置下评估了七种先进的视频识别模型（I3D、SlowFast、Swin Transformer等），并与WLASL2000数据集进行性能比较。 Result: 基于Transformer的架构表现优于卷积基线模型，其中Swin Transformer取得了34.1%的Top-1准确率；实验结果揭示了低资源手语中长尾类别分布带来的挑战。 Conclusion: RoCoISLR为罗马尼亚孤立手语识别研究提供了首个基础性资源，有助于推动该领域的系统性发展，特别是在处理长尾分布和低资源语言方面具有重要意义。 Abstract: Automatic sign language recognition plays a crucial role in bridging the communication gap between deaf communities and hearing individuals; however, most available datasets focus on American Sign Language. For Romanian Isolated Sign Language Recognition (RoISLR), no large-scale, standardized dataset exists, which limits research progress. In this work, we introduce a new corpus for RoISLR, named RoCoISLR, comprising over 9,000 video samples that span nearly 6,000 standardized glosses from multiple sources. We establish benchmark results by evaluating seven state-of-the-art video recognition models-I3D, SlowFast, Swin Transformer, TimeSformer, Uniformer, VideoMAE, and PoseConv3D-under consistent experimental setups, and compare their performance with that of the widely used WLASL2000 corpus. According to the results, transformer-based architectures outperform convolutional baselines; Swin Transformer achieved a Top-1 accuracy of 34.1%. Our benchmarks highlight the challenges associated with long-tail class distributions in low-resource sign languages, and RoCoISLR provides the initial foundation for systematic RoISLR research.

[246] Lightweight Optimal-Transport Harmonization on Edge Devices

Maria Larchenko,Dmitry Guskov,Alexander Lobashev,Georgy Derevyanko

Main category: cs.CV

TL;DR: 提出了一种轻量级的AR颜色协调方法MKL-Harmonizer，利用最优传输理论实现设备端实时推理，并在真实AR图像上表现最佳。

Details

Motivation: 现有的颜色协调算法缺乏实时性，难以集成到增强现实（AR）流程中。 Method: 基于经典最优传输理论，训练一个紧凑编码器来预测Monge-Kantorovich传输映射，实现高效的颜色协调。 Result: 在真实AR合成图像上，该方法相比现有最先进方法取得了最高的综合评分，并发布了带有精确掩码的AR数据集和数据采集工具包。 Conclusion: MKL-Harmonizer支持设备端实时运行，有效解决了AR中的颜色协调问题，推动了该领域的数据研究。 Abstract: Color harmonization adjusts the colors of an inserted object so that it perceptually matches the surrounding image, resulting in a seamless composite. The harmonization problem naturally arises in augmented reality (AR), yet harmonization algorithms are not currently integrated into AR pipelines because real-time solutions are scarce. In this work, we address color harmonization for AR by proposing a lightweight approach that supports on-device inference. For this, we leverage classical optimal transport theory by training a compact encoder to predict the Monge-Kantorovich transport map. We benchmark our MKL-Harmonizer algorithm against state-of-the-art methods and demonstrate that for real composite AR images our method achieves the best aggregated score. We release our dedicated AR dataset of composite images with pixel-accurate masks and data-gathering toolkit to support further data acquisition by researchers.

[247] Enhancing Neuro-Oncology Through Self-Assessing Deep Learning Models for Brain Tumor Unified Model for MRI Segmentation

Andrew Zhou

Main category: cs.CV

TL;DR: 提出一种不确定性感知的nnUNet扩展框架，同时实现脑肿瘤分割、健康脑结构分割与体素级不确定性估计，为临床手术决策提供更可靠的AI支持。

Details

Motivation: 现有方法缺乏对预测不确定性的量化，且未能统一肿瘤分割与周围正常解剖结构的上下文信息，限制了其在临床中的应用。 Method: 在nnUNet基础上增加一个体素级不确定性通道，并通过联合训练正常脑结构与肿瘤数据集实现全脑上下文分割，模型在单次前向传播中同时输出分割结果和不确定性图。 Result: 在BraTS2023上达到0.750的相关性和0.047的RMSD用于不确定性估计，肿瘤DSC为0.86，脑结构DSC为0.81，关键区域表现稳健。 Conclusion: 该框架首次实现了肿瘤、周围解剖结构及不确定性图的统一输出，可视化结果表明不确定性信息有助于评估预测可靠性并辅助纠正错误，提升AI在手术规划中的可信度与实用性。 Abstract: Accurate segmentation of brain tumors is vital for diagnosis, surgical planning, and treatment monitoring. Deep learning has advanced on benchmarks, but two issues limit clinical use: no uncertainty estimates for errors and no segmentation of healthy brain structures around tumors for surgery. Current methods fail to unify tumor localization with anatomical context and lack confidence scores. This study presents an uncertainty-aware framework augmenting nnUNet with a channel for voxel-wise uncertainty. Trained on BraTS2023, it yields a correlation of 0.750 and RMSD of 0.047 for uncertainty without hurting tumor accuracy. It predicts uncertainty in one pass, with no extra networks or inferences, aiding clinical decisions. For whole-brain context, a unified model combines normal and cancer datasets, achieving a DSC of 0.81 for brain structures and 0.86 for tumor, with robust key-region performance. Combining both innovations gives the first model outputting tumor in natural surroundings plus an overlaid uncertainty map. Visual checks of outputs show uncertainty offers key insights to evaluate predictions and fix errors, helping informed surgical decisions from AI.

[248] MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection

Leena Alghamdi,Muhammad Usman,Hafeez Anwar,Abdul Bais,Saeed Anwar

Main category: cs.CV

TL;DR: 提出了一种用于伪装物体检测的多尺度递归网络（MSRNet），结合金字塔视觉Transformer和注意力机制实现多尺度特征融合，并通过递归反馈解码策略提升对小尺寸和多个伪装物体的检测精度，在多个基准数据集上达到领先性能。

Details

Motivation: 现有方法在复杂场景下，尤其是对小尺寸和多个伪装物体的检测仍存在不足，需要提高检测精度和上下文理解能力。 Method: 采用金字塔视觉Transformer作为骨干网络提取多尺度特征，设计基于注意力的尺度融合单元进行特征选择性合并，并引入多粒度融合单元和递归反馈解码策略以实现特征的逐步优化和全局上下文增强。 Result: 在两个伪装物体检测基准数据集上取得最先进性能，其余两个数据集排名第二，有效提升了小尺寸和多物体场景下的检测效果。 Conclusion: 所提出的MSRNet通过多尺度学习与递归特征优化相结合，显著提升了伪装物体检测的性能，尤其在复杂场景下表现出色，具有良好的应用前景。 Abstract: Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \href{https://github.com/linaagh98/MSRNet}{https://github.com/linaagh98/MSRNet}.

[249] SAGA: Source Attribution of Generative AI Videos

Rohit Kundu,Vishal Mohanty,Hao Xiong,Shan Jia,Athula Balachandran,Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: 本文提出了SAGA，首个大规模生成式AI视频源归因的综合框架，能够多粒度识别生成模型，具备高数据效率和可解释性。

Details

Motivation: 随着生成式AI视频日益逼真，传统二元真假检测已不足以应对滥用风险，亟需能精确溯源生成模型的技术。 Method: 提出SAGA框架，采用基于视觉基础模型的视频Transformer架构，结合预训练与归因策略，并引入时间注意力特征（T-Sigs）实现可解释性。 Result: 在公共数据集上表现达到SOTA，仅用0.5%标注数据即可匹配全监督性能，且支持跨域归因，T-Sigs提供了模型区分性的可视化解释。 Conclusion: SAGA实现了细粒度、高效且可解释的AI生成视频溯源，为内容监管和取证提供了有力工具。 Abstract: The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5\% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

[250] Video Finetuning Improves Reasoning Between Frames

Ruiqi Yang,Tian Yun,Zihan Wang,Ellie Pavlick

Main category: cs.CV

TL;DR: 提出Visual Chain-of-Thought (vCoT) 方法，用于分析视频微调对多模态大语言模型的影响，发现视频微调模型已隐式捕捉帧间转换信息，并能将时序推理能力迁移到静态视觉任务中。

Details

Motivation: 现有视频理解方法多基于帧拼接，缺乏对多模态大语言模型在视频微调中实际获得能力的深入探究。 Method: 提出vCoT方法，生成连续帧间的过渡事件描述，系统比较图像-only与视频微调模型在有无过渡线索下的表现。 Result: vCoT显著提升图像-only模型在长视频问答中的性能，但对视频微调模型提升有限；视频模型能在静态关系推理任务上超越图像模型。 Conclusion: 视频微调使模型隐式学习到帧间时序关系，具备可迁移的时序推理能力，而不仅依赖显式过渡线索。 Abstract: Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on relational visual reasoning tasks.

Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide

Main category: cs.CV

TL;DR: 提出了一种名为ViCoKD的框架，用于在模态和标注受限的情况下，通过跨模态知识蒸馏提升多视角动作识别性能。

Details

Motivation: 部分重叠视角和有限模态下的多视角动作识别研究不足，尤其是在仅有序列级标注的真实场景中。 Method: 设计了跨模态适配器和视图感知一致性模块，利用跨模态注意力和人类检测掩码实现知识蒸馏与预测对齐。 Result: 在MultiSensor-Home数据集上，ViCoKD优于多种竞争性蒸馏方法，并在受限条件下超越教师模型。 Conclusion: ViCoKD能有效应对部分视角覆盖和标注稀疏的挑战，显著提升学生模型的多视角动作识别能力。 Abstract: The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

[252] Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma,Wentao Bao,Jingyi Xu,Guanzhong Sun,Yu Zheng,Erhang Zhang,Xieyuanli Chen,Hesheng Wang

Main category: cs.CV

TL;DR: 提出了一种名为EgoLoc的零样本方法，用于在以自我为中心的视频中定位手与物体接触和分离的时间戳，通过手部动态引导采样和视觉-语言模型实现高精度时序交互定位，无需依赖对象掩码或动作类别标注。

Details

Motivation: 现有研究主要关注交互行为的建模（如何交互），但对手与物体接触/分离的关键时刻（何时交互）的定位仍不足，这在混合现实和机器人规划中至关重要。 Method: 提出EgoLoc，采用手部动态引导采样生成高质量视觉提示，利用视觉-语言模型识别接触/分离属性并定位时间戳，并通过闭环反馈优化结果；无需对象掩码或动作分类标注，实现零样本泛化。 Result: 在公开数据集和新构建的基准上实验表明，EgoLoc能有效实现以自我为中心视频中的时序交互定位，并提升下游任务如机器人操作的表现。 Conclusion: EgoLoc是一种无需类别标注和对象掩码的零样本方法，在手-物接触与分离时刻定位方面表现出色，具有良好的泛化能力和应用潜力。 Abstract: Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., "how to interact"). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., "when to interact") is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.

[253] Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings

Zihao Lin,Zhenshan Shi,Sasa Zhao,Hanwei Zhu,Lingyu Zhu,Baoliang Chen,Lei Mo

Main category: cs.CV

TL;DR: 提出了一种基于多模态多任务学习框架的绘画创造力自动评估方法，结合内容与风格双维度，实现可解释且性能优越的创造力评分。

Details

Motivation: 现有创造力评估依赖专家主观打分，费时费力且主观性强，急需一种自动化、可解释的数据驱动方法。 Method: 通过扩充标注数据集以包含内容类别，并设计一个多模态多任务学习框架，联合预测创造力得分、分类内容类型并提取风格特征；引入条件学习机制，根据语义和风格线索动态调整特征提取过程。 Result: 模型在创造力评分上达到当前最优性能，优于回归基线方法，并提供与人类判断一致的可解释可视化结果。 Conclusion: 该框架有效融合内容与风格信息，实现了自动且可解释的绘画创造力评估，具备实际应用潜力。 Abstract: Assessing human creativity through visual outputs, such as drawings, plays a critical role in fields including psychology, education, and cognitive science. However, current assessment practices still rely heavily on expert-based subjective scoring, which is both labor-intensive and inherently subjective. In this paper, we propose a data-driven framework for automatic and interpretable creativity assessment from drawings. Motivated by the cognitive understanding that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions.Specifically, we first augment an existing creativity labeled dataset with additional annotations targeting content categories. Based on the enriched dataset, we further propose a multi-modal, multi-task learning framework that simultaneously predicts creativity scores, categorizes content types, and extracts stylistic features. In particular, we introduce a conditional learning mechanism that enables the model to adapt its visual feature extraction by dynamically tuning it to creativity-relevant signals conditioned on the drawing's stylistic and semantic cues.Experimental results demonstrate that our model achieves state-of-the-art performance compared to existing regression-based approaches and offers interpretable visualizations that align well with human judgments. The code and annotations will be made publicly available at https://github.com/WonderOfU9/CSCA_PRCV_2025

[254] ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

Kaixin Zhang,Ruiqing Yang,Yuan Zhang,Shan You,Tao Huang

Main category: cs.CV

TL;DR: 提出ActVAR，一种动态激活框架，通过在权重和token序列上引入双重稀疏性来提升视觉自回归模型的效率，同时保持性能。

Details

Motivation: 现有的静态剪枝方法因永久删除权重或token而破坏预训练依赖关系，导致性能下降，且随着序列长度增加，计算成本显著上升。 Method: 将前馈网络分解为轻量级专家子网络，并使用可学习路由器根据内容动态选择特定于token的专家子集；同时利用门控token选择器识别高更新潜力的token并重建未选中的token以保持全局上下文和序列对齐；采用两阶段知识蒸馏策略，用原始VAR模型监督路由和门控策略的学习。 Result: 在ImageNet $256\times 256$基准测试中，ActVAR实现了最高21.2%的FLOPs减少，且性能损失极小。 Conclusion: ActVAR通过动态激活机制有效提升了视觉自回归模型的计算效率，同时较好地保留了模型容量和生成质量。 Abstract: Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet $256\times 256$ benchmark demonstrate that ActVAR achieves up to $21.2\%$ FLOPs reduction with minimal performance degradation.

[255] Reconstructing 3D Scenes in Native High Dynamic Range

Kaixuan Zhang,Minxian Li,Mingwu Ren,Jiankang Deng,Xiatian Zhu

Main category: cs.CV

TL;DR: 本文提出了首个直接利用单次曝光原生HDR数据进行3D场景重建的方法NH-3DGS，通过亮度-色度分解实现全动态范围保持，显著提升了重建质量。

Details

Motivation: 现有的3D场景重建方法主要基于低动态范围（LDR）数据或依赖多曝光融合与逆色调映射，限制了在专业高动态范围（HDR）媒体制作中的应用。随着能直接捕获原生HDR数据的相机出现，需要一种能直接建模原生HDR观测的3D重建方法。 Method: 提出Native High dynamic range 3D Gaussian Splatting (NH-3DGS)，引入新颖的亮度-色度颜色分解表示，使HDR相机数据可直接优化，并在整个重建流程中保持完整动态范围。 Result: 在合成和真实多视角HDR数据集上，NH-3DGS在重建质量和动态范围保持方面显著优于现有方法。 Conclusion: NH-3DGS是首个直接处理原生HDR观测的3D重建方法，为专业级数字内容创作提供了高质量、简化的HDR重建方案。 Abstract: High Dynamic Range (HDR) imaging is essential for professional digital media creation, e.g., filmmaking, virtual production, and photorealistic rendering. However, 3D scene reconstruction has primarily focused on Low Dynamic Range (LDR) data, limiting its applicability to professional workflows. Existing approaches that reconstruct HDR scenes from LDR observations rely on multi-exposure fusion or inverse tone-mapping, which increase capture complexity and depend on synthetic supervision. With the recent emergence of cameras that directly capture native HDR data in a single exposure, we present the first method for 3D scene reconstruction that directly models native HDR observations. We propose {\bf Native High dynamic range 3D Gaussian Splatting (NH-3DGS)}, which preserves the full dynamic range throughout the reconstruction pipeline. Our key technical contribution is a novel luminance-chromaticity decomposition of the color representation that enables direct optimization from native HDR camera data. We demonstrate on both synthetic and real multi-view HDR datasets that NH-3DGS significantly outperforms existing methods in reconstruction quality and dynamic range preservation, enabling professional-grade 3D reconstruction directly from native HDR captures. Code and datasets will be made available.

[256] FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI

Hao Li,Zhenfeng Zhuang,Jingyu Lin,Yu Liu,Yifei Chen,Qiong Peng,Lequan Yu,Liansheng Wang

Main category: cs.CV

TL;DR: 提出了一种基于频域分解预处理（FDP）的无监督脑MRI异常检测框架，通过频域分析揭示病理特征的独特频率模式，提升多种生成模型的异常检测性能。

Details

Motivation: 现有无监督异常检测方法使用人工噪声模拟异常，缺乏真实病变的生物物理和形态复杂性，难以准确检测临床异常。 Method: 提出频率分解预处理（FDP）框架，首次系统分析脑MRI在频域中的病理特征，利用低频信号一致性与异常独特频率模式，实现正常解剖结构保留与病理抑制的联合重建。 Result: FDP在多种基线模型上均提升异常检测性能，与LDM结合时DICE分数提升17.63%，且保持诊断保真度。 Conclusion: FDP是一种通用、有效的无监督异常检测预处理方法，通过频域重建显著提升脑MRI异常检测的准确性与鲁棒性。 Abstract: Due to the diversity of brain anatomy and the scarcity of annotated data, supervised anomaly detection for brain MRI remains challenging, driving the development of unsupervised anomaly detection (UAD) approaches. Current UAD methods typically utilize artificially generated noise perturbations on healthy MRIs to train generative models for normal anatomy reconstruction, enabling anomaly detection via residual mapping. However, such simulated anomalies lack the biophysical fidelity and morphological complexity characteristic of true clinical lesions. To advance UAD in brain MRI, we conduct the first systematic frequency-domain analysis of pathological signatures, revealing two key properties: (1) anomalies exhibit unique frequency patterns distinguishable from normal anatomy, and (2) low-frequency signals maintain consistent representations across healthy scans. These insights motivate our Frequency-Decomposition Preprocessing (FDP) framework, the first UAD method to leverage frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation. FDP can integrate seamlessly with existing anomaly simulation techniques, consistently enhancing detection performance across diverse architectures while maintaining diagnostic fidelity. Experimental results demonstrate that FDP consistently improves anomaly detection performance when integrated with existing methods. Notably, FDP achieves a 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines. The code is available at https://github.com/ls1rius/MRI_FDP.

[257] DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Junbo Zou,Haotian Xia,Zhen Ye,Shengjie Zhang,Christopher Lai,Vicente Ordonez,Weining Shen,Hanjie Chen

Main category: cs.CV

TL;DR: 本文提出了DeepSport，首个端到端训练的多任务、多体育视频理解多模态大模型框架，通过主动迭代推理和专门的帧提取工具实现“用视频思考”，并在6.7k测试问题上达到SOTA性能。

Details

Motivation: 现有体育视频理解方法局限于单一体育项目、特定任务或缺乏有效学习推理机制，难以应对高速动态、复杂规则和长时序上下文等挑战，因此需要一个通用且具备强推理能力的多体育视频理解框架。 Method: 提出DeepSport框架，采用数据蒸馏流水线从10个不同数据源生成78k高质量思维链（CoT）轨迹；设计两阶段训练策略：监督微调（SFT）和结合新型门控工具使用奖励的强化学习（RL）；引入专用帧提取工具支持模型动态查询视频内容，实现主动推理。 Result: 在包含6.7k个问题的测试基准上，DeepSport显著优于现有闭源和开源基线模型，展现出强大的多任务、多体育视频理解能力，特别是在复杂时空推理任务中表现突出。 Conclusion: DeepSport为体育视频理解建立了新范式，验证了端到端训练与主动推理结合的有效性，为领域特定视频理解提供了可扩展的框架基础。 Abstract: Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos'' by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model's reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.

[258] CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection

Yaohua Zha,Xue Yuerong,Chunlin Fan,Yuansong Wang,Tao Dai,Ke Chen,Shu-Tao Xia

Main category: cs.CV

TL;DR: 提出了一种基于曲率增强的自监督学习框架CASL，用于3D异常检测，在不依赖任务特定设计的情况下实现了优异性能，并具有良好的泛化能力。

Details

Motivation: 现有自监督点云模型在异常检测上表现不佳，而专用异常检测方法泛化能力差，因此需要一个既能有效检测异常又具有良好通用性的3D模型。 Method: 基于U-Net架构和重建范式，引入多尺度曲率提示来指导解码器预测点的坐标，利用曲率信息增强表示学习。 Result: 仅用曲率作为异常分数已优于多种经典方法；CASL在异常检测任务中表现领先，并可通过微调应用于标准3D理解任务如点云分类。 Conclusion: CASL框架无需专用异常检测机制即可实现高性能异常检测，且学习到的表示具有良好的跨任务泛化能力。 Abstract: Deep learning-based 3D anomaly detection methods have demonstrated significant potential in industrial manufacturing. However, many approaches are specifically designed for anomaly detection tasks, which limits their generalizability to other 3D understanding tasks. In contrast, self-supervised point cloud models aim for general-purpose representation learning, yet our investigation reveals that these classical models are suboptimal at anomaly detection under the unified fine-tuning paradigm. This motivates us to develop a more generalizable 3D model that can effectively detect anomalies without relying on task-specific designs. Interestingly, we find that using only the curvature of each point as its anomaly score already outperforms several classical self-supervised and dedicated anomaly detection models, highlighting the critical role of curvature in 3D anomaly detection. In this paper, we propose a Curvature-Augmented Self-supervised Learning (CASL) framework based on a reconstruction paradigm. Built upon the classical U-Net architecture, our approach introduces multi-scale curvature prompts to guide the decoder in predicting the spatial coordinates of each point. Without relying on any dedicated anomaly detection mechanisms, it achieves leading detection performance through straightforward anomaly classification fine-tuning. Moreover, the learned representations generalize well to standard 3D understanding tasks such as point cloud classification. The code is available at https://github.com/zyh16143998882/CASL.

[259] Explore How to Inject Beneficial Noise in MLLMs

Ruishu Zhu,Sida Huang,Ziheng Jiao,Hongyuan Zhang

Main category: cs.CV

TL;DR: 提出一种通过注入有益随机噪声来微调多模态大语言模型的新方法，该方法优于全参数微调且仅需调整1~2%的额外参数。

Details

Motivation: 现有微调方法忽略跨模态异质性，限制了多模态大语言模型的潜力。 Method: 从变分推断角度重构MLLM的推理过程，设计动态分析图文对跨模态关系的多模态噪声生成器（MuNG），向冻结的MLLM中注入任务自适应的有益噪声。 Result: 在QwenVL和LLaVA两个主流MLLM上实验表明，该方法超越全参数微调及其他微调方法，仅需调整约1~2%的额外参数。 Conclusion: MuNG能有效抑制无关语义成分，提升跨模态表征对齐和下游任务性能，是一种高效、低参数开销的微调策略。 Abstract: Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2\%$ additional parameters. The relevant code is uploaded in the supplementary.

[260] CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Dexin Zuo,Ang Li,Wei Wang,Wenxian Yu,Danping Zou

Main category: cs.CV

TL;DR: 提出CoordAR，一种基于自回归框架的无3D模型依赖的单参考图像6D姿态估计方法，通过离散化坐标映射和解耦编码策略，在对称、遮挡等复杂场景下显著优于现有方法。

Details

Motivation: 现有单参考6D姿态估计方法依赖实值坐标回归，受限于卷积结构的局部性且缺乏不确定性建模，难以应对对称或遮挡场景。 Method: 将3D-3D对应关系建模为离散token序列，采用自回归Transformer解码器，结合坐标图离散化、模态解耦编码和位置对齐特征进行概率化预测。 Result: 在多个基准上显著超越现有方法，并在真实场景中表现出对对称性、遮挡等挑战的强鲁棒性。 Conclusion: CoordAR通过离散化与自回归建模有效提升了单参考6D姿态估计的全局一致性和鲁棒性，为无3D模型场景下的姿态估计提供了新思路。 Abstract: Object 6D pose estimation, a crucial task for robotics and augmented reality applications, becomes particularly challenging when dealing with novel objects whose 3D models are not readily available. To reduce dependency on 3D models, recent studies have explored one-reference-based pose estimation, which requires only a single reference view instead of a complete 3D model. However, existing methods that rely on real-valued coordinate regression suffer from limited global consistency due to the local nature of convolutional architectures and face challenges in symmetric or occluded scenarios owing to a lack of uncertainty modeling. We present CoordAR, a novel autoregressive framework for one-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a map of discrete tokens, which is obtained in an autoregressive and probabilistic manner. To enable accurate correspondence regression, CoordAR introduces 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features and the partially generated token sequence. With these novel mechanisms, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests.

[261] Generative Photographic Control for Scene-Consistent Video Cinematic Editing

Huiqiang Sun,Liao Shen,Zhan Peng,Kun Wang,Size Wu,Yuhang Zang,Tianqi Liu,Zihao Huang,Xingyu Zeng,Zhiguo Cao,Wei Li,Chen Change Loy

Main category: cs.CV

TL;DR: 本文提出了CineCtrl，首个可精细控制专业相机参数（如散景、快门速度）的视频电影编辑框架，通过解耦交叉注意力机制实现对摄影效果的独立调控，同时保持场景一致性。

Details

Motivation: 现有的生成视频模型大多仅支持相机运动控制，难以精确操控电影中的摄影效果（如景深、曝光），限制了其在电影级叙事中的应用。因此，需要一种能够精细控制专业相机参数的方法。 Method: 提出了一种解耦的交叉注意力机制，将相机运动与摄影输入分离，实现独立控制；并通过模拟摄影效果和真实数据采集构建大规模训练数据集。 Result: 实验表明，该方法能生成高保真视频，并精确控制用户指定的摄影参数，如散景和快门速度，同时保持场景的时间一致性和视觉质量。 Conclusion: CineCtrl首次实现了对生成视频中专业级摄影参数的精细控制，为生成模型在电影化视觉叙事中的应用开辟了新路径。 Abstract: Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.

[262] Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes

Feng Lv,Haoxuan Feng,Zilu Zhang,Chunlong Xia,Yanfeng Li

Main category: cs.CV

TL;DR: 提出了一种统一的文本驱动框架，用于交通场景中的图像生成与编辑，结合可控掩码机制和多视角数据，显著提升了生成图像的语义丰富性、视觉保真度和文本对齐能力。

Details

Motivation: 现有文本到图像生成技术在交通场景中存在语义不够丰富、视角有限、视觉保真度低以及文本-图像对齐差等问题，难以满足智能交通系统的需求。 Method: 提出一个统一的文本驱动生成与编辑框架，引入可控掩码机制融合两个任务；利用车端和路侧多视角数据增强几何多样性；采用两阶段训练策略：先用粗粒度数据进行概念学习，再用细粒度数据微调；设计掩码区域加权损失函数以提升小尺度交通元素的生成质量。 Result: 实验表明，该方法在交通场景的文本到图像生成与编辑任务上取得了领先性能，显著改善了生成图像的细节质量、语义准确性和文本对齐效果。 Conclusion: 所提出的框架有效解决了交通场景生成中的多个关键挑战，通过多视角数据融合、两阶段训练和区域加权损失，在生成质量与控制性方面均表现出优越性能，具有较强的应用潜力。 Abstract: With the rapid advancement of intelligent transportation systems, text-driven image generation and editing techniques have demonstrated significant potential in providing rich, controllable visual scene data for applications such as traffic monitoring and autonomous driving. However, several challenges remain, including insufficient semantic richness of generated traffic elements, limited camera viewpoints, low visual fidelity of synthesized images, and poor alignment between textual descriptions and generated content. To address these issues, we propose a unified text-driven framework for both image generation and editing, leveraging a controllable mask mechanism to seamlessly integrate the two tasks. Furthermore, we incorporate both vehicle-side and roadside multi-view data to enhance the geometric diversity of traffic scenes. Our training strategy follows a two-stage paradigm: first, we perform conceptual learning using large-scale coarse-grained text-image data; then, we fine-tune with fine-grained descriptive data to enhance text-image alignment and detail quality. Additionally, we introduce a mask-region-weighted loss that dynamically emphasizes small yet critical regions during training, thereby substantially enhancing the generation fidelity of small-scale traffic elements. Extensive experiments demonstrate that our method achieves leading performance in text-based image generation and editing within traffic scenes.

[263] PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Dianbing Xi,Guoyuan An,Jingsen Zhu,Zhijian Liu,Yuan Liu,Ruiyuan Zhang,Jiayuan Lu,Rui Wang,Yuchi Huo

Main category: cs.CV

TL;DR: 提出PFAvatar，一种从“每日穿搭”照片中重建高质量3D头像的新方法，通过两阶段流程实现：先微调姿态感知扩散模型，再蒸馏基于NeRF的3D头像表示，在细节保留、遮挡处理和重建保真度方面优于现有方法。

Details

Motivation: 现有方法在从真实世界OOTD照片生成3D头像时面临分解不一致、细节丢失、遮挡处理差和速度慢等问题，因此需要一种端到端、高效且能保持高保真细节的方法。 Method: 采用两阶段方法：第一阶段利用预训练ControlNet和提出的条件先验保持损失（CPPL），在少量样本下微调姿态感知扩散模型，直接建模全身外观而不进行图像分解；第二阶段通过SMPL-X规范空间采样和多分辨率3D-SDS优化基于NeRF的头像表示。 Result: PFAvatar在5分钟内完成个性化，比先前方法快48倍；实验表明其在重建保真度、细节保留（如头发）和对遮挡/截断的鲁棒性方面优于最先进方法，并支持虚拟试穿、动画和视频重演等应用。 Conclusion: PFAvatar实现了从真实OOTD照片快速、高质量地生成3D头像，解决了传统方法在一致性、分辨率和遮挡处理上的局限，推动了实际场景中3D头像生成的发展。 Abstract: We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from ``Outfit of the Day'' (OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48$\times$ speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

[264] ProtoAnomalyNCD: Prototype Learning for Multi-class Novel Anomaly Discovery in Industrial Scenarios

Botong Zhao,Qijun Shi,Shujing Lyu,Yue Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于原型学习的框架ProtoAnomalyNCD，用于发现和分类多种未知工业异常类型，结合图像先验与注意力机制提升聚类性能，并在多个基准数据集上优于现有方法。

Details

Motivation: 现有工业异常检测方法难以有效发现和分类多种语义细微的异常类型，且未充分利用图像先验信息，导致无监督聚类效果差。 Method: 提出ProtoAnomalyNCD：利用Grounded SAM结合文本提示定位物体区域作为先验；设计异常图引导的注意力模块，引入区域引导因子区分背景、物体和异常区域；在统一的原型学习框架下实现未知异常类的发现与分类，并扩展至异常 outlier 检测。 Result: 在MVTec AD、MTD和Real-IAD数据集上显著优于当前最先进的方法，验证了其在多类型异常发现与分类任务中的有效性。 Conclusion: ProtoAnomalyNCD通过融合视觉与语言先验、异常图引导注意力和原型学习，实现了对多种未知异常类型的高效发现与分类，推动了工业异常检测向更实用的多类型识别方向发展。 Abstract: Existing industrial anomaly detection methods mainly determine whether an anomaly is present. However, real-world applications also require discovering and classifying multiple anomaly types. Since industrial anomalies are semantically subtle and current methods do not sufficiently exploit image priors, direct clustering approaches often perform poorly. To address these challenges, we propose ProtoAnomalyNCD, a prototype-learning-based framework for discovering unseen anomaly classes of multiple types that can be integrated with various anomaly detection methods. First, to suppress background clutter, we leverage Grounded SAM with text prompts to localize object regions as priors for the anomaly classification network. Next, because anomalies usually appear as subtle and fine-grained patterns on the product, we introduce an Anomaly-Map-Guided Attention block. Within this block, we design a Region Guidance Factor that helps the attention module distinguish among background, object regions, and anomalous regions. By using both localized product regions and anomaly maps as priors, the module enhances anomalous features while suppressing background noise and preserving normal features for contrastive learning. Finally, under a unified prototype-learning framework, ProtoAnomalyNCD discovers and clusters unseen anomaly classes while simultaneously enabling multi-type anomaly classification. We further extend our method to detect unseen outliers, achieving task-level unification. Our method outperforms state-of-the-art approaches on the MVTec AD, MTD, and Real-IAD datasets.

[265] Semi-Supervised High Dynamic Range Image Reconstructing via Bi-Level Uncertain Area Masking

Wei Jiang,Jiahao Cui,Yizheng Wu,Zhan Peng,Zhiyu Pan,Zhiguo Cao

Main category: cs.CV

TL;DR: 提出了一种基于半监督学习的高动态范围图像重建方法，通过不确定性掩码机制减少伪标签中的确认偏差，仅用6.7%的真实标注即可达到全监督方法的性能。

Details

Motivation: 由于高动态范围（HDR）真实标签难以获取，研究目标是在有限HDR标注下实现高性能的HDR图像重建。 Method: 采用教师-学生框架进行半监督学习，教师模型生成无标签样本的伪HDR标签，引入像素级和块级的不确定性掩码机制过滤不可靠区域，使学生模型仅从可信区域学习。 Result: 该方法在仅使用6.7% HDR真实标签的情况下，性能优于先前的半监督方法，并可与最新的全监督方法相媲美。 Conclusion: 所提出的不确定性掩码机制有效缓解了半监督HDR重建中的确认偏差问题，显著降低了对标注数据的依赖。 Abstract: Reconstructing high dynamic range (HDR) images from low dynamic range (LDR) bursts plays an essential role in the computational photography. Impressive progress has been achieved by learning-based algorithms which require LDR-HDR image pairs. However, these pairs are hard to obtain, which motivates researchers to delve into the problem of annotation-efficient HDR image reconstructing: how to achieve comparable performance with limited HDR ground truths (GTs). This work attempts to address this problem from the view of semi-supervised learning where a teacher model generates pseudo HDR GTs for the LDR samples without GTs and a student model learns from pseudo GTs. Nevertheless, the confirmation bias, i.e., the student may learn from the artifacts in pseudo HDR GTs, presents an impediment. To remove this impediment, an uncertainty-based masking process is proposed to discard unreliable parts of pseudo GTs at both pixel and patch levels, then the trusted areas can be learned from by the student. With this novel masking process, our semi-supervised HDR reconstructing method not only outperforms previous annotation-efficient algorithms, but also achieves comparable performance with up-to-date fully-supervised methods by using only 6.7% HDR GTs.

[266] Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

Taiye Chen,Zihan Ding,Anjian Li,Christina Zhang,Zeqi Xiao,Yisen Wang,Chi Jin

Main category: cs.CV

TL;DR: 提出了一种新的循环自回归扩散（RAD）框架，通过在扩散变换器中引入LSTM来增强长期视频生成中的历史信息保持能力。

Details

Motivation: 现有基于扩散模型的视频生成方法在长序列生成时存在遗忘和时空不一致问题，且受限于训练与推理之间的差距或窗口间缺乏重叠。 Method: 将LSTM与注意力机制结合到扩散变换器中，提出RAD框架，实现帧级别的自回归记忆更新与检索，并在训练和推理过程中保持一致性。 Result: 在Memory Maze和Minecraft数据集上的实验表明，RAD在长视频生成方面优于现有方法，验证了LSTM在序列建模中的高效性。 Conclusion: RAD框架有效解决了长时视频生成中的记忆压缩与一致性问题，为基于扩散模型的世界模型提供了更优的循环结构设计。 Abstract: Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

[267] T2I-Based Physical-World Appearance Attack against Traffic Sign Recognition Systems in Autonomous Driving

Chen Ma,Ningfei Wang,Junhao Zheng,Qing Guo,Qian Wang,Qi Alfred Chen,Chao Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于文本到图像扩散模型的新型交通标志识别对抗攻击框架DiffSign，通过CLIP损失和掩码提示增强攻击的可控性，并引入两种新的风格定制方法以提高对未知类型交通标志的泛化能力和隐蔽性。实验表明，该方法在真实世界条件下平均攻击成功率达到83.3%，具有良好的可迁移性和实用性。

Details

Motivation: 现有对抗性外观攻击在隐蔽性、可迁移性和对未知交通标志的泛化能力方面存在不足，难以有效攻击实际部署的交通标志识别系统。 Method: 提出DiffSign框架，结合CLIP损失与掩码提示来优化攻击焦点，设计两种风格定制方法以提升视觉隐蔽性和跨域泛化能力，并在多种真实场景下进行物理世界测试。 Result: 在不同距离、角度、光照条件和标志类别下，DiffSign实现了平均83.3%的物理攻击成功率，显著优于现有方法，展现出强有效性、可转移性和实用性。 Conclusion: DiffSign为基于扩散模型的对抗性外观攻击提供了高效且实用的解决方案，揭示了当前交通标志识别系统在物理世界中的安全漏洞，强调了提升自动驾驶系统鲁棒性的必要性。 Abstract: Traffic Sign Recognition (TSR) systems play a critical role in Autonomous Driving (AD) systems, enabling real-time detection of road signs, such as STOP and speed limit signs. While these systems are increasingly integrated into commercial vehicles, recent research has exposed their vulnerability to physical-world adversarial appearance attacks. In such attacks, carefully crafted visual patterns are misinterpreted by TSR models as legitimate traffic signs, while remaining inconspicuous or benign to human observers. However, existing adversarial appearance attacks suffer from notable limitations. Pixel-level perturbation-based methods often lack stealthiness and tend to overfit to specific surrogate models, resulting in poor transferability to real-world TSR systems. On the other hand, text-to-image (T2I) diffusion model-based approaches demonstrate limited effectiveness and poor generalization to out-of-distribution sign types. In this paper, we present DiffSign, a novel T2I-based appearance attack framework designed to generate physically robust, highly effective, transferable, practical, and stealthy appearance attacks against TSR systems. To overcome the limitations of prior approaches, we propose a carefully designed attack pipeline that integrates CLIP-based loss and masked prompts to improve attack focus and controllability. We also propose two novel style customization methods to guide visual appearance and improve out-of-domain traffic sign attack generalization and attack stealthiness. We conduct extensive evaluations of DiffSign under varied real-world conditions, including different distances, angles, light conditions, and sign categories. Our method achieves an average physical-world attack success rate of 83.3%, leveraging DiffSign's high effectiveness in attack transferability.

[268] EndoSight AI: Deep Learning-Driven Real-Time Gastrointestinal Polyp Detection and Segmentation for Enhanced Endoscopic Diagnostics

Daniel Cavadia

Main category: cs.CV

TL;DR: 本文提出了一种名为EndoSight AI的深度学习架构，用于在胃肠内窥镜检查中实时、精确地检测和分割息肉。基于Hyper-Kvasir数据集，该系统在检测和分割任务上表现出色，并具备实时推理能力。

Details

Motivation: 准确且实时的息肉检测对于结直肠癌的早期诊断和预防至关重要，现有方法在精度或速度方面仍存在不足。 Method: 采用深度学习架构EndoSight AI，结合公开的Hyper-Kvasir数据集进行训练，引入临床相关性能指标和新型热感知机制以提升模型鲁棒性和效率。 Result: 系统在息肉检测任务上达到88.3%的mAP，在分割任务上Dice系数最高达69%，并在GPU上实现超过35帧/秒的实时推理速度。 Conclusion: EndoSight AI是一种高效、稳健的集成AI解决方案，可无缝融入内窥镜工作流程，有望提升胃肠道诊疗的准确性和临床决策水平。 Abstract: Precise and real-time detection of gastrointestinal polyps during endoscopic procedures is crucial for early diagnosis and prevention of colorectal cancer. This work presents EndoSight AI, a deep learning architecture developed and evaluated independently to enable accurate polyp localization and detailed boundary delineation. Leveraging the publicly available Hyper-Kvasir dataset, the system achieves a mean Average Precision (mAP) of 88.3% for polyp detection and a Dice coefficient of up to 69% for segmentation, alongside real-time inference speeds exceeding 35 frames per second on GPU hardware. The training incorporates clinically relevant performance metrics and a novel thermal-aware procedure to ensure model robustness and efficiency. This integrated AI solution is designed for seamless deployment in endoscopy workflows, promising to advance diagnostic accuracy and clinical decision-making in gastrointestinal healthcare.

[269] CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models

Mehrab Mustafy Rahman,Jayanth Mohan,Tiberiu Sosea,Cornelia Caragea

Main category: cs.CV

TL;DR: 提出了一种名为CalibrateMix的针对性mixup方法，用于改善半监督学习模型的校准性能，同时保持或提升分类准确性。

Details

Motivation: 现有的半监督学习方法在校准性方面表现不佳，模型预测过于自信，不能准确反映实际预测概率；而伪标签的过度自信和不可靠性使得直接应用mixup面临挑战。 Method: 利用有标签和无标签样本的训练动态识别“易学”和“难学”样本，并在这些样本之间进行有针对性的mixup混合。 Result: 在多个基准图像数据集上的实验表明，该方法相比现有半监督学习方法具有更低的预期校准误差（ECE）和更高的分类精度。 Conclusion: CalibrateMix能有效提升半监督学习模型的校准性，同时不牺牲甚至提高分类性能，为校准可靠的SSL模型提供了新思路。 Abstract: Semi-supervised learning (SSL) has demonstrated high performance in image classification tasks by effectively utilizing both labeled and unlabeled data. However, existing SSL methods often suffer from poor calibration, with models yielding overconfident predictions that misrepresent actual prediction likelihoods. Recently, neural networks trained with {\tt mixup} that linearly interpolates random examples from the training set have shown better calibration in supervised settings. However, calibration of neural models remains under-explored in semi-supervised settings. Although effective in supervised model calibration, random mixup of pseudolabels in SSL presents challenges due to the overconfidence and unreliability of pseudolabels. In this work, we introduce CalibrateMix, a targeted mixup-based approach that aims to improve the calibration of SSL models while maintaining or even improving their classification accuracy. Our method leverages training dynamics of labeled and unlabeled samples to identify ``easy-to-learn'' and ``hard-to-learn'' samples, which in turn are utilized in a targeted mixup of easy and hard samples. Experimental results across several benchmark image datasets show that our method achieves lower expected calibration error (ECE) and superior accuracy compared to existing SSL approaches.

[270] GrOCE:Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models

Ning Han,Zhenyu Ge,Feng Han,Yuhua Sun,Chengqing Li,Jingjing Chen

Main category: cs.CV

TL;DR: 提出了一种无需训练的图引导在线概念擦除框架GrOCE，通过动态语义图实现精确且自适应的概念移除。

Details

Motivation: 现有概念擦除方法依赖昂贵的微调或粗粒度语义分离，易损害无关概念且难以适应新概念。 Method: 构建动态语义图，通过多跳遍历与相似性衰减评分进行自适应聚类，并选择性切断目标边以实现精准擦除。 Result: 在CS和FID指标上达到SOTA，有效实现高效、准确、稳定的概念擦除。 Conclusion: GrOCE无需微调即可实现精细且可扩展的概念擦除，具有良好的实际应用潜力。 Abstract: Concept erasure aims to remove harmful, inappropriate, or copyrighted content from text-to-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. To alleviate this issue, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and adaptive concept removal through graph-based semantic reasoning. GrOCE models concepts and their interrelations as a dynamic semantic graph, enabling principled reasoning over dependencies and fine-grained isolation of undesired content. It comprises three components: (1) Dynamic Topological Graph Construction for incremental graph building, (2) Adaptive Cluster Identification for multi-hop traversal with similarity-decay scoring, and (3) Selective Edge Severing for targeted edge removal while preserving global semantics. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fréchet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure without retraining.

[271] HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology

Ziqiao Weng,Yaoyu Fang,Jiahe Qian,Xinkun Wang,Lee AD Cooper,Weidong Cai,Bo Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为HiFusion的深度学习框架，用于从H&E染色全切片图像中预测空间转录组基因表达。该方法通过多层次子区域分解和跨尺度融合模块，有效整合局部细胞形态与组织微环境信息，显著提升了预测准确性，并在多个基准数据集上达到最先进性能。

Details

Motivation: 空间转录组技术虽能关联基因表达与组织形态，但因成本高和技术复杂而难以临床应用。现有计算方法在捕捉点内生物异质性和抵抗形态噪声方面存在不足。 Method: 提出HiFusion，包含两个模块：1）层次化点内建模模块，通过多分辨率子块分解提取细粒度形态特征，并引入特征对齐损失保证跨尺度语义一致性；2）上下文感知跨尺度融合模块，利用交叉注意力选择性地整合生物学相关的区域上下文信息。 Result: 在两个基准空间转录组数据集上进行了广泛实验，HiFusion在二维滑动交叉验证和更具挑战性的三维样本特异性场景中均实现了最先进的性能。 Conclusion: HiFusion能够全面建模细胞级特征与组织微环境线索，是一种鲁棒、准确且可扩展的空间转录组推断解决方案，有助于推动其在临床病理中的应用。 Abstract: Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion's potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.

[272] MCAQ-YOLO: Morphological Complexity-Aware Quantization for Efficient Object Detection with Curriculum Learning

Yoonjae Seo,Ermal Elbasani,Jaehong Lee

Main category: cs.CV

TL;DR: 本文提出MCAQ-YOLO，一种基于形态复杂度感知的空间自适应量化框架，通过五种形态学指标指导位宽分配，在保持低计算开销的同时显著提升目标检测精度。

Details

Motivation: 现有神经网络量化方法通常在空间上采用统一比特精度，忽略了视觉数据在结构和纹理上的异质性，导致复杂区域量化敏感而简单区域冗余。因此需要一种能根据局部视觉复杂度动态调整量化精度的方法。 Method: 引入五个形态学指标（分形维数、纹理熵、梯度方差、边缘密度和轮廓复杂度）表征局部视觉形态，并建立其与量化敏感性的关联；基于此实现空间自适应比特分配；结合课程式量化感知训练策略，逐步增加量化难度以稳定优化过程。 Result: 在安全装备数据集上，MCAQ-YOLO以平均4.2比特和7.6倍压缩比达到85.6% mAP@0.5，比均匀4比特量化高3.5个百分点，每图像仅增加1.8ms运行开销；在COCO和Pascal VOC上也验证了性能增益的泛化性。 Conclusion: 形态复杂度与量化敏感性密切相关，基于形态驱动的空间自适应量化可有效提升目标检测模型的效率与鲁棒性，适用于资源受限且安全关键的视觉任务。 Abstract: Most neural network quantization methods apply uniform bit precision across spatial regions, ignoring the heterogeneous structural and textural complexity of visual data. This paper introduces MCAQ-YOLO, a morphological complexity-aware quantization framework for object detection. The framework employs five morphological metrics - fractal dimension, texture entropy, gradient variance, edge density, and contour complexity - to characterize local visual morphology and guide spatially adaptive bit allocation. By correlating these metrics with quantization sensitivity, MCAQ-YOLO dynamically adjusts bit precision according to spatial complexity. In addition, a curriculum-based quantization-aware training scheme progressively increases quantization difficulty to stabilize optimization and accelerate convergence. Experimental results demonstrate a strong correlation between morphological complexity and quantization sensitivity and show that MCAQ-YOLO achieves superior detection accuracy and convergence efficiency compared with uniform quantization. On a safety equipment dataset, MCAQ-YOLO attains 85.6 percent mAP@0.5 with an average of 4.2 bits and a 7.6x compression ratio, yielding 3.5 percentage points higher mAP than uniform 4-bit quantization while introducing only 1.8 ms of additional runtime overhead per image. Cross-dataset validation on COCO and Pascal VOC further confirms consistent performance gains, indicating that morphology-driven spatial quantization can enhance efficiency and robustness for computationally constrained, safety-critical visual recognition tasks.

[273] ArtiWorld: LLM-Driven Articulation of 3D Objects in Scenes

Yixuan Yang,Luyang Xie,Zhen Luo,Zixiang Zhao,Mingqi Gao,Feng Zheng

Main category: cs.CV

TL;DR: 本文提出了ArtiWorld，一种能够从文本场景描述中自动识别可动部件并生成保持原始几何形状的URDF模型的场景感知流程，显著提升了3D资产转化为可交互机器人仿真环境的效率和质量。

Details

Motivation: 现有3D仿真资产多为刚性物体，手动转换为可动部件费时费力，亟需自动化方法实现高效转化。 Method: 提出ArtiWorld流程，核心为Arti4URDF，结合3D点云、大语言模型先验知识和面向URDF的提示设计，将刚性物体快速转化为可交互的URDF可动模型。 Result: 在模拟物体、完整模拟场景和真实扫描场景三个层面评估，本方法均优于现有方法，达到SOTA水平，且能保持几何形状并准确捕捉物体交互性。 Conclusion: ArtiWorld为从现有3D资产构建可交互、机器人就绪的仿真环境提供了实用路径。 Abstract: Building interactive simulators and scalable robot-learning environments requires a large number of articulated assets. However, most existing 3D assets in simulation are rigid, and manually converting them into articulated objects is extremely labor- and cost-intensive. This raises a natural question: can we automatically identify articulable objects in a scene and convert them into articulated assets directly? In this paper, we present ArtiWorld, a scene-aware pipeline that localizes candidate articulable objects from textual scene descriptions and reconstructs executable URDF models that preserve the original geometry. At the core of this pipeline is Arti4URDF, which leverages 3D point cloud, prior knowledge of a large language model (LLM), and a URDF-oriented prompt design to rapidly convert rigid objects into interactive URDF-based articulated objects while maintaining their 3D shape. We evaluate ArtiWorld at three levels: 3D simulated objects, full 3D simulated scenes, and real-world scan scenes. Across all three settings, our method consistently outperforms existing approaches and achieves state-of-the-art performance, while preserving object geometry and correctly capturing object interactivity to produce usable URDF-based articulated models. This provides a practical path toward building interactive, robot-ready simulation environments directly from existing 3D assets. Code and data will be released.

[274] Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

Aishwarya Agarwal,Srikrishna Karanam,Vineet Gandhi

Main category: cs.CV

TL;DR: 本文提出了一种名为Cluster-based Concept Importance (CCI)的视觉语言模型解释方法，利用CLIP的patch embedding进行语义聚类并评估遮蔽对预测的影响，在可信度指标上达到新SOTA。结合GroundedSAM可自动区分前景与背景驱动的预测，并指出现有基准（如CounterAnimals）仅依赖准确率会误将非背景因素错误归因于背景相关性。为此，作者构建了新基准COVAR，系统性地解耦前景与背景变化，并基于CCI对18种CLIP变体进行了全面评估，推动更鲁棒的视觉语言模型发展。

Details

Motivation: 现有的对比视觉语言模型（如CLIP）虽具备强大的零样本识别能力，但易受虚假相关性（尤其是背景依赖）影响；当前的解释方法和基准在归因错误时往往假设性能下降均由背景引起，忽略了视角、尺度、细粒度混淆等因素，缺乏精细诊断能力。 Method: 提出Cluster-based Concept Importance (CCI)：利用CLIP自身的patch embedding将图像空间块聚类为语义一致的组，通过遮蔽这些簇并观察预测变化来评估其重要性；结合GroundedSAM实现前景/背景驱动预测的自动分类；构建新基准COVAR，系统控制前景与背景的变化以分离多种干扰因素。 Result: CCI在MS COCO检索任务的deletion-AUC指标上相较先前方法提升超过两倍，显著领先现有解释方法；结合COVAR对18种CLIP变体的评估揭示了除背景外，视角变化、尺度差异和细粒度混淆也是导致模型性能下降的重要原因。 Conclusion: CCI是一种高保真的解释方法，能有效识别影响VLM预测的关键区域；通过COVAR基准可更精确地诊断模型偏差来源，突破了传统仅依赖准确率分析的局限，为构建更鲁棒、可解释的视觉语言模型提供了方法论支持和实证路径。 Abstract: Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP's own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.

[275] UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Furui Xu,Shaobo Wang,Jiajun Zhang,Chenghao Sun,Haixiang Tang,Linfeng Zhang

Main category: cs.CV

TL;DR: 提出了一种基于泛化视角的即插即用数据集剪枝框架UNSEEN，通过在未见过的数据上评分样本并引入多步增量选择策略，显著优于现有方法，在ImageNet-1K上减少30%训练数据的同时保持性能无损。

Details

Motivation: 现有数据集剪枝方法依赖训练过程中的拟合表现对样本打分，导致分数分布密集、区分度低，难以有效识别代表性样本。 Method: 从泛化角度出发，基于未暴露于训练数据的模型对样本打分；提出UNSEEN框架，可集成到现有剪枝方法中，并扩展至多步场景，通过在不同核心集上训练的模型进行增量选择和动态优化。 Result: 在CIFAR-10、CIFAR-100和ImageNet-1K上显著优于当前最先进方法；在ImageNet-1K上减少30%训练数据时仍保持性能无损。 Conclusion: UNSEEN通过泛化感知的样本评分和多步增量选择机制，有效提升了数据集剪枝的质量和效率，为大规模深度学习的数据压缩提供了新思路。 Abstract: The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30\%.

[276] Semantic Prioritization in Visual Counterfactual Explanations with Weighted Segmentation and Auto-Adaptive Region Selection

Lintong Zhang,Kang Yin,Seong-Whan Lee

Main category: cs.CV

TL;DR: 提出了一种新的非生成式视觉反事实解释方法WSAE-Net，通过加权语义图和自适应候选编辑序列提升解释的语义相关性和计算效率。

Details

Motivation: 传统方法在替换图像区域时忽略替换区域与目标对象的语义相关性，影响模型可解释性和编辑效果。 Method: 提出WSAE-Net，包含两个关键创新：生成加权语义图以减少非语义特征单元的计算；设计自适应候选编辑序列以优化处理顺序。 Result: 实验表明该方法在生成反事实样本时更高效且保持更高的语义相关性，提升了视觉反事实解释的清晰度和深度。 Conclusion: WSAE-Net有效解决了非生成式反事实解释中语义缺失和计算冗余的问题，为模型解释提供了更可靠、高效的工具。 Abstract: In the domain of non-generative visual counterfactual explanations (CE), traditional techniques frequently involve the substitution of sections within a query image with corresponding sections from distractor images. Such methods have historically overlooked the semantic relevance of the replacement regions to the target object, thereby impairing the model's interpretability and hindering the editing workflow. Addressing these challenges, the present study introduces an innovative methodology named as Weighted Semantic Map with Auto-adaptive Candidate Editing Network (WSAE-Net). Characterized by two significant advancements: the determination of an weighted semantic map and the auto-adaptive candidate editing sequence. First, the generation of the weighted semantic map is designed to maximize the reduction of non-semantic feature units that need to be computed, thereby optimizing computational efficiency. Second, the auto-adaptive candidate editing sequences are designed to determine the optimal computational order among the feature units to be processed, thereby ensuring the efficient generation of counterfactuals while maintaining the semantic relevance of the replacement feature units to the target object. Through comprehensive experimentation, our methodology demonstrates superior performance, contributing to a more lucid and in-depth understanding of visual counterfactual explanations.

[277] PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

Zewei Chang,Zheng-Peng Duan,Jianxing Zhang,Chun-Le Guo,Siyu Liu,Hyungju Chun,Hyunhee Park,Zikun Liu,Chongyi Li

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的统一图像润饰框架PerTouch，能够结合用户个性化审美偏好进行语义级图像增强，并通过参数映射、语义替换、参数扰动及VLM驱动代理机制提升可控性与用户意图对齐能力。

Details

Motivation: 传统图像润饰方法在可控性与主观审美一致性之间难以平衡，且缺乏对语义区域的精细控制和用户长期偏好的建模。 Method: 提出PerTouch框架，利用包含语义区域属性值的参数图构建显式的参数到图像映射；引入语义替换和参数扰动机制以增强语义边界感知；设计VLM驱动代理将自然语言指令与视觉控制关联，并结合反馈式再思考和场景感知记忆机制捕捉用户意图与长期偏好。 Result: 实验验证了各模块的有效性，PerTouch在个性化图像润饰任务中表现出优越性能，能更准确地保持全局美学并满足用户主观需求。 Conclusion: PerTouch通过多机制协同实现了高可控性与个性化对齐的图像润饰，为扩散模型在语义级图像编辑中的应用提供了有效解决方案。 Abstract: Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component's effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: https://github.com/Auroral703/PerTouch.

[278] Medal S: Spatio-Textual Prompt Model for Medical Segmentation

Pengcheng Shi,Jiawei Chen,Jiaqi Liu,Xinglin Zhang,Tao Chen,Lei Li

Main category: cs.CV

TL;DR: Medal S 是一个支持原生分辨率空间和文本提示的医学分割基础模型，通过通道级对齐和3D上下文保留，在多类医学图像分割中实现了高效、准确的性能，显著优于现有方法。

Details

Motivation: 现有的文本提示方法缺乏空间感知能力，且在处理多类医学图像分割时存在分辨率不匹配和效率低下的问题。 Method: 提出 Medal S 模型，采用端到端可训练框架，实现体素级空间提示与文本嵌入的通道级对齐；引入轻量级3D卷积模块进行精炼，并设计动态重采样、两阶段推理和优化后处理策略以提升性能。 Result: 在五种模态平均的验证集上，Medal S 的 DSC 为 75.44（对比 SAT 的 69.83），NSD 为 77.34（对比 71.06），F1 为 38.24（对比 24.88），DSC TP 为 65.46（对比 46.97）；并行空间提示使推理时间减少超过90%。 Conclusion: Medal S 通过融合空间精度与语义文本引导，在多类医学图像分割任务中展现出优越的效率和准确性，支持多种模态和高达243类的分割，未来将开源供社区使用。 Abstract: We introduce Medal S, a medical segmentation foundation model that supports native-resolution spatial and textual prompts within an end-to-end trainable framework. Unlike text-only methods lacking spatial awareness, Medal S achieves channel-wise alignment between volumetric prompts and text embeddings, mitigating inaccuracies from resolution mismatches. By preserving full 3D context, it efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance. A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types, supporting up to 243 classes across CT, MRI, PET, ultrasound, and microscopy modalities in the BiomedSegFM dataset. Medal S offers two prompting modes: a text-only mode, where model predictions serve as spatial prompts for self-refinement without human input, and a hybrid mode, incorporating manual annotations for enhanced flexibility. For 24-class segmentation, parallel spatial prompting reduces inference time by more than 90% compared to sequential prompting. We propose dynamic resampling to address target-patch ratio imbalance, extending SAT and nnU-Net for data augmentation. Furthermore, we develop optimized text preprocessing, a two-stage inference strategy, and post-processing techniques to improve memory efficiency, precision, and inference speed. On the five-modality average on the validation set, Medal S outperforms SAT with a DSC of 75.44 (vs. 69.83), NSD of 77.34 (vs. 71.06), F1 of 38.24 (vs. 24.88), and DSC TP of 65.46 (vs. 46.97). Medal S achieves excellent performance by harmonizing spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches. Medal S will be publicly available at https://github.com/yinghemedical/Medal-S.

[279] Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Jihun Park,Kyoungmin Lee,Jongmin Gim,Hyeonseo Jo,Minseok Oh,Wonhyeok Choi,Kyumin Hwang,Jaeyeul Kim,Minwoo Choi,Sunghoon Im

Main category: cs.CV

TL;DR: 提出了一种无需训练的文本到图像生成框架Infinite-Story，用于多提示语故事场景，通过引入身份提示替换和统一注意力引导机制，在保持提示忠实性的同时实现高一致性的身份和风格控制，且推理速度比现有最快模型快6倍以上。

Details

Motivation: 解决多提示语故事场景中文本到图像生成存在的身份不一致和风格不一致问题，同时避免现有扩散模型需要微调或推理速度慢的缺陷。 Method: 基于尺度自回归模型，提出身份提示替换技术以缓解文本编码器中的上下文偏差，并设计包含自适应风格注入和同步引导适应的统一注意力引导机制，以在测试时实现全局风格和身份一致性。 Result: 在多个实验中达到最先进的生成性能，推理速度为1.72秒每张图像，比现有最快的一致性T2I模型快6倍以上。 Conclusion: Infinite-Story是一种高效、实用的无需训练的文本到图像生成框架，能够在多提示语故事生成中实现高水平的身份和风格一致性，具有显著的速度优势和应用潜力。 Abstract: We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

[280] SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias

Wenqian Ye,Di Wang,Guangtao Zheng,Bohan Liu,Aidong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为SAGE的方法，用于缓解大视觉-语言模型（如CLIP）在零样本分类中的多模态虚假偏差问题，无需训练或微调，通过引导式提示选择提升跨分布泛化性能。

Details

Motivation: CLIP等模型在零样本分类中表现出色，但容易依赖虚假特征（如背景），导致在分布外数据上性能下降。现有方法需微调或先验知识，限制了其即用性。 Method: 提出Spuriousness-Aware Guided Exploration (SAGE)，通过探索提示模板空间，选择能最大化类别间语义分离的提示，从而减轻虚假偏差，无需训练或外部标注。 Result: 在四个真实基准数据集和五个主流骨干模型上，SAGE一致提升了零样本分类性能和最差组鲁棒性，优于此前零样本方法。 Conclusion: SAGE是一种简单有效的无训练方法，能够显著缓解CLIP模型中的多模态虚假偏差，增强其在未知分布数据上的泛化能力，且无需模型更新或额外知识。 Abstract: Large vision-language models, such as CLIP, have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, which is the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object's core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias through guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores a space of prompt templates and selects the prompts that induce the largest semantic separation between classes, thereby improving worst-group robustness. Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

[281] Beyond Darkness: Thermal-Supervised 3D Gaussian Splatting for Low-Light Novel View Synthesis

Qingsen Ma,Chen Zou,Dianyun Wang,Jia Wang,Liuyu Xiang,Zhaofeng He

Main category: cs.CV

TL;DR: 提出DTGS，一种将Retinex光照分解与热感知3D高斯点阵结合的统一框架，实现极端低光条件下的光照不变新视图合成。

Details

Motivation: 标准3D高斯点阵在极低光照下因独立增强导致光照不一致和几何失真，难以保持几何、颜色和辐射稳定性。 Method: 通过循环增强-重建机制联合优化增强、几何和热监督；嵌入Retinex分解模块实现反射-光照分离，并引入热监督分支动态平衡损失。 Result: 在自建RGBT-LOW数据集上实验表明，DTGS在辐射一致性、几何保真度和色彩稳定性方面显著优于现有方法。 Conclusion: DTGS实现了光照不变的新视图合成，在极端低光条件下有效提升了重建质量与跨视角一致性。 Abstract: Under extremely low-light conditions, novel view synthesis (NVS) faces severe degradation in terms of geometry, color consistency, and radiometric stability. Standard 3D Gaussian Splatting (3DGS) pipelines fail when applied directly to underexposed inputs, as independent enhancement across views causes illumination inconsistencies and geometric distortion. To address this, we present DTGS, a unified framework that tightly couples Retinex-inspired illumination decomposition with thermal-guided 3D Gaussian Splatting for illumination-invariant reconstruction. Unlike prior approaches that treat enhancement as a pre-processing step, DTGS performs joint optimization across enhancement, geometry, and thermal supervision through a cyclic enhancement-reconstruction mechanism. A thermal supervisory branch stabilizes both color restoration and geometry learning by dynamically balancing enhancement, structural, and thermal losses. Moreover, a Retinex-based decomposition module embedded within the 3DGS loop provides physically interpretable reflectance-illumination separation, ensuring consistent color and texture across viewpoints. To evaluate our method, we construct RGBT-LOW, a new multi-view low-light thermal dataset capturing severe illumination degradation. Extensive experiments show that DTGS significantly outperforms existing low-light enhancement and 3D reconstruction baselines, achieving superior radiometric consistency, geometric fidelity, and color stability under extreme illumination.

[282] You Only Look Omni Gradient Backpropagation for Moving Infrared Small Target Detection

Guoyi Zhang,Guangsheng Xu,Siyang Chen,Han Wang,Xiaohu Zhang

Main category: cs.CV

TL;DR: 提出了一种从反向传播角度设计的特征金字塔网络BP-FPN，用于红外小目标检测，通过GILS和DGR模块提升特征表示能力，在多个数据集上达到SOTA性能。

Details

Motivation: 现有方法在时空特征聚合上受限于每帧特征表示模糊，难以有效区分小目标与背景，因此需要改进特征学习机制。 Method: 提出BP-FPN，包含梯度隔离的低层捷径（GILS）以保留细粒度目标细节，并引入方向性梯度正则化（DGR）来增强反向传播过程中的层次特征一致性。 Result: 在多个公开数据集上实验表明，BP-FPN显著优于现有方法，取得新的SOTA性能，且计算开销极小。 Conclusion: BP-FPN首次从反向传播视角设计FPN，有效解决了红外小目标检测中特征表示模糊的问题，为该领域提供了新的优化方向。 Abstract: Moving infrared small target detection is a key component of infrared search and tracking systems, yet it remains extremely challenging due to low signal-to-clutter ratios, severe target-background imbalance, and weak discriminative features. Existing deep learning methods primarily focus on spatio-temporal feature aggregation, but their gains are limited, revealing that the fundamental bottleneck lies in ambiguous per-frame feature representations rather than spatio-temporal modeling itself. Motivated by this insight, we propose BP-FPN, a backpropagation-driven feature pyramid architecture that fundamentally rethinks feature learning for small target. BP-FPN introduces Gradient-Isolated Low-Level Shortcut (GILS) to efficiently incorporate fine-grained target details without inducing shortcut learning, and Directional Gradient Regularization (DGR) to enforce hierarchical feature consistency during backpropagation. The design is theoretically grounded, introduces negligible computational overhead, and can be seamlessly integrated into existing frameworks. Extensive experiments on multiple public datasets show that BP-FPN consistently establishes new state-of-the-art performance. To the best of our knowledge, it is the first FPN designed for this task entirely from the backpropagation perspective.

[283] Geometry Meets Light: Leveraging Geometric Priors for Universal Photometric Stereo under Limited Multi-Illumination Cues

King-Man Tam,Satoshi Ikehata,Yuta Asano,Zhaoyi An,Rei Kawakami

Main category: cs.CV

TL;DR: 提出GeoUniPS，一种结合合成监督和大规模3D重建模型中几何先验的通用光度立体网络，在复杂真实场景中实现先进的表面法线估计性能。

Details

Motivation: 现有通用光度立体方法在光照偏差、阴影或自遮挡区域等复杂真实场景中表现不佳，需引入更强的几何先验以提升鲁棒性。 Method: 设计Light-Geometry Dual-Branch Encoder，利用预训练的大规模3D重建模型提取高阶几何先验，并结合多光照线索；构建PS-Perp数据集，采用透视投影以更真实地建模空间变化的视角方向。 Result: 在多个数据集上验证了GeoUniPS的优越性，无论定量还是定性结果均达到最先进水平，尤其在复杂真实场景下表现突出。 Conclusion: 通过融合视觉-几何基础模型的几何先验与双分支架构，GeoUniPS显著提升了通用光度立体在真实复杂场景中的法线估计能力。 Abstract: Universal Photometric Stereo is a promising approach for recovering surface normals without strict lighting assumptions. However, it struggles when multi-illumination cues are unreliable, such as under biased lighting or in shadows or self-occluded regions of complex in-the-wild scenes. We propose GeoUniPS, a universal photometric stereo network that integrates synthetic supervision with high-level geometric priors from large-scale 3D reconstruction models pretrained on massive in-the-wild data. Our key insight is that these 3D reconstruction models serve as visual-geometry foundation models, inherently encoding rich geometric knowledge of real scenes. To leverage this, we design a Light-Geometry Dual-Branch Encoder that extracts both multi-illumination cues and geometric priors from the frozen 3D reconstruction model. We also address the limitations of the conventional orthographic projection assumption by introducing the PS-Perp dataset with realistic perspective projection to enable learning of spatially varying view directions. Extensive experiments demonstrate that GeoUniPS delivers state-of-the-arts performance across multiple datasets, both quantitatively and qualitatively, especially in the complex in-the-wild scenes.

[284] MeanFlow Transformers with Representation Autoencoders

Zheyuan Hu,Chieh-Hsin Lai,Ge Wu,Yuki Mitsufuji,Stefano Ermon

Main category: cs.CV

TL;DR: 本文提出了一种在Representation Autoencoder (RAE)潜在空间中训练和采样MeanFlow（MF）的高效方法，通过引入轨迹感知初始化和两阶段训练策略（蒸馏+自举），显著提升了训练稳定性与生成效率，在ImageNet上实现了更优的1步生成FID和更低的计算成本。

Details

Motivation: MeanFlow在高维数据生成中依赖SD-VAE导致计算成本高、训练不稳定，且需要复杂引导参数；为解决其在轻量级潜在空间中的训练梯度爆炸问题，需设计更高效的训练框架。 Method: 采用基于DINO等预训练视觉编码器的RAE提供语义丰富的潜在表示，并结合轻量解码器；提出Consistency Mid-Training进行轨迹感知初始化；采用两阶段训练：先从预训练流匹配模型蒸馏以加速收敛，再使用单点速度估计器进行自举优化。 Result: 在ImageNet 256上实现1步FID 2.03（优于基线3.43），采样GFLOPS降低38%，训练成本减少83%；在ImageNet 512上达到1步FID 3.23，为所有基线中最低GFLOPS。 Conclusion: 该方法有效解决了MF在轻量潜在空间中的训练不稳定性，去除了对复杂引导的依赖，大幅降低了训练和采样成本，实现了高效的一步生成性能。 Abstract: MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive 1-step FID of 3.23 with the lowest GFLOPS among all baselines. Code is available at https://github.com/sony/mf-rae.

[285] SpectralAdapt: Semi-Supervised Domain Adaptation with Spectral Priors for Human-Centered Hyperspectral Image Reconstruction

Yufei Wen,Yuting Zhang,Jingdan Kang,Hao Ren,Weibin Cheng,Jintai Chen,Kaishun Wu

Main category: cs.CV

TL;DR: 提出SpectralAdapt框架，通过半监督域自适应方法解决医疗领域高光谱图像重建中数据稀缺和域差距问题。

Details

Motivation: 高光谱成像在医疗中有潜力，但获取困难且成本高；现有方法受限于人体高光谱数据稀少和跨域差异。 Method: 提出SpectralAdapt框架，包含光谱密度掩码（SDM）和光谱端元表示对齐（SERA），结合半监督域自适应，利用有限标签和大量无标签数据进行高光谱重建。 Result: 在基准数据集上实验表明，该方法在光谱保真度、跨域泛化能力和训练稳定性方面均有提升。 Conclusion: SpectralAdapt有效缓解了域偏移、光谱退化和数据稀缺问题，展示了半监督域自适应在医疗高光谱成像中的应用前景。 Abstract: Hyperspectral imaging (HSI) holds great potential for healthcare due to its rich spectral information. However, acquiring HSI data remains costly and technically demanding. Hyperspectral image reconstruction offers a practical solution by recovering HSI data from accessible modalities, such as RGB. While general domain datasets are abundant, the scarcity of human HSI data limits progress in medical applications. To tackle this, we propose SpectralAdapt, a semi-supervised domain adaptation (SSDA) framework that bridges the domain gap between general and human-centered HSI datasets. To fully exploit limited labels and abundant unlabeled data, we enhance spectral reasoning by introducing Spectral Density Masking (SDM), which adaptively masks RGB channels based on their spectral complexity, encouraging recovery of informative regions from complementary cues during consistency training. Furthermore, we introduce Spectral Endmember Representation Alignment (SERA), which derives physically interpretable endmembers from valuable labeled pixels and employs them as domain-invariant anchors to guide unlabeled predictions, with momentum updates ensuring adaptability and stability. These components are seamlessly integrated into SpectralAdapt, a spectral prior-guided framework that effectively mitigates domain shift, spectral degradation, and data scarcity in HSI reconstruction. Experiments on benchmark datasets demonstrate consistent improvements in spectral fidelity, cross-domain generalization, and training stability, highlighting the promise of SSDA as an efficient solution for hyperspectral imaging in healthcare.

[286] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li,Hao Yin,Wenhui Tan,Jingyang Chen,Boshen Xu,Yuxun Qu,Yijing Chen,Jianzhong Ju,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: 本文提出了REVISOR框架，通过跨模态的文本与视觉协同反思机制，提升多模态大模型在长视频理解中的推理能力，并结合DADR奖励机制实现反思过程与视觉证据的因果对齐。

Details

Motivation: 现有基于纯文本反思的方法在处理长视频理解任务时存在不足：一是仅反思文本信息不足以应对丰富的动态视觉输入；二是缺乏跨模态交互，无法在反思中充分融合视觉信息。 Method: 提出REVISOR框架，支持工具增强的多模态反思，实现文本与视觉模态的协同内省推理；设计Dual Attribution Decoupled Reward (DADR)机制，结合GRPO训练策略，确保模型推理过程与所选视频片段证据之间的因果对齐。 Result: REVISOR在不依赖额外监督微调或外部模型的情况下，显著提升了MLLM在长视频理解上的性能，在VideoMME、LongVideoBench、MLVU和LVBench四个基准上均取得优异表现。 Conclusion: REVISOR通过引入视觉导向的反思机制和跨模态对齐训练策略，有效增强了多模态大模型在复杂长视频理解任务中的推理能力，为未来多模态反思研究提供了新方向。 Abstract: Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

[287] Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Weihua Wang,Yubo Cui,Xiangru Lin,Zhiheng Li,Zheng Fang

Main category: cs.CV

TL;DR: 本文提出了一种面向3D语义场景补全的物体中心型框架Ocean，通过分解场景为独立物体实例来提升语义和几何预测精度，在SemanticKITTI和SSCBench-KITTI360上实现了最先进的性能。

Details

Motivation: 现有基于视觉的3D语义场景补全方法多采用以自我为中心的范式，忽视了细粒度的物体级别细节，导致在复杂环境中存在语义和几何模糊性。 Method: 提出Ocean框架：使用MobileSAM提取图像中的实例掩码；设计3D语义分组注意力模块，利用线性注意力聚合三维空间中的物体中心特征；引入全局相似性引导注意力模块以补偿分割错误；并提出实例感知局部扩散模块，通过生成过程优化实例特征并在BEV空间中细化场景表示。 Result: 在SemanticKITTI和SSCBench-KITTI360数据集上实验表明，Ocean分别达到了17.40和20.28的mIoU分数，性能达到最先进水平。 Conclusion: Ocean通过物体中心的建模方式有效提升了3D语义场景补全的精度，尤其在处理复杂场景时表现出更强的鲁棒性和细节恢复能力。 Abstract: Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

[288] Uni-Inter: Unifying 3D Human Motion Synthesis Across Diverse Interaction Contexts

Sheng Liu,Yuanzhi Liang,Jiepeng Wang,Sidan Du,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 提出了一种名为Uni-Inter的统一框架，用于生成人类运动，支持多种交互场景（人-人、人-物、人-场景），通过统一的交互体积（UIV）实现跨任务的泛化与一致的关系推理。

Details

Motivation: 现有方法依赖任务特定设计，泛化能力有限，难以统一建模复杂多样的交互场景。 Method: 引入统一交互体积（UIV）作为异构交互实体的共享空间表示，将动作生成建模为基于UIV的关节级概率预测，实现细粒度空间依赖建模。 Result: 在三种代表性交互任务上实验表明，Uni-Inter性能具有竞争力，并能良好泛化到新实体组合。 Conclusion: 统一建模复合交互是复杂环境中可扩展动作合成的有前景方向。 Abstract: We present Uni-Inter, a unified framework for human motion generation that supports a wide range of interaction scenarios: including human-human, human-object, and human-scene-within a single, task-agnostic architecture. In contrast to existing methods that rely on task-specific designs and exhibit limited generalization, Uni-Inter introduces the Unified Interactive Volume (UIV), a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field. This enables consistent relational reasoning and compound interaction modeling. Motion generation is formulated as joint-wise probabilistic prediction over the UIV, allowing the model to capture fine-grained spatial dependencies and produce coherent, context-aware behaviors. Experiments across three representative interaction tasks demonstrate that Uni-Inter achieves competitive performance and generalizes well to novel combinations of entities. These results suggest that unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.

[289] uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Dahyun Chung,Donghyun Shin,Yujin Sung,Seunggi Moon,Jinwoo Jeon,Byung-Jun Lee

Main category: cs.CV

TL;DR: 提出了一种轻量且数据高效的多语言视觉-语言对齐框架，无需图像-文本对或文本-对，仅训练一个小型投影模块，利用英文表示作为语义锚点，显著提升低资源语言的跨模态检索性能。

Details

Motivation: 现有CLIP模型在低资源语言上表现受限，主要由于高质量多语言图像-文本数据稀缺，导致这些语言在跨模态检索任务中性能低下。 Method: 冻结预训练图像编码器和多语言文本编码器，仅训练一个1.7M参数的投影模块，使用基于英文表示的对比损失作为语义锚点，实现多语言对齐。 Result: 在多个多语言检索基准上取得显著提升，尤其在捷克语、芬兰语、克罗地亚语、匈牙利语和罗马尼亚语等五种代表性低资源语言上表现突出。 Conclusion: 该方法通过以英文为枢纽的参数高效对齐策略，有效提升了低资源语言的跨模态理解能力，推动了包容性多模态学习的发展。 Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.

[290] MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization

Zhenying Fang,Richang Hong

Main category: cs.CV

TL;DR: 提出多粒度类别感知网络（MGCA-Net）以提升开放词汇时序动作定位性能，通过粗到细的分类策略实现对新类别和基类别的有效识别。

Details

Motivation: 现有方法通常在单一粒度上识别动作类别，导致对基类和新类别的识别准确率下降。 Method: 设计包含定位器、动作存在预测器、传统分类器和粗到细分类器的MGCA-Net，分别在片段级、视频级和提案级实现多粒度分类。 Result: 在THUMOS'14和ActivityNet-1.3数据集上达到最先进性能，并在零样本设置下表现优异。 Conclusion: 多粒度类别感知机制有效提升了开放词汇时序动作定位的性能。 Abstract: Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.

[291] DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

Yan Gong,Jianli Lu,Yongsheng Gao,Jie Zhao,Xiaojuan Zhang,Susanto Rahardja

Main category: cs.CV

TL;DR: 提出DiffPixelFormer，一种用于RGB-D室内场景语义分割的差分像素感知Transformer，通过IIMIB模块增强模态内表征并建模模态间交互，显著提升性能。

Details

Motivation: 现有RGB-D融合方法计算复杂且对模态内外特征关系建模不足，导致特征对齐不精确和判别能力有限。 Method: 设计Intra-Inter Modal Interaction Block（IIMIB），利用自注意力捕捉模态内长程依赖，并通过Differential-Shared Inter-Modal（DSIM）模块解耦模态特有与共享特征；引入动态融合策略根据场景特性平衡多模态贡献。 Result: 在SUN RGB-D和NYUDv2数据集上，DiffPixelFormer-L分别达到54.28%和59.95%的mIoU，超过DFormer-L 1.78%和2.75%。 Conclusion: DiffPixelFormer有效提升了RGB-D室内语义分割的精度，实现了更精细的像素级跨模态对齐和更强的特征表达能力。 Abstract: Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.

[292] ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Bo Fang,Yuxin Song,Qiangqiang Wu,Haoyuan Sun,Wenhao Wu,Antoni B. Chan

Main category: cs.CV

TL;DR: 提出Pretext-GRPO算法和ViSS-R1框架，通过自监督强化学习提升多模态大模型在复杂视频理解中的视觉中心推理能力。

Details

Motivation: 现有基于R1的方法在视频任务中过度依赖文本推理，忽视丰富视觉信息，易导致捷径学习和幻觉问题。 Method: 引入Pretext-GRPO算法，在R1流程中通过变换视觉输入的前置任务给予正向奖励；进一步提出ViSS-R1框架，将前置任务自监督学习融入MLLM的R1后训练范式，联合处理变换相关问题与用户真实查询。 Result: 在六个主流视频理解基准上验证了方法的有效性与优越性，显著提升复杂视频推理性能。 Conclusion: Pretext-GRPO和ViSS-R1实现了更鲁棒的视觉中心化视频理解，推动MLLM从文本中心向视觉深度推理转变。 Abstract: Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

[293] Monocular 3D Lane Detection via Structure Uncertainty-Aware Network with Curve-Point Queries

Ruixin Liu,Zejian Yuan

Main category: cs.CV

TL;DR: 本文提出了一种名为MonoUnc的单目3D车道检测方法，通过在前视图空间中建模3D车道并显式地结合局部结构和随机不确定性，提升了检测精度与鲁棒性，在ONCE-3DLanes和OpenLane数据集上超越了现有最先进方法。

Details

Motivation: 现有的单目3D车道检测方法依赖简化的几何假设，难以捕捉真实场景中的结构变化和随机不确定性，因此需要一种能更好地建模局部结构与不确定性的方法。 Method: 将3D车道投影到前视图（FV）空间，并用参数曲线近似；基于曲线预测动态生成曲线-点查询嵌入以进行3D空间中的车道点预测；每个相邻点构成的线段被建模为具有局部结构和不确定性估计的3D高斯分布，并设计了新的3D高斯匹配损失联合优化参数。 Result: 在ONCE-3DLanes和OpenLane数据集上，MonoUnc在更严格的评估标准下均超越了之前的最先进方法；并提出了两种新的综合评估指标——平均和最大双向Chamfer距离，用于量化全局与局部误差。 Conclusion: MonoUnc通过显式建模由局部车道结构引导的随机不确定性，显著提升了单目3D车道检测的性能，且所提出的评估指标更全面地反映了检测质量。 Abstract: Monocular 3D lane detection is challenged by aleatoric uncertainty arising from inherent observation noise. Existing methods rely on simplified geometric assumptions, such as independent point predictions or global planar modeling, failing to capture structural variations and aleatoric uncertainty in real-world scenarios. In this paper, we propose MonoUnc, a bird's-eye view (BEV)-free 3D lane detector that explicitly models aleatoric uncertainty informed by local lane structures. Specifically, 3D lanes are projected onto the front-view (FV) space and approximated by parametric curves. Guided by curve predictions, curve-point query embeddings are dynamically generated for lane point predictions in 3D space. Each segment formed by two adjacent points is modeled as a 3D Gaussian, parameterized by the local structure and uncertainty estimations. Accordingly, a novel 3D Gaussian matching loss is designed to constrain these parameters jointly. Experiments on the ONCE-3DLanes and OpenLane datasets demonstrate that MonoUnc outperforms previous state-of-the-art (SoTA) methods across all benchmarks under stricter evaluation criteria. Additionally, we propose two comprehensive evaluation metrics for ONCE-3DLanes, calculating the average and maximum bidirectional Chamfer distances to quantify global and local errors. Codes are released at https://github.com/lrx02/MonoUnc.

[294] FGNet: Leveraging Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation

Zhenghua Li,Hang Chen,Zihao Sun,Kai Li,Xiaolin Hu

Main category: cs.CV

TL;DR: 提出一种新框架，通过将自然图像上预训练的SAM2模型知识迁移到电子显微镜（EM）图像神经结构分割任务中，结合特征引导注意力模块和双亲和性解码器，在无需大量标注的情况下显著提升分割性能。

Details

Motivation: 电子显微镜图像中神经结构分割面临形态复杂、信噪比低和标注数据稀缺等挑战，现有方法在准确性和泛化能力上受限。 Method: 利用在自然图像上预训练的SAM2提取通用特征，引入特征引导注意力模块（FGE），利用SAM2的语义线索指导轻量编码器关注困难区域，并通过双亲和性解码器生成粗粒度和精细的亲和图。 Result: 在SAM2权重冻结时，性能已媲美当前最先进方法；进一步在EM数据上微调后，显著超越现有SOTA方法。 Conclusion: 研究表明，将在自然图像上预训练的表征与针对性的领域自适应机制结合，能有效应对神经元分割中的特定挑战。 Abstract: Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.

[295] RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Reeshoon Sayera,Akash Kumar,Sirshapan Mitra,Prudvi Kamtam,Yogesh S Rawat

Main category: cs.CV

TL;DR: 提出RobustGait框架，对基于外观的步态识别系统进行细粒度鲁棒性评估，涵盖四种扰动类型、轮廓提取方法、模型架构和部署场景，并在多个数据集上评测六种最先进系统，揭示了RGB级噪声、轮廓提取偏差等关键影响因素。

Details

Motivation: 现有步态识别方法在受控数据集上表现良好，但缺乏对真实世界干扰和轮廓变化的系统性鲁棒性评估。 Method: 构建RobustGait框架，从扰动类型、轮廓提取方法、模型架构和部署场景四个维度进行评估，在CASIA-B、CCPG、SUSTech1K引入15种5级严重程度的损坏，并在MEVID上验证真实场景性能，评测六种SOTA系统。 Result: 发现RGB级噪声更能反映真实退化；轮廓提取偏差显著影响精度，揭示基准偏倚；鲁棒性依赖扰动类型与模型架构；噪声感知训练和知识蒸馏可提升鲁棒性。 Conclusion: 步态识别系统的鲁棒性受多种因素影响，需综合考虑轮廓提取、模型设计与训练策略，RobustGait为迈向可部署系统提供了系统评估路径。 Abstract: Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.

[296] Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

Jiacheng Tang,Mingyue Feng,Jiachao Liu,Yaonong Wang,Jian Pu

Main category: cs.CV

TL;DR: 本文提出AdaptiveAD，一种用于自动驾驶规划的模块化架构，通过解耦场景感知与自车状态来解决现有系统对自车状态过度依赖的问题。

Details

Motivation: 现有自动驾驶架构在上游BEV编码器中过早融合自车状态，导致下游规划模块依赖该强先验信息，影响泛化能力和场景理解。 Method: 提出双分支结构：一个分支在缺失自车状态的情况下进行多任务学习以实现场景驱动推理，另一个分支基于规划任务进行自车驱动推理；通过场景感知融合模块自适应地整合两个分支的决策。引入路径注意力机制和两个辅助任务（BEV单向蒸馏、自回归在线建图）以保持多任务学习效果。 Result: 在nuScenes数据集上实现了最先进的开环规划性能，显著减轻了对自车状态的过度依赖，并在多种场景中表现出优异的泛化能力。 Conclusion: AdaptiveAD通过架构级设计有效解耦场景与自车信息，在不牺牲多任务学习的前提下提升了规划系统的鲁棒性和泛化性。 Abstract: Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.

[297] Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations

Yehonatan Elisha,Seffi Cohen,Oren Barkan,Noam Koenigstein

Main category: cs.CV

TL;DR: 提出了Reference-Frame × Granularity (RFxG)分类法，以系统评估显著性解释方法在点对点和对比式、类级与组级解释中的表现，揭示现有指标的局限性，并提出新的保真度度量。

Details

Motivation: 显著性图在深度学习可视化解释中广泛应用，但缺乏对其目的和用户查询对齐的共识，导致评估困难。 Method: 引入RFxG分类框架，从参考系（点对点 vs. 对比式）和粒度（类级 vs. 组级）两个维度组织显著性解释，并设计四种新的保真度指标进行系统评估。 Result: 在十种主流显著性方法、四种模型架构和三个数据集上的实验表明，现有指标多偏向点对点保真，忽视对比性和语义粒度；新指标能更全面地评估解释质量。 Conclusion: 推动以用户意图为导向的解释评估范式，为开发更符合人类认知的可视化解释提供了概念框架和实用工具。 Abstract: Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods.We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise ("Why this prediction?") and contrastive ("Why this and not an alternative?") explanations.Granularity: Ranging from fine-grained class-level (e.g., "Why Husky?") to coarse-grained group-level (e.g., "Why Dog?") interpretations.Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets.By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.

[298] MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Doanh C. Bui,Ba Hung Ngo,Hoai Luan Pham,Khang Nguyen,Maï K. Nguyen,Yasuhiko Nakashima

Main category: cs.CV

TL;DR: 本文提出了MergeSlide，一种基于模型合并的终生学习框架，用于全切片图像（WSI）上的癌症任务，通过视觉-语言病理基础模型和正交持续合并策略有效缓解灾难性遗忘，并在未知任务身份的类增量学习设置下引入任务到类提示对齐（TCP）推理方法。

Details

Motivation: 在全切片图像（WSI）上进行终生学习旨在减少跨多个癌症任务的数据传输和处理资源开销，同时应对WSI数据量大、传统持续学习方法易发生灾难性遗忘的问题。 Method: 将终生学习视为模型合并问题，利用视觉-语言病理基础模型，通过类感知提示定义新任务，使用无MLP主干网络进行少量轮次微调，并采用正交持续合并策略将新任务模型合并至统一模型；在推理阶段提出TCP方法，先用任务级提示识别最相关任务，再应用对应类提示生成预测。 Result: 在六个TCGA数据集的任务流上进行实验，结果表明MergeSlide优于基于回放的持续学习方法和视觉-语言零样本基线方法，在类增量学习设置下表现出更强的性能和稳定性。 Conclusion: MergeSlide提供了一种高效、可扩展的WSI终生学习方案，通过模型合并与提示工程有效平衡了学习新任务与保留旧知识的能力，展示了在病理图像分析中应用基础模型的潜力。 Abstract: Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

[299] CapeNext: Rethinking and refining dynamic support information for category-agnostic pose estimation

Yu Zhu,Dan Zeng,Shuiwang Li,Qijun Zhao,Qiaomu Shen,Bo Tang

Main category: cs.CV

TL;DR: 本文提出了一种新的类别无关姿态估计框架CapeNext，通过引入分层跨模态交互和双流特征优化机制，克服了传统静态联合嵌入在跨类别歧义和细粒度类内差异上的局限性，在MP-100数据集上显著优于现有方法。

Details

Motivation: 现有的类别无关姿态估计方法使用固定的文本关键点描述作为语义先验，存在语义多义性导致的跨类别歧义以及对细粒度类内变化区分能力不足的问题。 Method: 提出CapeNext框架，结合分层次的跨模态交互与双流特征精炼机制，利用文本描述和特定图像中的类别级与实例级线索来增强联合嵌入。 Result: 在MP-100数据集上的实验表明，无论采用何种网络主干，CapeNext均大幅超越当前最先进的CAPE方法。 Conclusion: 通过引入动态、多层次的跨模态信息融合，CapeNext有效提升了类别无关姿态估计的准确性和鲁棒性，解决了静态嵌入带来的语义模糊与判别力不足问题。 Abstract: Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.

[300] PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking

Seungjae Kim,SeungJoon Lee,MyeongAh Cho

Main category: cs.CV

TL;DR: PlugTrack 是一种新型多目标跟踪框架，通过自适应融合卡尔曼滤波和数据驱动的运动预测器，结合两者优势，在多种数据集上实现性能提升。

Details

Motivation: 传统卡尔曼滤波在非线性运动中表现有限，而数据驱动方法虽能捕捉复杂动态但泛化性和效率不足；现实中运动模式兼具线性和非线性，需更优融合策略。 Method: 提出 PlugTrack 框架，利用多感知运动分析生成自适应融合权重，动态结合卡尔曼滤波与数据驱动预测器，无需修改现有预测器结构。 Result: 在 MOT17、MOT20 和 DanceTrack 数据集上取得显著性能增益，并在 DanceTrack 上达到最先进水平。 Conclusion: PlugTrack 首次成功桥接经典与现代运动预测范式，通过自适应融合策略有效应对真实场景中的混合运动模式。 Abstract: Multi-object tracking (MOT) predominantly follows the tracking-by-detection paradigm, where Kalman filters serve as the standard motion predictor due to computational efficiency but inherently fail on non-linear motion patterns. Conversely, recent data-driven motion predictors capture complex non-linear dynamics but suffer from limited domain generalization and computational overhead. Through extensive analysis, we reveal that even in datasets dominated by non-linear motion, Kalman filter outperforms data-driven predictors in up to 34\% of cases, demonstrating that real-world tracking scenarios inherently involve both linear and non-linear patterns. To leverage this complementarity, we propose PlugTrack, a novel framework that adaptively fuses Kalman filter and data-driven motion predictors through multi-perceptive motion understanding. Our approach employs multi-perceptive motion analysis to generate adaptive blending factors. PlugTrack achieves significant performance gains on MOT17/MOT20 and state-of-the-art on DanceTrack without modifying existing motion predictors. To the best of our knowledge, PlugTrack is the first framework to bridge classical and modern motion prediction paradigms through adaptive fusion in MOT.

[301] Low-Level Dataset Distillation for Medical Image Enhancement

Fengzhi Xu,Ziyuan Yang,Mengyu Sun,Joey Tianyi Zhou,Yi Zhang

Main category: cs.CV

TL;DR: 提出首个用于医学图像增强的低级数据集蒸馏方法，通过共享解剖先验和个性化生成模块，在保护隐私的同时实现高效训练。

Details

Motivation: 现有医学图像增强方法依赖大规模数据集，训练和存储成本高；而传统数据集蒸馏主要面向高层任务，难以应用于需要像素级保真的低级任务。 Method: 构建基于代表性患者的共享解剖先验作为蒸馏数据初始化，结合结构保持个性化生成（SPG）模块注入患者特异性信息，并通过梯度对齐策略生成任务特定的高低质量训练对。 Result: 所提方法在多个低级医学图像增强任务中实现了接近使用原始数据训练的性能，同时显著减少数据存储和传输开销，并保障患者隐私。 Conclusion: 该方法成功解决了低级任务中数据蒸馏的欠定问题，为隐私保护下的高效医学图像增强提供了可行方案。 Abstract: Medical image enhancement is clinically valuable, but existing methods require large-scale datasets to learn complex pixel-level mappings. However, the substantial training and storage costs associated with these datasets hinder their practical deployment. While dataset distillation (DD) can alleviate these burdens, existing methods mainly target high-level tasks, where multiple samples share the same label. This many-to-one mapping allows distilled data to capture shared semantics and achieve information compression. In contrast, low-level tasks involve a many-to-many mapping that requires pixel-level fidelity, making low-level DD an underdetermined problem, as a small distilled dataset cannot fully constrain the dense pixel-level mappings. To address this, we propose the first low-level DD method for medical image enhancement. We first leverage anatomical similarities across patients to construct the shared anatomical prior based on a representative patient, which serves as the initialization for the distilled data of different patients. This prior is then personalized for each patient using a Structure-Preserving Personalized Generation (SPG) module, which integrates patient-specific anatomical information into the distilled dataset while preserving pixel-level fidelity. For different low-level tasks, the distilled data is used to construct task-specific high- and low-quality training pairs. Patient-specific knowledge is injected into the distilled data by aligning the gradients computed from networks trained on the distilled pairs with those from the corresponding patient's raw data. Notably, downstream users cannot access raw patient data. Instead, only a distilled dataset containing abstract training information is shared, which excludes patient-specific details and thus preserves privacy.

[302] DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

Jiazhen Yan,Ziqiang Li,Fan Wang,Boyu Wang,Zhangjie Fu

Main category: cs.CV

TL;DR: 提出DGS-Net框架，通过梯度空间分解和知识蒸馏在保持预训练先验的同时提升AI生成图像检测性能。

Details

Motivation: 现有微调方法在检测AI生成图像时易导致灾难性遗忘，削弱跨域泛化能力。 Method: 引入梯度空间分解，将任务梯度投影到有害方向的正交补空间，并利用冻结的CLIP编码器蒸馏有益方向进行对齐。 Result: 在50种生成模型上实验表明，该方法平均超越现有技术6.6个百分点，具有优异的检测性能与泛化能力。 Conclusion: DGS-Net有效平衡了先验保留与无关成分抑制，显著提升了生成图像检测的鲁棒性和通用性。 Abstract: The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.

[303] Learning Implicit Neural Degradation Representation for Unpaired Image Dehazing

Shuaibin Fan,Senming Zhong,Wenchao Yan,Minglong Xue

Main category: cs.CV

TL;DR: 提出了一种基于隐式神经退化表示的无监督去雾方法，通过结合通道独立与依赖机制及密集残差增强模块，在复杂场景下实现了高质量的图像去雾效果。

Details

Motivation: 现有去雾方法在处理复杂场景时难以平衡非均匀雾霾分布的细粒度特征表示与全局一致性建模，且对显式特征提取和物理模型依赖较强。 Method: 受Kolmogorov-Arnold表示定理启发，结合通道独立与依赖机制以增强非线性依赖学习能力；设计隐式神经表示将雾霾退化建模为连续函数，并引入密集残差增强模块消除冗余信息。 Result: 在多个公开和真实世界数据集上取得了具有竞争力的去雾性能，有效提升了复杂场景下的视觉感知质量。 Conclusion: 所提方法无需依赖显式特征或物理模型，通过隐式神经表示和高效特征学习机制，实现了鲁棒且高质量的图像去雾。 Abstract: Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common degenerate representation of haze in spatial variations, we propose an unsupervised dehaze method for implicit neural degradation representation. Firstly, inspired by the Kolmogorov-Arnold representation theorem, we propose a mechanism combining the channel-independent and channel-dependent mechanisms, which efficiently enhances the ability to learn from nonlinear dependencies. which in turn achieves good visual perception in complex scenes. Moreover, we design an implicit neural representation to model haze degradation as a continuous function to eliminate redundant information and the dependence on explicit feature extraction and physical models. To further learn the implicit representation of the haze features, we also designed a dense residual enhancement module from it to eliminate redundant information. This achieves high-quality image restoration. Experimental results show that our method achieves competitive dehaze performance on various public and real-world datasets. This project code will be available at https://github.com/Fan-pixel/NeDR-Dehaze.

[304] Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining

Zhaocheng Yu,Kui Jiang,Junjun Jiang,Xianming Liu,Guanglu Sun,Yi Xiao

Main category: cs.CV

TL;DR: 提出了一种结合宏观语义文本先验（CLIP）和微观结构视觉先验（DINOv2）的多先验分层Mamba网络（MPHM），用于图像去雨，通过渐进式先验融合注入机制和分层Mamba模块，在保持语义和空间细节方面实现了最先进的性能。

Details

Motivation: 现有去雨方法在语义和空间细节保真度方面存在不足，影响了自动驾驶和视频监控等应用中的视觉系统性能。 Method: 提出MPHM网络，融合CLIP的语义先验和DINOv2的结构先验，设计渐进式先验融合注入（PFI）机制，并采用基于傅里叶增强双路径结构的分层Mamba模块（HMM）以增强全局建模与局部细节恢复。 Result: 在Rain200H数据集上PSNR提升0.57 dB，且在真实场景中表现出优异的泛化能力。 Conclusion: MPHM通过有效融合多模态先验信息和改进的特征表示结构，在图像去雨任务中实现了领先的性能，兼顾语义准确性与结构完整性。 Abstract: Rain significantly degrades the performance of computer vision systems, particularly in applications like autonomous driving and video surveillance. While existing deraining methods have made considerable progress, they often struggle with fidelity of semantic and spatial details. To address these limitations, we propose the Multi-Prior Hierarchical Mamba (MPHM) network for image deraining. This novel architecture synergistically integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information. To alleviate potential conflicts between heterogeneous priors, we devise a progressive Priors Fusion Injection (PFI) that strategically injects complementary cues at different decoder levels. Meanwhile, we equip the backbone network with an elaborate Hierarchical Mamba Module (HMM) to facilitate robust feature representation, featuring a Fourier-enhanced dual-path design that concurrently addresses global context modeling and local detail recovery. Comprehensive experiments demonstrate MPHM's state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset while delivering superior generalization on real-world rainy scenarios.

[305] A Lightweight 3D Anomaly Detection Method with Rotationally Invariant Features

Hanzhe Liang,Jie Zhou,Can Gao,Bingyang Guo,Jinbao Wang,Linlin Shen

Main category: cs.CV

TL;DR: 提出了一种旋转不变特征（RIF）框架用于3D异常检测，通过点坐标映射（PCM）和轻量级CTF-Net提取鲁棒特征，并结合迁移学习提升性能，在多个数据集上取得了先进性能。

Details

Motivation: 现有3D异常检测方法在处理姿态和位置变化的点云时性能下降，因特征表示不一致，需构建对旋转鲁棒的特征提取框架。 Method: 提出RIF框架：1）设计点坐标映射（PCM）技术将点映射到旋转不变空间；2）构建轻量级CTF-Net网络提取不变特征；3）采用3D数据增强与迁移学习预训练特征提取器以提升表征能力。 Result: 在Anomaly-ShapeNet上平均P-AUROC提升17.7%，在Real3D-AD上提升1.6%，且具备强泛化能力，可与传统方法结合提升性能。 Conclusion: RIF框架有效解决了点云旋转敏感问题，显著提升了3D异常检测性能，具有良好的工业应用潜力。 Abstract: 3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7\%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6\%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.

[306] CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model

Yuqi Zhang,Guanying Chen,Jiaxing Chen,Chuanyu Fu,Chuan Huang,Shuguang Cui

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的框架CloseUpShot，用于从稀疏输入中实现近距离场景的新视角合成。通过分层 warp 和遮挡感知噪声抑制，结合全局结构引导，显著提升了稀疏条件下3D重建的质量和完整性。

Details

Motivation: 现有方法在处理视点变化较大的近距离场景时，因输入信息稀疏且存在背景泄漏问题，难以捕捉细粒度细节，导致新视角合成效果不佳。 Method: 提出CloseUpShot框架，采用点云条件化视频扩散模型；引入分层warping和遮挡感知噪声抑制以改善条件图像质量；利用稠密融合点云提供全局结构引导，增强几何一致性。 Result: 在多个数据集上的实验表明，该方法在近距离新视角合成任务上优于现有方法，尤其在细节还原和完整性方面表现突出。 Conclusion: CloseUpShot通过改进条件输入和引入全局几何引导，在稀疏输入下实现了高质量的近距离新视角合成，验证了其设计的有效性。 Abstract: Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.

[307] Region-Point Joint Representation for Effective Trajectory Similarity Learning

Hao Long,Silin Zhou,Lisi Chen,Shuo Shang

Main category: cs.CV

TL;DR: 提出RePo方法，联合编码区域和点级特征以捕捉轨迹的上下文信息和细粒度移动模式，在轨迹相似性计算中显著优于现有方法。

Details

Motivation: 现有基于学习的方法未能充分利用轨迹信息的全谱进行相似性建模，导致性能受限。 Method: 将GPS轨迹映射为网格序列以提取区域级结构与语义特征，同时使用三个轻量专家网络从密集GPS序列中提取点级局部、相关性和连续运动模式，通过路由网络自适应融合，并利用交叉注意力结合两类特征；采用带难负样本的对比损失进行训练。 Result: 实验结果显示，RePo在所有评估指标上平均准确率比现有最先进方法提升22.2%。 Conclusion: RePo有效整合了区域与点级轨迹特征，显著提升了轨迹相似性计算的准确性，具有较强的实用性与优越性能。 Abstract: Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2\% over SOTA baselines across all evaluation metrics.

[308] VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

Zonghao Ying,Moyang Chen,Nizhang Li,Zhiqiang Wang,Wenxin Zhang,Quanchen Zou,Zonglei Jing,Aishan Liu,Xianglong Liu

Main category: cs.CV

TL;DR: 提出了一种名为VEIL的越狱攻击框架，利用文本到视频模型的跨模态关联模式，通过模块化提示设计生成语义上不安全但表面上看似无害的视频，显著提升了攻击成功率。

Details

Motivation: 现有的对文本到视频模型的越狱攻击通常在明显不安全的提示上添加对抗扰动，容易被检测和防御；因此需要一种更隐蔽、基于良性外观提示的攻击方法来揭示模型的安全盲点。 Method: 提出VEIL框架，采用模块化提示设计，包含三个部分：中性场景锚点（维持表面合理性）、潜在听觉触发器（利用音视频共现先验诱导不安全视觉内容）和风格调制器（增强触发效果）；将攻击生成形式化为约束优化问题，并通过引导搜索策略求解。 Result: 在7个文本到视频模型上的实验表明，该方法在商业模型中的平均攻击成功率提升了23%。 Conclusion: VEIL能够有效绕过T2V模型的安全防护，揭示了模型在跨模态关联上的安全隐患，强调了现有安全机制在应对隐式语义攻击时的不足。 Abstract: Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.

[309] Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack

Chenyang Li,Wenbing Tang,Yihao Huang,Sinong Simon Zhan,Ming Hu,Xiaojun Jia,Yang Liu

Main category: cs.CV

TL;DR: 提出了一种基于室内光照的黑盒对抗攻击框架ILA，通过静态和动态照明干扰揭示视觉-语言导航（VLN）智能体在真实光照变化下的脆弱性。

Details

Motivation: 现有对抗评估多依赖不自然的纹理扰动，缺乏实际意义；而室内光照作为影响导航的关键因素被忽视，需研究其对VLN智能体鲁棒性的影响。 Method: 设计了两种攻击模式：静态照明攻击（SILA）保持光照恒定，动态照明攻击（DILA）在关键时机动态开关灯；在三个导航任务上对两种先进VLN模型进行评估。 Result: ILA显著提高了VLN智能体的失败率并降低了轨迹效率，暴露出其对现实室内光照变化的敏感性和未被认知的弱点。 Conclusion: 室内光照是影响VLN智能体性能的重要因素，考虑光照变化的对抗测试有助于提升模型在真实环境中的鲁棒性。 Abstract: Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.

[310] MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

Junjie Yang,Yuhao Yan,Gang Wu,Yuxuan Wang,Ruoyu Liang,Xinjie Jiang,Xiang Wan,Fenglei Fan,Yongquan Zhang,Feiwei Qin,Changmiao Wan

Main category: cs.CV

TL;DR: 本文提出了MedGEN-Bench，一个用于推动医学AI研究的综合性多模态基准，包含6,422个专家验证的图文对，涵盖六种成像模式、16项临床任务和28个子任务，并采用三层次评估框架评估现有模型在跨模态推理与生成能力上的表现。

Details

Motivation: 现有的医学视觉基准存在查询模糊、诊断推理简化以及忽视图像生成能力等问题，难以满足临床对AI生成诊断文本及相应医学图像的需求。 Method: 构建了一个包含三种格式（视觉问答、图像编辑和上下文多模态生成）的多模态基准MedGEN-Bench，并提出一个结合像素级指标、语义文本分析和专家指导的临床相关性评分的三层次评估框架。 Result: 该基准涵盖了6,422个专家验证的图文对，覆盖多种成像模态与临床任务，评估了10个组合式框架、3个统一模型和5个视觉语言模型的表现。 Conclusion: MedGEN-Bench通过强调上下文交织指令和开放式生成输出，推动了医学AI向更复杂的跨模态推理与真实临床工作流集成的发展。 Abstract: As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textsc{MedGEN-Bench}, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.

[311] WinMamba: Multi-Scale Shifted Windows in State Space Model for 3D Object Detection

Longhui Zheng,Qiming Xia,Xiaolu Chen,Zhaoliang Liu,Chenglu Wen

Main category: cs.CV

TL;DR: 本文提出了一种基于Mamba的3D目标检测骨干网络WinMamba，通过自适应窗口模块和位置编码提升多尺度表征与上下文建模能力，在KITTI和Waymo数据集上显著优于基线方法。

Details

Motivation: 在保持计算效率的同时，现有Mamba模型因固定轴对齐扫描窗口而丢失空间信息，难以有效捕获长距离依赖关系。 Method: 提出WinMamba，包含窗口尺度自适应模块（AWF）以补偿不同分辨率下的体素特征，并引入可学习位置编码与窗口移位策略（WSF）增强上下文感知。 Result: 在KITTI和Waymo数据集上实验表明，WinMamba显著优于基线模型，消融实验验证了各模块对检测精度的贡献。 Conclusion: WinMamba在效率与准确性之间实现了更好平衡，为基于Mamba的3D目标检测提供了有效解决方案。 Abstract: 3D object detection is critical for autonomous driving, yet it remains fundamentally challenging to simultaneously maximize computational efficiency and capture long-range spatial dependencies. We observed that Mamba-based models, with their linear state-space design, capture long-range dependencies at lower cost, offering a promising balance between efficiency and accuracy. However, existing methods rely on axis-aligned scanning within a fixed window, inevitably discarding spatial information. To address this problem, we propose WinMamba, a novel Mamba-based 3D feature-encoding backbone composed of stacked WinMamba blocks. To enhance the backbone with robust multi-scale representation, the WinMamba block incorporates a window-scale-adaptive module that compensates voxel features across varying resolutions during sampling. Meanwhile, to obtain rich contextual cues within the linear state space, we equip the WinMamba layer with a learnable positional encoding and a window-shift strategy. Extensive experiments on the KITTI and Waymo datasets demonstrate that WinMamba significantly outperforms the baseline. Ablation studies further validate the individual contributions of the WSF and AWF modules in improving detection accuracy. The code will be made publicly available.

[312] Automated Road Distress Detection Using Vision Transformersand Generative Adversarial Networks

Cesar Portocarrero Rodriguez,Laura Vandeweyen,Yosuke Yamamoto

Main category: cs.CV

TL;DR: 本研究探讨了使用先进的计算机视觉技术进行道路病害分割，利用生成对抗网络（GAN）生成的合成数据来提升模型性能，并比较了卷积神经网络（CNN）与基于Transformer的MaskFormer模型，结果显示MaskFormer在mAP50和IoU两个指标上表现更优。

Details

Motivation: 美国土木工程师协会给美国基础设施的状况评分为C，其中道路系统仅得D分。当前的道路检测方法依赖于过时的人工或激光检测手段，成本高且耗时长，因此需要更高效的自动化解决方案。 Method: 研究首先评估了使用生成对抗网络（GAN）生成的合成数据在模型训练中的有效性，随后应用卷积神经网络（CNN）进行道路病害分割，并进一步测试了基于Transformer的MaskFormer模型。 Result: 实验结果表明，使用GAN生成的数据能够提升模型性能，且MaskFormer在mAP50和IoU两个评价指标上优于传统的CNN模型。 Conclusion: 基于Transformer的MaskFormer模型结合GAN生成的合成数据，能有效提升道路病害分割的准确性，为道路基础设施的智能监测提供了可行的技术路径。 Abstract: The American Society of Civil Engineers has graded Americas infrastructure condition as a C, with the road system receiving a dismal D. Roads are vital to regional economic viability, yet their management, maintenance, and repair processes remain inefficient, relying on outdated manual or laser-based inspection methods that are both costly and time-consuming. With the increasing availability of real-time visual data from autonomous vehicles, there is an opportunity to apply computer vision (CV) methods for advanced road monitoring, providing insights to guide infrastructure rehabilitation efforts. This project explores the use of state-of-the-art CV techniques for road distress segmentation. It begins by evaluating synthetic data generated with Generative Adversarial Networks (GANs) to assess its usefulness for model training. The study then applies Convolutional Neural Networks (CNNs) for road distress segmentation and subsequently examines the transformer-based model MaskFormer. Results show that GAN-generated data improves model performance and that MaskFormer outperforms the CNN model in two metrics: mAP50 and IoU.

[313] Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Rifen Lin,Alex Jinpeng Wang,Jiawei Mo,Min Li

Main category: cs.CV

TL;DR: 本文提出了CSIP-ReID，首个基于骨架驱动的视频行人重识别预训练框架，通过对比骨架-图像预训练和动态原型融合更新器，在多个基准上实现了最先进性能。

Details

Motivation: 现有基于文本的多模态预训练在视频行人重识别中存在缺乏真正多模态预训练和文本难以捕捉细粒度时序运动的问题。 Method: 提出两阶段方法CSIP-ReID：第一阶段采用对比学习对齐骨架与视觉序列级特征；第二阶段引入动态原型融合更新器（PFU）和骨架引导时序建模（SGTM）模块，融合运动与外观线索。 Result: 在MARS、LS-VID、iLIDS-VID等视频ReID基准上达到SOTA，并在BIWI、IAS等骨架ReID任务中表现出强泛化能力，显著优于先前方法。 Conclusion: CSIP-ReID开创了一种无需标注且感知运动的行人重识别预训练范式，为多模态表征学习开辟了新方向。 Abstract: Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.

[314] SOMA: Feature Gradient Enhanced Affine-Flow Matching for SAR-Optical Registration

Haodong Wang,Tao Zhuo,Xiuwei Zhang,Hanlin Yin,Wencong Wu,Yanning Zhang

Main category: cs.CV

TL;DR: 提出了一种用于SAR与光学图像配准的深度学习框架SOMA，通过引入结构梯度先验和混合匹配策略显著提升了配准精度。

Details

Motivation: SAR与光学图像因成像机制不同导致配准困难，现有深度学习方法未能有效利用梯度信息来增强特征区分性。 Method: 提出SOMA框架，包含特征梯度增强模块（FGE）以多尺度多方向梯度滤波结合注意力机制增强特征，并设计全局-局部仿射-光流匹配器（GLAM）进行粗到精的配准优化。 Result: 在SEN1-2和GFGE_SO数据集上CMR@1px分别提升12.29%和18.50%，表现出良好的鲁棒性和泛化能力。 Conclusion: SOMA通过融合结构梯度先验与混合匹配策略，在SAR与光学图像密集配准任务中实现了更高精度和更强鲁棒性。 Abstract: Achieving pixel-level registration between SAR and optical images remains a challenging task due to their fundamentally different imaging mechanisms and visual characteristics. Although deep learning has achieved great success in many cross-modal tasks, its performance on SAR-Optical registration tasks is still unsatisfactory. Gradient-based information has traditionally played a crucial role in handcrafted descriptors by highlighting structural differences. However, such gradient cues have not been effectively leveraged in deep learning frameworks for SAR-Optical image matching. To address this gap, we propose SOMA, a dense registration framework that integrates structural gradient priors into deep features and refines alignment through a hybrid matching strategy. Specifically, we introduce the Feature Gradient Enhancer (FGE), which embeds multi-scale, multi-directional gradient filters into the feature space using attention and reconstruction mechanisms to boost feature distinctiveness. Furthermore, we propose the Global-Local Affine-Flow Matcher (GLAM), which combines affine transformation and flow-based refinement within a coarse-to-fine architecture to ensure both structural consistency and local accuracy. Experimental results demonstrate that SOMA significantly improves registration precision, increasing the CMR@1px by 12.29% on the SEN1-2 dataset and 18.50% on the GFGE_SO dataset. In addition, SOMA exhibits strong robustness and generalizes well across diverse scenes and resolutions.

[315] THIR: Topological Histopathological Image Retrieval

Zahra Tabatabaei,Jon Sporring

Main category: cs.CV

TL;DR: 本文提出了一种名为THIR的无监督内容-based医学图像检索框架，利用持久同调中的贝蒂数来提取乳腺癌组织病理图像的拓扑特征，实现快速、可解释且无需训练的图像检索。

Details

Motivation: 乳腺癌是全球女性死亡的主要原因之一，早期诊断和精准临床决策至关重要。现有基于深度学习的医学图像检索方法依赖大量标注数据和计算资源，限制了其在资源有限环境下的应用。因此，需要一种无需训练、高效且可解释的图像检索方法。 Method: 提出THIR框架，采用立方体持久同调从RGB组织病理图像中提取贝蒂数作为拓扑指纹，生成紧凑且可解释的特征向量，并通过计算这些拓扑描述符之间的距离进行相似性检索。整个过程无需监督，不依赖标注数据或GPU资源。 Result: 在BreaKHis数据集上的实验表明，THIR优于现有的有监督和无监督方法，能够在标准CPU上20分钟内处理完整数据集，实现高效的top-K图像检索。 Conclusion: THIR提供了一种快速、可扩展、无需训练的医学图像检索解决方案，具有良好的临床应用潜力，尤其适用于计算资源有限的场景。 Abstract: According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches. Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval.

[316] HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution

Chao Yang,Boqian Zhang,Jinghao Xu,Guang Jiang

Main category: cs.CV

TL;DR: 提出了一种基于小波分解的高频引导扩散网络HDW-SR，用于单幅图像超分辨率，通过仅对残差图进行扩散并引入小波下采样与稀疏交叉注意力机制，有效恢复高频细节，在合成和真实数据集上均表现出色。

Details

Motivation: 现有基于扩散的方法在单图像超分辨率中常因高频信息引导不足而导致细节模糊，因此需要更有效的高频恢复机制。 Method: 采用小波分解替代传统CNN下采样实现多尺度频率分解，仅对残差图进行扩散处理，并通过低频与高频子带间的稀疏交叉注意力提供高频引导；设计动态阈值块（DTB）优化高频选择，利用小波变换的可逆性实现低损耗特征重建。 Result: 在多个合成和真实世界数据集上实验表明，HDW-SR在恢复精细图像细节方面表现优异，整体超分辨率性能具有竞争力。 Conclusion: HDW-SR通过高频引导和小波分解显著提升了扩散模型在超分辨率任务中的细节恢复能力，尤其擅长重建清晰的高频结构。 Abstract: Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.

[317] GenTract: Generative Global Tractography

Alec Sargood,Lemuel Puglisi,Elinor Thompson,Mirco Musolesi,Daniel C. Alexander

Main category: cs.CV

TL;DR: GenTract是首个用于全局纤维追踪的生成模型，通过将扩散磁共振成像（dMRI）直接映射到完整的、解剖上合理的纤维束路径，显著提高了追踪精度，尤其在低分辨率和噪声数据下表现优异。

Details

Motivation: 局部纤维追踪方法易产生误差累积和高假阳性率，而传统全局方法计算成本高，因此需要一种高效且精确的新型全局追踪方法。 Method: 提出GenTract，将纤维追踪建模为生成任务，采用扩散模型和流匹配两种范式，直接从dMRI数据生成完整的纤维束轨迹，并进行端到端优化。 Result: GenTract的精度达到现有最佳方法TractOracle的2.1倍，在低分辨率和噪声环境下性能超过次优方法一个数量级。 Conclusion: GenTract在研究级和低质量数据上均能生成高精度、可靠的纤维束图谱，为全局纤维追踪提供了有前景的解决方案。 Abstract: Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract's performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1x higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

Diego Ortego,Marlon Rodríguez,Mario Almagro,Kunal Dahiya,David Jiménez,Juan C. SanMiguel

Main category: cs.CV

TL;DR: 本文提出了ViXML框架，通过结合大型解码器模型和视觉信息来提升极端多标签分类（XMC）的性能，在保持计算效率的同时显著优于现有方法。

Details

Motivation: 探索如何在极端多标签分类中有效利用大规模解码器模型和视觉信息，以平衡性能与计算效率。 Method: 提出ViXML框架，使用池化方式融合基础视觉模型的单个图像嵌入，并结合小型编码器或大解码器进行多模态学习。 Result: 在四个公开数据集及其图像增强版本上实验表明，ViXML在最大数据集上的P@1指标最高提升了+8.21%，且小编码器版本通常优于纯文本大解码器。 Conclusion: ViXML能有效整合视觉信息并利用大模型提升XMC性能，证明图像可弥补参数规模的不足，为未来多模态XMC研究提供了新方向和基准资源。 Abstract: Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.

[319] Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang,Meng Cao,Ruyang Liu,Xiaoxi Liang,Linglong Li,Ge Li,Xiaodan Liang

Main category: cs.CV

TL;DR: 提出了一种名为Object-Centric 3D Rollout (OCR)的新方法，通过在训练中引入3D几何扰动提升多模态大模型的视频空间推理能力，显著优于现有方法。

Details

Motivation: 现有MLLM在视频空间推理中存在查询锁定问题，难以全面理解动态3D场景中的对象关系和上下文信息。 Method: 在训练中对选定对象的3D几何结构施加结构化扰动，削弱特定对象视觉线索，并将修改后的几何投影到2D空间；设计基于rollout的训练流程，联合使用原始视频和区域噪声视频优化空间推理轨迹。 Result: 3B参数模型在VSI-Bench上达到47.5%准确率，超越多个7B基线模型；消融实验表明OCR优于T-GRPO、NoisyRollout等先前方法。 Conclusion: OCR能有效提升多模态大模型对动态场景的整体性空间理解能力，缓解查询锁定问题，为小规模模型实现优越空间推理性能提供了新路径。 Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

[320] Birth of a Painting: Differentiable Brushstroke Reconstruction

Ying Jiang,Jiayin Lu,Yunuo Chen,Yumeng He,Kui Wu,Yin Yang,Chenfanfu Jiang

Main category: cs.CV

TL;DR: 提出了一种可微分的笔画重建框架，统一了绘画、风格化纹理和涂抹过程，实现了逼真的数字绘画生成。

Details

Motivation: 现有绘画生成方法多关注最终图像或基于局部块的模拟，缺乏明确的笔画结构和自然的明暗过渡，难以真实还原人类绘画过程。 Method: 通过并行可微渲染器优化单色和双色贝塞尔笔画，结合几何条件纹理生成模块和可微涂抹算子，在粗到精的策略下联合优化笔画几何、颜色和纹理。 Result: 在油画、水彩、墨水和数字绘画上实验表明，该方法能生成逼真的笔画结构、平滑的色调过渡和丰富的风格化效果。 Conclusion: 该框架成功模拟了人类绘画-涂抹循环，为表达性数字绘画提供了统一且可控的生成模型。 Abstract: Painting embodies a unique form of visual storytelling, where the creation process is as significant as the final artwork. Although recent advances in generative models have enabled visually compelling painting synthesis, most existing methods focus solely on final image generation or patch-based process simulation, lacking explicit stroke structure and failing to produce smooth, realistic shading. In this work, we present a differentiable stroke reconstruction framework that unifies painting, stylized texturing, and smudging to faithfully reproduce the human painting-smudging loop. Given an input image, our framework first optimizes single- and dual-color Bezier strokes through a parallel differentiable paint renderer, followed by a style generation module that synthesizes geometry-conditioned textures across diverse painting styles. We further introduce a differentiable smudge operator to enable natural color blending and shading. Coupled with a coarse-to-fine optimization strategy, our method jointly optimizes stroke geometry, color, and texture under geometric and semantic guidance. Extensive experiments on oil, watercolor, ink, and digital paintings demonstrate that our approach produces realistic and expressive stroke reconstructions, smooth tonal transitions, and richly stylized appearances, offering a unified model for expressive digital painting creation. See our project page for more demos: https://yingjiang96.github.io/DiffPaintWebsite/.

[321] Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection

Soyul Lee,Seungmin Baek,Dongbo Min

Main category: cs.CV

TL;DR: 提出了一种名为MonoDLGD的新型单目3D目标检测框架，通过难度感知的标签引导去噪机制，自适应地对不同难度的实例进行标签扰动与重建，从而提升检测性能。

Details

Motivation: 单目3D目标检测因深度线索模糊而具有挑战性，现有方法在深度估计和实例级检测难度（如遮挡、距离、截断）建模方面存在不足。 Method: 提出MonoDLGD，基于检测不确定性对地面真值标签进行自适应扰动与重建：对简单样本施加更强扰动，困难样本施加较弱扰动，并通过联合优化标签重建与检测任务来增强几何感知表示学习。 Result: 在KITTI基准上实现了最先进的性能，且在所有难度级别上均表现出优异效果。 Conclusion: MonoDLGD通过引入难度感知的标签引导去噪策略，有效提升了单目3D检测的鲁棒性和精度，尤其在处理复杂场景下的物体时表现突出。 Abstract: Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.

[322] Self-Supervised Ultrasound Screen Detection

Alberto Gomez,Jorge Oliveira,Ramon Casero,Agis Chartsias

Main category: cs.CV

TL;DR: 提出了一种自监督管道，从超声设备显示器的照片中提取图像，避免DICOM传输瓶颈，验证实验显示校正后的图像在心脏视图分类中具有0.79的平衡准确率。

Details

Motivation: 超声设备通常依赖DICOM进行图像传输，限制了快速测试和算法原型开发，因此需要一种更灵活、高效的图像提取方法。 Method: 采用自监督学习 pipeline 从显示器照片中提取并校正超声图像。 Result: 在概念验证研究中，校正后的图像在心脏视图分类任务中相对于原始DICOM图像达到了0.79的平衡准确率。 Conclusion: 该方法有效绕过DICOM瓶颈，支持超声图像的快速获取与算法开发，具备临床应用潜力。 Abstract: Ultrasound (US) machines display images on a built-in monitor, but routine transfer to hospital systems relies on DICOM. We propose a self-supervised pipeline to extract the US image from a photograph of the monitor. This removes the DICOM bottleneck and enables rapid testing and prototyping of new algorithms. In a proof-of-concept study, the rectified images retained enough visual fidelity to classify cardiac views with a balanced accuracy of 0.79 with respect to the native DICOMs.

[323] RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

Junhee Lee,ChaeBeen Bang,MyoungChul Kim,MyeongAh Cho

Main category: cs.CV

TL;DR: 本文提出了RefineVAD，一种用于弱监督视频异常检测的新型框架，通过结合时序动态和语义结构来提升异常检测性能。

Details

Motivation: 现有方法通常将所有异常事件视为单一类别，忽略了真实世界中异常的多样性和复杂性，本文旨在通过引入语义类别先验和时序建模来更精细地建模异常。 Method: 提出RefineVAD框架，包含两个模块：MoTAR（运动感知时序注意力与重校准）用于捕捉运动显著性并动态调整时序关注；CORE（类别导向细化）通过交叉注意力将片段级特征与可学习的类别原型对齐，引入软异常类别先验。 Result: 在WVAD基准上的实验表明，RefineVAD优于现有方法，验证了融合语义上下文对特征优化的有效性。 Conclusion: 联合建模时序动态与语义结构能有效提升弱监督视频异常检测性能，语义引导的特征细化是关键。 Abstract: Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

[324] End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Yonghui Yu,Jiahang Cai,Xun Wang,Wenwu Yang

Main category: cs.CV

TL;DR: 本文提出了一种全新的端到端多人体视频姿态估计框架PAVE-Net，避免了传统方法中的启发式操作，通过姿态感知注意力机制和时空依赖建模实现跨帧个体关联与高精度估计。

Details

Motivation: 现有两阶段方法依赖检测、RoI裁剪和NMS等启发式操作，限制了精度和效率，且难以处理复杂遮挡和重叠情况下的跨帧个体关联。 Method: 提出PAVE-Net，包含空间编码器建模帧内关系，时空姿态解码器捕捉跨帧全局依赖；引入姿态感知注意力机制实现同一人物特征在连续帧间的有效聚合，并显式建模关键点间的时空依赖关系。 Result: 在PoseTrack2017数据集上比之前的图像端到端方法提升6.0 mAP，性能媲美最先进的两阶段视频方法，同时显著提高效率。 Conclusion: PAVE-Net是首个用于多帧2D人体姿态估计的端到端方法，有效消除启发式操作，在准确性和效率之间实现了更好平衡，推动视频姿态估计向真正端到端方向发展。 Abstract: Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames.Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation.Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a \textbf{6.0} mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency.Project page: https://github.com/zgspose/PAVENet

[325] 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale

Yijia Fan,Jusheng Zhang,Kaitong Cai,Jing Yang,Jian Wang,Keze Wang

Main category: cs.CV

TL;DR: 提出3DAlign-DAER，一个通过动态注意力策略和高效检索方法实现文本与3D几何精细对齐的统一框架，并构建包含200万文本-3D配对的大规模数据集Align3D-2M。

Details

Motivation: 现有方法在细粒度文本语义与三维几何结构对齐方面表现不足，且在大规模3D数据库上性能下降明显。 Method: 引入动态注意力策略（DAP），结合层次化注意力融合（HAF）模块和蒙特卡洛树搜索优化跨模态对齐；推理阶段采用高效检索策略（ERS）进行大规模嵌入空间搜索。 Result: 在多个跨模态检索和分类任务中表现出优越性能，显著提升对齐精度与效率。 Conclusion: 3DAlign-DAER有效解决了细粒度文本-3D对齐难题，在大规模场景下具备优异表现，推动了该领域的发展。 Abstract: Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets.

[326] Hybrid-Domain Adaptative Representation Learning for Gaze Estimation

Qida Tan,Hongyu Yang,Wenchao Du

Main category: cs.CV

TL;DR: 提出了一种基于混合域自适应表示学习（HARL）的外观型视线估计框架，通过解耦高、低质量图像特征并融合头部姿态几何约束，实现了跨域场景下的鲁棒3D视线估计，在多个数据集上达到SOTA性能。

Details

Motivation: 现有外观型视线估计方法在跨域评估中因表情、可穿戴设备和图像质量等无关因素干扰而性能下降，亟需提升模型的鲁棒性和泛化能力。 Method: 提出HARL框架，采用无监督域自适应方法，利用高质量近眼图像特征来对齐并解耦低质量人脸图像中的视线相关表示；设计稀疏图融合模块，结合头部姿态的几何约束，增强表示的稠密性与鲁棒性。 Result: 在EyeDiap、MPIIFaceGaze和Gaze360数据集上分别取得5.02°、3.36°和9.26°的平均误差，跨数据集实验表现优异，且几乎不增加推理开销。 Conclusion: HARL有效提升了外观型视线估计在跨域场景下的性能，通过域自适应和几何约束融合，实现了高精度、强鲁棒的3D视线估计，具有良好的实际应用潜力。 Abstract: Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.

[327] MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI

Malek Al Abed,Sebiha Demir,Anne Groteklaes,Elodie Germani,Shahrooz Faghihroohi,Hemmen Sabir,Shadi Albarqouni

Main category: cs.CV

TL;DR: 提出MRIQT，一种用于将便携式超低场MRI图像质量提升至高场MRI水平的3D条件扩散框架，在新生儿脑部成像中实现优于现有方法的去噪与结构保持。

Details

Motivation: 便携式超低场MRI（0.064 T）虽便于新生儿神经影像获取，但信噪比低、诊断质量差，亟需有效方法提升图像质量以满足临床需求。 Method: 提出MRIQT，结合真实K空间退化模拟物理一致的超低场图像，采用v预测与无分类器引导实现稳定图像生成，并引入SNR加权3D感知损失保证解剖保真度；模型基于体积注意力UNet架构，从加噪的超低场输入进行去噪并保留结构信息。 Result: 在新生儿队列上训练后，MRIQT在PSNR上比最先进方法提升15.3%（超出基线1.78%），85%的输出被医生评为高质量且病灶清晰可见，显著优于GAN和CNN基线模型。 Conclusion: MRIQT可实现高保真、基于扩散模型的便携式超低场MRI图像增强，有望推动其在新生儿脑部评估中的可靠临床应用。 Abstract: Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.

[328] MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection

Junjie Wu,Guohong Fu

Main category: cs.CV

TL;DR: 本文提出MMD-Thinker，一个通过自适应多维思考进行多模态虚假信息检测的两阶段框架，结合任务特定指令微调和强化学习策略，在领域内外均实现最先进的性能。

Details

Motivation: 现有的通用多模态大语言模型在检测多模态虚假信息时存在推理不足和推理偏差问题，难以应对日益复杂和快速演变的AI生成虚假内容。 Method: 设计针对多模态虚假信息检测的定制化思维模式，通过任务特定的指令微调将该模式注入通用MLLM，并采用带有混合优势函数的强化学习策略来增强推理轨迹的能力。 Result: 在多个基准数据集上，MMD-Thinker在领域内和跨领域设置下均达到最先进性能，同时保持推理灵活性和较低的token消耗。此外构建了包含8000多个图文对的MMR数据集。 Conclusion: MMD-Thinker通过引入任务特定的多维思考机制，有效提升了多模态虚假信息的检测能力，为应对AIGC时代的虚假信息挑战提供了新思路。 Abstract: Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.

[329] Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

Yu Wen,Shuyong Gao,Shuping Zhang,Miao Huang,Lili Tao,Han Yang,Haozhe Xing,Lihe Zhang,Boxue Hou

Main category: cs.CV

TL;DR: 本文提出了一种新的网络RFMNet用于解决参照型伪装物体检测（Ref-COD）问题，通过多阶段特征交互融合和局部注意力机制提升检测性能。

Details

Motivation: 现有方法将参考图像转换为一维提示，未能充分利用丰富的显著性特征与伪装特征之间的多上下文关联，限制了检测性能。 Method: 提出RFMNet，利用参考显著图像在多个编码阶段的特征，并在对应阶段与伪装特征进行交互融合；设计重叠窗口交叉注意力机制以增强局部信息匹配，并引入参照特征聚合（RFA）模块实现逐步解码与分割。 Result: 在Ref-COD基准上的大量实验表明，该方法达到了最先进的性能。 Conclusion: 通过多阶段特征融合和局部注意力机制，有效提升了参照型伪装物体检测的准确性，验证了多上下文融合策略的优势。 Abstract: Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.

[330] GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Yushuo Zheng,Jiangyong Ying,Huiyu Duan,Chunyi Li,Zicheng Zhang,Jing Liu,Xiaohong Liu,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了GeoX-Bench，一个用于评估大型多模态模型（LMMs）在跨视角地理定位和姿态估计任务中性能的综合基准。该基准包含10,859对全景-卫星图像和755,976个问答对，并用于评估25种最先进LMM的表现。结果表明，当前LMM在地理定位任务中表现良好，但在更复杂的姿态估计任务中效果显著下降，且通过指令微调可显著提升其地理感知能力。

Details

Motivation: 尽管大型多模态模型（LMMs）在多种任务中表现出色，但其在跨视角地理定位与姿态估计领域的能力尚未被探索。为填补这一空白，研究旨在建立一个标准化基准以系统评估和推动LMM在此类关键应用（如导航、自动驾驶等）中的发展。 Method: 提出GeoX-Bench，包含来自49个国家128个城市的10,859对全景-卫星图像及755,976个问答对，其中42,900个用于正式评测。基于该基准对25个最先进的LMM进行评估，并研究指令微调对其性能的影响。 Result: 实验显示现有LMM在地理定位任务中表现优异，但在姿态估计任务上性能显著下降；通过对训练数据进行指令微调，可显著增强LMM的跨视角地理感知能力。 Conclusion: GeoX-Bench为评估LMM在跨视角地理定位和姿态估计方面提供了有效基准，揭示了当前模型在姿态估计上的不足，并验证了指令微调在提升相关能力方面的有效性，指出了未来改进方向。 Abstract: Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

[331] Building Egocentric Procedural AI Assistant: Methods, Benchmarks, and Challenges

Junlong Li,Huaiyuan Xu,Sijie Cheng,Kejun Wu,Kim-Hui Yap,Lap-Pui Chau,Yi Wang

Main category: cs.CV

TL;DR: 本文提出了一个面向第一人称视角日常任务的自我中心程序性AI助手（EgoProceAssist）的概念，定义了三个核心任务：错误检测、程序学习和问答，并系统综述了相关技术、数据集与评估指标，通过实验分析现有视觉语言模型的局限性，探讨了未来研究方向。

Details

Motivation: 现有AI助手在支持日常程序性任务方面缺乏针对第一人称视角的系统框架和功能划分，难以满足实际需求。 Method: 提出EgoProceAssist概念及包含三个核心任务的新分类体系，综述现有技术、数据集与评估方法，并通过实验对比分析代表性视觉语言模型的表现。 Result: 明确了EgoProceAssist与现有VLM助手之间的差距，评估结果显示当前模型在三项核心任务上仍有显著局限。 Conclusion: 建立了一个系统化的自我中心程序性AI助手研究框架，指出了当前技术挑战并提出了未来研究方向，推动该领域的进一步发展。 Abstract: Driven by recent advances in vision language models (VLMs) and egocentric perception research, we introduce the concept of an egocentric procedural AI assistant (EgoProceAssist) tailored to step-by-step support daily procedural tasks in a first-person view. In this work, we start by identifying three core tasks: egocentric procedural error detection, egocentric procedural learning, and egocentric procedural question answering. These tasks define the essential functions of EgoProceAssist within a new taxonomy. Specifically, our work encompasses a comprehensive review of current techniques, relevant datasets, and evaluation metrics across these three core areas. To clarify the gap between the proposed EgoProceAssist and existing VLM-based AI assistants, we introduce novel experiments and provide a comprehensive evaluation of representative VLM-based methods. Based on these findings and our technical analysis, we discuss the challenges ahead and suggest future research directions. Furthermore, an exhaustive list of this study is publicly available in an active repository that continuously collects the latest work: https://github.com/z1oong/Building-Egocentric-Procedural-AI-Assistant

[332] SymGS : Leveraging Local Symmetries for 3D Gaussian Splatting Compression

Keshav Gupta,Akshat Sanghvi,Shreyas Reddy Palley,Astitva Srivastava,Charu Sharma,Avinash Sharma

Main category: cs.CV

TL;DR: 本文提出了一种名为SymGS的新型压缩框架，通过引入可学习的镜像机制来消除3D高斯点阵中的反射冗余，从而提升现有压缩方法的性能。

Details

Motivation: 现有的3D高斯点阵压缩方法受限于对称性冗余无法有效去除，导致压缩率提升有限。 Method: 提出SymGS框架，将可学习的镜面引入场景中，检测并消除局部与全局的反射对称冗余，并与现有压缩技术（如HAC）结合实现进一步压缩。 Result: 相比HAC，平均实现1.66倍压缩，在大规模场景中最高达3倍；整体平均压缩率达到108倍，同时保持良好的渲染质量。 Conclusion: SymGS通过利用镜像对称性显著提升了3D高斯点阵的压缩效率，可作为即插即用模块增强现有压缩方法。 Abstract: 3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, \textbf{\textit{SymGS}}, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve $1.66 \times$ compression across benchmark datasets (upto $3\times$ on large-scale scenes). On an average, SymGS enables $\bf{108\times}$ compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at \textbf{\color{cyan}{symgs.github.io}}

Lingfeng Zhang,Yuchen Zhang,Hongsheng Li,Haoxiang Fu,Yingbo Tang,Hangjun Ye,Long Chen,Xiaojun Liang,Xiaoshuai Hao,Wenbo Ding

Main category: cs.CV

TL;DR: 本文提出了SpatialSky-Bench，一个用于评估视觉语言模型（VLMs）在无人机（UAV）导航中空间智能能力的综合基准，并构建了包含100万样本的SpatialSky-Dataset。基于该数据集，作者开发了专用于UAV多粒度空间推理的Sky-VLM模型，在各项任务中达到SOTA性能。

Details

Motivation: 现有VLMs在UAV场景中的空间智能能力尚未被充分探索，导致其在动态环境导航与理解方面表现受限，缺乏专门的评估基准和针对性模型。 Method: 提出SpatialSky-Bench基准，包含环境感知与场景理解两大类共13个子任务；构建大规模SpatialSky-Dataset数据集；基于此训练专用于UAV空间推理的VLM模型Sky-VLM。 Result: 主流VLMs在该基准上表现不佳，表明其空间能力存在显著缺陷；Sky-VLM在所有任务上均取得最优性能，验证了专用模型的有效性。 Conclusion: 通过构建专用数据集和基准，证明了提升VLM在UAV场景中空间智能的重要性，Sky-VLM为未来UAV应用中的视觉语言模型发展提供了有效路径。 Abstract: Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.

[334] Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models

Noam Tsfaty,Avishai Weizman,Liav Cohen,Moshe Tshuva,Yehudit Aperstein

Main category: cs.CV

TL;DR: 提出了一种双主干框架，结合卷积和Transformer表示，通过top-k池化在仅有视频级监督的情况下实现对罕见且多样异常的有效检测。

Details

Motivation: 在仅有视频级监督的条件下，检测监控视频中罕见且多样的异常具有挑战性。 Method: 采用双主干框架，融合卷积神经网络和Transformer的特征表示，并利用top-k池化机制进行关键帧筛选和异常检测。 Result: 在UCF-Crime数据集上达到了90.7%的曲线下面积（AUC）性能。 Conclusion: 所提出的方法在视频级监督下能有效检测复杂场景中的罕见异常，表现出优越的性能。 Abstract: We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.

[335] SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting

Zihan Li,Tengfei Wang,Wentian Gan,Hao Zhan,Xin Wang,Zongqian Zhan

Main category: cs.CV

TL;DR: 本文提出SF-Recon方法，直接从多视角图像重建轻量化的建筑表面模型，无需后处理简化，具有高效、保结构且几何精度高的特点。

Details

Motivation: 传统多视图几何流程依赖密集重建、网格化和简化，过程繁琐且对质量敏感，难以高效生成轻量化的建筑表面模型。 Method: 首先训练初始的3D高斯点阵场以获得视角一致的表示；通过法向梯度引导的高斯优化提取与屋顶和墙体边界对齐的结构；结合多视角边缘一致性剪枝增强结构清晰度并抑制非结构伪影；最后采用多视角深度约束的Delaunay三角剖分生成轻量且结构保真的建筑网格。 Result: 在自建SF数据集上的实验表明，SF-Recon能直接生成轻量化建筑模型，面数和顶点数显著减少，同时保持良好的结构准确性和计算效率。 Conclusion: SF-Recon提供了一种端到端的轻量化建筑表面重建方案，避免了传统流程中的复杂后处理，适用于数字城市、导航和快速地理空间分析等应用。 Abstract: Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/

[336] Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

Kaiwen Wang,Kaili Zheng,Yiming Shi,Chenyi Guo,Ji Wu

Main category: cs.CV

TL;DR: 提出Depth-conditioned Translation Optimization (DTO) 方法和Metric-Aware HMR网络，构建大规模场景一致的多人体网格数据集DTO-Humans，并实现先进的多人体网格恢复性能。

Details

Motivation: 现有单人中心的伪真值生成方法缺乏场景一致性，导致多人图像中存在深度和尺度冲突问题。 Method: 提出DTO方法，在MAP框架下联合优化多人在相机空间中的平移，结合人体先验和单目深度线索；构建DTO-Humans数据集；设计Metric-Aware HMR网络，通过相机分支和相对度量损失实现端到端的度量尺度人体网格恢复。 Result: 在4D-Humans数据集上构建了包含0.56M高质量样本的DTO-Humans数据集，平均每张图像有4.8人；实验表明该方法在相对深度推理和人体网格恢复上达到SOTA性能。 Conclusion: DTO有效提升了多人体场景的几何一致性，Metric-Aware HMR实现了精确的度量尺度人体网格估计，推动了多人HMR的发展。 Abstract: Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.

[337] TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

Jongha Kim,Minseong Bae,Sanghyeok Lee,Jinsung Yoon,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 本文提出了TabFlash，一种高效的多模态大语言模型，用于表格理解。通过渐进式问题条件化、背景token剪枝和token聚焦策略，实现了更高效且准确的表格问答。

Details

Motivation: 现有MLLM在处理表格图像时未能充分关注问题相关区域，且包含大量冗余背景信息，导致效率低和表示不充分。 Method: 提出渐进式问题条件化将问题逐步注入ViT层；采用剪枝策略去除背景token以提升效率；设计token聚焦训练策略减少剪枝带来的信息损失。 Result: TabFlash在性能上达到SOTA，优于开源及闭源MLLM，同时减少27%的FLOPs和30%的内存占用。 Conclusion: 所提方法能有效生成紧凑且信息丰富的视觉特征，显著提升表格理解的效率与准确性。 Abstract: Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

[338] SkyReels-Text: Fine-grained Font-Controllable Text Editing for Poster Design

Yunjie Yu,Jingchen Wu,Junchen Zhu,Chunze Lin,Guibin Chen

Main category: cs.CV

TL;DR: 提出SkyReels-Text，一种无需字体标签或微调、可实现多区域精确字体控制的海报文本编辑框架，在文本保真度和视觉真实性上达到SOTA。

Details

Motivation: 现有图像编辑模型在细粒度、字体感知的文本编辑方面表现不足，难以满足专业设计（如海报设计）中对视觉和谐与排版意图保持的需求。 Method: 提出SkyReels-Text框架，通过用户提供所需字体的裁剪字形块作为参考，实现无需字体标签或推理时微调的多区域文本编辑，同时保持非编辑区域的视觉一致性。 Result: 在多个数据集（包括手写文本基准）上实验表明，SkyReels-Text在文本保真度和视觉 realism 方面均达到最先进水平，支持对字体族和风格细节的精细控制。 Conclusion: SkyReels-Text弥合了通用图像编辑与专业级排版设计之间的差距，为实际设计工作流提供了高效且高精度的文本编辑解决方案。 Abstract: Artistic design such as poster design often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present SkyReels-Text, a novel font-controllable framework for precise poster text editing. Our method enables simultaneous editing of multiple text regions, each rendered in distinct typographic styles, while preserving the visual appearance of non-edited regions. Notably, our model requires neither font labels nor fine-tuning during inference: users can simply provide cropped glyph patches corresponding to their desired typography, even if the font is not included in any standard library. Extensive experiments on multiple datasets, including handwrittent text benchmarks, SkyReels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families, and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design.

[339] CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving

Enhui Ma,Lijun Zhou,Tao Tang,Jiahuan Zhang,Junpeng Jiang,Zhan Zhang,Dong Han,Kun Zhan,Xueyang Zhang,XianPeng Lang,Haiyang Sun,Xia Zhou,Di Lin,Kaicheng Yu

Main category: cs.CV

TL;DR: 提出CorrectAD，一个基于扩散模型和3D布局的自校正代理系统，用于自动修复端到端自动驾驶规划中的长尾失败案例。

Details

Motivation: 端到端自动驾驶系统因数据驱动方法在长尾问题上的鲁棒性差而面临安全挑战，难以处理罕见但关键的失败情况。 Method: 设计PM-Agent模拟产品经理生成数据需求，结合结构化3D布局，提出DriveSora生成时空一致、符合3D标注的高保真视频，构建自校正系统CorrectAD。 Result: 在nuScenes和内部数据集上，CorrectAD分别修正62.5%和49.8%的失败案例，碰撞率降低39%和27%，且兼容多种端到端规划器。 Conclusion: CorrectAD提供了一种模型无关的端到端自校正框架，显著提升自动驾驶系统对长尾场景的鲁棒性。 Abstract: End-to-end planning methods are the de facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious long-tail problem (i.e., rare but safety-critical failure cases). In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed PM-Agent, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high-fidelity data conditioned on 3D layouts. To address this, we propose DriveSora, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM-Agent. We integrate these components into our self-correcting agentic system, CorrectAD. Importantly, our pipeline is an end-to-end model-agnostic and can be applied to improve any end-to-end planner. Evaluated on both nuScenes and a more challenging in-house dataset across multiple end-to-end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.

[340] DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving

Kaiwen Cai,Xinze Liu,Xia Zhou,Hengtong Hu,Jie Xiang,Luyao Zhang,Xueyang Zhang,Kun Zhan,Yifei Zhan,Xianpeng Lang

Main category: cs.CV

TL;DR: 本文提出了DriveLiDAR4D，一种新型的LiDAR生成管道，能够生成时间一致、前景可控且背景真实的序列化LiDAR点云，是首个端到端实现全场景操控的序列生成方法，在nuScenes和KITTI数据集上显著超越现有最先进方法。

Details

Motivation: 现有LiDAR点云生成方法缺乏序列生成能力，且难以准确生成前景物体位置和真实背景，限制了其在自动驾驶系统中的实际应用。 Method: 提出DriveLiDAR4D，包含多模态条件输入和新设计的序列噪声预测模型LiDAR4DNet，实现端到端的时序一致点云生成与全场景操控。 Result: 在nuScenes数据集上FRD得分743.13，FVD得分16.96，相比UniScene分别提升37.2%和24.1%，在KITTI数据集上也表现出色。 Conclusion: DriveLiDAR4D首次实现了端到端的可控序列LiDAR点云生成，在生成质量与一致性方面显著优于现有方法，具有较强的实用潜力。 Abstract: The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.

[341] Computer Vision based group activity detection and action spotting

Narthana Sivalingam,Santhirarajah Sivasthigan,Thamayanthi Mahendranathan,G. M. R. I. Godaliyadda,M. P. B. Ekanayake,H. M. V. R. Herath

Main category: cs.CV

TL;DR: 提出了一种基于计算机视觉的框架，结合深度学习和图关系推理进行群体活动识别与动作检测。

Details

Motivation: 由于人类交互复杂、遮挡以及外观随时间变化，多人群体活动检测具有挑战性。 Method: 使用Mask R-CNN进行人物定位，融合实例掩码与特征图进行特征优化，并构建演员关系图（通过外观相似性和位置关系）结合GCN进行关系推理。 Result: 在Collective Activity数据集上实验表明，该方法在拥挤和非拥挤场景下均提升了识别性能。 Conclusion: 结合分割、特征提取与图关系推理有助于提升复杂视频理解任务的性能。 Abstract: Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.

[342] YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

Ori Meiraz,Sharon Shalev,Avishai Weizman

Main category: cs.CV

TL;DR: 提出了一种用于目标检测的新型Mixture-of-Experts框架，通过在多个YOLOv9-T专家之间进行自适应路由，实现动态特征专业化，相比单一YOLOv9-T模型提高了mAP和AR。

Details

Motivation: 为了提升目标检测的性能，尤其是在复杂场景下的准确性和召回率，本文旨在探索多专家系统在该领域的应用潜力。 Method: 采用Mixture-of-Experts框架，结合自适应路由机制，在多个YOLOv9-T模型间动态分配特征处理任务，以实现更精细的特征表达。 Result: 所提方法在标准数据集上测试显示，相较于单个YOLOv9-T模型，显著提升了mAP和AR指标。 Conclusion: 本研究证明了通过引入自适应路由的多专家系统可以有效增强目标检测模型的性能，为未来的研究提供了新的方向。 Abstract: This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

[343] Semi-Supervised Multi-Task Learning for Interpretable Quality As- sessment of Fundus Images

Lucas Gabriel Telesco,Danila Nejamkin,Estefanía Mata,Francisco Filizzola,Kevin Wignall,Lucía Franco Troilo,María de los Angeles Cenoz,Melissa Thompson,Mercedes Leguía,Ignacio Larrabide,José Ignacio Orlando

Main category: cs.CV

TL;DR: 提出一种半监督多任务学习方法，利用少量人工标注和伪标签提升视网膜图像质量评估的准确性和可解释性，无需额外标注成本。

Details

Motivation: 现有方法仅分类整体图像质量，缺乏对采集缺陷的细粒度标注以指导重拍，且详细标注成本高昂。 Method: 采用混合半监督学习框架，结合人工标注的整体质量与教师模型生成的细节伪标签，在多任务设置下训练ResNet-18模型。 Result: 在EyeQ和DeepDRiD数据集上优于单任务基线，整体质量评估F1得分提升，细节预测与专家水平相当，伪标签噪声与专家判读变异一致。 Conclusion: 该方法在不增加人工标注成本的前提下，提升了RIQA模型的性能和可解释性，能提供光照、清晰度、对比度等临床可操作反馈，指导图像重拍。 Abstract: Retinal image quality assessment (RIQA) supports computer-aided diagnosis of eye diseases. However, most tools classify only overall image quality, without indicating acquisition defects to guide recapture. This gap is mainly due to the high cost of detailed annotations. In this paper, we aim to mitigate this limitation by introducing a hybrid semi-supervised learning approach that combines manual labels for overall quality with pseudo-labels of quality details within a multi-task framework. Our objective is to obtain more interpretable RIQA models without requiring extensive manual labeling. Pseudo-labels are generated by a Teacher model trained on a small dataset and then used to fine-tune a pre-trained model in a multi-task setting. Using a ResNet-18 backbone, we show that these weak annotations improve quality assessment over single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, and 0.778 vs. 0.763 on DeepDRiD), matching or surpassing existing methods. The multi-task model achieved performance statistically comparable to the Teacher for most detail prediction tasks (p > 0.05). In a newly annotated EyeQ subset released with this paper, our model performed similarly to experts, suggesting that pseudo-label noise aligns with expert variability. Our main finding is that the proposed semi-supervised approach not only improves overall quality assessment but also provides interpretable feedback on capture conditions (illumination, clarity, contrast). This enhances interpretability at no extra manual labeling cost and offers clinically actionable outputs to guide image recapture.

[344] Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model

Fei Kong

Main category: cs.CV

TL;DR: 本文提出了广义去噪扩散压缩模型（gDDCM），将DDCM方法扩展到主流的扩散模型及其变体，包括DDPM、基于分数的模型、一致性模型和修正流，并在CIFAR-10和LSUN Bedroom数据集上验证了其有效性和性能提升。

Details

Motivation: DDCM仅适用于DDPM，无法推广到其他扩散模型，限制了其应用。因此，需要一种更通用的框架以支持多种扩散模型并实现图像压缩。 Method: 提出gDDCM，通过将DDCM中的噪声替换机制推广到多种扩散模型中，使其适用于DDPM、Score-Based Models、Consistency Models和Rectified Flow等模型。 Result: 在CIFAR-10和LSUN Bedroom数据集上的实验表明，gDDCM成功地将DDCM扩展到多种扩散模型，并在压缩性能上实现了提升。 Conclusion: gDDCM是一个通用且有效的扩散模型压缩框架，能够广泛适用于主流扩散模型，在保持生成质量的同时提升压缩性能。 Abstract: Recently, the Denoising Diffusion Codebook Models (DDCM) was proposed. DDCM leverages the Denoising Diffusion Probabilistic Model (DDPM) and replaces the random noise in the backward process with noise sampled from specific sets according to a predefined rule, thereby enabling image compression. However, DDCM cannot be applied to methods other than DDPM. In this paper, we propose the generalized Denoising Diffusion Compression Model (gDDCM), which extends DDCM to mainstream diffusion models and their variants, including DDPM, Score-Based Models, Consistency Models, and Rectified Flow. We evaluate our method on CIFAR-10 and LSUN Bedroom datasets. Experimental results demonstrate that our approach successfully generalizes DDCM to the aforementioned models and achieves improved performance.

[345] Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis,Tim Brophy,Reenu Mohandas,Ganesh Sistu,Fiachra Collins,Anthony Scanlan,Ciaran Eising

Main category: cs.CV

TL;DR: DTPQA是一个专为评估自动驾驶中视觉语言模型（VLM）感知能力而设计的视觉问答基准，包含合成和真实世界两部分，并引入距离标注以分析模型在不同距离下的性能表现。

Details

Motivation: 为了在安全关键的自动驾驶领域可靠地应用视觉语言模型，需要评估其在复杂交通场景中的感知能力，特别是在远距离（30+米）下对关键物体的识别能力。 Method: 提出DTPQA基准，包括基于模拟器生成的合成数据集DTP-Synthetic和基于真实交通图像构建的DTP-Real，每个样本包含图像、问题、答案及对象距离信息，用于隔离评估VLM的感知能力。 Result: DTPQA提供了可量化的评估方式，支持分析模型性能随物体距离增加而下降的趋势，并公开数据集和生成脚本以促进后续研究。 Conclusion: DTPQA为评估VLM在自动驾驶场景中的感知能力提供了一个有效且可扩展的工具，尤其强调了距离因素对感知性能的影响。 Abstract: The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

[346] TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

Yuchen Bao,Yiting Wang,Wenjian Huang,Haowei Wang,Shen Chen,Taiping Yao,Shouhong Ding,Jianguo Zhang

Main category: cs.CV

TL;DR: 提出TripleFDS框架和SCB Synthesis数据集，实现文本样式、内容和背景的三重特征解耦，提升场景文本编辑的可控性和视觉一致性。

Details

Motivation: 现有方法在场景文本编辑中难以完全解耦可编辑属性（如仅处理文本内容），导致控制性与视觉一致性受限。 Method: 提出TripleFDS框架，利用SCB Group作为训练单元，通过组间对比正则化和组内多特征正交性实现三重特征解耦；合成阶段采用特征重映射防止重建中的‘捷径’现象和特征泄漏。 Result: 在12.5万SCB Group上训练后，TripleFDS在主流STE基准上达到最先进的图像保真度（SSIM 44.54）和文本准确率（ACC 93.58%），并支持风格替换和背景迁移等新编辑操作。 Conclusion: TripleFDS通过三重特征解耦显著提升了场景文本编辑的灵活性与视觉质量，为复杂文本编辑任务提供了有效解决方案。 Abstract: Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS

[347] What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

Jinkun Zhao,Lei Huang,Wenjun Wu

Main category: cs.CV

TL;DR: 提出“是什么颜色”数据集，用于验证多模态大模型在颜色感知中的视觉幻觉问题，并探讨其成因及增强鲁棒性的方法。

Details

Motivation: 多模态大模型在视觉感知中容易受到信息干扰，尤其是在颜色感知方面，可能导致幻觉现象，影响模型可靠性。 Method: 构建名为“What Color Is It”的新基准数据集，通过简单方法触发单模态视觉幻觉，并分析多模态大模型在该数据集上的表现以探究幻觉成因。 Result: 验证了多模态大模型在颜色感知中存在视觉幻觉问题，并揭示了其潜在成因。 Conclusion: 颜色感知干扰是导致多模态大模型产生幻觉的重要因素，需针对性设计方法以提升模型在视觉模态中的鲁棒性。 Abstract: With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the "What Color Is It" dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.

[348] Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

Mykola Lavreniuk,Nataliia Kussul,Andrii Shelestov,Yevhenii Salii,Volodymyr Kuzin,Sergii Skakun,Zoltan Szantoi

Main category: cs.CV

TL;DR: 提出了一种名为Delineate Anything Flow (DelAnyFlow)的方法，用于从卫星图像中大规模精确提取农田边界，结合高精度实例分割模型与后处理流程，显著优于现有方法。

Details

Motivation: 现有农田边界提取方法存在边界不完整、相邻田块合并和难以扩展的问题，尤其在小农户和破碎化农田系统中表现不佳，亟需一种可扩展且高精度的解决方案。 Method: 提出DelAnyFlow方法，基于YOLOv11 backbone的DelAny实例分割模型，训练于大规模FBIS 22M数据集（包含67.2万图像块和2290万个田块实例），并结合结构化后处理、合并与矢量化流程生成拓扑一致的矢量边界。该方法支持多分辨率输入，具备分辨率无关性。 Result: DelAny模型比SAM2提升100%以上mAP且推理速度快400倍；利用Sentinel-2数据，单工作站六小时内完成乌克兰全国（60.3万km²）的农田边界提取，在5m和2.5m分辨率下分别识别出375万和515万块田地，远超Sinergise和NASA Harvest的结果；边界完整性显著提升，尤其适用于小规模农田。 Conclusion: DelAnyFlow提供了一种可扩展、低成本的农田边界制图方法，适用于缺乏数字地籍数据的地区，推动大范围农业监测与土地管理的应用。 Abstract: Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate-Anything/.

[349] VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task

Xingming Long,Jie Zhang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为VOPE的新方法，用于评估大型视觉语言模型（LVLMs）在主动想象任务中的幻觉问题，发现现有模型和缓解方法在此类任务中表现不佳。

Details

Motivation: 现有研究多关注LVLMs在事实描述任务中的幻觉，而在允许创造性生成的主动想象任务中，如何合理评估幻觉尚缺乏有效方法。 Method: 提出VOPE方法，通过基于重新检查的问题来评估LVLM对其生成对象存在性的判断一致性，进而判断是否产生幻觉。 Result: 实验显示大多数LVLM在主动想象任务中严重幻觉，且在存在性评估中表现差；现有缓解方法效果有限。 Conclusion: 主动想象任务中的幻觉评估需新范式，VOPE为该方向提供了有效工具，并指出现有模型与方法亟需改进。 Abstract: Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs' hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model's interpretation and the object's presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.

[350] FUSE: A Flow-based Mapping Between Shapes

Lorenzo Olearo,Giulio Viganò,Daniele Baieri,Filippo Maggioli,Simone Melzi

Main category: cs.CV

TL;DR: 提出了一种基于流匹配模型的新型神经表示方法，用于3D形状间的映射，具有计算高效、跨表示形式匹配、无需大规模训练的优点。

Details

Motivation: 现有方法在处理不同表示形式（如点云、网格、SDF等）的3D形状匹配时通常需要大量训练或特定数据处理流程，缺乏通用性和效率。 Method: 将3D形状表示为从一个固定锚分布通过连续可逆流映射得到的概率分布；通过组合源到锚的逆流和锚到目标的正向流实现形状间点的连续映射，并使用点级任务定制嵌入编码形状。 Result: 该方法在多种基准和挑战性场景下实现了高覆盖率和匹配精度，支持跨模态形状匹配，并在UV映射和人体点云配准任务中表现出良好潜力。 Conclusion: 所提出的流匹配框架提供了一种通用、可逆且模态无关的3D形状映射表示方法，在匹配性能和其他几何任务上均表现优异。 Abstract: We introduce a novel neural representation for maps between 3D shapes based on flow-matching models, which is computationally efficient and supports cross-representation shape matching without large-scale training or data-driven procedures. 3D shapes are represented as the probability distribution induced by a continuous and invertible flow mapping from a fixed anchor distribution. Given a source and a target shape, the composition of the inverse flow (source to anchor) with the forward flow (anchor to target), we continuously map points between the two surfaces. By encoding the shapes with a pointwise task-tailored embedding, this construction provides an invertible and modality-agnostic representation of maps between shapes across point clouds, meshes, signed distance fields (SDFs), and volumetric data. The resulting representation consistently achieves high coverage and accuracy across diverse benchmarks and challenging settings in shape matching. Beyond shape matching, our framework shows promising results in other tasks, including UV mapping and registration of raw point cloud scans of human bodies.

[351] Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline

Rui Zuo,Qinyue Tong,Zhe-Ming Lu,Ziqian Lu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的基于多模态大语言模型（MLLM）的图像伪造分析框架Foresee，通过类型先验驱动策略和灵活特征检测模块（FFD），在无需额外训练的情况下实现了高精度的篡改定位和丰富的文本解释，具有良好的泛化能力。

Details

Motivation: 现有图像伪造检测方法在跨数据集泛化性和可解释性方面存在不足，且当前基于MLLM的方法通常依赖大规模训练，计算成本高，未能充分挖掘原始MLLM在该任务上的固有泛化潜力。 Method: 提出Foresee，一种无需训练的MLLM-based图像伪造分析流程，采用类型先验驱动策略，并设计灵活特征检测模块（FFD）以专门应对复制-移动篡改，充分利用原始MLLM在取证领域的潜力。 Result: 实验表明，Foresee在多种篡改类型（包括复制-移动、拼接、删除、局部增强、深度伪造和AIGC编辑）上均实现了优于现有方法的定位精度和更全面的文本解释，且具备更强的泛化能力。 Conclusion: Foresee无需训练即可有效释放原始MLLM在图像伪造分析中的潜力，在定位准确性、解释丰富性和跨域泛化性方面均表现优越，为图像取证提供了一种高效、轻量的新方案。 Abstract: With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.

[352] Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Adam Hazimeh,Ke Wang,Mark Collier,Gilles Baechler,Efi Kokiopoulou,Pascal Frossard

Main category: cs.CV

TL;DR: 本文提出了一种名为SliDer的新框架，利用视觉-语言模型（VLM）将幻灯片图像语义去渲染为可编辑的SVG格式，解决了传统几何矢量化方法在复杂文档中丢失高层结构的问题。

Details

Motivation: 现有的光栅转矢量方法难以保持多媒体文档（如幻灯片）中的语义结构，导致文本与图像元素混淆，限制了可编辑性。因此需要一种能恢复高层语义结构的方法。 Method: 提出SliDer框架，结合视觉-语言模型检测和提取光栅图像中的文本与图像元素及其属性，并将其组织为结构化的SVG输出；通过推理过程中的迭代优化提升重建精度。同时构建了真实幻灯片的配对数据集Slide2SVG用于训练与评估。 Result: SliDer在重建质量上达到0.069的LPIPS分数，在82.9%的情况下优于最强的零样本VLM基线，并获得人类评估者的偏好。 Conclusion: SliDer能够有效恢复幻灯片的语义结构并生成高质量、可编辑的SVG表示，推动了语义文档去渲染的发展。 Abstract: Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

[353] InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Lipeng Wang,Hongxing Fan,Haohua Chen,Zehuan Huang,Lu Sheng

Main category: cs.CV

TL;DR: 本文提出了一种基于动态时间选择性专家混合模型（InterMoE）的高保真3D人体交互生成框架，通过结合文本语义与运动上下文的路由机制，有效保留个体特征并提升语义一致性。

Details

Motivation: 现有方法在生成高质量人体交互时难以同时保持个体独特性和对文本描述的高保真度。 Method: 提出InterMoE框架，采用动态时间选择性专家混合结构，利用高层文本语义和低层运动上下文共同指导时序特征分配至专用专家，实现关键特征聚焦。 Result: 在InterHuman和InterX数据集上实验表明，InterMoE分别将FID分数降低了9%和22%，达到当前最优性能。 Conclusion: InterMoE能有效平衡个体特性保留与语义忠实性，在个体化高保真3D人体交互生成任务中表现优异。 Abstract: Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.

[354] Language-Guided Invariance Probing of Vision-Language Models

Jae Joong Lee

Main category: cs.CV

TL;DR: 提出语言引导不变性探测（LGIP）基准，评估视觉-语言模型在图像-文本匹配中对语言扰动的鲁棒性，发现部分模型在标准指标下表现良好但实际对语义变化敏感。

Details

Motivation: 现有视觉-语言模型在零样本任务上表现良好，但其对语言扰动的响应可靠性尚不明确，缺乏有效评估其语言鲁棒性的方法。 Method: 构建包含4万张MS COCO图像的基准，每图配5条人工标注文本，并自动生成保持语义的改写和改变对象、颜色或数量的语义翻转；提出不变性误差、语义敏感性差距和正率统计量来量化模型表现。 Result: EVA02-CLIP和大型OpenCLIP变体在不变性和敏感性之间表现较好，而SigLIP和SigLIP2存在较大不变性误差且常偏好被翻转的文本；这些缺陷在传统检索指标中难以发现。 Conclusion: LGIP提供了一种模型无关的诊断工具，可有效揭示视觉-语言模型在语言鲁棒性方面的潜在问题，超越传统准确率指标。 Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

[355] Mapping the Vanishing and Transformation of Urban Villages in China

Wenyu Zhang,Yao Tong,Yiqiu Liu,Rui Cao

Main category: cs.CV

TL;DR: 本研究提出一种基于深度学习的框架，用于监测中国城中村的时空变化，评估其拆除后的土地利用情况，并揭示再开发过程的复杂性与非线性特征。

Details

Motivation: 现有城中村拆除再开发实践缺乏对土地是否有效再利用的系统评估，亟需科学方法评估再开发成效及其可持续性。 Method: 采用多时相遥感影像语义分割提取城中村边界变化，并将拆除后土地利用分为六类（如闲置地、建筑、绿地等），分析四个代表性城市的再开发时空演变。 Result: 发现再开发过程常被延长；主要发生在城市边缘区；识别出同步再开发、延迟再开发和逐步优化三种时空转化路径。 Conclusion: 城中村再开发具有碎片化、复杂且非线性的特点，需因地制宜实施分级、情境敏感的规划策略，以支持更包容、高效和可持续的城市更新。 Abstract: Urban villages (UVs), informal settlements embedded within China's urban fabric, have undergone widespread demolition and redevelopment in recent decades. However, there remains a lack of systematic evaluation of whether the demolished land has been effectively reused, raising concerns about the efficacy and sustainability of current redevelopment practices. To address the gap, this study proposes a deep learning-based framework to monitor the spatiotemporal changes of UVs in China. Specifically, semantic segmentation of multi-temporal remote sensing imagery is first used to map evolving UV boundaries, and then post-demolition land use is classified into six categories based on the "remained-demolished-redeveloped" phase: incomplete demolition, vacant land, construction sites, buildings, green spaces, and others. Four representative cities from China's four economic regions were selected as the study areas, i.e., Guangzhou (East), Zhengzhou (Central), Xi'an (West), and Harbin (Northeast). The results indicate: 1) UV redevelopment processes were frequently prolonged; 2) redevelopment transitions primarily occurred in peripheral areas, whereas urban cores remained relatively stable; and 3) three spatiotemporal transformation pathways, i.e., synchronized redevelopment, delayed redevelopment, and gradual optimization, were revealed. This study highlights the fragmented, complex and nonlinear nature of UV redevelopment, underscoring the need for tiered and context-sensitive planning strategies. By linking spatial dynamics with the context of redevelopment policies, the findings offer valuable empirical insights that support more inclusive, efficient, and sustainable urban renewal, while also contributing to a broader global understanding of informal settlement transformations.

[356] Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems

Jeffrey Wen,Rizwan Ahmad,Philip Schniter

Main category: cs.CV

TL;DR: 提出了一种渐近最小最大方法用于多目标共形预测，能够在保证联合边际覆盖的同时提供紧致的预测区间，并在合成数据和MRI数据上验证了其优越性。

Details

Motivation: 在病态成像反问题中，不确定性量化是一个基本挑战，尤其是在安全关键应用中。现有工作仅处理标量估计目标，而实际应用常涉及多个目标。 Method: 提出一种渐近最小最大方法用于多目标共形预测，确保联合边际覆盖并提供紧致预测区间，并将其应用于多指标盲图像质量评估、多任务不确定性量化和多轮测量采集。 Result: 在合成数据和磁共振成像（MRI）数据上的数值实验表明，所提方法相比现有方法具有更好的性能。 Conclusion: 该方法为多目标不确定性量化提供了有效解决方案，适用于多种实际应用场景。 Abstract: In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.

[357] Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

Farhin Farhad Riya,Shahinul Hoque,Jinyuan Stella Sun,Olivera Kotevska

Main category: cs.CV

TL;DR: 本文提出了一种名为Chromatic Perturbation Module的新型攻击框架，能够在联邦学习中通过微小的颜色扰动破坏模型的显著性图解释，而不影响模型准确性。

Details

Motivation: 随着机器学习模型在安全关键领域的广泛应用，可解释性变得至关重要。然而，现有研究较少关注解释可信度本身的安全隐患，本文旨在揭示模型解释性可能成为新的攻击面。 Method: 提出Chromatic Perturbation Module（CPM），通过调整前景与背景的颜色对比度生成对抗样本，在联邦学习过程中逐步累积扰动，从而在不改变预测结果的前提下削弱显著性图的语义一致性。 Result: 在多个数据集上验证了该攻击的有效性，Grad-CAM解释的峰值激活重叠率最高下降35%，同时分类准确率保持在96%以上；标准训练流程难以检测此类解释退化。 Conclusion: 研究结果挑战了‘正确预测即意味着可信解释’的假设，表明可解释性本身可能被恶意操控，尤其在联邦学习场景下更具隐蔽性和持续性，需引起重视。 Abstract: As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model's saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model's internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.

[358] BootOOD: Self-Supervised Out-of-Distribution Detection via Synthetic Sample Exposure under Neural Collapse

Yuanchao Wang,Tian Qin,Eduardo Valle,Bruno Abrahao

Main category: cs.CV

TL;DR: 提出了一种名为BootOOD的完全自监督的OOD检测框架，仅使用ID数据进行训练，通过特征范数的半径分类实现对语义上接近的OOD样本的有效检测。

Details

Motivation: 现有OOD检测方法在面对与ID类别语义相似的OOD样本时表现不佳，需要一种能在无外部异常数据情况下有效识别此类挑战性样本的方法。 Method: 利用ID数据的表示进行简单变换生成伪OOD特征，结合神经坍缩（Neural Collapse）现象，引入一个轻量级辅助头对特征范数进行基于半径的分类，使OOD样本学习到比ID样本更小的特征范数。 Result: 在CIFAR-10、CIFAR-100和ImageNet-200上的实验表明，BootOOD优于之前的后处理方法，超过无需异常暴露的训练方法，并与最先进的异常暴露方法相当，同时保持或提升ID分类精度。 Conclusion: BootOOD是一种有效的自监督OOD检测方法，特别适用于语义上与ID数据接近的OOD样本，且无需外部异常数据，在多种基准上达到领先性能。 Abstract: Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.

[359] Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks

Md. Iqbal Hossain,Afia Sajeeda,Neeresh Kumar Perla,Ming Shao

Main category: cs.CV

TL;DR: 提出一种针对CLIP等多模态对比学习模型的新型后门防御策略，通过引入图像分割“oracle”识别触发器并精确定位受害样本和标签，从而高效修复中毒模型。

Details

Motivation: 现有防御方法通常需从头训练或大规模微调，且无法精确定位受攻击标签；而多模态模型如CLIP易受后门攻击，亟需更精准高效的修复机制。 Method: 引入图像分割oracle作为监督信号，设计两种算法：1）利用CLIP与oracle的知识差异识别潜在触发器；2）定位受影响的标签和样本，构建紧凑微调数据集以修复模型。 Result: 在视觉识别基准上的实验表明，该方法能有效识别触发器、定位受害样本与标签，并显著削弱后门攻击效果，提升模型鲁棒性。 Conclusion: 所提方法为CLIP类多模态模型提供了一种高效、精准的后门防御方案，无需大规模训练即可实现模型修复，增强了实际部署中的安全性。 Abstract: The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle'' is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle's knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.

[360] TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images

Sining Chen,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: 提出TSE-Net，一种用于半监督单目高度估计的自训练框架，通过教师-学生-考试网络结构和伪标签过滤机制，在少量标注数据下显著提升性能。

Details

Motivation: 单目高度估计受限于高质量标注数据的稀缺，现有深度学习方法难以获得良好泛化性能。 Method: 提出TSE-Net，包含教师、学生和考试网络：教师网络生成回归与分类联合预测的伪标签；学生网络用伪标签训练；考试网络作为学生网络的时间集成以稳定性能；采用分层双切策略划分高度类别，并用Plackett-Luce模型校准分类概率以过滤可靠伪标签。 Result: 在三个不同分辨率和成像模态的数据集上验证了方法的有效性，相比现有半监督方法取得更优的高度估计精度，尤其在低标注率场景下表现突出。 Conclusion: TSE-Net通过有效的伪标签生成与筛选机制，充分利用未标注数据，显著提升了单目高度估计的性能与泛化能力，具有较强的实用价值。 Abstract: Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.

[361] Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation

Ziyang Huang,Jiagang Chen,Jin Liu,Shunping Ji

Main category: cs.CV

TL;DR: 提出Opt3DGS，一种通过自适应探索和曲率引导利用的两阶段优化框架，提升3D高斯溅射的渲染质量。

Details

Motivation: 3D高斯溅射（3DGS）在新视角合成中表现优异，但其核心优化问题如陷入局部最优和收敛质量不足仍未充分探索。 Method: 提出Opt3DGS，包含两个阶段：1）自适应加权随机梯度朗之万动力学（SGLD）用于增强全局搜索；2）局部拟牛顿方向引导的Adam优化器利用曲率信息实现精确收敛。 Result: 在多个基准数据集上的实验表明，Opt3DGS在不改变3DGS表示的前提下实现了最先进的渲染质量。 Conclusion: Opt3DGS通过改进优化过程显著提升了3DGS的性能，为后续研究提供了有效优化范式。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a leading framework for novel view synthesis, yet its core optimization challenges remain underexplored. We identify two key issues in 3DGS optimization: entrapment in suboptimal local optima and insufficient convergence quality. To address these, we propose Opt3DGS, a robust framework that enhances 3DGS through a two-stage optimization process of adaptive exploration and curvature-guided exploitation. In the exploration phase, an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method enhances global search to escape local optima. In the exploitation phase, a Local Quasi-Newton Direction-guided Adam optimizer leverages curvature information for precise and efficient convergence. Extensive experiments on diverse benchmark datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization process without modifying its underlying representation.

[362] Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification

Linhan Zhou,Shuang Li,Neng Dong,Yonghang Tai,Yafei Zhang,Huafeng Li

Main category: cs.CV

TL;DR: 提出了一种名为分层提示学习（HPL）的统一框架，通过任务感知的提示建模联合优化图像到图像和文本到图像的行人重识别任务。

Details

Motivation: 现有方法通常将图像检索和文本检索任务分开处理，导致表征纠缠和性能次优，因此需要一种能够同时有效处理两种任务的统一框架。 Method: 引入任务路由Transformer，在共享视觉编码器中使用双分类标记分别路由I2I和T2I分支特征；设计分层提示生成机制，结合身份级可学习标记和实例级伪文本标记；利用模态特定的逆网络生成伪标记以注入细粒度语义；提出跨模态提示正则化策略，增强提示标记空间中的语义对齐。 Result: 在多个行人重识别基准上的实验表明，该方法在I2I和T2I任务上均达到最先进的性能。 Conclusion: HPL通过任务感知的分层提示学习，有效实现了图像和文本模态下的统一行人检索，提升了跨模态对齐与判别性表征学习。 Abstract: Person re-identification (ReID) aims to retrieve target pedestrian images given either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). Although both tasks share a common retrieval objective, they pose distinct challenges: I2I emphasizes discriminative identity learning, while T2I requires accurate cross-modal semantic alignment. Existing methods often treat these tasks separately, which may lead to representation entanglement and suboptimal performance. To address this, we propose a unified framework named Hierarchical Prompt Learning (HPL), which leverages task-aware prompt modeling to jointly optimize both tasks. Specifically, we first introduce a Task-Routed Transformer, which incorporates dual classification tokens into a shared visual encoder to route features for I2I and T2I branches respectively. On top of this, we develop a hierarchical prompt generation scheme that integrates identity-level learnable tokens with instance-level pseudo-text tokens. These pseudo-tokens are derived from image or text features via modality-specific inversion networks, injecting fine-grained, instance-specific semantics into the prompts. Furthermore, we propose a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, ensuring that pseudo-prompts preserve source-modality characteristics while enhancing cross-modal transferability. Extensive experiments on multiple ReID benchmarks validate the effectiveness of our method, achieving state-of-the-art performance on both I2I and T2I tasks.

[363] Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images

Yinuo Xu,Yan Cui,Mingyao Li,Zhi Huang

Main category: cs.CV

TL;DR: NuClass是一种受病理学家工作流程启发的框架，通过结合细胞核形态和微环境上下文实现细胞级别的多尺度分类，解决了传统模型缺乏组织背景和精细标注不足的问题。

Details

Motivation: 现有基于图像块的模型难以整合影响细胞功能和身份的广泛组织背景，且人类标注通常粗糙且分布不均，导致细粒度亚型监督困难。 Method: 提出NuClass框架，包含Path local（224×224像素，关注细胞核形态）和Path global（1024×1024像素，建模周围环境），并通过可学习门控模块自适应平衡局部细节与上下文线索；引入不确定性引导的目标函数，使全局路径优先关注局部路径不确定的区域，并提供置信度估计和Grad-CAM可视化以增强可解释性。 Result: 在三个完全独立的数据集上评估，NuClass在其表现最佳类别上达到高达96%的F1分数，优于强基线模型。 Conclusion: 多尺度、感知不确定性的融合策略能够弥合全切片病理基础模型与可靠细胞级表型预测之间的差距。 Abstract: Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell's function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain. To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability. To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.

[364] VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

Haotian Dong,Ye Li,Rongwei Lu,Chen Tang,Shu-Tao Xia,Zhi Wang

Main category: cs.CV

TL;DR: 提出了一种名为VVS的新颖推测解码框架，通过部分验证跳过加速视觉自回归生成，显著减少目标模型前向传递次数，同时保持良好的生成质量。

Details

Motivation: 现有的推测解码方法因“起草一步、验证一步”的范式限制了加速潜力，且无法直接减少前向传递次数；基于视觉token的可互换性，探索验证跳过以进一步降低推理延迟。 Method: 提出VVS框架，包含三个模块：动态截断的无验证token选择器、token级特征缓存与重用、细粒度跳过步调度，实现部分验证跳过，并分析起草阶段特性以保留生成质量和加速效果。 Result: 相比传统自回归解码，VVS将目标模型前向传递次数减少了2.8倍，在保持竞争力的生成质量的同时，优于常规推测解码框架。 Conclusion: VVS通过验证跳过有效提升了视觉自回归模型的推理效率，揭示了重塑推测解码范式的潜力。 Abstract: Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.

[365] ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement

Xin Xu,Hao Liu,Wei Liu,Wei Wang,Jiayi Wu,Kui Jiang

Main category: cs.CV

TL;DR: 提出了一种新的低光照图像增强框架ICLR，包含DIEM模块和CCL损失函数，有效提升亮度与色度分支间的互补特征提取并缓解梯度冲突，显著优于现有方法。

Details

Motivation: 现有方法在色度与亮度分支交互中存在分布差异大、误差传播及弱相关区域梯度冲突等问题，限制了增强效果。 Method: 设计双流交互增强模块（DIEM）从融合与增强两个维度提升互补信息提取；提出协方差校正损失（CCL），利用亮度残差统计量抑制色度误差并平衡梯度冲突。 Result: 在多个数据集上实验表明，该方法在定量指标和视觉效果上均优于当前最先进的低光增强方法。 Conclusion: ICLR框架有效改善了色度与亮度分支的交互机制，在低光照图像增强任务中取得了优异性能。 Abstract: Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.

[366] AtlasMorph: Learning conditional deformable templates for brain MRI

Marianne Rakic,Andrew Hoopes,S. Mazdak Abulnaga,Mert R. Sabuncu,John V. Guttag,Adrian V. Dalca

Main category: cs.CV

TL;DR: 提出一种基于卷积配准神经网络的机器学习框架，能够根据个体属性（如年龄、性别）高效生成条件化、带解剖标签的脑部MRI模板，并提升配准性能。

Details

Motivation: 传统模板构建耗时且数量有限，常导致研究中使用不具代表性的模板，尤其在人群差异大时表现更差。因此需要一种能快速生成适应特定人群的高质量模板的方法。 Method: 采用卷积配准神经网络，学习从个体属性（如年龄、性别）到图像模板的映射；利用可用的分割标注生成带解剖标签的模板；同时该网络可用于将个体图像配准到生成的模板上。 Result: 在多个3D脑MRI数据集上验证，模型能生成高质量、具有代表性的条件模板；带有标注的条件模板在配准任务中优于无条件或无标注模板，并优于其他模板构建方法。 Conclusion: 所提框架能高效生成个性化、带标签的解剖模板，显著提升医学图像配准与分析的准确性，为计算解剖学提供了更具适应性的工具。 Abstract: Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.

[367] Tissue Aware Nuclei Detection and Classification Model for Histopathology Images

Kesi Xu,Eleni Chiou,Ali Varamesh,Laura Acqualagna,Nasir Rajpoot

Main category: cs.CV

TL;DR: 提出了一种新的组织感知核检测框架TAND，利用点级监督和组织掩码条件进行联合核检测与分类，在PUMA基准上达到最先进性能。

Details

Motivation: 现有方法依赖详细的专家标注且未能充分利用组织上下文信息，限制了核检测与分类的准确性。 Method: TAND结合ConvNeXt编码器-解码器与冻结的Virchow-2组织分割分支，通过多尺度空间特征线性调制（Spatial-FiLM）将组织语义概率用于调节分类流。 Result: 在PUMA基准上优于无组织感知的基线方法和掩码监督方法，尤其在上皮、内皮和间质等组织依赖性细胞类型上有显著提升。 Conclusion: 这是首个利用学习到的组织掩码来调节单细胞分类的方法，有效降低标注负担，为计算病理学提供了实用的新路径。 Abstract: Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.

[368] A Real-Time Driver Drowsiness Detection System Using MediaPipe and Eye Aspect Ratio

Ashlesha G. Sawant,Shreyash S. Kamble,Raj S. Kanade,Raunak N. Kanugo,Tanishq A. Kapse,Karan A. Bhapse

Main category: cs.CV

TL;DR: 本文提出了一种基于普通网络摄像头和Eye Aspect Ratio（EAR）方法的驾驶员疲劳检测系统，利用MediaPipe的Face Mesh模型实时追踪面部特征，特别是眼部运动，通过声音警报提醒困倦驾驶员，具有高精度、快速响应和低成本的优点，可集成于高级驾驶辅助系统（ADAS）中。

Details

Motivation: 驾驶员疲劳是导致道路交通事故的主要原因之一，每年造成大量伤亡，因此需要一种高效且低成本的实时疲劳监测系统以提升行车安全。 Method: 系统采用标准网络摄像头，结合OpenCV进行图像处理，利用MediaPipe的Face Mesh模型精准提取面部关键点，重点通过Eye Aspect Ratio（EAR）算法监测眼睛开合程度和眨眼频率，判断是否出现长时间闭眼或低频眨眼等疲劳迹象。 Result: 实验结果表明，该系统能够准确、快速地检测驾驶员的疲劳状态，具备高检测性能和实时响应能力，在实际应用中表现出良好的可行性与稳定性。 Conclusion: 该基于EAR和MediaPipe的驾驶员疲劳检测系统是一种高效、低成本的解决方案，可有效提升驾驶安全性，并有望集成到现有的高级驾驶辅助系统（ADAS）中，具有广阔的应用前景。 Abstract: One of the major causes of road accidents is driver fatigue that causes thousands of fatalities and injuries every year. This study shows development of a Driver Drowsiness Detection System meant to improve the safety of the road by alerting drivers who are showing signs of being drowsy. The system is based on a standard webcam that tracks the facial features of the driver with the main emphasis on the examination of eye movements that can be conducted with the help of the Eye Aspect Ratio (EAR) method. The Face Mesh by MediaPipe is a lightweight framework that can identify facial landmarks with high accuracy and efficiency, which is considered to be important in real time use. The system detects the moments of long eye shutdowns or a very low rate of blinking which are manifestations of drowsiness and alerts the driver through sound to get her attention back. This system achieves a high-performance and low-cost driver monitoring solution with the help of the computational power of OpenCV to process the image and the MediaPipe to identify faces. Test data experimental analyses indicate that the system is very accurate and responds quicker; this confirms that it can be a component of the current Advanced Driving Assistance System (ADAS).

[369] Alpha Divergence Losses for Biometric Verification

Dimitrios Koutsianos,Ladislav Mosner,Yannis Panagakis,Themos Stafylakis

Main category: cs.CV

TL;DR: 本文提出了两种基于α-散度的新型带角度边距的损失函数（Q-Margin和A3M），用于人脸识别与说话人验证，在低误识率下显著优于现有方法，尤其适用于高安全场景。

Details

Motivation: 现有的CosFace和ArcFace等基于边距的softmax损失在人脸和说话人验证中表现优异，但α-散度损失虽能诱导稀疏解，却难以直接引入关键的角度边距，因此需要探索其有效融合方式。 Method: 通过在参考测度或logits中引入边距，提出Q-Margin和A3M两种新损失函数，并针对A3M训练不稳定性提出原型重初始化策略。 Result: 在IJB-B、IJB-C和VoxCeleb数据集上取得显著性能提升，尤其在低误接受率（FAR）下优于强基线模型。 Conclusion: 所提出的Q-Margin和A3M有效结合了α-散度与角度边距，在多种验证任务中表现出优越性能，尤其适合对安全性要求高的应用场景。 Abstract: Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.

[370] CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Shrenik Patel,Daivik Patel

Main category: cs.CV

TL;DR: 本文提出CacheFlow，一种无需训练的长视频问答（VQA）处理流程，通过动态令牌丢弃和压缩长期记忆机制，在保持回答准确性的前提下显著提升推理效率。

Details

Motivation: 现有的视觉语言模型在处理长视频时受限于注意力机制和KV缓存随时间增长的问题，导致推理成本高昂或只能使用短时滑动窗口，难以实现高效且全面的长时理解。 Method: CacheFlow结合了动态令牌丢弃（DTD）与压缩式长期记忆机制：DTD通过帧间余弦相似性在线剪枝每帧的图像块令牌，并将剩余令牌打包成固定大小的块；每个块的键由小型循环编码器摘要形成检索索引，完整KV对则被卸载并在生成时重新加载；推理时通过共识机制检索最相关的Top-K块，并联合局部上下文进行注意力计算。 Result: 在离线和流式VQA基准测试中，CacheFlow优于现有强基线方法，同时减少最多87%的处理令牌量，支持实时流式VQA并保持高回答保真度。 Conclusion: CacheFlow是一种即插即用、架构无关且无需微调的解决方案，使视觉语言模型在长视频理解中兼具高效性和上下文感知能力，推动了实际应用的发展。 Abstract: Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.

[371] Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Chunshi Wang,Junliang Ye,Yunhan Yang,Yang Li,Zizhuo Lin,Jun Zhu,Zhuo Chen,Yawei Luo,Chunchao Guo

Main category: cs.CV

TL;DR: 提出Part-X-MLLM，一种原生3D多模态大语言模型，通过结构化可执行语法将多种3D任务统一为程序，实现基于语言指令的部件级生成与编辑。

Details

Motivation: 现有方法难以统一处理多样化的3D任务，且缺乏对部件级语义理解与几何生成的有效解耦。 Method: 设计双编码器架构，预训练以分离结构与语义，并在大规模部件中心数据集上进行指令微调，自回归生成包含部件框、语义描述和编辑命令的结构化输出序列。 Result: 在接地问答、组合生成和局部编辑等任务上实现最先进性能，验证了统一接口生成高质量结构化计划的能力。 Conclusion: Part-X-MLLM通过符号规划与几何合成的解耦，提供了一个通用、灵活的前端接口，可驱动任意兼容的几何引擎完成多样化3D任务。 Abstract: We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/

[372] PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

Ziang Cao,Fangzhou Hong,Zhaoxi Chen,Liang Pan,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了PhysX-Anything，首个基于单张野外图像生成仿真就绪的物理3D资产的框架，具备显式几何、关节结构和物理属性，适用于具身AI中的模拟与交互。

Details

Motivation: 现有的3D生成方法大多忽略了关键的物理和关节特性，限制了其在具身AI中的应用。因此需要一个能够生成直接用于仿真的物理3D资产的框架。 Method: 提出了一种基于视觉语言模型（VLM）的新型物理3D生成模型，并设计了一种高效标记化几何的新3D表示方法，减少了193倍的token数量；同时构建了新的数据集PhysX-Mobility，扩展了物体类别并包含超过2000个真实世界物体的丰富物理标注。 Result: 在PhysX-Mobility和野外图像上的实验表明，PhysX-Anything具有强大的生成性能和良好的泛化能力；在MuJoCo风格环境中的仿真验证了所生成资产可直接用于接触密集型机器人策略学习。 Conclusion: PhysX-Anything能有效生成高质量、仿真就绪的物理3D资产，显著推动具身AI和基于物理的仿真等下游应用的发展。 Abstract: 3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

[373] Distribution Matching Distillation Meets Reinforcement Learning

Dengyang Jiang,Dongyang Liu,Zanyi Wang,Qilong Wu,Xin Jin,David Liu,Zhen Li,Mengmeng Wang,Peng Gao,Harry Yang

Main category: cs.CV

TL;DR: 提出DMDR框架，结合强化学习与分布匹配蒸馏，提升少步生成模型性能，超越多步教师模型。

Details

Motivation: 传统蒸馏方法中少步扩散模型性能受限于预训练的多步教师模型，难以突破其性能上限。 Method: 将强化学习引入分布匹配蒸馏过程，利用DMD损失作为正则化项，并设计动态分布引导和动态重加噪采样策略以优化初始蒸馏。 Result: 实验表明DMDR在视觉质量和提示一致性方面优于现有少步方法，甚至超过多步教师模型。 Conclusion: DMDR通过联合蒸馏与强化学习，有效释放少步生成模型潜力，实现性能突破。 Abstract: Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.

[374] OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

Henry Herzog,Favyen Bastani,Yawen Zhang,Gabriel Tseng,Joseph Redmon,Hadrien Sablon,Ryan Park,Jacob Morrison,Alexandra Buraczynski,Karen Farley,Joshua Hansen,Andrew Howe,Patrick Alan Johnson,Mark Otterlee,Ted Schmitt,Hunter Pitelka,Stephen Daspit,Rachel Ratner,Christopher Wilhelm,Sebastian Wood,Mike Jacobi,Hannah Kerner,Evan Shelhamer,Ali Farhadi,Ranjay Krishna,Patrick Beukema

Main category: cs.CV

TL;DR: OlmoEarth是一个针对地球观测数据的多模态时空基础模型，采用新颖的自监督学习方法，在多种基准和实际任务中表现优异，并已开源。

Details

Motivation: 地球观测数据具有空间性、时序性和高度多模态的特点，现有模型难以有效处理，因此需要专门设计的基础模型。 Method: 提出OlmoEarth模型，采用专为地球观测领域设计的自监督学习框架、掩码策略和损失函数，支持端到端的训练与推理平台。 Result: 在24项任务中嵌入性能领先15项，全微调在29项任务中领先19项，超越12个其他基础模型。 Conclusion: OlmoEarth在地球观测领域实现了最先进的性能，并已部署为开放平台，服务于非营利组织和NGO，推动全球重大问题的解决。 Abstract: Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.

[375] Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

Jiangnan Ye,Jiedong Zhuang,Lianrui Mu,Wenjie Zheng,Jiaqi Hu,Xingze Zou,Jing Wang,Haoji Hu

Main category: cs.CV

TL;DR: GS-Light是一种基于高斯点阵的文本感知3D场景重光照方法，通过融合多视角输入与文本提示中的光照先验，实现高质量、符合用户意图的重光照效果。

Details

Motivation: 现有方法在文本引导的3D场景重光照中难以准确理解光照方向和语义意图，且多视图一致性不足，缺乏高效的训练-free解决方案。 Method: 提出GS-Light，利用大视觉语言模型解析文本提示为光照先验，结合深度、法线和语义分割等几何信息生成初始隐编码，指导单输入扩散模型进行多视角重光照，并微调3D高斯点阵以获得完整的重光照3D场景。 Result: 在室内外场景上优于现有方法，在多视图一致性、图像质量、美学评分和语义相似性等指标上均有提升，用户研究也表明其更符合人类偏好。 Conclusion: GS-Light实现了高效、精确的文本驱动3D场景重光照，兼顾艺术性与几何一致性，为3D内容编辑提供了实用且性能优越的新方案。 Abstract: We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

[376] TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Harold Haodong Chen,Disen Lan,Wen-Jie Shu,Qingyang Liu,Zihan Wang,Sirui Chen,Wenkai Cheng,Kanghao Chen,Hongfei Zhang,Zixin Zhang,Rongjin Guo,Yu Cheng,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了TiViBench，一个用于评估图像到视频生成模型推理能力的分层基准，并引入VideoTPO方法在测试时提升模型推理表现。

Details

Motivation: 现有视频生成模型评估主要关注视觉保真度和时间连贯性，缺乏对高阶推理能力的系统评估，难以衡量模型是否具备类似大语言模型的推理能力。 Method: 设计了包含四个维度（结构推理、空间模式推理、符号逻辑推理、动作规划）的TiViBench基准，在24种任务场景中进行评估；提出VideoTPO策略，利用大语言模型对生成结果进行自我分析与优化，以提升推理性能。 Result: 实验表明商业模型（如Sora 2、Veo 3.1）具有更强的推理潜力，开源模型受限于训练规模和数据多样性；VideoTPO无需额外训练即可显著提升推理表现。 Conclusion: TiViBench填补了视频生成模型高阶推理评估的空白，VideoTPO为提升模型推理能力提供了有效途径，推动视频生成向更具逻辑性和智能性的方向发展。 Abstract: The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

[377] Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine

Xincheng Shuai,Zhenyuan Qin,Henghui Ding,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了一种名为FFSE的3D感知自回归框架，用于在真实图像上实现直观且物理一致的对象编辑，通过建模一系列学习到的3D变换来支持多轮编辑，并引入了一个混合数据集3DObjectEditor以促进训练，实验表明该方法在单轮和多轮3D感知编辑中均优于现有方法。

Details

Motivation: 现有的文本到图像扩散模型在语义图像编辑方面取得进展，但在3D感知对象操作方面存在不足，尤其是缺乏对真实世界图像中物理一致性编辑的支持。 Method: 提出FFSE框架，将编辑建模为一系列学习的3D变换（如平移、缩放、旋转），直接在真实图像上进行操作；采用自回归方式实现多轮编辑一致性；构建3DObjectEditor混合数据集，结合模拟编辑序列进行训练。 Result: 在单轮和多轮3D感知图像编辑任务中，FFSE显著优于现有方法，能够保持背景效果（如阴影、反射）的真实性和全局场景一致性。 Conclusion: FFSE为3D-aware图像编辑提供了有效解决方案，能够在无需显式3D重建的情况下实现高质量、多轮次的物理一致对象操作。 Abstract: Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.

[378] UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

Junwei Yu,Trevor Darrell,XuDong Wang

Main category: cs.CV

TL;DR: UnSAMv2 提出了一种无需人工标注的任意粒度分割方法，通过自监督学习和新颖的粒度控制嵌入，在仅使用6K无标签图像和极小额外参数的情况下显著提升了SAM-2在多种任务上的性能。

Details

Motivation: 现有的SAM模型在分割粒度控制方面能力有限，用户需手动调整提示或选择预生成掩码以获得所需细节，且密集标注各粒度数据成本过高，难以进行监督训练。 Method: UnSAMv2扩展了UnSAM的分治策略，从无标签图像中发现大量掩码-粒度对，并引入一种新的粒度控制嵌入，实现对分割尺度的精确连续控制，采用基于无标注数据的自监督学习方法。 Result: 在11个以上基准测试中，UnSAMv2将NoC_90从5.69提升至4.75，1-IoU从58.0提升至73.1，AR_1000从49.6提升至68.3，在交互式、全图和视频分割任务中均表现出色。 Conclusion: 少量无标签数据结合粒度感知的自监督学习方法可有效释放视觉基础模型在多粒度分割上的潜力，UnSAMv2为实现真正灵活的‘任意分割’提供了可行路径。 Abstract: The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.

[379] Segment Anything Across Shots: A Method and Benchmark

Hengrui Hu,Kaining Ying,Henghui Ding

Main category: cs.CV

TL;DR: 提出了一种用于多镜头半监督视频对象分割（MVOS）的过渡模拟数据增强策略（TMA）和跨镜头分割模型（SAAS），并构建了新的MVOS基准Cut-VOS，实现了在复杂镜头过渡下的先进性能。

Details

Motivation: 现有VOS方法主要针对单镜头视频，在处理多镜头视频的镜头不连续性方面表现不佳，限制了其实际应用。同时，标注的多镜头数据稀疏，难以支持相关研究。 Method: 提出过渡模拟数据增强策略（TMA），利用单镜头数据实现跨镜头泛化；设计Segment Anything Across Shots（SAAS）模型，有效检测和理解镜头过渡；构建新的MVOS基准Cut-VOS，包含密集掩码标注、多样对象类别和高频镜头切换。 Result: 在YouMVOS和Cut-VOS两个数据集上进行了大量实验，结果表明SAAS在跨复杂镜头过渡的分割任务中达到最先进的性能。 Conclusion: TMA和SAAS有效解决了多镜头视频中因标注数据稀疏和镜头不连续带来的挑战，显著提升了MVOS的性能，推动了该领域的研究与应用。 Abstract: This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.

[380] Scaling Spatial Intelligence with Multimodal Foundation Models

Zhongang Cai,Ruisi Wang,Chenyang Gu,Fanyi Pu,Junxiang Xu,Yubo Wang,Wanqi Yin,Zhitao Yang,Chen Wei,Qingping Sun,Tongxi Zhou,Jiaqi Li,Hui En Pang,Oscar Qian,Yukun Wei,Zhiqian Lin,Xuanke Shi,Kewang Deng,Xiaoyang Han,Zukai Chen,Xiangyu Fan,Hanming Deng,Lewei Lu,Liang Pan,Bo Li,Ziwei Liu,Quan Wang,Dahua Lin,Lei Yang

Main category: cs.CV

TL;DR: 本研究提出了SenseNova-SI系列多模态基础模型，旨在通过大规模数据训练提升模型的空间智能。研究构建了包含八百万多样本的SenseNova-SI-8M数据集，并在严格的分类体系下系统地培养空间认知能力。模型在多个空间智能基准测试中表现优异，同时保持了强大的通用多模态理解能力。研究还探讨了数据扩展的影响、泛化能力的涌现迹象、过拟合与语言捷径的风险，并初步探索了空间链式推理及其下游应用潜力。所有新训练的模型均已公开发布。

Details

Motivation: 尽管多模态基础模型取得了显著进展，但在空间智能方面仍存在明显缺陷。为此，研究旨在通过系统性方法提升模型对空间关系的理解与推理能力，推动多模态AI向更高级的认知能力发展。 Method: 基于已有的多模态基础模型（如Qwen3-VL、InternVL3和Bagel），构建了一个涵盖八百万样本的大规模、多样化数据集SenseNova-SI-8M，依据严格的空间能力分类体系进行数据整理，并通过规模化训练培育空间智能。 Result: SenseNova-SI在多项空间智能基准上取得领先性能：VSI-Bench达68.7%，MMSI达43.3%，MindCube达85.6%，ViewSpatial达54.6%，SITE达50.1%；同时在MMBench-En等通用多模态任务上保持84.9%的高性能。实验还发现数据多样性促进了泛化能力的初步涌现，验证了空间链式推理的可能性，并评估了过拟合与语言捷径风险。 Conclusion: 通过系统性的大规模数据构建与训练，可以有效提升多模态基础模型的空间智能水平。SenseNova-SI展示了数据规模与多样性对复杂空间推理能力发展的关键作用，为未来多模态智能系统的发展提供了可行路径和技术验证，且所有模型已开源以促进后续研究。 Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

[381] Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li,Kaiming He

Main category: cs.CV

TL;DR: 提出一种直接预测干净数据的扩散模型JiT，基于流形假设，在高维空间中使用大块Transformer直接处理原始图像，无需分词、预训练或额外损失，取得了在ImageNet上的竞争性结果。

Details

Motivation: 现有扩散模型预测噪声而非干净图像，违背流形假设；作者认为应直接预测位于低维流形上的真实数据。 Method: 设计名为JiT的模型，使用大patch大小的纯Transformer直接在像素级预测干净图像，不引入分词器、预训练或额外损失函数。 Result: 在ImageNet 256和512分辨率上，使用patch大小16和32取得具有竞争力的结果，尤其在高维噪声预测易失败的情况下表现稳健。 Conclusion: 直接预测干净数据符合流形假设，可使容量有限的网络在高维空间有效工作，JiT提供了一种自包含、回归本质的Transformer扩散模型范式。 Abstract: Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "$\textbf{Just image Transformers}$", or $\textbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Table of Contents

cs.CL [Back]

[1] TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy

[2] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

[3] On the Notion that Language Models Reason

[4] Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis

[5] Towards Autoformalization of LLM-generated Outputs for Requirement Verification

[6] Three Stage Narrative Analysis; Plot-Sentiment Breakdown, Structure Learning and Concept Detection

[7] Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

[8] MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

[9] ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts

[10] Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support

[11] Additive Large Language Models for Semi-Structured Text

[12] InData: Towards Secure Multi-Step, Tool-Based Data Analysis

[13] Improving LLM's Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

[14] On the Entropy Calibration of Language Models

[15] A Reasoning Paradigm for Named Entity Recognition

[16] Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations

[17] CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs

[18] Exploring Parameter-Efficient Fine-Tuning and Backtranslation for the WMT 25 General Translation Task

[19] LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models

[20] PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

[21] AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

[22] Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

[23] CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

[24] MME-RAG: Multi-Manager-Expert Retrieval-Augmented Generation for Fine-Grained Entity Recognition in Task-Oriented Dialogues

[25] Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts

[26] ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

[27] Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor

[28] AugAbEx : Way Forward for Extractive Case Summarization

[29] Do LLMs and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering

[30] Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

[31] From Phonemes to Meaning: Evaluating Large Language Models on Tamil

[32] Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

[33] Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing

[34] SGuard-v1: Safety Guardrail for Large Language Models

[35] QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs

[36] TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

[37] Mitigating Length Bias in RLHF through a Causal Lens

[38] MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

[39] Group-Aware Reinforcement Learning for Output Diversity in Large Language Models

[40] Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

[41] Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

[42] Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing

[43] Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

[44] Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

[45] Adaptive Focus Memory for Language Models

[46] On the Brittleness of LLMs: A Journey around Set Membership

[47] Evidence of Phase Transitions in Small Transformer-Based Language Models

[48] LLM Reinforcement in Context

[49] Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

[50] BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals

[51] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

[52] Quantifying consistency and accuracy of Latent Dirichlet Allocation

[53] NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation

[54] From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

[55] Classification of Hope in Textual Data using Transformer-Based Models

[56] Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy

[57] Visual Room 2.0: Seeing is Not Understanding for MLLMs

[58] Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

[59] AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

[60] How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm

[61] Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

[62] BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

[63] Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

[64] Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

[65] A Comparative Analysis of Recurrent and Attention Architectures for Isolated Sign Language Recognition

[66] Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels

[67] Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

[68] TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine

[69] Translation Entropy: A Statistical Framework for Evaluating Translation Systems

[70] Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study

[71] Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

[72] Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

[73] RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service Copyright Protection

[74] AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects

[75] Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

[76] Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

[77] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

[78] Non-Linear Scoring Model for Translation Quality Evaluation