cs.CL [Back]

[1] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data

Shashi Kant Gupta,Arijeet Pramanik,Jerrin John Thomas,Regina Schwind,Lauren Wiener,Avi Raju,Jeremy Kornbluth,Yanshan Wang,Zhaohui Su,Hrituraj Singh

Main category: cs.CL

TL;DR: 本研究提出了一种基于大语言模型（LLM）的代理框架，用于从大规模真实世界肿瘤学文本中端到端地提取结构化临床数据，显著提高了自动化提取的准确性和可扩展性。

Details

Motivation: 电子健康记录中的非结构化笔记包含丰富的肿瘤学信息，但现有方法在处理患者级别、跨文档的数据整合时存在局限，且难以应对文本变异性与术语复杂性，亟需一种全面、可扩展的自动化解决方案。 Method: 提出一种模块化、自适应的代理框架，利用大语言模型作为推理代理，结合上下文敏感检索和迭代综合能力，对复杂的肿瘤学数据提取任务进行系统分解，并在超过40万份临床笔记和PDF报告上进行端到端提取。 Result: 在包含2,250名癌症患者的超40万份文档数据集上，该方法平均F1得分达0.93，103个临床变量中有100个超过0.85，关键变量（如生物标志物和药物）超过0.95；应用于数据整理流程后，手动批准率达94%，显著降低标注成本。 Conclusion: 这是首个大规模、端到端应用LLM代理进行肿瘤学结构化数据提取的研究，展示了其在真实临床环境中的高效性与实用性，为自动化医疗数据提取提供了新范式。 Abstract: Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale

[2] How well do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse

Kirk Vanacore,Rene F. Kizilcec

Main category: cs.CL

TL;DR: 本研究评估了六种大语言模型（LLMs）在无显著定制情况下对真实课堂对话中教学行为的分类能力，发现少样本提示显著提升性能，但表现因任务而异且存在可靠性局限。

Details

Motivation: 随着大语言模型在教育技术中的广泛应用，了解其在真实教育场景中开箱即用的表现对于设定合理预期和建立基准至关重要。 Method: 比较六种大语言模型在零样本、单样本和少样本提示下对真实课堂转录文本中教学行为的分类效果，并以专家编码为标准进行评估。 Result: 少样本提示显著提升了最先进模型的性能（Cohen's Kappa = 0.58），但不同教学行为上的表现差异较大，高召回率常伴随更多误报。 Conclusion: 基础模型具备一定解读教学话语的能力，提示设计可挖掘其潜力，但无法完全克服其可靠性限制。 Abstract: Large language models (LLMs) are increasingly adopted in educational technologies for a variety of tasks, from generating instructional materials and assisting with assessment design to tutoring. While prior work has investigated how models can be adapted or optimized for specific tasks, far less is known about how well LLMs perform at interpreting authentic educational scenarios without significant customization. As LLM-based systems become widely adopted by learners and educators in everyday academic contexts, understanding their out-of-the-box capabilities is increasingly important for setting expectations and benchmarking. We compared six LLMs to estimate their baseline performance on a simple but important task: classifying instructional moves in authentic classroom transcripts. We evaluated typical prompting methods: zero-shot, one-shot, and few-shot prompting. We found that while zero-shot performance was moderate, providing comprehensive examples (few-shot prompting) significantly improved performance for state-of-the-art models, with the strongest configuration reaching Cohen's Kappa = 0.58 against expert-coded annotations. At the same time, improvements were neither uniform nor complete: performance varied considerably by instructional move, and higher recall frequently came at the cost of increased false positives. Overall, these findings indicate that foundation models demonstrate meaningful yet limited capacity to interpret instructional discourse, with prompt design helping to surface capability but not eliminating fundamental reliability constraints.

[3] Counterfactual LLM-based Framework for Measuring Rhetorical Style

Jingyi Qiu,Hong Chen,Zongyi Li

Main category: cs.CL

TL;DR: 本文提出一种基于大语言模型（LLM）的反事实框架，用于量化机器学习论文中的修辞风格，分离其与实质内容的影响。通过对8,485篇ICLR投稿的分析，发现愿景式表述能显著预测引用和媒体关注，且自2023年起修辞强度明显上升，主要受LLM写作辅助工具推动。

Details

Motivation: 旨在解决AI领域论文中“夸大宣传”现象难以量化的问题，区分修辞风格与真实科研贡献，以更客观评估科学影响力。 Method: 构建基于LLM的反事实框架：使用多个LLM修辞角色从相同内容生成不同表述，通过LLM裁判进行成对比较，并用Bradley-Terry模型聚合结果，从而量化修辞风格。 Result: 在8,485篇ICLR论文中生成超25万条反事实文本，发现愿景式表达显著预测后续关注度（如引用和媒体报道），且修辞强度在2023年后显著上升，证据表明这与LLM写作辅助的普及密切相关。 Conclusion: LLM可作为有效工具来测量和改进科学评价过程，揭示修辞风格对学术影响力的真实作用，并为识别‘炒作’提供可扩展的方法。 Abstract: The rise of AI has fueled growing concerns about ``hype'' in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because bold language can stem from either strong empirical results or mere rhetorical style, it is often difficult to distinguish between the two. To disentangle rhetorical style from substantive content, we introduce a counterfactual, LLM-based framework: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley--Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. We find that visionary framing significantly predicts downstream attention, including citations and media attention, even after controlling for peer-review evaluations. We also observe a sharp rise in rhetorical strength after 2023, and provide empirical evidence showing that this increase is largely driven by the adoption of LLM-based writing assistance. The reliability of our framework is validated by its robustness to the choice of personas and the high correlation between LLM judgments and human annotations. Our work demonstrates that LLMs can serve as instruments to measure and improve scientific evaluation.

Zhixiang Lu,Xueyuan Deng,Yiran Liu,Yulong Li,Qiang Yan,Imran Razzak,Jionglong Su

Main category: cs.CL

TL;DR: 本文提出了一种名为PRISM的新框架，结合随机微分方程和基于MBTI人格的决策模型，用于模拟社交媒体中的意见动态，显著提升了人格一致性和对极化现象的解释力。

Details

Motivation: 传统基于代理的意见动力学模型因假设个体同质而难以捕捉在线极化的心理异质性，阻碍了对意识形态分裂机制的理解。 Method: 提出PRISM框架，结合描述情绪演化的随机微分方程与基于MBTI人格的PC-POMDP决策模型，使用大规模社交媒体数据初始化多模态大语言模型代理的人格特征。 Result: PRISM在人格一致性上优于传统同质模型和大五人格基准，能有效复现理性抑制和情感共鸣等涌现现象。 Conclusion: PRISM为研究复杂社交媒体生态系统中由认知差异驱动的极化机制提供了更强大的建模工具。 Abstract: Traditional agent-based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality-Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality-conditional partially observable Markov decision process (PC-POMDP) for discrete decision-making. In contrast to continuous trait approaches, PRISM assigns distinct Myers-Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data-driven priors from large-scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.

[5] Bias Beneath the Tone: Empirical Characterisation of Tone Bias in LLM-Driven UX Systems

Heet Bodara,Md Masum Mushfiq,Isma Farah Siddiqui

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型在对话系统中的隐性语气偏见问题，通过合成对话数据和可控制的生成方法，发现即使在中性提示下模型仍表现出系统性语气偏差，并利用分类模型有效检测这些偏差。

Details

Motivation: 识别并量化大型语言模型在对话中潜在的语气偏见，以提升对话系统的公平性、可信度与情感识别的伦理性。 Method: 结合可控的大型语言模型生成中性和带有情感倾向的对话数据集，使用预训练的DistilBERT模型进行弱监督标注，并训练多个分类器（包括集成模型）来检测语气偏见。 Result: 即使在中性提示生成的对话中也发现了持续的语气偏向；集成分类器在检测语气偏见上达到最高0.92的macro F1分数，表明该偏见具有系统性和可测量性。 Conclusion: 大型语言模型存在隐性的语气偏见，这种偏见可能源于其内在对话风格，需在设计对话系统时加以调控，以实现更公平、可信的AI交互。 Abstract: Large Language Models are increasingly used in conversational systems such as digital personal assistants, shaping how people interact with technology through language. While their responses often sound fluent and natural, they can also carry subtle tone biases such as sounding overly polite, cheerful, or cautious even when neutrality is expected. These tendencies can influence how users perceive trust, empathy, and fairness in dialogue. In this study, we explore tone bias as a hidden behavioral trait of large language models. The novelty of this research lies in the integration of controllable large language model based dialogue synthesis with tone classification models, enabling robust and ethical emotion recognition in personal assistant interactions. We created two synthetic dialogue datasets, one generated from neutral prompts and another explicitly guided to produce positive or negative tones. Surprisingly, even the neutral set showed consistent tonal skew, suggesting that bias may stem from the model's underlying conversational style. Using weak supervision through a pretrained DistilBERT model, we labeled tones and trained several classifiers to detect these patterns. Ensemble models achieved macro F1 scores up to 0.92, showing that tone bias is systematic, measurable, and relevant to designing fair and trustworthy conversational AI.

[6] Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Ming Li,Chenrui Fan,Yize Cheng,Soheil Feizi,Tianyi Zhou

Main category: cs.CL

TL;DR: 提出ThinkARM框架，利用Schoenfeld的Episode Theory将大模型推理过程抽象成功能性步骤，揭示数学解题中的可重现思维动态与结构差异。

Details

Motivation: 现有方法难以从表层统计之外识别和分析大语言模型推理过程的认知结构，缺乏中等尺度的分析视角。 Method: 采用Schoenfeld的Episode Theory作为归纳框架，构建ThinkARM，将推理轨迹抽象为分析、探索、实施、验证等功能性步骤，并在不同模型的数学解题任务中进行应用与诊断分析。 Result: 发现推理与非推理模型在结构上有显著差异；探索步骤是与正确性相关的关键分支点；效率优化方法选择性抑制评估反馈步骤而非简单缩短响应。 Conclusion: 基于片段（episode）的表示能显式呈现推理步骤，为系统分析现代语言模型中推理的结构、稳定性及变化提供了新途径。 Abstract: Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

[7] Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Yiming Du,Baojun Wang,Yifan Xiang,Zhaowei Wang,Wenyu Huang,Boyang Xue,Bin Liang,Xingshan Zeng,Fei Mi,Haoli Bai,Lifeng Shang,Jeff Z. Pan,Yuxin Jiang,Kam-Fai Wong

Main category: cs.CL

TL;DR: 本文提出了Memory-T1，一种基于强化学习的时间感知记忆选择框架，用于提升长对话历史中的时序推理能力，在Time-Dialog基准上显著超越现有模型并具备强噪声鲁棒性。

Details

Motivation: 随着对话历史变长且包含噪声，现有长上下文模型难以准确识别时序相关信息，导致推理性能下降，因此需要更有效的时间感知信息筛选机制。 Method: 提出Memory-T1框架，采用粗到细策略：先通过时间与相关性过滤器剪枝对话历史，再用强化学习代理选择关键证据会话；训练中采用多层级奖励函数，优化答案准确性、证据可追溯性和时间一致性（包括会话级和话语级的时序对齐）。 Result: 在Time-Dialog基准上，Memory-T1使7B模型达到67.0%的总分，超过14B基线10.2%，创下新开源模型SOTA；消融实验显示时间一致性和证据奖励共同带来15.0%增益；在高达128k token的噪声历史中仍保持鲁棒性。 Conclusion: Memory-T1通过时间感知的记忆选择机制有效提升了长对话中的时序推理能力，尤其在处理长文本和噪声干扰时表现出卓越的性能与稳定性。 Abstract: Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0\% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/

[8] A Novel Graph-Sequence Learning Model for Inductive Text Classification

Zuo Wang,Ye Yuan

Main category: cs.CL

TL;DR: 本文提出了一种新的图-序列学习模型TextGSL，用于归纳式文本分类，通过构建多类型边的文本级图并结合Transformer层，有效融合了多种结构和序列信息，提升了文本表示与分类性能。

Details

Motivation: 现有基于图神经网络的文本分类方法未能充分考虑词对之间的多样化结构信息（如共现、语法、语义），且忽略了序列信息，难以处理包含新词和新关系的文本。 Method: 构建单个文本级图，根据不同词对关系设置多种边类型；设计自适应多边消息传递机制以聚合多样化的结构信息；引入Transformer层捕捉文本中的序列信息。 Result: 在多个基准数据集上的实验表明，TextGSL在分类准确率上优于多种强基线模型。 Conclusion: TextGSL通过融合多类型图结构与序列信息，能够学习更具判别性的文本表示，显著提升归纳式文本分类性能。 Abstract: Text classification plays an important role in various downstream text-related tasks, such as sentiment analysis, fake news detection, and public opinion analysis. Recently, text classification based on Graph Neural Networks (GNNs) has made significant progress due to their strong capabilities of structural relationship learning. However, these approaches still face two major limitations. First, these approaches fail to fully consider the diverse structural information across word pairs, e.g., co-occurrence, syntax, and semantics. Furthermore, they neglect sequence information in the text graph structure information learning module and can not classify texts with new words and relations. In this paper, we propose a Novel Graph-Sequence Learning Model for Inductive Text Classification (TextGSL) to address the previously mentioned issues. More specifically, we construct a single text-level graph for all words in each text and establish different edge types based on the diverse relationships between word pairs. Building upon this, we design an adaptive multi-edge message-passing paradigm to aggregate diverse structural information between word pairs. Additionally, sequential information among text data can be captured by the proposed TextGSL through the incorporation of Transformer layers. Therefore, TextGSL can learn more discriminative text representations. TextGSL has been comprehensively compared with several strong baselines. The experimental results on diverse benchmarking datasets demonstrate that TextGSL outperforms these baselines in terms of accuracy.

[9] ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

Aly Lidayan,Jakob Bjorner,Satvik Golechha,Kartik Goyal,Alane Suhr

Main category: cs.CL

TL;DR: 提出ABBEL框架，通过自然语言摘要的信念状态压缩长交互历史，在多步决策任务中保持近似恒定的内存使用，并结合强化学习优化信念生成与动作选择，提升性能并减少错误传播。

Details

Motivation: 随着序列决策任务长度增加，保留完整的交互历史在计算上变得不现实，需要一种有效的方法来压缩上下文以降低内存开销，同时维持决策性能。 Method: 提出ABBEL框架，用自然语言形式的信念状态替代长交互历史；每一步先基于最新观测更新先验信念为后验信念，再仅依赖后验信念选择动作；进一步通过强化学习对LLM进行后训练，引入信念评分和长度惩罚来优化信念质量和压缩程度。 Result: 在六个多步环境中评估前沿模型，发现ABBEL能保持近似恒定的内存使用并生成可解释的信念，但存在因信念更新错误导致的误差传播问题；经RL优化后，ABBEL性能超越全上下文方法，且内存占用更低。 Conclusion: ABBEL提供了一种高效、可解释的上下文压缩机制，结合强化学习可有效缓解误差传播，实现更优的多步决策性能与更少的内存消耗。 Abstract: As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.

[10] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Hyeongcheol Park,Jiyoung Seo,Jaewon Mun,Hogun Park,Wonmin Byeon,Sung June Kim,Hyeonsoo Im,JeungSub Lee,Sangpil Kim

Main category: cs.CL

TL;DR: 本文提出M$^3$KG-RAG，一种基于多跳多模态知识图谱的检索增强生成框架，通过构建上下文丰富的M$^3$KG和GRASP检索与剪枝机制，提升多模态大模型在音视频领域的推理深度与回答准确性。

Details

Motivation: 现有音视频检索增强生成面临多模态知识图谱模态覆盖有限、多跳连接不足，以及仅依赖嵌入相似性检索导致冗余或离题知识的问题，限制了多模态大模型的推理能力。 Method: 设计轻量级多智能体流水线构建多跳多模态知识图谱（M$^3$KG），并提出GRASP框架，实现查询对齐的实体定位、相关性评估与冗余上下文剪枝，以支持精准的知识检索与选择。 Result: 在多个多模态基准上的实验表明，M$^3$KG-RAG显著优于现有方法，在多模态推理与实体定位任务中取得更优性能。 Conclusion: M$^3$KG-RAG通过构建高质量多跳知识图谱与精细化检索剪枝机制，有效提升了多模态大语言模型在音视频场景下的知识利用效率与回答可信度。 Abstract: Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches.

[11] Multi-hop Reasoning via Early Knowledge Alignment

Yuxin Wang,Shicheng Fang,Bo Wang,Qi Luo,Xuanjing Huang,Yining Zheng,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了Early Knowledge Alignment (EKA) 模块，通过在规划前引入检索知识来提升迭代式RAG系统的性能，有效提高检索精度、减少推理错误并增强效率。

Details

Motivation: 现有迭代RAG系统在问题分解时未考虑检索语料库信息，导致检索和推理效率低下，易产生级联错误。 Method: 提出EKA模块，在规划前将大语言模型与上下文相关的检索知识对齐，以强化推理基础，并从熵的角度分析其减少无效探索的作用。 Result: 在六个标准RAG数据集上实验表明，EKA显著提升了检索精度和整体性能，减少了级联错误，且作为无需训练的推理策略可扩展至大模型。 Conclusion: EKA通过早期知识对齐增强了迭代RAG系统的推理能力和效率，揭示了结构化推理与高效探索在强化学习增强框架中的关键互动，推动了该领域的前沿发展。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at \href{https://github.com/yxzwang/EarlyKnowledgeAlignment}{Github}.

[12] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

Xiang Chen,Yixin Ou,Quan Feng,Lei Li,Piji Li,Haibo Ye,Sheng-Jun Huang,Shuofei Qiao,Shumin Deng,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为RetroPrompt的新方法，通过解耦记忆与知识，在预训练基础模型中实现记忆与泛化之间的平衡。该方法利用从训练数据生成的公开知识库，并在输入、训练和推理阶段引入检索机制，以增强上下文信息的获取。实验表明，RetroPrompt在多种自然语言处理和计算机视觉任务的零样本和少样本场景中表现优越，并有效减少对死记硬背的依赖，提升模型泛化能力。

Details

Motivation: 现有的提示学习方法仍遵循参数化学习范式，可能导致泛化稳定性下降，难以充分利用非典型样本并容易在小样本全监督训练中过拟合浅层模式。因此需要一种能平衡记忆与泛化的新方法。 Method: 提出RetroPrompt方法，将知识从单纯记忆中解耦，利用基于训练数据构建的公开知识库，并在整个输入、训练和推理过程中引入检索机制，使模型能够主动从语料库中检索相关信息以增强输入线索。 Result: 在多个NLP和CV任务的数据集上实验显示，RetroPrompt在零样本和少样本设置下均优于现有方法；分析表明其显著降低了对死记硬背的依赖，提升了模型的泛化能力。 Conclusion: RetroPrompt通过引入检索机制和外部知识库，有效实现了记忆与泛化的平衡，增强了提示学习的稳定性和性能，为多模态少样本学习提供了新方向。 Abstract: The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.

[13] Fun-Audio-Chat Technical Report

Qian Chen,Luyao Cheng,Chong Deng,Xiangang Li,Jiaqing Liu,Chao-Hong Tan,Wen Wang,Junhao Xu,Jieping Ye,Qinglin Zhang,Qiquan Zhang,Jingren Zhou

Main category: cs.CL

TL;DR: Fun-Audio-Chat是一种大型音频语言模型，通过双分辨率语音表示和核心-鸡尾酒训练等创新方法，解决了现有语音-文本模型在时间分辨率不匹配、计算成本高和灾难性遗忘等方面的问题，在多种语音任务中表现出色。

Details

Motivation: 现有的语音-文本联合模型存在语音和文本标记之间的时间分辨率不匹配问题，导致语义信息丢失、计算成本高昂以及对文本大模型知识的灾难性遗忘，限制了其在实际语音交互中的应用。 Method: 提出Fun-Audio-Chat模型，采用双分辨率语音表示（DRSR），在5Hz下进行高效处理，并通过语音精炼头在25Hz下生成高质量语音；引入核心-鸡尾酒训练策略进行两阶段微调以缓解灾难性遗忘；并使用多任务DPO训练增强鲁棒性、语音理解和指令跟随能力。 Result: Fun-Audio-Chat 8B和MoE 30B-A3B在语音到文本、语音到语音任务中表现优异，在类似规模模型中于口语问答基准上排名领先，并在音频理解、语音功能调用、指令跟随和语音共情方面达到先进水平。全双工版本Fun-Audio-Chat-Duplex在实时交互中也表现出色。 Conclusion: Fun-Audio-Chat通过高效的架构设计和多阶段后训练策略，成功平衡了语音处理的效率与质量，避免了对大规模音频-文本预训练的依赖，为构建高性能语音交互系统提供了可行路径。 Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo.

[14] AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Honglin Mu,Jinghao Liu,Kaiyang Wan,Rui Xing,Xiuying Chen,Timothy Baldwin,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了一种针对大语言模型在简历筛选中易受对抗性指令攻击的基准测试，并评估了基于提示和LoRA微调的FIDS防御方法，发现训练时防御在安全性和实用性上均优于推理时防御。

Details

Motivation: 大语言模型在自动化任务中广泛应用，但容易受到输入中隐藏的对抗性指令操纵，尤其在缺乏成熟防御机制的应用场景（如简历筛选）中存在安全隐患。 Method: 构建了一个评估简历筛选场景下对抗性指令攻击的基准，测试了提示工程和基于LoRA适配的FIDS（Foreign Instruction Detection through Separation）两种防御机制，并比较其攻击缓解效果与误拒率。 Result: 某些攻击类型的攻击成功率超过80%；基于提示的防御减少10.1%攻击但增加12.5%误拒率，FIDS减少15.4%攻击并仅增加10.4%误拒率；两者结合可实现26.3%的攻击减少。 Conclusion: 训练时的模型适应性防御（如LoRA微调）比推理时的提示防御更有效，在提升安全性的同时更好地保持了模型效用。 Abstract: Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task. Notably, while defenses may exist for mature domains such as code review, they are often absent in other common applications such as resume screening and peer review. This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types. We evaluate two defense mechanisms: prompt-based defenses achieve 10.1% attack reduction with 12.5% false rejection increase, while our proposed FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation achieves 15.4% attack reduction with 10.4% false rejection increase. The combined approach provides 26.3% attack reduction, demonstrating that training-time defenses outperform inference-time mitigations in both security and utility preservation.

[15] FaithLens: Detecting and Explaining Faithfulness Hallucination

Shuzheng Si,Qingyi Wang,Haozhe Zhao,Yuzhuo Bai,Guanqiao Chen,Kangyang Luo,Gang Chen,Fanchao Qi,Minjia Zhang,Baobao Chang,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出了FaithLens，一个高效且低成本的忠实性幻觉检测模型，能够联合提供二分类预测和相应解释，提升可信度。

Details

Motivation: 识别大语言模型输出中的忠实性幻觉对于实际应用至关重要，如检索增强生成和摘要任务。现有方法在成本、解释性和性能之间难以平衡。 Method: 首先利用先进的大语言模型合成带解释的训练数据，并采用严格的数据过滤策略保证标签正确性、解释质量和数据多样性；然后在此高质量数据上微调模型作为冷启动，再结合基于规则的强化学习进一步优化，同时奖励预测准确性和解释质量。 Result: 在12个多样化任务上的实验表明，8B参数的FaithLens优于GPT-4.1和o3等先进模型，且能生成高质量的解释。 Conclusion: FaithLens在忠实性幻觉检测方面实现了可信度、效率与有效性的良好平衡，具有较强的实用价值。 Abstract: Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.

[16] Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings

Marko Čechovič,Natália Komorníková,Dominik Macháček,Ondřej Bojar

Main category: cs.CL

TL;DR: 本文介绍了一个用于评估自动同声传译系统的跨语言对话语料库，包含5小时、12种语言的语音记录及其转录和翻译，并探讨了利用大语言模型自动检测跨语言交流中的误解，结果显示Gemini模型在识别误解方面具有较高的召回率。

Details

Motivation: 为了评估在无共同语言个体间进行实时语音翻译的自动系统，需要一个真实且多用途的评估语料库。 Method: 构建了一个包含语音记录、ASR与人工校正转录及翻译的跨语言对话语料库，并对会议内容进行了书面摘要；同时，采用人工标注和大语言模型（如Gemini）测试自动检测交流中的误解。 Result: 语料库包含5小时语音数据，覆盖12种语言到英语的自动与修正翻译；Gemini模型在误解检测任务中达到77%的召回率和47%的精确率。 Conclusion: 所提出的语料库为跨语言对话与翻译研究提供了宝贵资源，且大语言模型在自动识别交流误解方面展现出潜力，尽管精度仍有提升空间。 Abstract: Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings. Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.

[17] SlideTailor: Personalized Presentation Slide Generation for Scientific Papers

Wenzheng Zeng,Mingyu Ouyang,Langyuan Cui,Hwee Tou Ng

Main category: cs.CL

TL;DR: 本文提出了一种基于用户偏好示例的论文到幻灯片自动生成框架SlideTailor，通过隐式学习用户内容与视觉风格偏好，实现个性化、可编辑的幻灯片生成，并引入“叙述链”机制提升内容与口头讲解的一致性。

Details

Motivation: 现有幻灯片自动生成方法因未能充分捕捉个体用户的偏好而导致结果不够理想，且用户难以用文字明确表达其复杂偏好。因此需要一种能从自然易得的输入中隐式学习并泛化用户偏好的新方法。 Method: 提出SlideTailor框架，利用用户提供的论文-幻灯片示例对和视觉模板作为输入，从中提取内容与视觉风格偏好；采用类人行为的代理架构进行渐进式生成；引入链式叙述（chain-of-speech）机制以对齐幻灯片内容与预期口头讲解。 Result: 构建了首个面向用户偏好定制的幻灯片生成基准数据集，并设计了可解释的评估指标；实验表明该方法在生成质量、用户对齐度及下游应用（如视频演示）方面均优于现有方法。 Conclusion: SlideTailor能够有效从隐式输入中学习用户偏好，实现高质量、个性化的幻灯片生成，推动了内容生成系统向更贴近用户需求的方向发展。 Abstract: Automatic presentation slide generation can greatly streamline content creation. However, since preferences of each user may vary, existing under-specified formulations often lead to suboptimal results that fail to align with individual user needs. We introduce a novel task that conditions paper-to-slides generation on user-specified preferences. We propose a human behavior-inspired agentic framework, SlideTailor, that progressively generates editable slides in a user-aligned manner. Instead of requiring users to write their preferences in detailed textual form, our system only asks for a paper-slides example pair and a visual template - natural and easy-to-provide artifacts that implicitly encode rich user preferences across content and visual style. Despite the implicit and unlabeled nature of these inputs, our framework effectively distills and generalizes the preferences to guide customized slide generation. We also introduce a novel chain-of-speech mechanism to align slide content with planned oral narration. Such a design significantly enhances the quality of generated slides and enables downstream applications like video presentations. To support this new task, we construct a benchmark dataset that captures diverse user preferences, with carefully designed interpretable metrics for robust evaluation. Extensive experiments demonstrate the effectiveness of our framework.

[18] AprielGuard

Jaykumar Kasundra,Anjaneya Praharaj,Sourabh Surana,Lakshmi Sirisha Chodisetty,Sourav Sharma,Abhigya Verma,Abhishek Bhardwaj,Debasish Kanhar,Aakash Bhagat,Khalil Slimi,Seganrasan Subramanian,Sathwik Tejaswi Madhusudhan,Ranga Prasad Chenna,Srinivas Sunkara

Main category: cs.CL

TL;DR: AprielGuard是一个8B参数的统一安全防护模型，旨在同时应对大语言模型中的安全风险和对抗性威胁，在多轮对话和复杂推理场景中表现优异。

Details

Motivation: 现有安全工具通常将安全风险与对抗性威胁分开处理，导致鲁棒性和泛化能力受限，因此需要一个统一的防护框架。 Method: 提出AprielGuard，采用统一分类体系和学习框架，基于开源与合成数据混合训练，覆盖单轮、多轮对话及智能体工作流，并引入结构化推理轨迹以提升可解释性。 Result: 在多个公开和私有基准上，AprielGuard在检测有害内容和对抗性操纵方面优于Llama-Guard和Granite Guardian等现有开源守护模型，尤其在多步推理场景中表现突出。 Conclusion: AprielGuard实现了对LLM安全风险和对抗威胁的统一有效防护，通过开源促进可复现和透明的安全研究发展。 Abstract: Safeguarding large language models (LLMs) against unsafe or adversarial behavior is critical as they are increasingly deployed in conversational and agentic settings. Existing moderation tools often treat safety risks (e.g. toxicity, bias) and adversarial threats (e.g. prompt injections, jailbreaks) as separate problems, limiting their robustness and generalizability. We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework. AprielGuard is trained on a diverse mix of open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces to improve interpretability. Across multiple public and proprietary benchmarks, AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations, outperforming existing opensource guardrails such as Llama-Guard and Granite Guardian, particularly in multi-step and reasoning intensive scenarios. By releasing the model, we aim to advance transparent and reproducible research on reliable safeguards for LLMs.

[19] Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Karolina Drożdż,Kacper Dudzic,Anna Sterna,Marcin Moskalewicz

Main category: cs.CL

TL;DR: 本研究首次比较了最先进的大语言模型（LLM）与心理健康专业人士在使用波兰语自传体叙述诊断边缘型（BPD）和自恋型（NPD）人格障碍方面的能力，发现顶级LLM在总体诊断准确率上超过人类专家，但在NPD诊断中表现显著不足。

Details

Motivation: 随着对大语言模型在精神健康自我评估中应用的依赖增加，亟需评估其在理解定性患者叙述和做出可靠诊断方面的能力，尤其是在复杂人格障碍诊断中的表现。 Method: 采用波兰语的第一人称自传体叙述数据，对比最先进的LLM（如Gemini Pro）与人类心理健康专业人士在诊断BPD和NPD方面的准确性，并通过F1分数和定性分析评估其判断依据与风格差异。 Result: Gemini Pro模型在总体诊断准确率上以65.48%显著超过人类专家的43.57%（高出21.91个百分点）；在BPD诊断中模型与人类表现相近（F1分别为83.4和80.0），但模型在NPD诊断中严重低估（F1为6.7 vs. 50.0），且表现出对‘自恋’这一价值负载术语的回避倾向；模型输出自信详尽，侧重模式识别，而人类专家则更谨慎简洁，关注患者的自我感与时间体验。 Conclusion: 尽管大语言模型在解析复杂的临床叙述方面能力突出，但其在诊断敏感性人格障碍时仍存在显著偏差与可靠性问题，特别是在处理具有道德评判色彩的诊断标签时需格外谨慎。 Abstract: Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient's sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.

[20] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Maxime Poli,Mahi Luthra,Youssef Benchekroun,Yosuke Higuchi,Martin Gleize,Jiayi Shen,Robin Algayres,Yu-An Chung,Mido Assran,Juan Pino,Emmanuel Dupoux

Main category: cs.CL

TL;DR: 本文提出了SpidR，一种自监督的语音表示模型，通过掩码预测、自蒸馏和在线聚类从原始波形中学习富含音素信息的表示，显著提升了无文本口语建模性能，并大幅缩短了预训练时间。

Details

Motivation: 旨在不依赖文本中介的情况下直接从语音中学习语言，需提取具有丰富语义和音素信息的语音表示以支持无文本语言建模。 Method: 提出SpidR模型，基于原始波形采用掩码预测结合自蒸馏与在线聚类进行训练；学生模型中间层预测教师模型中间层生成的聚类分配，提升聚类稳定性与码本质量。 Result: SpidR在sWUGGY、sBLIMP、tSC等下游语言建模任务上优于wav2vec 2.0、HuBERT、WavLM和DinoSR；验证了ABX、PNMI等语音单元质量指标与语言建模性能的相关性；仅用16块GPU训练一天即完成预训练，远快于HuBERT的一周。 Conclusion: SpidR能高效学习高质量语音表示，适用于无文本口语建模，且训练速度快、资源消耗低，推动了文本无关语音语言模型的发展。 Abstract: The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher's intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr.

[21] Can LLMs Solve My Grandma's Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles

Nurul Labib Sayeedi,Md. Faiyaz Abdullah Sayeedi,Khushnur Binte Jahangir,Swakkhar Shatabda,Sarah Masud Preum

Main category: cs.CL

TL;DR: 本文提出了BanglaRiddleEval，一个包含1,244个孟加拉语谜题的新基准，用于评估大语言模型在低资源、文化相关和隐喻性推理任务中的表现。结果显示当前模型性能远低于人类水平，表明该领域仍具挑战性。

Details

Motivation: 探索大语言模型在低资源、文化相关和隐喻性语境下的推理能力，尤其是在孟加拉语等资源较少的语言中。 Method: 构建了一个包含4,976个谜题任务实例的基准BanglaRiddleEval，并使用基于LLM的流程生成思维链解释、干扰项和歧义标注，评估多种开源与闭源模型在不同提示策略下的表现。 Result: 生成式问答中语义重叠中等但正确率低，多项选择题准确率最高仅约56%（人类基线为83%），歧义解析表现介于26%至68%，高质量解释仅限最强模型产生。 Conclusion: 当前大语言模型在理解孟加拉语谜题方面仅捕捉到部分线索，距离人类水平仍有较大差距，BanglaRiddleEval成为评估低资源语言隐喻推理的新挑战性基准。 Abstract: Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://github.com/Labib1610/BanglaRiddleEval.

[22] Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation

Nilesh Jain,Seyi Adeyinka,Leor Roseman,Aza Allsop

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的主题分析多视角验证框架，结合集成验证与双重可靠性指标（Cohen's Kappa和余弦相似度），支持灵活配置与共识主题提取，并在 psychedelic 艺术治疗访谈文本上验证了其有效性。

Details

Motivation: 传统定性研究中的编码员间一致性方法依赖多人编码，耗时且一致性常不理想，亟需更高效可靠的自动化解决方案。 Method: 提出一个多视角验证框架，结合LLM的多次运行结果，使用Cohen's Kappa衡量编码一致性，余弦相似度评估语义一致性，支持种子数、温度、提示结构等参数配置，并从JSON格式中提取共识主题。 Result: 在三个主流LLM（Gemini、GPT-4o、Claude）上的实验显示，所有模型均达到高一致性（κ > 0.80），其中Gemini表现最佳（κ = 0.907，余弦相似度95.3%），并成功提取出跨运行的共识主题。 Conclusion: 该框架为AI辅助的定性研究提供了透明、可配置且结构无关的方法基础，显著提升了主题分析的可靠性与可重复性。 Abstract: Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($κ$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($κ= 0.907$, cosine=95.3%), followed by GPT-4o ($κ= 0.853$, cosine=92.6%) and Claude ($κ= 0.842$, cosine=92.1%). All three models achieve a high agreement ($κ> 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.

[23] Sentiment-Aware Extractive and Abstractive Summarization for Unstructured Text Mining

Junyi Liu,Stanley Kok

Main category: cs.CL

TL;DR: 本文提出了一种情感感知的文本摘要框架，通过将情感信号融入抽取式（TextRank）和生成式（UniLM）方法，提升对社交媒体中非正式、情绪化文本的情感细节和主题相关性的捕捉能力。

Details

Motivation: 现有摘要方法主要针对结构化新闻数据优化，难以有效处理社交媒体中嘈杂且非正式的用户生成内容，且较少研究将情感建模整合到短文本摘要中，而情感信息在品牌监控和市场分析等信息系统任务中至关重要。 Method: 提出一种情感感知的摘要框架，扩展TextRank和UniLM模型，在句子排序和生成过程中嵌入情感信号，以同时捕捉情感细微差别和主题相关性。 Result: 实验结果表明，该方法在情感保留和主题连贯性方面优于传统摘要模型，能生成更简洁且富含情感信息的摘要。 Conclusion: 所提出的框架有效提升了对用户生成短文本的情感化摘要质量，有助于在动态在线环境中实现及时干预和战略决策支持。 Abstract: With the rapid growth of unstructured data from social media, reviews, and forums, text mining has become essential in Information Systems (IS) for extracting actionable insights. Summarization can condense fragmented, emotion-rich posts, but existing methods-optimized for structured news-struggle with noisy, informal content. Emotional cues are critical for IS tasks such as brand monitoring and market analysis, yet few studies integrate sentiment modeling into summarization of short user-generated texts. We propose a sentiment-aware framework extending extractive (TextRank) and abstractive (UniLM) approaches by embedding sentiment signals into ranking and generation processes. This dual design improves the capture of emotional nuances and thematic relevance, producing concise, sentiment-enriched summaries that enhance timely interventions and strategic decision-making in dynamic online environments.

[24] Step-DeepResearch Technical Report

Chen Hu,Haikuo Du,Heng Wang,Lin Lin,Mingrui Chen,Peng Liu,Ruihang Miao,Tianchi Yue,Wang You,Wei Ji,Wei Yuan,Wenjin Deng,Xiaojian Yuan,Xiaoyun Zhang,Xiangyu Liu,Xikai Liu,Yanming Xu,Yicheng Cao,Yifei Zhang,Yongyao Wang,Yubo Shu,Yurong Zhang,Yuxiang Zhang,Zheng Gong,Zhichao Chang,Binyan Li,Dan Ma,Furong Jia,Hongyuan Wang,Jiayu Liu,Jing Bai,Junlan Liu,Manjiao Liu,Na Wang,Qiuping Wu,Qinxin Du,Shiwei Li,Wen Sun,Yifeng Gong,Yonglin Chen,Yuling Zhao,Yuxuan Lin,Ziqi Ren,Zixuan Wang,Aihu Zhang,Brian Li,Buyun Ma,Kang An,Li Xie,Mingliang Li,Pan Li,Shidong Yang,Xi Chen,Xiaojia Liu,Yuchu Luo,Yuan Song,YuanHao Ding,Yuanwei Liang,Zexi Li,Zhaoning Zhang,Zixin Zhang,Binxing Jiao,Daxin Jiang,Jiansheng Chen,Jing Li,Xiangyu Zhang,Yibo Zhu

Main category: cs.CL

TL;DR: 本文提出了一种面向深度研究的端到端代理模型Step-DeepResearch，通过基于原子能力的数据合成策略和渐进式训练方法，在低成本下实现了中等规模模型在开放性研究任务中的卓越表现，并构建了中文领域深度研究评测基准ADR-Bench。

Details

Motivation: 现有学术基准（如BrowseComp）难以满足现实世界对开放式深度研究的需求，尤其在意图识别、长视野决策和跨源验证方面存在不足，且中文领域缺乏合适的评估体系。 Method: 提出Step-DeepResearch，采用基于原子能力的数据合成策略增强规划与报告生成能力，结合从代理中期训练到SFT和RL的渐进训练路径，并引入清单式评判器（Checklist-style Judger）提升系统鲁棒性；同时构建中文深度研究评测基准ADR-Bench。 Result: Step-DeepResearch（32B）在Scale AI研究评分标准下得分为61.4%；在ADR-Bench上显著优于同类模型，性能接近OpenAI和Gemini DeepResearch等最先进闭源模型。 Conclusion: 精细化的训练策略可使中等规模模型在深度研究任务中达到专家级水平，同时实现行业领先的成本效益，验证了高效训练框架的重要性。 Abstract: As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

[25] Distilling to Hybrid Attention Models via KL-Guided Layer Selection

Yanhong Li,Songlin Yang,Shawn Tan,Mayank Mishra,Rameswar Panda,Jiawei Zhou,Yoon Kim

Main category: cs.CL

TL;DR: 提出一种基于层重要性评分的简单高效方法，用于在蒸馏预训练Transformer时选择哪些层转换为线性注意力，优于现有层选择策略。

Details

Motivation: 在不从头预训练的情况下提升大语言模型推理效率，需有效选择哪些softmax注意力层可转换为线性注意力层。 Method: 利用少量通用文本数据计算层重要性得分进行层选择，并结合RADLADS流程完成蒸馏：注意力权重迁移、隐藏状态对齐、KL分布匹配及微调。 Result: 该方法在层选择上优于均匀交错的启发式方法和依赖专用诊断数据集的复杂方法。 Conclusion: 基于轻量评分的层选择配合标准蒸馏流程，可高效实现softmax与线性注意力的混合架构压缩。 Abstract: Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.

[26] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Amirhosein Ghasemabadi,Di Niu

Main category: cs.CL

TL;DR: 本文提出了Gnosis，一种轻量级的自感知机制，使冻结的大型语言模型能够通过解码隐藏状态和注意力模式中的信号来内在地预测自身的错误，而无需外部监督。

Details

Motivation: 大型语言模型虽然能生成流畅复杂的输出，但常无法识别自身的错误和幻觉。现有方法依赖外部评判或多次采样，计算成本高且与真实正确性关联弱。因此，研究目标是让LLM通过检查推理过程中的内部状态来自我判断输出的正确性。 Method: 提出Gnosis机制，通过被动观察模型内部的隐藏状态和注意力轨迹，将其压缩为固定预算的描述符，并训练一个仅含约500万参数的小型模块来预测输出的正确性。该方法在推理过程中几乎不增加计算开销，且独立于序列长度。 Result: 在数学推理、开放域问答和学术知识等多个基准上，Gnosis在不同规模（1.7B至20B）的冻结模型上均优于强内部基线和大型外部评判器，表现出更优的准确性和校准性；并能零样本泛化到部分生成内容，实现失败路径的早期检测和计算感知控制。 Conclusion: 研究表明，大型语言模型生成过程本身蕴含可靠的正确性线索，这些线索可被高效提取并用于内在的自我验证，无需额外采样或外部监督。 Abstract: Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.

[27] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

Dhruv Anand,Ehsan Shareghi

Main category: cs.CL

TL;DR: 本文提出了Cube Bench，一个用于评估多模态大语言模型在空间和序列推理能力的魔方基准测试，分解了五种关键技能，并通过实验揭示了现有模型在复杂度增加时性能显著下降的问题。

Details

Motivation: 为了系统评估多模态大语言模型在空间与序列推理方面的能力，需要一个结构清晰、可复现的基准任务，而魔方因其复杂的逻辑和空间特性成为一个理想选择。 Method: 设计了一个包含五个子任务的魔方基准（Cube Bench），使用相同打乱状态、统一提示和解析器，以距离目标状态的距离作为评估指标，对七个多模态大语言模型进行比较，并引入自纠正机制测试其改进效果。 Result: 实验显示所有模型在打乱深度增加时准确率显著下降；轨迹停滞或偏离后极少能恢复；高面识别精度不保证有效动作选择；闭源模型表现优于开源模型；最简单的自纠正方法仅带来有限提升且可能导致过度思考。 Conclusion: Cube Bench为评估多模态大语言模型的序列空间推理提供了一个紧凑且可复现的工具，揭示了当前模型在复杂推理任务中的局限性，特别是在错误恢复和长期规划方面。 Abstract: We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.

[28] MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts

Alexandros Christoforos,Chadbourne Davis

Main category: cs.CL

TL;DR: 提出MoE-DiffuSeq，一种基于专家混合的扩散模型框架，通过稀疏注意力和软吸收状态提升长文本生成的效率与质量。

Details

Motivation: 现有扩散模型在长文本生成中计算成本高、内存开销大，难以高效处理长序列。 Method: 结合稀疏注意力机制与专家混合架构，并引入软吸收状态优化扩散过程。 Result: 显著提升训练效率和采样速度，在科学文章、代码库和长对话生成等任务中表现优越。 Conclusion: MoE-DiffuSeq有效推动了扩散模型在高质量长文本生成中的实际应用。 Abstract: We present MoE-DiffuSeq, a mixture of experts based framework for enhancing diffusion models in long document generation. Existing diffusion based text generation models, such as DiffuSeq, suffer from high computational cost and memory overhead when applied to extended sequences. To address these challenges, MoE-DiffuSeq integrates sparse attention with a mixture of experts architecture, enabling efficient and scalable long sequence modeling. Our approach introduces a customized sparse attention mechanism designed to reduce computational complexity while preserving text quality and coherence. In addition, we incorporate a soft absorbing state within the diffusion process to accelerate sequence reconstruction and improve generation precision. Extensive experiments demonstrate that MoE-DiffuSeq significantly improves training efficiency and sampling speed compared to existing diffusion models. These advantages are particularly effective for long document scenarios, including scientific article generation, code repository modeling, and long form dialogue generation. Benchmark results further show that MoE-DiffuSeq improves efficiency, speed, accuracy, and expressiveness, advancing the practical applicability of diffusion models for high quality long form text generation.

cs.CV [Back]

[29] PHANTOM: PHysical ANamorphic Threats Obstructing Connected Vehicle Mobility

Md Nahid Hasan Shuvo,Moinul Hossain

Main category: cs.CV

TL;DR: 本文提出了PHANTOM，一种基于变形艺术的物理对抗攻击框架，可对自动驾驶车辆的视觉感知和V2X通信系统造成严重威胁，具有高攻击成功率和跨模型迁移性。

Details

Motivation: 现有自动驾驶系统在面对物理域对抗攻击时存在安全漏洞，尤其是在真实环境下的感知与通信协同脆弱性尚未被充分研究。 Method: 利用anamorphic艺术生成视角依赖的对抗样本，在无需目标模型信息的黑盒设置下攻击多种主流目标检测器，并通过CARLA和SUMO-OMNeT++联合仿真评估其对感知与通信层的影响。 Result: 在CARLA中达到90%以上的攻击成功率，恶劣环境下仍保持60-80%有效性；攻击触发距离为6-10米，导致V2X网络中信息峰值年龄增加68-89%，引发连锁通信干扰。 Conclusion: PHANTOM揭示了CAV系统在感知与通信双层面的安全弱点，强调需设计更具鲁棒性的防御机制以应对现实世界中的物理对抗威胁。 Abstract: Connected autonomous vehicles (CAVs) rely on vision-based deep neural networks (DNNs) and low-latency (Vehicle-to-Everything) V2X communication to navigate safely and efficiently. Despite their advances, these systems remain vulnerable to physical adversarial attacks. In this paper, we introduce PHANTOM (PHysical ANamorphic Threats Obstructing connected vehicle Mobility), a novel framework for crafting and deploying perspective-dependent adversarial examples using \textit{anamorphic art}. PHANTOM exploits geometric distortions that appear natural to humans but are misclassified with high confidence by state-of-the-art object detectors. Unlike conventional attacks, PHANTOM operates in black-box settings without model access and demonstrates strong transferability across four diverse detector architectures (YOLOv5, SSD, Faster R-CNN, and RetinaNet). Comprehensive evaluation in CARLA across varying speeds, weather conditions, and lighting scenarios shows that PHANTOM achieves over 90\% attack success rate under optimal conditions and maintains 60-80\% effectiveness even in degraded environments. The attack activates within 6-10 meters of the target, providing insufficient time for safe maneuvering. Beyond individual vehicle deception, PHANTOM triggers network-wide disruption in CAV systems: SUMO-OMNeT++ co-simulation demonstrates that false emergency messages propagate through V2X links, increasing Peak Age of Information by 68-89\% and degrading safety-critical communication. These findings expose critical vulnerabilities in both perception and communication layers of CAV ecosystems.

[30] Generating the Past, Present and Future from a Motion-Blurred Image

SaiKiran Tedla,Kelly Zhu,Trevor Canham,Felix Taubner,Michael S. Brown,Kiriakos N. Kutulakos,David B. Lindell

Main category: cs.CV

TL;DR: 本文提出了一种利用预训练的视频扩散模型从运动模糊图像中恢复出拍摄瞬间及前后场景动态视频的新方法，克服了以往方法在复杂场景下的局限性。

Details

Motivation: 运动模糊通常被视为图像质量退化的原因，但其也包含了曝光期间场景和相机运动的信息。现有方法依赖手工先验或网络结构，难以处理复杂的场景动态，且无法恢复图像拍摄前后的场景变化。因此，需要一种能更好地利用大规模数据先验的方法来解决这一问题。 Method: 本文利用在互联网规模数据集上预训练的视频扩散模型，将其重新用于从单张运动模糊图像中恢复出包含拍摄时刻及前后场景动态的视频序列。该方法无需手工设计先验，而是借助大规模图像与视频的生成先验来解决模糊到清晰视频恢复中的歧义问题。 Result: 所提方法在模糊图像恢复为清晰视频任务上优于先前方法，能够处理真实复杂场景，并支持相机轨迹、物体运动和动态3D场景结构等下游任务的恢复。 Conclusion: 运动模糊图像不仅包含当前帧的信息，还能揭示场景的过去与未来动态。通过利用大规模预训练视频扩散模型，本文方法能有效恢复复杂场景动态，在多种任务上表现出更强的泛化能力和鲁棒性。 Abstract: We seek to answer the question: what can a motion-blurred image reveal about a scene's past, present, and future? Although motion blur obscures image details and degrades visual quality, it also encodes information about scene and camera motion during an exposure. Previous techniques leverage this information to estimate a sharp image from an input blurry one, or to predict a sequence of video frames showing what might have occurred at the moment of image capture. However, they rely on handcrafted priors or network architectures to resolve ambiguities in this inverse problem, and do not incorporate image and video priors on large-scale datasets. As such, existing methods struggle to reproduce complex scene dynamics and do not attempt to recover what occurred before or after an image was taken. Here, we introduce a new technique that repurposes a pre-trained video diffusion model trained on internet-scale datasets to recover videos revealing complex scene dynamics during the moment of capture and what might have occurred immediately into the past or future. Our approach is robust and versatile; it outperforms previous methods for this task, generalizes to challenging in-the-wild images, and supports downstream tasks such as recovering camera trajectories, object motion, and dynamic 3D scene structure. Code and data are available at https://blur2vid.github.io

[31] Learning to Refocus with Video Diffusion Models

SaiKiran Tedla,Zhoutong Zhang,Xuaner Zhang,Shumian Xin

Main category: cs.CV

TL;DR: 提出了一种基于视频扩散模型的单张离焦图像后处理重对焦方法，生成感知准确的焦栈视频序列，支持交互式重对焦，并发布了大规模真实场景智能手机焦栈数据集。

Details

Motivation: 自动对焦系统常无法准确捕捉预期主体，且用户常希望在拍摄后调整焦点，现有方法在感知质量和鲁棒性上存在不足。 Method: 利用视频扩散模型，从单张离焦图像生成逼真的焦栈（focal stack），以视频形式表示不同焦点层次，实现连续、自然的重对焦效果。 Result: 在多种挑战性场景下，该方法在感知质量和鲁棒性方面均优于现有方法，并发布了用于训练与评估的大规模真实世界智能手机焦栈数据集。 Conclusion: 该方法为日常摄影中的高级焦点编辑能力提供了新路径，推动了后拍摄焦点调整技术的发展。 Abstract: Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io

[32] RANSAC Scoring Functions: Analysis and Reality Check

A. Shekhovtsov

Main category: cs.CV

TL;DR: 本文重新审视了RANSAC中几何模型评分的问题，扩展了几何误差至球形噪声，并在鲁棒设置下统一了基于似然和M-估计的方法。研究发现MAGSAC++的推导存在问题，其评分函数等价于简单的高斯-均匀混合模型。实验表明不同评分函数性能相近，MAGSAC++并无明显优势。

Details

Motivation: 现有RANSAC框架中的模型评分机制缺乏统一理论基础，MAGSAC++虽表现优异但其理论推导存在缺陷，需系统性地重新评估主流评分方法的有效性。 Method: 通过扩展几何误差至球形噪声，构建包含均匀分布外点的混合模型，提出阈值化参数化方法以统一似然与鲁棒M估计；设计基于大验证集或期望小验证集的实验方法评估各类评分函数。 Result: 发现MAGSAC++的评分函数在数值上等价于基本的高斯-均匀似然模型；所有测试的评分函数（包括学习得到的内点分布）性能相同，且MAGSAC++对阈值超参数并不更鲁棒。 Conclusion: 当前主流评分函数在理论上可统一于所提框架下，实际性能无显著差异，因此未来改进应聚焦于更本质的建模而非复杂评分设计。 Abstract: We revisit the problem of assigning a score (a quality of fit) to candidate geometric models -- one of the key components of RANSAC for robust geometric fitting. In a non-robust setting, the ``gold standard'' scoring function, known as the geometric error, follows from a probabilistic model with Gaussian noises. We extend it to spherical noises. In a robust setting, we consider a mixture with uniformly distributed outliers and show that a threshold-based parameterization leads to a unified view of likelihood-based and robust M-estimators and associated local optimization schemes. Next we analyze MAGSAC++ which stands out for two reasons. First, it achieves the best results according to existing benchmarks. Second, it makes quite different modeling assumptions and derivation steps. We discovered, however that the derivation does not correspond to sound principles and the resulting score function is in fact numerically equivalent to a simple Gaussian-uniform likelihood, a basic model within the proposed framework. Finally, we propose an experimental methodology for evaluating scoring functions: assuming either a large validation set, or a small random validation set in expectation. We find that all scoring functions, including using a learned inlier distribution, perform identically. In particular, MAGSAC++ score is found to be neither better performing than simple contenders nor less sensitive to the choice of the threshold hyperparameter. Our theoretical and experimental analysis thus comprehensively revisit the state-of-the-art, which is critical for any future research seeking to improve the methods or apply them to other robust fitting problems.

[33] HyGE-Occ: Hybrid View-Transformation with 3D Gaussian and Edge Priors for 3D Panoptic Occupancy Prediction

Jong Wook Kim,Wonseok Roh,Ha Dam Baek,Pilhyeon Lee,Jonghyun Choi,Sangpil Kim

Main category: cs.CV

TL;DR: 本文提出了HyGE-Occ，一种用于3D全景占据预测的新框架，通过结合3D高斯和边缘先验的混合视图变换分支，提升了几何一致性和边界感知能力。

Details

Motivation: 现有方法在保持精确几何结构和捕捉3D实例的空间范围方面存在困难，影响了全景分割的性能。 Method: 提出HyGE-Occ框架，采用混合视图变换分支融合连续的基于高斯的深度表示与离散的深度bin表示，并利用BEV特征中的边缘图作为辅助信息来学习边缘线索。 Result: 在Occ3D-nuScenes数据集上的实验表明，HyGE-Occ在3D几何推理方面优于现有方法。 Conclusion: HyGE-Occ有效提升了3D全景占据预测中的几何一致性和边界感知，实现了更精细的场景理解。 Abstract: 3D Panoptic Occupancy Prediction aims to reconstruct a dense volumetric scene map by predicting the semantic class and instance identity of every occupied region in 3D space. Achieving such fine-grained 3D understanding requires precise geometric reasoning and spatially consistent scene representation across complex environments. However, existing approaches often struggle to maintain precise geometry and capture the precise spatial range of 3D instances critical for robust panoptic separation. To overcome these limitations, we introduce HyGE-Occ, a novel framework that leverages a hybrid view-transformation branch with 3D Gaussian and edge priors to enhance both geometric consistency and boundary awareness in 3D panoptic occupancy prediction. HyGE-Occ employs a hybrid view-transformation branch that fuses a continuous Gaussian-based depth representation with a discretized depth-bin formulation, producing BEV features with improved geometric consistency and structural coherence. In parallel, we extract edge maps from BEV features and use them as auxiliary information to learn edge cues. In our extensive experiments on the Occ3D-nuScenes dataset, HyGE-Occ outperforms existing work, demonstrating superior 3D geometric reasoning.

[34] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

Houston H. Zhang,Tao Zhang,Baoze Lin,Yuanqi Xue,Yincheng Zhu,Huan Liu,Li Gu,Linfeng Ye,Ziqiang Wang,Xinxin Zuo,Yang Wang,Yuanhao Yu,Zhixiang Chi

Main category: cs.CV

TL;DR: 本文提出了Widget-to-Code（Widget2Code）任务，针对应用小部件生成代码的挑战，构建了首个仅基于图像的小部件基准，并提出了一种结合感知理解与结构化代码生成的新型基线系统WidgetFactory，显著提升了视觉保真度。

Details

Motivation: 小部件作为紧凑、无上下文的微型界面，在布局密集、图标丰富且空间受限的情况下难以用现有UI2Code方法生成准确代码；同时缺乏公开的标记数据和专用评估体系，限制了相关研究发展。 Method: 提出WidgetFactory框架，包含遵循设计原则的原子组件组装、图标检索与可视化模块以增强感知理解；设计领域特定语言WidgetDSL及其编译器，支持多种前端实现；引入自适应渲染模块优化空间尺寸以满足紧凑性约束。 Result: 在新提出的图像-only小部件基准上评测显示，当前多模态大模型虽优于专用UI2Code方法，但仍存在代码不可靠和视觉不一致问题；所提方法在细粒度多维度指标下显著提升生成代码的视觉保真度和结构准确性。 Conclusion: 本文确立了Widget2Code研究的初步基础，提供了统一的基础设施与强基线，推动未来在小部件代码生成方向的发展。 Abstract: User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

[35] Unified Brain Surface and Volume Registration

S. Mazdak Abulnaga,Andrew Hoopes,Malte Hoffmann,Robin Magnet,Maks Ovsjanikov,Lilla Zöllei,John Guttag,Bruce Fischl,Adrian Dalca

Main category: cs.CV

TL;DR: NeurAlign是一种基于深度学习的脑MRI图像配准框架，通过统一的体积-表面表示联合对皮层和皮下区域进行对齐，利用球面坐标空间实现解剖结构的一致性，在精度、速度和易用性上均优于现有方法。

Details

Motivation: 传统方法将体积和表面配准分离处理，导致结果不一致，限制了神经科学研究中的下游分析，因此需要一种能够协同配准皮层与皮下结构的方法以提高整体一致性与准确性。 Method: 提出NeurAlign，利用中间球面坐标空间整合体积与表面解剖信息，通过深度学习框架实现端到端的联合配准，确保体积与表面域之间的几何一致性。 Result: 在多个数据集（包括域内和域外）上实验表明，NeurAlign在Dice分数上最高提升7点，保持规则形变场，推理速度比传统方法快几个数量级，且仅需MRI扫描即可运行。 Conclusion: NeurAlign实现了更准确、快速和易用的脑部MRI联合配准，为皮层与皮下结构的协同分析树立了新标准。 Abstract: Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, NeurAlign, that registers $3$D brain MRI images by jointly aligning both cortical and subcortical regions through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods -- improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. With its superior accuracy, fast inference, and ease of use, NeurAlign sets a new standard for joint cortical and subcortical registration.

[36] Vehicle-centric Perception via Multimodal Structured Pre-training

Wentao Wu,Xiao Wang,Chenglong Li,Jin Tang,Bin Luo

Main category: cs.CV

TL;DR: 本文提出了一种新的以车辆为中心的预训练大模型VehicleMAE-V2，通过引入对称性、轮廓和语义三种结构先验来增强车辆感知表征的学习能力，并构建了大规模数据集Autobot4M支持训练，在五个下游任务中表现出优越性能。

Details

Motivation: 现有方法在预训练阶段缺乏对车辆相关知识的有效学习，导致在车辆为中心的感知任务中建模通用表征的能力较差。 Method: 提出VehicleMAE-V2，设计了对称性引导的掩码模块（SMM）、轮廓引导的表示模块（CRM）和语义引导的表示模块（SRM），利用多模态结构先验指导掩码令牌重建过程，提升模型学习通用表征的能力。 Result: 在五个下游任务上的大量实验表明，VehicleMAE-V2具有卓越的性能；同时构建了包含约400万张车辆图像和12,693条文本描述的大规模数据集Autobot4M。 Conclusion: 通过引入结构化先验知识，VehicleMAE-V2显著提升了车辆中心感知任务中的表征学习能力，为自动驾驶等智能系统提供了更强大的基础模型支持。 Abstract: Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model's capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.

[37] Block-Recurrent Dynamics in Vision Transformers

Mozes Jacobs,Thomas Fel,Richard Hakim,Alessandra Brondetta,Demba Ba,T. Andy Keller

Main category: cs.CV

TL;DR: 本文提出了Block-Recurrent Hypothesis (BRH)，认为训练好的Vision Transformers (ViTs)在深度方向上存在可重用的块状循环结构，并通过Raptor模型实验证明仅用2个block即可恢复96%的DINOv2线性探测准确率，揭示了ViT深层中存在低复杂度的动态规律。

Details

Motivation: 尽管ViTs已成为主流视觉骨干网络，但其深层计算机制尚不明确，缺乏将Transformer深度解释为明确动力学流程的统一框架，因此需要一种机械性的解释来理解其内部计算现象。 Method: 提出Block-Recurrent Hypothesis (BRH)，利用跨层表征相似性矩阵识别ViT中的阶段性结构，并构建循环代理模型Raptor（Recurrence Approximations to Phase-structured TransfORmers）来拟合预训练ViT的行为；同时结合随机深度训练和动力学分析方法研究深层动态特性。 Result: 实验表明：1) 不同ViT模型中存在少数连续阶段；2) Raptor模型仅用2个block即能以相当计算成本恢复96%的DINOv2 ImageNet-1k线性探测精度；3) 深层呈现类依赖角盆地的方向收敛、cls token的晚期重定向、patch token的后期一致性以及低秩更新等动态特性。 Conclusion: ViT的深层计算并非逐层独立，而是遵循一种紧凑的循环程序，支持将其视为具有低维吸引子的动力系统进行系统性分析，为理解ViT提供了新的规范性视角。 Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

[38] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction

Haoyi Zhong,Fang-Lue Zhang,Andrew Chalmers,Taehyun Rhee

Main category: cs.CV

TL;DR: 本文提出了一种名为SE360的新框架，用于在360°全景图像中进行多条件引导的对象编辑，通过无需人工干预的粗到精自主数据生成管道和两阶段数据优化策略，结合基于Transformer的扩散模型，在视觉质量和语义准确性上均优于现有方法。

Details

Motivation: 现有的360°全景图像编辑方法在等距柱状投影和透视视图中常产生不真实的结果，难以实现语义一致且几何合理的对象编辑，因此需要一种能同时保证语义合理性和几何一致性的新方法。 Method: 提出SE360框架，包含一个基于视觉-语言模型（VLM）和自适应投影调整的粗到精自主数据生成流程，生成语义有意义且几何一致的数据对，并采用两阶段低成本数据精炼策略提升数据真实感并减少过拟合；基于构建的数据集训练一个支持文本、掩码或参考图像引导的Transformer-based扩散模型。 Result: 实验表明，该方法在视觉质量和语义准确性方面均优于现有方法，能够实现更真实、更精确的360°全景图像对象编辑。 Conclusion: SE360通过自主数据生成与两阶段数据优化，有效解决了360°全景图像编辑中的几何失真与语义不一致问题，为指令驱动的全景编辑提供了高效且实用的解决方案。 Abstract: While instruction-based image editing is emerging, extending it to 360$^\circ$ panoramas introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360$^\circ$ panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erase artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360$^\circ$ panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.

[39] How Much 3D Do Video Foundation Models Encode?

Zixuan Huang,Xiang Li,Zhaoyang Lv,James M. Rehg

Main category: cs.CV

TL;DR: 研究发现，尽管未经过3D数据训练，最先进的视频生成模型仍展现出强大的3D理解和场景感知能力，甚至超过专门针对3D任务训练的大型专家模型。

Details

Motivation: 探索在大规模视频数据上训练的视频基础模型（VidFMs）是否能自然涌现出对三维世界的理解能力。 Method: 提出首个模型无关的框架，通过浅层读出从VidFMs的特征中估计多种3D属性，以量化其3D感知能力。 Result: 最先进的视频生成模型表现出强烈的3D对象和场景理解能力，且该能力优于某些专为3D任务设计的专家模型；同时建立了主要VidFMs的3D基准测试。 Conclusion: 视频基础模型在无3D监督的情况下仍可发展出强3D理解，表明利用大规模视频学习全球3D结构是可行且有前景的方向。 Abstract: Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

[40] HistoWAS: A Pathomics Framework for Large-Scale Feature-Wide Association Studies of Tissue Topology and Patient Outcomes

Yuechen Yang,Junlin Guo,Yanfan Zhu,Jialin Yue,Junchao Zhu,Yu Wang,Shilin Zhao,Haichun Yang,Xingyi Guo,Jovan Tanevski,Laura Barisoni,Avi Z. Rosenberg,Yuankai Huo

Main category: cs.CV

TL;DR: 本文提出了HistoWAS，一种将组织空间结构与临床结果关联的计算框架，结合30种拓扑和空间特征及大规模单变量回归分析，用于高通量病理图像分析。

Details

Motivation: 现有方法在量化组织微环境的空间相互作用及其与临床参数的关联方面工具不足，限制了病理图像分析的临床相关性。 Method: 提出HistoWAS框架，包含两个部分：(1) 增强的传统指标，引入来自地理信息系统（GIS）点模式分析的30个拓扑和空间特征；(2) 受PheWAS启发的关联分析引擎，进行大规模单变量回归并进行统计校正。 Result: 在KPMP项目的385张PAS染色全切片图像上验证了HistoWAS，分析了共102个特征（72个传统特征+30个空间特征），展示了其在病理组学中发现潜在生物标志物的能力。 Conclusion: HistoWAS为全切片图像的高通量分析提供了新工具，能够有效连接组织空间架构与临床结局，促进精准医学中的生物标志物发现。 Abstract: High-throughput "pathomic" analysis of Whole Slide Images (WSIs) offers new opportunities to study tissue characteristics and for biomarker discovery. However, the clinical relevance of the tissue characteristics at the micro- and macro-environment level is limited by the lack of tools that facilitate the measurement of the spatial interaction of individual structure characteristics and their association with clinical parameters. To address these challenges, we introduce HistoWAS (Histology-Wide Association Study), a computational framework designed to link tissue spatial organization to clinical outcomes. Specifically, HistoWAS implements (1) a feature space that augments conventional metrics with 30 topological and spatial features, adapted from Geographic Information Systems (GIS) point pattern analysis, to quantify tissue micro-architecture; and (2) an association study engine, inspired by Phenome-Wide Association Studies (PheWAS), that performs mass univariate regression for each feature with statistical correction. As a proof of concept, we applied HistoWAS to analyze a total of 102 features (72 conventional object-level features and our 30 spatial features) using 385 PAS-stained WSIs from 206 participants in the Kidney Precision Medicine Project (KPMP). The code and data have been released to https://github.com/hrlblab/histoWAS.

[41] WSD-MIL: Window Scale Decay Multiple Instance Learning for Whole Slide Image Classification

Le Feng,Li Xiao

Main category: cs.CV

TL;DR: 提出了一种名为WSD-MIL的新方法，结合窗口尺度衰减注意力和区域门控机制，在保持高效计算的同时提升了对不同尺度肿瘤区域的建模能力，在多个病理数据集上实现了最先进的性能。

Details

Motivation: 现有MIL方法在处理全切片图像时忽视了实例间的复杂语义关系，且Transformer方法因二次计算复杂度和固定尺度注意力难以有效捕捉多尺度局部相关性并建模远距离补丁衰减效应。 Method: 提出了WSD-MIL，包含两个模块：1）基于聚类采样的窗口尺度衰减注意力模块，逐步缩小注意力窗口以捕获多尺度局部实例关系；2）基于挤压-激励的区域门模块，动态调整窗口权重以增强全局信息建模。 Result: 在CAMELYON16和TCGA-BRCA数据集上达到最先进性能，同时减少62%的计算内存。 Conclusion: WSD-MIL通过引入尺度衰减注意力和动态门控机制，有效解决了大规模WSI中多尺度肿瘤区域建模与计算效率之间的矛盾，具有良好的应用前景。 Abstract: In recent years, the integration of pre-trained foundational models with multiple instance learning (MIL) has improved diagnostic accuracy in computational pathology. However, existing MIL methods focus on optimizing feature extractors and aggregation strategies while overlooking the complex semantic relationships among instances within whole slide image (WSI). Although Transformer-based MIL approaches aiming to model instance dependencies, the quadratic computational complexity limits their scalability to large-scale WSIs. Moreover, due to the pronounced variations in tumor region scales across different WSIs, existing Transformer-based methods employing fixed-scale attention mechanisms face significant challenges in precisely capturing local instance correlations and fail to account for the distance-based decay effect of patch relevance. To address these challenges, we propose window scale decay MIL (WSD-MIL), designed to enhance the capacity to model tumor regions of varying scales while improving computational efficiency. WSD-MIL comprises: 1) a window scale decay based attention module, which employs a cluster-based sampling strategy to reduce computational costs while progressively decaying attention window-scale to capture local instance relationships at varying scales; and 2) a squeeze-and-excitation based region gate module, which dynamically adjusts window weights to enhance global information modeling. Experimental results demonstrate that WSD-MIL achieves state-of-the-art performance on the CAMELYON16 and TCGA-BRCA datasets while reducing 62% of the computational memory. The code will be publicly available.

[42] A Novel CNN Gradient Boosting Ensemble for Guava Disease Detection

Tamim Ahasan Rijon,Yeasin Arafath

Main category: cs.CV

TL;DR: 本研究提出了一种结合CNN与传统机器学习的集成模型（CNN-ML级联框架），用于高效检测孟加拉国番石榴种植中的病害，基于GFDD24数据集实现了接近99.99%的分类准确率，适用于实时农业监测。

Details

Motivation: 番石榴是孟加拉国重要的经济作物，但炭疽病和果蝇感染严重影响其产量和质量，亟需高效的早期病害检测方法以减少损失。 Method: 使用来自孟加拉国Rajshahi和Pabna地区的Guava Fruit Disease Dataset 2024（GFDD24），构建卷积神经网络（CNN）与传统机器学习相结合的级联模型，并引入梯度提升机（Gradient Boosting Machine）进行集成学习以提高分类精度。 Result: 所提出的CNN-ML集成模型在GFDD24数据集上达到了约99.99%的分类准确率，表现出卓越的病害识别性能。 Conclusion: CNN-ML级联框架能够高效、准确地识别本地番石榴病害，具备应用于实际农业环境中实时监测系统的潜力。 Abstract: As a significant agricultural country, Bangladesh utilizes its fertile land for guava cultivation and dedicated labor to boost its economic development. In a nation like Bangladesh, enhancing guava production and agricultural practices plays a crucial role in its economy. Anthracnose and fruit fly infection can lower the quality and productivity of guava, a crucial tropical fruit. Expert systems that detect diseases early can reduce losses and safeguard the harvest. Images of guava fruits classified into the Healthy, Fruit Flies, and Anthracnose classes are included in the Guava Fruit Disease Dataset 2024 (GFDD24), which comes from plantations in Rajshahi and Pabna, Bangladesh. This study aims to create models using CNN alongside traditional machine learning techniques that can effectively identify guava diseases in locally cultivated varieties in Bangladesh. In order to achieve the highest classification accuracy of approximately 99.99% for the guava dataset, we propose utilizing ensemble models that combine CNNML with Gradient Boosting Machine. In general, the CNN-ML cascade framework exhibits strong, high-accuracy guava disease detection that is appropriate for real-time agricultural monitoring systems.

[43] A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping

Peng Gao,Ke Li,Di Wang,Yongshan Zhu,Yiming Zhang,Xuemei Luo,Yifeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为DDTM的双分支弱监督框架，用于解决跨分辨率土地覆盖映射中的细粒度结构对齐难题，通过扩散模型和Transformer分支分别进行局部语义细化和全局上下文建模，并引入伪标签置信度评估模块以减少噪声影响，在Chesapeake Bay基准上达到了新的SOTA性能。

Details

Motivation: 由于粗粒度监督与高分辨率预测之间存在严重的分辨率不匹配，现有方法难以有效对齐细粒度空间结构，导致监督噪声和精度下降。 Method: 提出DDTM框架：1）基于扩散的分支在粗略监督下逐步细化局部语义；2）基于Transformer的分支建模长距离上下文一致性；3）设计伪标签置信度评估模块，选择性利用可靠的监督信号。 Result: 在Chesapeake Bay基准上实现了66.52%的mIoU，显著优于先前的弱监督方法。 Conclusion: DDTM通过解耦局部细化与全局推理，有效缓解了跨分辨率学习的挑战，为弱监督土地覆盖映射提供了新思路。 Abstract: Cross-resolution land cover mapping aims to produce high-resolution semantic predictions from coarse or low-resolution supervision, yet the severe resolution mismatch makes effective learning highly challenging. Existing weakly supervised approaches often struggle to align fine-grained spatial structures with coarse labels, leading to noisy supervision and degraded mapping accuracy. To tackle this problem, we propose DDTM, a dual-branch weakly supervised framework that explicitly decouples local semantic refinement from global contextual reasoning. Specifically, DDTM introduces a diffusion-based branch to progressively refine fine-scale local semantics under coarse supervision, while a transformer-based branch enforces long-range contextual consistency across large spatial extents. In addition, we design a pseudo-label confidence evaluation module to mitigate noise induced by cross-resolution inconsistencies and to selectively exploit reliable supervisory signals. Extensive experiments demonstrate that DDTM establishes a new state-of-the-art on the Chesapeake Bay benchmark, achieving 66.52\% mIoU and substantially outperforming prior weakly supervised methods. The code is available at https://github.com/gpgpgp123/DDTM.

[44] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

Zhenhao Li,Shaohan Yi,Zheng Liu,Leonartinus Gao,Minh Ngoc Le,Ambrose Ling,Zhuoran Wang,Md Amirul Islam,Zhixiang Chi,Yuanhao Yu

Main category: cs.CV

TL;DR: 提出了一种名为MIVA的轻量级子网络，可附加到预训练扩散模型上，用于图像动画生成，通过模块化设计实现高效训练和精确运动控制。

Details

Motivation: 扩散模型在图像和视频生成中表现出色，但在图像动画应用中受限，主要由于视频信号高维导致数据稀缺，模型倾向于记忆而非遵循提示生成运动，且难以泛化到新运动模式。 Method: 提出Modular Image-to-Video Adapter（MIVA），作为可附加到预训练扩散模型的轻量子网络，每个MIVA捕捉一种运动模式，支持并行扩展；仅需约十个样本即可在单个消费级GPU上高效训练。 Result: 实验表明，MIVA能在极少量样本上有效训练，用户可通过选择一个或多个MIVA来指定运动，无需复杂提示工程，实现了更精确的运动控制，并保持甚至超越了在更大数据集上训练的模型的生成质量。 Conclusion: MIVA为图像动画提供了一种高效、可控的解决方案，解决了扩散模型在小样本下运动生成与泛化能力不足的问题。 Abstract: Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples using a single consumer-grade GPU. At inference time, users can specify motion by selecting one or multiple MIVAs, eliminating the need for prompt engineering. Extensive experiments demonstrate that MIVA enables more precise motion control while maintaining, or even surpassing, the generation quality of models trained on significantly larger datasets.

[45] PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

Blessing Agyei Kyem,Joshua Kofi Asamoah,Anthony Dontoh,Andrews Danyo,Eugene Denteh,Armstrong Aboah

Main category: cs.CV

TL;DR: 本文提出了一种标准化的大规模路面缺陷检测基准数据集，整合了来自七个国家的52747张图像和135277个标注框，覆盖13种缺陷类型，支持跨环境的模型训练与公平评估。

Details

Motivation: 现有路面缺陷数据集在标注风格、缺陷定义和格式上不统一，导致难以集成和泛化，缺乏全球代表性基准。 Method: 整合多个公开数据源，标准化类别定义和标注格式，构建包含丰富现实变化的大规模数据集，并使用YOLO系列、Faster R-CNN和DETR等先进检测模型进行基准测试。 Result: 数据集包含52747张图像和135277个边界框标注，覆盖13种缺陷类型，实验表明主流检测模型在其上表现良好，且具备零样本迁移能力。 Conclusion: 该数据集为路面缺陷检测提供了首个全球代表性的标准化基准，促进了模型的统一训练、公平比较和跨域泛化能力。 Abstract: Automated pavement defect detection often struggles to generalize across diverse real-world conditions due to the lack of standardized datasets. Existing datasets differ in annotation styles, distress type definitions, and formats, limiting their integration for unified training. To address this gap, we introduce a comprehensive benchmark dataset that consolidates multiple publicly available sources into a standardized collection of 52747 images from seven countries, with 135277 bounding box annotations covering 13 distinct distress types. The dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions, offering a unique resource for consistent training and evaluation. Its effectiveness was demonstrated through benchmarking with state-of-the-art object detection models including YOLOv8-YOLOv12, Faster R-CNN, and DETR, which achieved competitive performance across diverse scenarios. By standardizing class definitions and annotation formats, this dataset provides the first globally representative benchmark for pavement defect detection and enables fair comparison of models, including zero-shot transfer to new environments.

[46] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

Zepeng Xin,Kaiyu Li,Luodi Chen,Wanchen Li,Yuchen Xiao,Hui Qiao,Weizhan Zhang,Deyu Meng,Xiangyong Cao

Main category: cs.CV

TL;DR: 本文提出了LaSeRS，首个面向遥感图像中复杂语言引导分割的大规模数据集，并设计了SegEarth-R2模型以应对多层次粒度、多目标指令、推理需求和语言多样性等挑战。

Details

Motivation: 现有模型在处理简单指令时有效，但在复杂的地理空间场景（如多目标、隐含意图、不同粒度）中表现不佳；同时缺乏能全面评估这些能力的数据集。 Method: 构建LaSeRS数据集，覆盖四个关键维度：层次化粒度、目标多重性、推理要求和语言变异性；提出SegEarth-R2模型，引入空间注意力监督机制和灵活高效的分割查询机制。 Result: SegEarth-R2在LaSeRS及其他基准上均取得优异性能，显著提升对小物体及其组件的定位能力，并支持单目标与多目标分割任务。 Conclusion: LaSeRS为复杂地理空间语言理解提供了新基准，SegEarth-R2通过结构创新有效应对了遥感图像中语言到像素的精准映射挑战，推动了该领域的实际应用潜力。 Abstract: Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth-insights/SegEarth-R2.

[47] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments

Anthony Dontoh,Stephanie Ivey,Armstrong Aboah

Main category: cs.CV

TL;DR: 本研究探讨了在自然驾驶条件下，结合道路视角和驾驶员视角视频是否能提高分心驾驶检测的准确性，发现性能提升依赖于模型架构，简单的多视图输入可能因表示冲突导致性能下降。

Details

Motivation: 现有分心驾驶检测模型大多仅使用驾驶员视角，忽略了影响驾驶行为的重要环境上下文信息，因此需要探索引入道路视角能否提升检测效果。 Method: 使用真实驾驶中的同步双摄像头数据，对比三种主流时空动作识别模型（SlowFast-R50、X3D-M、SlowOnly-R50）在仅驾驶员视角和融合双视角两种输入配置下的表现。 Result: SlowOnly-R50在双视角输入下准确率提升了9.8%，而SlowFast-R50则下降了7.2%，表明架构决定多视图输入的效果，且可能因表征冲突产生负面影响。 Conclusion: 简单地增加视觉上下文不足以提升分心驾驶检测性能，未来多模态驾驶员监控系统需采用支持多视图融合的架构设计。 Abstract: Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.

[48] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

Ziwei Qin,Xuhui Song,Deqing Huang,Na Qin,Jun Li

Main category: cs.CV

TL;DR: 提出了一种多激活平面交互图神经网络（MAPI-GNN），通过从语义解耦的特征子空间中学习多面图谱，克服传统单静态图在多模态医学诊断中的局限性。

Details

Motivation: 现有图神经网络依赖单一静态图，难以建模患者特异性的病理关系，限制了其在多模态医学诊断中的有效性。 Method: 首先利用多维判别器发现潜在的图感知模式，由这些模式指导动态构建一组激活图，最后通过关系融合引擎聚合和上下文化该多面图谱以实现鲁棒诊断。 Result: 在两个包含超过1300例患者样本的任务上进行了广泛实验，MAPI-GNN显著优于现有最先进方法。 Conclusion: MAPI-GNN有效提升了多模态医学诊断中图神经网络的关系建模能力，展现出更强的诊断性能和应用潜力。 Abstract: Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.

[49] $\text{H}^2$em: Learning Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning

Lin Li,Jiahui Li,Jiaming Lei,Jun Xiao,Feifei Shao,Long Chen

Main category: cs.CV

TL;DR: 本文提出H2em框架，利用双曲几何的层次超球嵌入来解决大规模组合零样本学习中的层次结构建模问题，在多个基准上实现了最先进性能。

Details

Motivation: 现有CZSL方法在欧几里得空间中建模层次结构时难以扩展到大规模分类体系，因其多项式体积增长无法匹配现实世界中指数级的层次结构，限制了泛化能力。 Method: 提出H2em框架，采用双曲空间嵌入以更好表示树状层次结构；设计双重层次蕴含损失（使用双曲蕴含锥）和判别对齐损失（结合难负样本挖掘），并在双曲空间内实现跨模态注意力机制。 Result: 在三个基准数据集上进行了广泛实验，H2em在闭世界和开世界场景下均达到最优性能。 Conclusion: H2em通过双曲几何有效建模大规模语义与概念层次，在组合零样本学习中显著提升泛化能力和细粒度区分性。 Abstract: Compositional zero-shot learning (CZSL) aims to recognize unseen state-object compositions by generalizing from a training set of their primitives (state and object). Current methods often overlook the rich hierarchical structures, such as the semantic hierarchy of primitives (e.g., apple fruit) and the conceptual hierarchy between primitives and compositions (e.g, sliced apple apple). A few recent efforts have shown effectiveness in modeling these hierarchies through loss regularization within Euclidean space. In this paper, we argue that they fail to scale to the large-scale taxonomies required for real-world CZSL: the space's polynomial volume growth in flat geometry cannot match the exponential structure, impairing generalization capacity. To this end, we propose H2em, a new framework that learns Hierarchical Hyperbolic EMbeddings for CZSL. H2em leverages the unique properties of hyperbolic geometry, a space naturally suited for embedding tree-like structures with low distortion. However, a naive hyperbolic mapping may suffer from hierarchical collapse and poor fine-grained discrimination. We further design two learning objectives to structure this space: a Dual-Hierarchical Entailment Loss that uses hyperbolic entailment cones to enforce the predefined hierarchies, and a Discriminative Alignment Loss with hard negative mining to establish a large geodesic distance between semantically similar compositions. Furthermore, we devise Hyperbolic Cross-Modal Attention to realize instance-aware cross-modal infusion within hyperbolic geometry. Extensive ablations on three benchmarks demonstrate that H2em establishes a new state-of-the-art in both closed-world and open-world scenarios. Our codes will be released.

Chang Sun,Dongliang Xie,Bo Qin,Hong Yang

Main category: cs.CV

TL;DR: 提出VALLR-Pin，一种用于中文视觉语音识别的两阶段框架，结合字符与拼音多任务学习，并利用大语言模型结合拼音上下文和错误模式微调来解决中文唇读中的同音字歧义问题。

Details

Motivation: 中文中视素模糊且同音字普遍，导致视觉语音识别难度高，现有方法难以有效处理发音相似但语义不同的字词。 Method: 采用共享视频编码器和双解码器结构，同时预测汉字序列和对应的拼音；在推理阶段生成候选文本并结合拼音构造提示输入大语言模型进行纠错；通过合成含噪拼音-文本对微调大语言模型以增强其对模型特有错误的识别能力。 Result: 该方法有效提升了中文无声视频中的唇读准确率，尤其在处理同音字歧义和常见错误方面表现优越。 Conclusion: VALLR-Pin通过融合视觉、语音和语言层级信息，显著改善了中文视觉语音识别性能，展示了多模态与大模型协同的潜力。 Abstract: Visual Speech Recognition aims to transcribe spoken words from silent lip-motion videos. This task is particularly challenging for Mandarin, as visemes are highly ambiguous and homophones are prevalent. We propose VALLR-Pin, a novel two-stage framework that extends the recent VALLR architecture from English to Mandarin. First, a shared video encoder feeds into dual decoders, which jointly predict both Chinese character sequences and their standard Pinyin romanization. The multi-task learning of character and phonetic outputs fosters robust visual-semantic representations. During inference, the text decoder generates multiple candidate transcripts. We construct a prompt by concatenating the Pinyin output with these candidate Chinese sequences and feed it to a large language model to resolve ambiguities and refine the transcription. This provides the LLM with explicit phonetic context to correct homophone-induced errors. Finally, we fine-tune the LLM on synthetic noisy examples: we generate imperfect Pinyin-text pairs from intermediate VALLR-Pin checkpoints using the training data, creating instruction-response pairs for error correction. This endows the LLM with awareness of our model's specific error patterns. In summary, VALLR-Pin synergizes visual features with phonetic and linguistic context to improve Mandarin lip-reading performance.

[51] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs

Andreas Zinonos,Michał Stypułkowski,Antoni Bigata,Stavros Petridis,Maja Pantic,Nikita Drobyshev

Main category: cs.CV

TL;DR: FlashLips是一个无需掩码的两阶段实时唇同步系统，通过解耦控制与渲染，在单GPU上实现超过100 FPS的速度，同时保持高质量视觉效果。

Details

Motivation: 现有唇同步方法通常依赖复杂的生成模型（如GANs或扩散模型）和显式掩码，导致计算开销大、流程复杂。需要一种更高效、稳定且无需掩码的方法来实现实时高质量唇同步。 Method: 提出两阶段框架：第一阶段使用紧凑的一次性潜在空间编辑器，结合参考身份、目标帧和低维嘴唇姿态向量进行图像重建，仅使用重建损失训练，不使用GAN或扩散；通过自监督生成嘴部变形样本用于微调，实现无需显式掩码的精确嘴唇定位；第二阶段采用基于流匹配目标的音频到姿态Transformer，从语音预测嘴唇姿态向量。 Result: 系统在单GPU上运行速度超过100 FPS，达到实时性能；视觉质量与当前最优大型模型相当；无需GAN或扩散模型，训练更稳定；通过自监督避免了推理时对显式掩码的依赖。 Conclusion: FlashLips提供了一种简单、稳定且高效的唇同步解决方案，结合确定性重建与鲁棒音频控制，在速度和质量之间取得了良好平衡，适用于实际应用。 Abstract: We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision: we generate mouth-altered variants of the target image, that serve as pseudo ground truth for fine-tuning, teaching the network to localize edits to the lips while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-poses vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.

Nguyen Lam Phu Quy,Pham Phu Hoa,Tran Chi Nguyen,Dao Sy Duy Minh,Nguyen Hoang Minh Ngoc,Huynh Trung Kiet

Main category: cs.CV

TL;DR: 本文提出了一种多模态管道，通过结合视觉输入与外部文本知识来生成更丰富、上下文感知的图像描述，显著提升了现实场景中图像理解的信息量。

Details

Motivation: 现有图像描述方法在真实场景中缺乏上下文深度，遗漏了无法从视觉上直接获取的重要信息（如事件背景、时间线索、结果和命名实体），限制了其在新闻、教育和数字档案等领域的应用。 Method: 提出一个多模态增强管道：使用BEIT-3和SigLIP检索语义相似图像，利用ORB和SIFT进行几何对齐重排序，并通过语义搜索从相关文章中提取上下文信息；采用QLoRA微调的Qwen3模型融合上下文与Instruct BLIP生成的基础描述，输出事件增强型描述。 Result: 在OpenEvents v1数据集上的评估表明，该方法生成的描述比传统方法显著更丰富、更具信息性。 Conclusion: 所提方法能有效提升图像描述的上下文深度和信息含量，具备在需要深层视觉-文本理解的实际应用中广泛使用的潜力。 Abstract: Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

[53] Progressive Learned Image Compression for Machine Perception

Jungwoo Kim,Jun-Hyuk Kim,Jong-Seok Lee

Main category: cs.CV

TL;DR: 本文提出了一种面向机器感知的渐进式学习图像压缩编解码器PICM-Net，基于三值平面编码，并设计了自适应解码控制器以在推理时动态调整解码质量，实现高效、自适应的渐进传输，同时保持下游分类任务的高性能。

Details

Motivation: 现有的渐进式图像压缩方法在面向机器感知方面尚未探索细粒度可伸缩性（FGS），且人类感知与机器感知在率失真优先级上存在差异，需专门针对机器任务优化压缩策略。 Method: 提出基于三值平面编码的PICM-Net，分析人机感知在率失真优先级上的差异，系统研究潜在的优先级策略，并引入自适应解码控制器，根据下游任务置信度动态决定解码层级。 Result: 实验表明，该方法在多个质量级别下实现高效的渐进式传输，显著提升机器分类任务性能，同时具备良好的实际适应能力。 Conclusion: PICM-Net建立了面向机器感知的渐进式图像压缩新范式，通过细粒度可伸缩性和自适应解码机制，在保证机器任务性能的同时实现了高效传输。 Abstract: Recent advances in learned image codecs have been extended from human perception toward machine perception. However, progressive image compression with fine granular scalability (FGS)-which enables decoding a single bitstream at multiple quality levels-remains unexplored for machine-oriented codecs. In this work, we propose a novel progressive learned image compression codec for machine perception, PICM-Net, based on trit-plane coding. By analyzing the difference between human- and machine-oriented rate-distortion priorities, we systematically examine the latent prioritization strategies in terms of machine-oriented codecs. To further enhance real-world adaptability, we design an adaptive decoding controller, which dynamically determines the necessary decoding level during inference time to maintain the desired confidence of downstream machine prediction. Extensive experiments demonstrate that our approach enables efficient and adaptive progressive transmission while maintaining high performance in the downstream classification task, establishing a new paradigm for machine-aware progressive image compression.

[54] Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Hao Guo,Xugong Qin,Jun Jie Ou Yang,Peng Zhang,Gangyan Zeng,Yubo Li,Hailun Lin

Main category: cs.CV

TL;DR: 本文提出了一个基于自然语言的文档图像检索（NL-DIR）新基准，包含4.1万张真实文档图像和细粒度语义查询，用于推动视觉-语言模型在文档理解中的应用。

Details

Motivation: 现有文档图像检索方法主要依赖图像查询且局限于粗粒度类别，难以应对现实场景中细粒度文本查询的需求，因此需要构建更贴近实际的自然语言检索基准。 Method: 构建了NL-DIR数据集，每张文档图像配以5个由大语言模型生成并经人工验证的高质量自然语言查询；评估了主流视觉-语言对比模型和OCR-free视觉文档理解模型的零样本和微调性能，并提出一种高效的两阶段检索方法。 Result: NL-DIR数据集包含41K文档图像及205K细粒度自然语言查询；实验表明现有模型在该任务上仍有挑战，两阶段方法在提升性能的同时兼顾了时间和空间效率。 Conclusion: NL-DIR为文档图像检索提供了新的自然语言驱动基准，有望促进视觉-语言与文档理解领域的交叉研究。 Abstract: Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.

[55] Item Region-based Style Classification Network (IRSN): A Fashion Style Classifier Based on Domain Knowledge of Fashion Experts

Jinyoung Choi,Youngchae Kwon,Injung Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于物品区域的时尚风格分类网络（IRSN），通过结合全局特征与局部物品特征及其组合来提升分类精度，在多个数据集上显著优于现有方法。

Details

Motivation: 由于同一风格内视觉差异大且存在视觉上相似的风格，时尚风格分类具有挑战性；需要更精细地建模物品属性及其组合以区分细微差别。 Method: 提出IRSN网络，使用物品区域池化（IRP）提取各物品区域特征，通过门控特征融合（GFF）进行组合，并采用双主干架构（领域特定+通用预训练模型）增强特征提取能力。 Result: 在FashionStyle14和ShowniqV3两个数据集上，六种主流骨干网络结合IRSN后分别平均提升6.9%和7.6%的准确率，最高提升达14.5%和15.1%，可视化显示模型能更好捕捉相似风格间的差异。 Conclusion: IRSN通过显式建模物品区域特征及其融合机制，有效提升了时尚风格分类性能，验证了局部物品分析与双主干设计的有效性。 Abstract: Fashion style classification is a challenging task because of the large visual variation within the same style and the existence of visually similar styles. Styles are expressed not only by the global appearance, but also by the attributes of individual items and their combinations. In this study, we propose an item region-based fashion style classification network (IRSN) to effectively classify fashion styles by analyzing item-specific features and their combinations in addition to global features. IRSN extracts features of each item region using item region pooling (IRP), analyzes them separately, and combines them using gated feature fusion (GFF). In addition, we improve the feature extractor by applying a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pre-trained with a large-scale image-text dataset. In experiments, applying IRSN to six widely-used backbones, including EfficientNet, ConvNeXt, and Swin Transformer, improved style classification accuracy by an average of 6.9% and a maximum of 14.5% on the FashionStyle14 dataset and by an average of 7.6% and a maximum of 15.1% on the ShowniqV3 dataset. Visualization analysis also supports that the IRSN models are better than the baseline models at capturing differences between similar style classes.

[56] Effect of Activation Function and Model Optimizer on the Performance of Human Activity Recognition System Using Various Deep Learning Models

Subrata Kumer Paula,Dewan Nafiul Islam Noora,Rakhi Rani Paula,Md. Ekramul Hamidb,Fahmid Al Faridc,Hezerul Abdul Karimd,Md. Maruf Al Hossain Princee,Abu Saleh Musa Miahb

Main category: cs.CV

TL;DR: 本研究探讨了激活函数和优化器对深度学习模型在人体活动识别（HAR）中性能的影响，使用BiLSTM和ConvLSTM架构进行实验，结果表明ConvLSTM结合Adam或RMSprop在医疗相关活动中表现最佳，准确率高达99.00%。

Details

Motivation: 现有研究多关注模型架构设计，而激活函数与优化器组合对模型性能的影响尚不明确，尤其是在医疗相关的实际应用场景中，因此需要系统分析其影响以提升HAR系统的鲁棒性和准确性。 Method: 采用两种循环神经网络结构（BiLSTM和ConvLSTM），结合三种激活函数（ReLU、Sigmoid、Tanh）和四种优化算法（SGD、Adam、RMSprop、Adagrad），在HMDB51和UCF101数据集的六个医学相关动作类别上进行实验评估。 Result: ConvLSTM在两个数据集上均优于BiLSTM，最高准确率达到99.00%（配合Adam或RMSprop）；BiLSTM在UCF101上可达约98.00%，但在HMDB51上性能下降至约60.00%，显示其对数据集敏感且稳定性较差。 Conclusion: 激活函数与优化器的选择显著影响HAR模型性能，ConvLSTM结合Adam或RMSprop是实现高精度、稳定动作识别的优选方案，尤其适用于对实时性与准确性要求高的医疗环境。 Abstract: Human Activity Recognition (HAR) plays a vital role in healthcare, surveillance, and innovative environments, where reliable action recognition supports timely decision-making and automation. Although deep learning-based HAR systems are widely adopted, the impact of Activation Functions (AFs) and Model Optimizers (MOs) on performance has not been sufficiently analyzed, particularly regarding how their combinations influence model behavior in practical scenarios. Most existing studies focus on architecture design, while the interaction between AF and MO choices remains relatively unexplored. In this work, we investigate the effect of three commonly used activation functions (ReLU, Sigmoid, and Tanh) combined with four optimization algorithms (SGD, Adam, RMSprop, and Adagrad) using two recurrent deep learning architectures, namely BiLSTM and ConvLSTM. Experiments are conducted on six medically relevant activity classes selected from the HMDB51 and UCF101 datasets, considering their suitability for healthcare-oriented HAR applications. Our experimental results show that ConvLSTM consistently outperforms BiLSTM across both datasets. ConvLSTM, combined with Adam or RMSprop, achieves an accuracy of up to 99.00%, demonstrating strong spatio-temporal learning capabilities and stable performance. While BiLSTM performs reasonably well on UCF101, with accuracy approaching 98.00%, its performance drops to approximately 60.00% on HMDB51, indicating limited robustness across datasets and weaker sensitivity to AF and MO variations. This study provides practical insights for optimizing HAR systems, particularly for real-world healthcare environments where fast and precise activity detection is critical.

[57] LiDARDraft: Generating LiDAR Point Cloud from Versatile Inputs

Haiyun Wei,Fan Lu,Yunwei Zhu,Zehan Zheng,Weiyi Xue,Lin Shao,Xudong Zhang,Ya Wu,Rong Fu,Guang Chen

Main category: cs.CV

TL;DR: 本文提出LiDARDraft，通过3D布局桥接多样化条件信号与LiDAR点云生成，实现从文本、图像等输入生成高质量、可控的LiDAR点云。

Details

Motivation: 现有方法在生成LiDAR点云时难以兼顾高质量和灵活控制，因点云分布复杂而控制信号简单。 Method: 将文本、图像和点云统一表示为3D布局，并转化为语义和深度控制信号，采用基于rangemap的ControlNet引导点云生成。 Result: 实现了像素级对齐的可控LiDAR点云生成，在多样输入下生成高质量点云。 Conclusion: LiDARDraft支持从任意文本描述、图像和草图创建自动驾驶环境，实现“从零模拟”。 Abstract: Generating realistic and diverse LiDAR point clouds is crucial for autonomous driving simulation. Although previous methods achieve LiDAR point cloud generation from user inputs, they struggle to attain high-quality results while enabling versatile controllability, due to the imbalance between the complex distribution of LiDAR point clouds and the simple control signals. To address the limitation, we propose LiDARDraft, which utilizes the 3D layout to build a bridge between versatile conditional signals and LiDAR point clouds. The 3D layout can be trivially generated from various user inputs such as textual descriptions and images. Specifically, we represent text, images, and point clouds as unified 3D layouts, which are further transformed into semantic and depth control signals. Then, we employ a rangemap-based ControlNet to guide LiDAR point cloud generation. This pixel-level alignment approach demonstrates excellent performance in controllable LiDAR point clouds generation, enabling "simulation from scratch", allowing self-driving environments to be created from arbitrary textual descriptions, images and sketches.

[58] UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis

Thanh-Tung Le,Tuan Pham,Tung Nguyen,Deying Kong,Xiaohui Xie,Stephan Mandt

Main category: cs.CV

TL;DR: 提出一种结合确定性与生成性方法优势的混合框架，用于稀疏多视角输入下的新视角合成，在保持高质量图像的同时显著提升渲染效率。

Details

Motivation: 现有确定性方法在未观测区域模糊，而生成式扩散方法虽能生成合理内容但计算成本高，亟需兼顾质量与效率的新方法。 Method: 采用双向transformer编码多视图图像token和Plucker射线嵌入，共享潜在表示；配备两个轻量头：前馈回归头渲染几何明确区域，掩码自回归扩散头补全遮挡区域，端到端联合训练。 Result: 在多个场景上实现了最先进的图像质量，渲染速度比全生成模型快一个数量级。 Conclusion: 该混合框架有效平衡了新视角合成中的真实性、一致性与计算效率，具备良好的跨场景可扩展性。 Abstract: Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.

Alireza Moayedikia,Sattar Dorafshan

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态注意力网络的桥梁表层剥离检测方法，融合雷达时序模式与热成像空间特征，并引入不确定性量化，显著提升了检测性能，但在极端类别不平衡下注意力机制存在性能退化问题。

Details

Motivation: 传统单模态检测技术（如探地雷达和红外热成像）在桥梁缺陷检测中存在局限性，如湿度干扰、天气依赖性和探测深度限制，难以满足自动化巡检需求。 Method: 提出一种多模态注意力网络，分别采用时序注意力处理雷达数据、空间注意力处理热成像数据，并通过可学习嵌入实现跨模态融合；结合蒙特卡洛Dropout和学习方差估计进行不确定性量化，分离认知不确定性和偶然不确定性。 Result: 在五个桥梁数据集上实验表明，该方法在准确率和AUC上优于基线模型，尤其在平衡至中度不平衡数据中表现突出；消融研究显示跨模态注意力和多头机制对性能提升关键，不确定性量化有助于降低校准误差并支持选择性预测。 Conclusion: 所提注意力架构在典型场景下表现优异，具备实时检测能力，但极端类别不平衡会导致注意力机制失效，需专门处理；研究结果为实际部署提供了明确的能力边界与改进方向。 Abstract: Deteriorating civil infrastructure requires automated inspection techniques overcoming limitations of visual assessment. While Ground Penetrating Radar and Infrared Thermography enable subsurface defect detection, single modal approaches face complementary constraints radar struggles with moisture and shallow defects, while thermography exhibits weather dependency and limited depth. This paper presents a multi modal attention network fusing radar temporal patterns with thermal spatial signatures for bridge deck delamination detection. Our architecture introduces temporal attention for radar processing, spatial attention for thermal features, and cross modal fusion with learnable embeddings discovering complementary defect patterns invisible to individual sensors. We incorporate uncertainty quantification through Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components for safety critical decisions. Experiments on five bridge datasets reveal that on balanced to moderately imbalanced data, our approach substantially outperforms baselines in accuracy and AUC representing meaningful improvements over single modal and concatenation based fusion. Ablation studies demonstrate cross modal attention provides critical gains beyond within modality attention, while multi head mechanisms achieve improved calibration. Uncertainty quantification reduces calibration error, enabling selective prediction by rejecting uncertain cases. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse. These findings provide actionable guidance: attention based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. Our system maintains deployment efficiency, enabling real time inspection with characterized capabilities and limitations.

[60] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

Jingqi Tian,Yiheng Du,Haoji Zhang,Yuji Wang,Isaac Ning Lee,Xulong Bai,Tianrui Zhu,Jingxuan Niu,Yansong Tang

Main category: cs.CV

TL;DR: 本文提出了一种名为DDAVS的音频-视觉分割框架，通过解耦音频语义和延迟双向对齐机制，有效解决了多源纠缠和音视频错位问题，在多个基准上实现了优于现有方法的性能。

Details

Motivation: 现有音视频分割方法常因多源纠缠和音视频错位而偏向 louder 或 larger 的物体，忽略较弱或共现的声源，因此需要更鲁棒的模型来提升对复杂场景的处理能力。 Method: 提出DDAVS框架：1）利用可学习查询和基于音频原型记忆库的结构化语义空间来提取并锚定音频语义，并通过对比学习增强判别性；2）引入具有延迟模态交互的双交叉注意力机制，改善音视频对齐。 Result: 在AVS-Objects和VPO数据集上进行了广泛实验，DDAVS在单源、多源和多实例场景下均优于现有方法，表现出更强的鲁棒性和泛化能力。 Conclusion: DDAVS通过解耦音频语义和优化跨模态对齐，显著提升了音视频分割性能，尤其在复杂真实场景中表现优异，验证了其有效性与通用性。 Abstract: Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: https://trilarflagz.github.io/DDAVS-page/

[61] HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer

Mohammad Helal Uddin,Liam Seymour,Sabur Baidya

Main category: cs.CV

TL;DR: 本文提出HEART-ViT，首个基于Hessian引导的动态注意力和令牌剪枝框架，统一优化视觉Transformer中的令牌与注意力头，实现高效、准确且适用于边缘设备的推理。

Details

Motivation: Vision Transformers虽然精度先进，但其高计算成本和冗余运算限制了在资源受限设备上的部署；现有剪枝方法孤立处理令牌或注意力头，依赖启发式策略或一阶信号，导致精度损失或泛化能力差。 Method: 提出HEART-ViT框架，利用高效的Hessian-向量积估计令牌和注意力头的曲率加权敏感度，在显式损失预算下进行原则性剪枝决策，实现输入自适应的联合剪枝。 Result: 在ImageNet-100/1K上，ViT-B/16和DeiT-B/16模型实现了最高49.4%的FLOPs减少、36%的延迟降低和46%的吞吐量提升，并在微调后恢复甚至超越基线精度（如40%令牌剪枝下精度提升4.7%）；在AGX Orin等边缘设备上验证了实际推理速度与能效增益。 Conclusion: HEART-ViT是首个统一的、基于二阶梯度信息的ViT剪枝框架，兼顾精度保持与边缘效率，弥合理论优化与实际部署之间的差距。 Abstract: Vision Transformers (ViTs) deliver state-of-the-art accuracy but their quadratic attention cost and redundant computations severely hinder deployment on latency and resource-constrained platforms. Existing pruning approaches treat either tokens or heads in isolation, relying on heuristics or first-order signals, which often sacrifice accuracy or fail to generalize across inputs. We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers, which to the best of our knowledge is the first unified, second-order, input-adaptive framework for ViT optimization. HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products, enabling principled pruning decisions under explicit loss budgets.This dual-view sensitivity reveals an important structural insight: token pruning dominates computational savings, while head pruning provides fine-grained redundancy removal, and their combination achieves a superior trade-off. On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput, while consistently matching or even surpassing baseline accuracy after fine-tuning, for example 4.7 percent recovery at 40 percent token pruning. Beyond theoretical benchmarks, we deploy HEART-ViT on different edge devices such as AGX Orin, demonstrating that our reductions in FLOPs and latency translate directly into real-world gains in inference speed and energy efficiency. HEART-ViT bridges the gap between theory and practice, delivering the first unified, curvature-driven pruning framework that is both accuracy-preserving and edge-efficient.

[62] milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion

Niraj Prakash Kini,Shiau-Rung Tsai,Guan-Hsun Lin,Wen-Hsiao Peng,Ching-Wen Ma,Jenq-Neng Hwang

Main category: cs.CV

TL;DR: 本文提出了一种基于毫米波雷达的二维人体姿态估计框架milliMamba，通过引入Cross-View Fusion Mamba编码器和Spatio-Temporal-Cross Attention解码器，联合建模特征提取与解码阶段的时空依赖关系，有效应对雷达信号稀疏性问题，并在多个数据集上显著超越基线方法。

Details

Motivation: 毫米波雷达在人体姿态估计中具有隐私保护和光照不变的优势，但其信号因镜面反射而稀疏，导致难以提取鲁棒特征，因此需要更有效的建模方法来提升性能。 Method: 提出milliMamba框架，采用Cross-View Fusion Mamba编码器以线性复杂度高效提取长序列的时空特征，并设计Spatio-Temporal-Cross Attention解码器预测多帧关节坐标；同时在训练中引入速度损失以增强运动平滑性。 Result: 在TransHuPR和HuPR数据集上，该方法分别比基线提升了11.0 AP和14.6 AP，且保持合理的计算复杂度。 Conclusion: milliMamba通过联合建模时空依赖关系，有效解决了雷达信号稀疏带来的挑战，在雷达人体姿态估计任务中实现了显著性能提升。 Abstract: Millimeter-wave radar offers a privacy-preserving and lighting-invariant alternative to RGB sensors for Human Pose Estimation (HPE) task. However, the radar signals are often sparse due to specular reflection, making the extraction of robust features from radar signals highly challenging. To address this, we present milliMamba, a radar-based 2D human pose estimation framework that jointly models spatio-temporal dependencies across both the feature extraction and decoding stages. Specifically, given the high dimensionality of radar inputs, we adopt a Cross-View Fusion Mamba encoder to efficiently extract spatio-temporal features from longer sequences with linear complexity. A Spatio-Temporal-Cross Attention decoder then predicts joint coordinates across multiple frames. Together, this spatio-temporal modeling pipeline enables the model to leverage contextual cues from neighboring frames and joints to infer missing joints caused by specular reflections. To reinforce motion smoothness, we incorporate a velocity loss alongside the standard keypoint loss during training. Experiments on the TransHuPR and HuPR datasets demonstrate that our method achieves significant performance improvements, exceeding the baselines by 11.0 AP and 14.6 AP, respectively, while maintaining reasonable complexity. Code: https://github.com/NYCU-MAPL/milliMamba

[63] Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS)

Robert van de Ven,Trim Bresilla,Bram Nelissen,Ard Nieuwenhuizen,Eldert J. van Henten,Gert Kootstra

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯点阵的苹果姿态估计标注流水线，大幅减少手动标注需求，并验证了高度遮挡下的训练效果。

Details

Motivation: 果园环境中存在大量遮挡和变化，导致苹果姿态估计困难，尤其是关键点（如花萼）常被遮挡，传统依赖关键点标注的方法费时且易产生冲突或缺失标注。 Method: 采用3D Gaussian Splatting重建果园场景，通过简化标注并自动将标注投影到图像中，构建一个自动化标注与训练流水线，并训练和评估姿态估计模型。 Result: 仅用105个手动标注生成28,191个训练标签，标注量减少99.6%；训练遮挡≤95%的果实效果最佳，原图F1达0.927，渲染图达0.970；数据集大小对性能影响小；遮挡越少定位越准，但模型无法准确学习苹果朝向估计。 Conclusion: 该流水线显著降低标注成本，在复杂果园环境中有效提升数据利用效率，适用于高度遮挡下的物体姿态估计任务，但在方向估计上仍有局限。 Abstract: Automating tasks in orchards is challenging because of the large amount of variation in the environment and occlusions. One of the challenges is apple pose estimation, where key points, such as the calyx, are often occluded. Recently developed pose estimation methods no longer rely on these key points, but still require them for annotations, making annotating challenging and time-consuming. Due to the abovementioned occlusions, there can be conflicting and missing annotations of the same fruit between different images. Novel 3D reconstruction methods can be used to simplify annotating and enlarge datasets. We propose a novel pipeline consisting of 3D Gaussian Splatting to reconstruct an orchard scene, simplified annotations, automated projection of the annotations to images, and the training and evaluation of a pose estimation method. Using our pipeline, 105 manual annotations were required to obtain 28,191 training labels, a reduction of 99.6%. Experimental results indicated that training with labels of fruits that are $\leq95\%$ occluded resulted in the best performance, with a neutral F1 score of 0.927 on the original images and 0.970 on the rendered images. Adjusting the size of the training dataset had small effects on the model performance in terms of F1 score and pose estimation accuracy. It was found that the least occluded fruits had the best position estimation, which worsened as the fruits became more occluded. It was also found that the tested pose estimation method was unable to correctly learn the orientation estimation of apples.

[64] CoDi -- an exemplar-conditioned diffusion model for low-shot counting

Grega Šuštar,Jer Pelhan,Alan Lukežič,Matej Kristan

Main category: cs.CV

TL;DR: 本文提出了一种基于潜在扩散模型的低样本目标计数方法CoDi，通过新的示例条件模块在中间去噪层中提取和调整对象原型，生成高质量密度图，并实现精确的目标定位。

Details

Motivation: 现有的低样本计数方法在密集小目标区域存在局限性：基于密度的方法定位能力差，而基于点检测的方法因预训练查询数量有限，在大数量目标场景下表现不佳。 Method: 提出CoDi，一种基于潜在扩散模型的低样本计数方法，引入示例条件模块，将示例中的对象原型提取并适配到去噪网络的中间层，从而生成高精度密度图，并通过非极大值抑制确定目标位置。 Result: 在FSC基准上，CoDi在少样本、单样本和无参考场景下分别比现有最优方法降低15%、13%和10%的MAE；在MCAC基准上，MAE降低44%，达到新的最先进水平。 Conclusion: CoDi有效结合了密度估计与精确定位的优势，显著提升了低样本条件下的目标计数性能，尤其在密集小目标场景中表现出色。 Abstract: Low-shot object counting addresses estimating the number of previously unobserved objects in an image using only few or no annotated test-time exemplars. A considerable challenge for modern low-shot counters are dense regions with small objects. While total counts in such situations are typically well addressed by density-based counters, their usefulness is limited by poor localization capabilities. This is better addressed by point-detection-based counters, which are based on query-based detectors. However, due to limited number of pre-trained queries, they underperform on images with very large numbers of objects, and resort to ad-hoc techniques like upsampling and tiling. We propose CoDi, the first latent diffusion-based low-shot counter that produces high-quality density maps on which object locations can be determined by non-maxima suppression. Our core contribution is the new exemplar-based conditioning module that extracts and adjusts the object prototypes to the intermediate layers of the denoising network, leading to accurate object location estimation. On FSC benchmark, CoDi outperforms state-of-the-art by 15% MAE, 13% MAE and 10% MAE in the few-shot, one-shot, and reference-less scenarios, respectively, and sets a new state-of-the-art on MCAC benchmark by outperforming the top method by 44% MAE. The code is available at https://github.com/gsustar/CoDi.

[65] AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

Sofian Chaybouti,Sanath Narayan,Yasser Dahou,Phúc H. Lê Khac,Ankit Singh,Ngoc Dung Huynh,Wamiq Reyaz Para,Hilde Kuehne,Hakim Hacid

Main category: cs.CV

TL;DR: 本文提出了一种名为AMoE的聚合式混合专家视觉基础模型，通过多教师蒸馏（来自SigLIP2和DINOv3）来学习统一的视觉表示，并引入了三项关键技术：非对称关系知识蒸馏损失、令牌平衡批处理和分层数据采样，显著提升了训练效率与样本利用率。同时构建并开源了200M图像的OpenLVD200M数据集。

Details

Motivation: 探索多教师蒸馏在视觉基础模型中的学习动态与数据效率，降低计算成本，实现高效的知识融合与统一视觉表示学习。 Method: 提出AMoE框架，采用非对称关系知识蒸馏损失保留教师模型几何特性，使用令牌平衡批处理处理多分辨率输入，并引入分层聚类与采样提升数据利用效率。 Result: 成功构建OpenLVD200M数据集，在多教师蒸馏中显著提升样本效率；AMoE模型在保持性能的同时实现更高效的训练。 Conclusion: 多教师蒸馏可通过结构化损失设计、批处理优化和数据采样策略大幅提升效率，为视觉基础模型训练提供了高数据与计算效率的可行路径。 Abstract: Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data--typically reserved for self-supervised learning--substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.

[66] Generative Latent Coding for Ultra-Low Bitrate Image Compression

Zhaoyang Jia,Jiahao Li,Bin Li,Houqiang Li,Yan Lu

Main category: cs.CV

TL;DR: 提出生成潜在编码（GLC）架构，在生成模型的潜在空间中进行变换编码，相较于像素空间更稀疏、语义更丰富且更符合人类感知，显著提升低码率下的图像压缩真实感与保真度。

Details

Motivation: 现有图像压缩方法在像素空间进行变换编码，难以在低码率下同时实现高真实感和高保真，因像素空间失真与人类感知不一致。 Method: 提出生成潜在编码（GLC）架构，将变换编码从像素空间迁移至生成式VQ-VAE的潜在空间；引入分类超模块降低超信息比特成本，并采用基于码预测的监督机制增强语义一致性。 Result: 在CLIC2020测试集上，达到与MS-ILLM相同的FID性能，但比特数减少45%；自然图像压缩低于0.04 bpp，人脸图像低于0.01 bpp，同时保持高视觉质量；并支持图像恢复和风格迁移等应用。 Conclusion: GLC通过在更符合感知的生成潜在空间中进行编码，有效实现了低码率下的高质量图像压缩，并具备扩展至多种视觉任务的潜力。 Abstract: Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer. The code is available at https://github.com/jzyustc/GLC.

[67] JDPNet: A Network Based on Joint Degradation Processing for Underwater Image Enhancement

Tao Ye,Hongbin Ren,Chongbing Zhang,Haoran Chen,Xiaosong Li

Main category: cs.CV

TL;DR: 本文提出了一种名为JDPNet的联合退化处理网络，用于有效挖掘和统一处理水下图像中非线性耦合的多种退化问题。

Details

Motivation: 水下图像受复杂环境和介质影响，存在多种非线性耦合的退化，现有方法难以有效处理这些耦合退化。 Method: 提出JDPNet，包含联合特征挖掘模块和概率引导的自助分布策略，并设计AquaBalanceLoss损失函数以平衡颜色、清晰度和对比度。 Result: 在六个公开数据集和两个自建数据集上实验表明，JDPNet在性能、参数量和计算成本之间取得了最优权衡。 Conclusion: JDPNet通过统一框架有效处理耦合退化，显著提升了水下图像增强效果。 Abstract: Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.

Xiangxuan Ren,Zhongdao Wang,Pin Tang,Guoqing Wang,Jilai Zheng,Chao Ma

Main category: cs.CV

TL;DR: 提出了一种名为LiteFusion的新型多模态3D检测器，利用LiDAR数据作为几何信息补充来增强基于相机的检测，无需3D骨干网络和专用LiDAR编码器，在nuScenes数据集上显著提升性能且具备良好部署性和鲁棒性。

Details

Motivation: 现有方法过度依赖LiDAR传感器和3D稀疏卷积，导致在LiDAR缺失时性能下降严重，且难以在NPU、FPGA等硬件上部署。 Method: 将LiDAR数据视为几何信息补充，而非独立模态，通过四元数空间融合LiDAR点云特征与图像特征，保留正交约束，构建紧凑的跨模态嵌入，完全去除对3D骨干网络的依赖。 Result: 在nuScenes数据集上，相比纯视觉基线模型，mAP提升+20.4%，NDS提升+19.7%，参数量仅增加1.1%；即使无LiDAR输入，仍能保持较强检测性能。 Conclusion: LiteFusion通过简洁有效的融合策略实现了高性能、高鲁棒性和易部署性的统一，为多模态3D检测提供了新范式。 Abstract: 3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.

[69] IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing

Oikantik Nath,Sahithi Kukkala,Mitesh Khapra,Ravi Kiran Sarvadevabhatla

Main category: cs.CV

TL;DR: 本文提出了IndicDLP，一个大规模、多语言的文档布局分析数据集，涵盖11种印度语言和英语及多种文档类型，解决了现有数据集在细粒度标注、多语言覆盖和规模上的不足。

Details

Motivation: 现有文档布局数据集在细粒度区域标注和多语言多样性方面存在不足，特别是印度语系文档缺乏代表性，限制了复杂文档理解的发展。 Method: 构建了一个大规模的多语言文档布局数据集IndicDLP，覆盖11种印度语言和英语以及12种常见文档领域，并整理了UED-mini数据集用于预训练，以提升模型对印度语言文档的理解能力。 Result: 实验表明，在IndicDLP上微调现有英文模型可显著提升性能，且训练出的模型不仅在印度语系文档上表现良好，还具备跨语言和跨领域的泛化能力。 Conclusion: IndicDLP填补了文档布局数据集在规模、多样性和标注精细度方面的空白，推动了包容性与高效性的文档理解研究。 Abstract: Document layout analysis is essential for downstream tasks such as information retrieval, extraction, OCR, and digitization. However, existing large-scale datasets like PubLayNet and DocBank lack fine-grained region labels and multilingual diversity, making them insufficient for representing complex document layouts. In contrast, human-annotated datasets such as M6Doc and D4LA offer richer labels and greater domain diversity, but are too small to train robust models and lack adequate multilingual coverage. This gap is especially pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets, further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet and M6Doc, to enhance pretraining and provide a solid foundation for Indic layout models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts performance, validating its effectiveness. Moreover, models trained on IndicDLP generalize well beyond Indic layouts, making it a valuable resource for document digitization. This work bridges gaps in scale, diversity, and annotation granularity, driving inclusive and efficient document understanding.

[70] Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

Binfeng Wang,Di Wang,Haonan Guo,Ying Fu,Jing Zhang

Main category: cs.CV

TL;DR: 提出了一种无需显式退化先验的高光谱图像统一恢复框架DAMP，通过设计空间-光谱退化度量作为退化提示（DP），结合空间-光谱自适应模块（SSAM）与MoE架构，实现对多种复杂退化的自适应恢复，在多个数据集上表现优异。

Details

Motivation: 现有高光谱图像恢复方法依赖难以获取的显式退化先验（如退化标签），在真实复杂混合退化场景中应用受限，因此需要一种不依赖先验、能自动感知退化的统一恢复方法。 Method: 提出Degradation-Aware Metric Prompting（DAMP）框架：1）设计空间-光谱退化度量生成连续的退化提示（DP），替代人工退化标签；2）构建空间-光谱自适应模块（SSAM），动态调节特征提取；3）将SSAM作为专家引入MoE架构，以DP为门控路由，实现自适应恢复。 Result: 在自然与遥感高光谱图像数据集上实验表明，DAMP在定量指标和视觉效果上均达到最先进水平，并展现出对未见退化的强泛化能力。 Conclusion: DAMP通过退化度量驱动的提示机制和自适应特征调制，实现了无需显式退化标签的高效统一恢复，提升了模型在真实复杂场景下的实用性与鲁棒性。 Abstract: Unified hyperspectral image (HSI) restoration aims to recover various degraded HSIs using a single model, offering great practical value. However, existing methods often depend on explicit degradation priors (e.g., degradation labels) as prompts to guide restoration, which are difficult to obtain due to complex and mixed degradations in real-world scenarios. To address this challenge, we propose a Degradation-Aware Metric Prompting (DAMP) framework. Instead of relying on predefined degradation priors, we design spatial-spectral degradation metrics to continuously quantify multi-dimensional degradations, serving as Degradation Prompts (DP). These DP enable the model to capture cross-task similarities in degradation distributions and enhance shared feature learning. Furthermore, we introduce a Spatial-Spectral Adaptive Module (SSAM) that dynamically modulates spatial and spectral feature extraction through learnable parameters. By integrating SSAM as experts within a Mixture-of-Experts architecture, and using DP as the gating router, the framework enables adaptive, efficient, and robust restoration under diverse, mixed, or unseen degradations. Extensive experiments on natural and remote sensing HSI datasets show that DAMP achieves state-of-the-art performance and demonstrates exceptional generalization capability. Code is publicly available at https://github.com/MiliLab/DAMP.

Jinghao Shi,Jianing Song

Main category: cs.CV

TL;DR: 本文提出了一种用于高分辨率遥感图像语义分割的双向协同优化框架（BiCoR-Seg），通过热图驱动的双向信息协同模块和分层监督策略提升特征判别能力，在多个数据集上实现了优异性能。

Details

Motivation: 高分辨率遥感图像语义分割面临类间相似性高、类内差异大的挑战，现有方法难以将强判别性的语义知识有效融入像素级特征学习，导致复杂场景下边界模糊和类别混淆。 Method: 提出BiCoR-Seg框架，包含热图驱动的双向信息协同模块（HBIS），在特征图与类别嵌入间建立双向信息流，并利用HBIS生成的可解释热图作为低分辨率预测进行分层监督；同时设计跨层类别嵌入的Fisher判别损失，增强类内紧凑性和类间可分性。 Result: 在LoveDA、Vaihingen和Potsdam数据集上进行了大量实验，结果表明BiCoR-Seg在分割性能和特征可解释性方面均表现突出。 Conclusion: BiCoR-Seg通过双向信息协同、分层监督和判别性损失有效提升了高分辨率遥感图像语义分割的精度与可解释性，为解决类间相似与类内变化问题提供了新思路。 Abstract: High-resolution remote sensing image semantic segmentation (HRSS) is a fundamental yet critical task in the field of Earth observation. However, it has long faced the challenges of high inter-class similarity and large intra-class variability. Existing approaches often struggle to effectively inject abstract yet strongly discriminative semantic knowledge into pixel-level feature learning, leading to blurred boundaries and class confusion in complex scenes. To address these challenges, we propose Bidirectional Co-Refinement Framework for HRSS (BiCoR-Seg). Specifically, we design a Heatmap-driven Bidirectional Information Synergy Module (HBIS), which establishes a bidirectional information flow between feature maps and class embeddings by generating class-level heatmaps. Based on HBIS, we further introduce a hierarchical supervision strategy, where the interpretable heatmaps generated by each HBIS module are directly utilized as low-resolution segmentation predictions for supervision, thereby enhancing the discriminative capacity of shallow features. In addition, to further improve the discriminability of the embedding representations, we propose a cross-layer class embedding Fisher Discriminative Loss to enforce intra-class compactness and enlarge inter-class separability. Extensive experiments on the LoveDA, Vaihingen, and Potsdam datasets demonstrate that BiCoR-Seg achieves outstanding segmentation performance while offering stronger interpretability. The released code is available at https://github.com/ShiJinghao566/BiCoR-Seg.

[72] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation

Daniele Cardullo,Simone Teglia,Irene Amerini

Main category: cs.CV

TL;DR: 本文提出了一种名为LADLE-MM的多模态虚假信息检测模型，能够在标注数据有限和计算资源受限的情况下高效运行，并在多个基准上优于现有方法。

Details

Motivation: 现有的多模态虚假信息检测方法通常依赖大量标注数据和高计算成本，难以在资源受限场景下应用。因此，需要一种在少标注、低资源条件下仍能保持高性能的检测模型。 Method: 提出LADLE-MM，包含两个单模态分支和一个融合BLIP提取的多模态嵌入的多模态分支，采用模型集成初始化策略，在无接地标注的情况下进行训练。 Result: 在DGM4基准上，LADLE-MM参数减少60.3%，性能与现有SOTA方法相当，且在无接地标注时表现更优；在VERITE数据集上，超越使用更大规模视觉语言模型的现有方法，展现出强泛化能力和对单模态偏差的鲁棒性。 Conclusion: LADLE-MM是一种高效、轻量且适用于低资源环境的多模态虚假信息检测框架，具有良好的实际应用前景。 Abstract: With the rise of easily accessible tools for generating and manipulating multimedia content, realistic synthetic alterations to digital media have become a widespread threat, often involving manipulations across multiple modalities simultaneously. Recently, such techniques have been increasingly employed to distort narratives of important events and to spread misinformation on social media, prompting the development of misinformation detectors. In the context of misinformation conveyed through image-text pairs, several detection methods have been proposed. However, these approaches typically rely on computationally intensive architectures or require large amounts of annotated data. In this work we introduce LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation, a model-soup initialized multimodal misinformation detector designed to operate under a limited annotation setup and constrained training resources. LADLE-MM is composed of two unimodal branches and a third multimodal one that enhances image and text representations with additional multimodal embeddings extracted from BLIP, serving as fixed reference space. Despite using 60.3% fewer trainable parameters than previous state-of-the-art models, LADLE-MM achieves competitive performance on both binary and multi-label classification tasks on the DGM4 benchmark, outperforming existing methods when trained without grounding annotations. Moreover, when evaluated on the VERITE dataset, LADLE-MM outperforms current state-of-the-art approaches that utilize more complex architectures involving Large Vision-Language-Models, demonstrating the effective generalization ability in an open-set setting and strong robustness to unimodal bias.

[73] ${D}^{3}${ETOR}: ${D}$ebate-Enhanced Pseudo Labeling and Frequency-Aware Progressive ${D}$ebiasing for Weakly-Supervised Camouflaged Object ${D}$etection with Scribble Annotations

Jiawei Ge,Jiuxin Cao,Xinyi Li,Xuelin Zhu,Chang Liu,Bo Liu,Chen Feng,Ioannis Patras

Main category: cs.CV

TL;DR: 本文提出了一种名为${D}^{3}$ETOR的两阶段弱监督伪装物体检测框架，通过辩论增强伪标签和频率感知去偏机制，提升了伪标签质量和模型对全局结构的理解，显著缩小了弱监督与全监督方法之间的性能差距。

Details

Motivation: 现有弱监督伪装物体检测方法因通用分割模型生成的伪标签不可靠以及忽略涂鸦标注中的偏差问题，导致性能远落后于全监督方法。 Method: 提出${D}^{3}$ETOR框架：第一阶段采用自适应熵驱动采样和多智能体辩论机制优化SAM生成的伪掩码；第二阶段设计FADeNet网络，融合多级频率感知特征并动态调整监督权重以缓解标注偏差。 Result: 在多个基准数据集上实现了最先进的性能，有效提升了弱监督伪装物体检测的效果。 Conclusion: ${D}^{3}$ETOR通过增强伪标签可靠性和建模全局结构，显著缩小了弱监督与全监督伪装物体检测之间的差距，为该领域提供了新的解决方案。 Abstract: Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

[74] UbiQVision: Quantifying Uncertainty in XAI for Image Recognition

Akshat Dubey,Aleksandar Anžel,Bahar İlgen,Georges Hattab

Main category: cs.CV

TL;DR: 本文提出了一种结合Dirichlet后验采样和Dempster-Shafer理论的新框架，用于量化医学图像中SHAP解释的不确定性，提升了深度学习模型在医疗领域中的可解释性与可靠性。

Details

Motivation: 深度学习模型在医学成像中日益复杂，导致可解释性下降；SHAP虽有助于解释模型预测，但其解释在存在认知和随机不确定性时可能不稳定，因此需要一种可靠的方法来量化这种不确定性。 Method: 采用Dirichlet后验采样和Dempster-Shafer理论，通过置信度、似然度和融合图方法，结合统计分析，对SHAP解释中的不确定性进行建模与量化。 Result: 该框架在三个具有不同类别分布、图像质量和模态类型的医学图像数据集上进行了评估，验证了其在病理学、眼科学和放射学中应对噪声和不确定性方面的有效性。 Conclusion: 所提出的框架能够有效量化SHAP解释中的不确定性，提高了深度学习模型在医学图像分析中的可解释性与可信度，适用于高风险决策场景。 Abstract: Recent advances in deep learning have led to its widespread adoption across diverse domains, including medical imaging. This progress is driven by increasingly sophisticated model architectures, such as ResNets, Vision Transformers, and Hybrid Convolutional Neural Networks, that offer enhanced performance at the cost of greater complexity. This complexity often compromises model explainability and interpretability. SHAP has emerged as a prominent method for providing interpretable visualizations that aid domain experts in understanding model predictions. However, SHAP explanations can be unstable and unreliable in the presence of epistemic and aleatoric uncertainty. In this study, we address this challenge by using Dirichlet posterior sampling and Dempster-Shafer theory to quantify the uncertainty that arises from these unstable explanations in medical imaging applications. The framework uses a belief, plausible, and fusion map approach alongside statistical quantitative analysis to produce quantification of uncertainty in SHAP. Furthermore, we evaluated our framework on three medical imaging datasets with varying class distributions, image qualities, and modality types which introduces noise due to varying image resolutions and modality-specific aspect covering the examples from pathology, ophthalmology, and radiology, introducing significant epistemic uncertainty.

[75] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

Ji-Hoon Kim,Junseok Ahn,Doyeop Kwak,Joon Son Chung,Shinji Watanabe

Main category: cs.CV

TL;DR: 本文提出了TAVID，一个能够从文本和参考图像同步生成交互式人脸视频和对话语音的统一框架，通过跨模态映射器实现音视频模态间的双向信息交换。

Details

Motivation: 现有研究通常孤立地处理说话人脸生成和对话语音生成，忽略了人类对话中音视频模态紧密耦合的多模态特性。 Method: TAVID框架引入两个跨模态映射器——运动映射器和说话人映射器，联合建模面部动作与语音生成，实现音视频同步合成。 Result: 在说话人脸真实性、听者反应性、双人互动流畅性和语音质量四个维度上的实验表明，TAVID在各项指标上均表现出色。 Conclusion: TAVID实现了交互式音视频内容的协同生成，为构建更自然的人机对话系统提供了有效方案。 Abstract: The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. Extensive experiments demonstrate the effectiveness of our approach across all these aspects.

[76] The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

Qingdong He,Xueqin Chen,Yanjie Pan,Peng Tang,Pengcheng Xu,Zhenye Gan,Chengjie Wang,Xiaobin Hu,Jiangning Zhang,Yabiao Wang

Main category: cs.CV

TL;DR: 提出KeyTailor框架和ViT-HD数据集，通过关键帧驱动的细节注入策略提升视频虚拟试穿的服装细节和背景一致性，同时不增加DiT结构复杂度。

Details

Motivation: 现有基于DiT的视频虚拟试穿方法难以捕捉精细的服装动态、保持背景完整性，且计算成本高，受限于数据集规模与质量。 Method: 设计关键帧引导的采样策略，并引入两个定制模块：服装细节增强模块和协同背景优化模块，从关键帧中提取信息并注入到DiT中，增强服装动态和背景一致性。 Result: 在自建高清数据集ViT-HD（15,070个810*1080视频）上验证，KeyTailor在服装保真度和背景完整性上优于现有方法。 Conclusion: KeyTailor通过关键帧驱动的细节注入有效提升了视频虚拟试穿的质量，且无需修改DiT架构，兼顾效率与效果。 Abstract: Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.

[77] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

V. Kovalev,A. Kuvshinov,A. Buzovkin,D. Pokidov,D. Timonin

Main category: cs.CV

TL;DR: 本文提出了CRAFT，一种无需训练、模型无关的框架，通过将提示分解为结构化视觉问题、使用视觉语言模型验证图像，并在约束失败处进行针对性编辑，实现可解释且可控的推理时优化，显著提升多模态生成模型的组合准确性与文本渲染能力。

Details

Motivation: 现有推理时优化方法依赖隐式或无约束的提示重写，导致行为难以解释、控制和停止。本文旨在引入类似大语言模型中成功的显式结构化思维机制，以提高图像生成的可控性和可靠性。 Method: 提出CRAFT框架：1）将输入提示分解为依赖结构的视觉问题；2）利用视觉-语言模型对生成图像进行验证；3）仅在约束不满足时通过LLM代理进行定向提示修改；4）设置明确的停止条件，当所有约束满足时终止迭代。 Result: 在多个模型家族和具有挑战性的基准上，CRAFT持续提升了组合准确性、文本渲染质量和用户偏好评分，尤其对轻量级生成器效果显著，且仅带来可忽略的推理开销。 Conclusion: 显式的、基于约束的结构化推理是提升多模态生成模型可靠性的关键机制，CRAFT为无需训练的高性能图像生成提供了有效路径。 Abstract: Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

[78] Linking Faces and Voices Across Languages: Insights from the FAME 2026 Challenge

Marta Moscati,Ahmed Abdullah,Muhammad Saad Saeed,Shah Nawaz,Rohan Kumar Das,Muhammad Zaigham Zaheer,Junaid Mir,Muhammad Haroon Yousaf,Khalid Mahmood Malik,Markus Schedl

Main category: cs.CV

TL;DR: FAME 2026挑战赛旨在推动跨语言环境下的面部-语音关联技术发展，重点关注训练与测试语言不同时的模型性能。

Details

Motivation: 在多语言环境中，超过一半的人口使用多种语言交流，因此需要能够在不同语言条件下有效进行面部-语音关联的技术。 Method: 通过举办FAME 2026挑战赛，提供统一的数据集和评估标准，鼓励参与者开发跨语言鲁棒的face-voice association方法。 Result: 该报告总结了挑战赛的设置和目标，但未提及具体参赛方法或性能结果。 Conclusion: FAME 2026挑战赛为研究多语言环境下跨模态关联提供了重要平台，推动了相关领域的发展。 Abstract: Over half of the world's population is bilingual and people often communicate under multilingual scenarios. The Face-Voice Association in Multilingual Environments (FAME) 2026 Challenge, held at ICASSP 2026, focuses on developing methods for face-voice association that are effective when the language at test-time is different than the training one. This report provides a brief summary of the challenge.

[79] SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images

Linfei Li,Lin Zhang,Zhong Wang,Ying Shen

Main category: cs.CV

TL;DR: 本文提出了一种名为SmartSplat的高自适应、特征感知的基于高斯泼溅（Gaussian Splatting）的图像压缩框架，能够在任意分辨率和压缩比下实现高效重建，显著优于现有方法。

Details

Motivation: 现有的2D高斯图像模型在超高清图像压缩中难以兼顾压缩率与重建保真度，亟需一种更高效的表示方法。 Method: 引入梯度-颜色引导的变分采样策略和基于排斥的均匀采样方案，结合尺度自适应的颜色采样方法，优化高斯基元的空间布局、尺度和颜色初始化。 Result: 在DIV8K和自建16K数据集上实验表明，SmartSplat在相同压缩比下优于现有最先进方法，并突破其压缩极限，具备良好可扩展性。 Conclusion: SmartSplat通过联合优化高斯表示的多个维度，在超高清图像压缩中实现了高效且高质量的重建，具有强实用性与广泛应用前景。 Abstract: Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content, posing significant challenges for efficient compression and real-time decoding on end-user devices. Inspired by 3D Gaussian Splatting, recent 2D Gaussian image models improve representation efficiency, yet existing methods struggle to balance compression ratio and reconstruction fidelity in ultra-high-resolution scenarios. To address this issue, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that supports arbitrary image resolutions and compression ratios. SmartSplat leverages image-aware features such as gradients and color variances, introducing a Gradient-Color Guided Variational Sampling strategy together with an Exclusion-based Uniform Sampling scheme to improve the non-overlapping coverage of Gaussian primitives in pixel space. In addition, we propose a Scale-Adaptive Gaussian Color Sampling method to enhance color initialization across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat efficiently captures both local structures and global textures using a limited number of Gaussians, achieving high reconstruction quality under strong compression. Extensive experiments on DIV8K and a newly constructed 16K dataset demonstrate that SmartSplat consistently outperforms state-of-the-art methods at comparable compression ratios and exceeds their compression limits, showing strong scalability and practical applicability. The code is publicly available at https://github.com/lif314/SmartSplat.

[80] DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

Junho Yoon,Jaemo Jung,Hyunju Kim,Dongman Lee

Main category: cs.CV

TL;DR: 本文提出了一种用于外视角视频与环境传感器对齐的分解时空框架DETACH，以解决现有全局对齐方法在外视角场景下的局限性。

Details

Motivation: 现有的基于自我中心视频与可穿戴传感器的方法存在用户不适、隐私和扩展性问题，而外视角与环境传感器提供了一种非侵入性且可扩展的替代方案，但面临局部细节捕捉不足和时间模式误对齐的问题。 Method: 提出DETACH框架，通过显式分解保留局部细节，并利用在线聚类发现的传感器-空间特征实现语义定位；采用两阶段对齐方法：首先通过互监督建立空间对应关系，再通过自适应处理不同负样本的空间-时间加权对比损失进行时间对齐。 Result: 在Opportunity++和HWU-USP数据集上的下游任务实验表明，该方法显著优于改编的自我中心-可穿戴基线方法。 Conclusion: DETACH通过分解时空建模和上下文感知对齐，有效解决了外视角-环境传感中的对齐难题，提升了动作识别性能。 Abstract: Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.

[81] Chain-of-Anomaly Thoughts with Large Vision-Language Models

Pedro Domingos,João Pereira,Vasco Lopes,João Neves,David Semedo

Main category: cs.CV

TL;DR: 提出了一种名为Chain-of-Anomaly-Thoughts (CoAT)的多智能体推理框架，通过在推理过程中引入归纳性异常偏差来提升视频监控中异常行为检测的性能。

Details

Motivation: 现有大型视觉语言模型在自动视频监控中存在对正常情况的固有偏见，难以有效检测犯罪等异常事件，且现有的思维链推理策略缺乏对异常的归纳偏向。 Method: 设计了一个多智能体推理框架CoAT，在推理链末端引入专注于异常分类的模块，从而在推理过程中注入归纳性刑事偏见，增强对异常事件的识别能力。 Result: 在低分辨率视频上F1分数提升了11.8个百分点，在高分辨率视频的异常分类任务中提升了3.78个百分点。 Conclusion: CoAT通过引入异常导向的推理机制，显著提升了视觉语言模型在视频监控中的异常检测与分类能力。 Abstract: Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.

[82] Skin Lesion Classification Using a Soft Voting Ensemble of Convolutional Neural Networks

Abdullah Al Shafi,Abdul Muntakim,Pintu Chandra Shill,Rowzatul Zannat,Abdullah Al-Amin

Main category: cs.CV

TL;DR: 提出一种基于软投票集成CNN的皮肤癌早期分类方法，通过图像增强、重平衡和混合双编码器分割技术，在三个基准数据集上实现了高精度的病变识别。

Details

Motivation: 早期检测皮肤癌可显著提高生存率，但传统诊断方法依赖医生经验，人工智能技术尤其是深度学习在医学图像分析中展现出潜力，因此需要开发更准确且适用于实际场景的自动化分类系统。 Method: 采用MobileNetV2、VGG19和InceptionV3构成软投票集成模型，结合图像重平衡、数据增强与滤波技术，并利用迁移学习实现混合双编码器进行病灶分割，以聚焦临床关键特征并减少背景干扰。 Result: 在HAM10000、ISIC 2016和ISIC 2019三个数据集上分别达到96.32%、90.86%和93.92%的分类准确率，评估指标表现优异。 Conclusion: 该方法有效提升了皮肤癌病变的识别精度，兼顾准确性与计算效率，具备良好的临床辅助诊断应用前景。 Abstract: Skin cancer can be identified by dermoscopic examination and ocular inspection, but early detection significantly increases survival chances. Artificial intelligence (AI), using annotated skin images and Convolutional Neural Networks (CNNs), improves diagnostic accuracy. This paper presents an early skin cancer classification method using a soft voting ensemble of CNNs. In this investigation, three benchmark datasets, namely HAM10000, ISIC 2016, and ISIC 2019, were used. The process involved rebalancing, image augmentation, and filtering techniques, followed by a hybrid dual encoder for segmentation via transfer learning. Accurate segmentation focused classification models on clinically significant features, reducing background artifacts and improving accuracy. Classification was performed through an ensemble of MobileNetV2, VGG19, and InceptionV3, balancing accuracy and speed for real-world deployment. The method achieved lesion recognition accuracies of 96.32\%, 90.86\%, and 93.92\% for the three datasets. The system performance was evaluated using established skin lesion detection metrics, yielding impressive results.

[83] High Dimensional Data Decomposition for Anomaly Detection of Textured Images

Ji Song,Xing Wang,Jianguo Wu,Xiaowei Yue

Main category: cs.CV

TL;DR: 提出一种针对纹理图像的高效异常检测方法TBSD，通过学习纹理基函数并利用其先验知识，有效减少误检，适用于平滑背景和稀疏异常的纹理图像。

Details

Motivation: 传统异常检测方法在处理纹理缺陷图像时存在误检率高、鲁棒性低、依赖大规模数据等问题，亟需更高效的方法。 Method: 提出纹理基集成平滑分解（TBSD）方法，包含两个主要过程：首先学习纹理基函数以提取准周期性纹理模式，然后利用该基函数作为先验进行异常检测。 Result: TBSD在仿真和真实数据集上均优于基准方法，具有更低的误检率、更小的训练数据需求和更优的检测性能。 Conclusion: TBSD是一种高效、鲁棒且数据效率高的纹理图像异常检测方法，特别适用于平滑背景与稀疏异常场景。 Abstract: In the realm of diverse high-dimensional data, images play a significant role across various processes of manufacturing systems where efficient image anomaly detection has emerged as a core technology of utmost importance. However, when applied to textured defect images, conventional anomaly detection methods have limitations including non-negligible misidentification, low robustness, and excessive reliance on large-scale and structured datasets. This paper proposes a texture basis integrated smooth decomposition (TBSD) approach, which is targeted at efficient anomaly detection in textured images with smooth backgrounds and sparse anomalies. Mathematical formulation of quasi-periodicity and its theoretical properties are investigated for image texture estimation. TBSD method consists of two principal processes: the first process learns the texture basis functions to effectively extract quasi-periodic texture patterns; the subsequent anomaly detection process utilizes that texture basis as prior knowledge to prevent texture misidentification and capture potential anomalies with high accuracy.The proposed method surpasses benchmarks with less misidentification, smaller training dataset requirement, and superior anomaly detection performance on both simulation and real-world datasets.

[84] Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding

Anh Dao,Manh Tran,Yufei Zhang,Xiaoming Liu,Zijun Cui

Main category: cs.CV

TL;DR: 该研究通过将物理推断的力（如关节驱动力）引入人体运动理解模型，系统评估了其在步态识别、动作识别和细粒度视频描述等任务中的作用，在多个基准上实现了性能提升，尤其在遮挡或视角变化等挑战性条件下效果更显著。

Details

Motivation: 现有基于视觉的人体运动理解方法大多忽略生物力学中关键的物理线索（如关节驱动力），限制了对复杂动态场景的理解，本文旨在探究物理力信号是否以及何时能增强运动理解。 Method: 将物理推断的力作为补充特征融入现有的运动识别流程中，并在8个基准数据集上评估其对步态识别、动作识别和视频描述三大任务的影响。 Result: 在CASIA-B上步态识别Rank-1准确率提升0.87%，遮挡和侧视条件下提升达+2.7%和+3.0%；Gait3D提升1.3%；CTR-GCN在Penn Action上动作识别提升2.00%，高负荷动作提升6.96%；Qwen2.5-VL视频描述ROUGE-L提升0.029。 Conclusion: 物理推断的力能够有效补充视觉与运动学特征，显著提升模型在动态、遮挡或外观变化条件下的运动理解能力。 Abstract: Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.

[85] UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images

Yiming Zhao,Yuanpeng Gao,Yuxuan Luo,Jiwei Duan,Shisong Lin,Longfei Xiong,Zhouhui Lian

Main category: cs.CV

TL;DR: 本文提出了UTDesign，一个支持中英文的高精度风格化文本编辑与条件文本生成的统一框架，基于DiT架构和合成数据集训练，实现了在设计图像中文本的高质量渲染，并集成到全自动文本到设计（T2D）流程中，性能优于现有开源及部分商业方法。

Details

Motivation: 现有的扩散模型在图形设计中的文本渲染能力有限，尤其在小字体和非拉丁文字（如中文）上表现不佳，缺乏风格一致性和精确性。 Method: 提出UTDesign框架：1）从零训练基于DiT的文本风格迁移模型，生成保持参考字形样式的透明RGBA文本前景；2）构建多模态条件编码器，结合背景图像、提示和布局信息实现条件文本生成；3）整合预训练T2I模型和MLLM布局规划器，构建端到端的文本到设计（T2D）流水线。 Result: 实验表明，UTDesign在开源方法中达到最先进的风格一致性和文本准确性，且相比商业方案展现出独特优势，特别是在中英文混合排版和小字号文本生成方面。 Conclusion: UTDesign有效解决了AI辅助设计中高精度、多语言文本生成的挑战，为自动化图形设计提供了高效、可扩展的解决方案。 Abstract: AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.

[86] Multi-temporal Adaptive Red-Green-Blue and Long-Wave Infrared Fusion for You Only Look Once-Based Landmine Detection from Unmanned Aerial Systems

James E. Gallagher,Edward J. Oughton,Jana Kosecka

Main category: cs.CV

TL;DR: 本研究评估了基于无人机系统（UAS）的自适应RGB与长波红外（LWIR）融合技术在地表布设地雷检测中的应用，利用地雷与土壤之间的热对比增强特征提取。通过YOLO系列模型在114张图像上进行了35,640次评估，结果表明YOLOv11性能最优（mAP达86.8%），且10–30%热成像融合在5–10米高度为最佳参数。尽管RF-DETR精度最高（69.2% mAP），但YOLOv11训练速度显著更快（41分钟 vs 12小时），展现出更优的效率-精度权衡。多时段数据集训练效果优于季节特定数据集。反坦克地雷检测准确率为61.9%，远高于反人员地雷的19.2%。研究局限在于仅针对地表布设地雷，未来需拓展至不同埋深和土壤类型下的热对比分析。

Details

Motivation: 地雷在全球仍构成严重人道主义威胁，每年造成大量伤亡。传统人工排雷方式效率低且危险，亟需高效、安全的自动化检测方法。利用无人机搭载多模态传感器进行远程探测成为潜在解决方案，但现有方法在检测精度、泛化能力与实时性之间存在权衡，因此需要探索最优的传感器融合策略与模型架构以提升实用性。 Method: 采用自适应RGB与长波红外（LWIR）图像融合方法，结合YOLO系列（v8、v10、v11）及其他主流目标检测模型（如RF-DETR、Faster R-CNN、RetinaNet）进行对比实验。使用114张测试图像，生成35,640种模型条件组合，评估不同热融合比例（0%-100%）、飞行高度（5-10米）及训练策略（多时段vs季节特定）对检测性能的影响。 Result: YOLOv11在所有YOLO版本中表现最佳，达到86.8% mAP；最优检测参数为10%-30%热融合、5-10米飞行高度。RF-DETR虽精度最高（69.2% mAP），但训练耗时远超YOLOv11（12小时 vs 41分钟）。多时段训练数据比季节特定数据提升1.8%-9.6%。反坦克地雷检测准确率为61.9%，显著高于反人员地雷的19.2%。 Conclusion: 自适应RGB-LWIR融合显著提升地表地雷检测性能，YOLOv11在精度与效率间取得最佳平衡，适合实际部署。多时段训练增强模型鲁棒性。未来应研究埋藏地雷在不同土壤环境下的热特征建模，以拓展该方法的应用范围。 Abstract: Landmines remain a persistent humanitarian threat, with 110 million actively deployed mines across 60 countries, claiming 26,000 casualties annually. This research evaluates adaptive Red-Green-Blue (RGB) and Long-Wave Infrared (LWIR) fusion for Unmanned Aerial Systems (UAS)-based detection of surface-laid landmines, leveraging the thermal contrast between the ordnance and the surrounding soil to enhance feature extraction. Using You Only Look Once (YOLO) architectures (v8, v10, v11) across 114 test images, generating 35,640 model-condition evaluations, YOLOv11 achieved optimal performance (86.8% mAP), with 10 to 30% thermal fusion at 5 to 10m altitude identified as the optimal detection parameters. A complementary architectural comparison revealed that while RF-DETR achieved the highest accuracy (69.2% mAP), followed by Faster R-CNN (67.6%), YOLOv11 (64.2%), and RetinaNet (50.2%), YOLOv11 trained 17.7 times faster than the transformer-based RF-DETR (41 minutes versus 12 hours), presenting a critical accuracy-efficiency tradeoff for operational deployment. Aggregated multi-temporal training datasets outperformed season-specific approaches by 1.8 to 9.6%, suggesting that models benefit from exposure to diverse thermal conditions. Anti-Tank (AT) mines achieved 61.9% detection accuracy, compared with 19.2% for Anti-Personnel (AP) mines, reflecting both the size differential and thermal-mass differences between these ordnance classes. As this research examined surface-laid mines where thermal contrast is maximized, future research should quantify thermal contrast effects for mines buried at varying depths across heterogeneous soil types.

[87] Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition

Gorjan Radevski

Main category: cs.CV

TL;DR: 本文探讨了多模态对齐、翻译、融合与迁移，以提升机器对复杂输入的理解能力，涵盖空间语言、医学文本、知识图谱和动作识别等应用。

Details

Motivation: 解决多模态机器学习中的关键挑战，如语义对齐、信息歧义和计算效率，提升模型在真实场景中的理解与泛化能力。 Method: 提出Spatial-Reasoning Bert、基于空间共现的损失函数、结构化文本到知识图谱的基准、视频与目标检测的融合方法，以及多模态知识蒸馏框架。 Result: 实现了文本到2D/3D空间映射、医学文本可解释定位、知识图谱事实提取、提升动作识别性能，并降低计算开销。 Conclusion: 所提方法有效推动了多模态理解在多个领域的应用，增强了系统处理复杂跨模态输入的能力。 Abstract: This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.

[88] SirenPose: Dynamic Scene Reconstruction via Geometric Supervision

Kaitong Cai,Jensen Zhang,Jing Yang,Keze Wang

Main category: cs.CV

TL;DR: SirenPose提出了一种几何感知损失函数，结合周期性激活网络与关键点几何监督，实现了单目视频中动态3D场景的高保真、时序一致的重建。

Details

Motivation: 现有方法在快速运动、多物体交互、遮挡和复杂场景变化下难以保持运动保真度和时空一致性。 Method: 引入基于物理启发的约束，利用Sinusoidal表示网络的周期性激活特性，结合关键点监督，并使用图神经网络建模关键点关系，实现跨时空的连贯预测。 Result: 在Sintel、Bonn和DAVIS等基准上显著优于现有方法；在DAVIS上FVD降低17.8%，FID降低28.7%，LPIPS提升6.0%；姿态估计中优于Monst3R，轨迹误差和相对位姿误差更低。 Conclusion: SirenPose通过几何感知损失和高频信号建模，有效提升了动态3D重建的几何精度、时序一致性和运动平滑性，适用于复杂动态场景。 Abstract: We introduce SirenPose, a geometry-aware loss formulation that integrates the periodic activation properties of sinusoidal representation networks with keypoint-based geometric supervision, enabling accurate and temporally consistent reconstruction of dynamic 3D scenes from monocular videos. Existing approaches often struggle with motion fidelity and spatiotemporal coherence in challenging settings involving fast motion, multi-object interaction, occlusion, and rapid scene changes. SirenPose incorporates physics inspired constraints to enforce coherent keypoint predictions across both spatial and temporal dimensions, while leveraging high frequency signal modeling to capture fine grained geometric details. We further expand the UniKPT dataset to 600,000 annotated instances and integrate graph neural networks to model keypoint relationships and structural correlations. Extensive experiments on benchmarks including Sintel, Bonn, and DAVIS demonstrate that SirenPose consistently outperforms state-of-the-art methods. On DAVIS, SirenPose achieves a 17.8 percent reduction in FVD, a 28.7 percent reduction in FID, and a 6.0 percent improvement in LPIPS compared to MoSCA. It also improves temporal consistency, geometric accuracy, user score, and motion smoothness. In pose estimation, SirenPose outperforms Monst3R with lower absolute trajectory error as well as reduced translational and rotational relative pose error, highlighting its effectiveness in handling rapid motion, complex dynamics, and physically plausible reconstruction.

[89] AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

Anna Šárová Mikeštíková,Médéric Fourmy,Martin Cífka,Josef Sivic,Vladimir Petrik

Main category: cs.CV

TL;DR: 提出AlignPose，一种无需对象特定训练的多视角6D物体姿态估计方法，通过多视角特征度量优化实现跨视图的一致性姿态估计，在多个数据集上优于现有方法。

Details

Motivation: 单视角RGB方法受限于深度模糊、杂乱和遮挡问题；现有多视角方法依赖精确的单视角姿态估计或难以泛化到未见物体。 Method: 提出AlignPose，利用多个外参标定的RGB视图聚合信息，并设计新的多视角特征度量细化策略，联合优化在世界坐标系下一致的物体姿态，最小化渲染与观测特征之间的差异。 Result: 在YCB-V、T-LESS、ITODD-MV和HouseCat6D四个数据集上进行实验，使用BOP基准评估，结果表明AlignPose优于其他已发表方法，尤其在具有挑战性的工业数据集上表现突出。 Conclusion: AlignPose无需对象特定训练即可实现高精度多视角6D姿态估计，有效解决深度模糊、遮挡等问题，具备强泛化能力和实际应用潜力。 Abstract: Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

[90] Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios

Mingwei Tang,Jiahao Nie,Guang Yang,Ziqing Cui,Jie Li

Main category: cs.CV

TL;DR: 本文提出了一种新的多粒度文本引导图像融合方法（MTIF），通过细粒度、结构和语义层次的文本描述提升图像融合质量，在多曝光和多焦点融合任务中表现优于现有方法。

Details

Motivation: 现有文本引导的图像融合方法仅使用粗粒度文本描述，难以实现细粒度细节理解与精确的跨模态对齐，限制了融合性能。 Method: 提出MTIF框架，包含三个关键设计：1）引入多粒度文本描述（细粒度、结构、语义）并通过分层跨模态调制模块指导融合；2）在每个粒度上引入监督信号以增强视觉-文本特征对齐；3）采用显著性驱动的增强模块生成富含语义的训练数据。 Result: 在多曝光和多焦点图像融合任务上的大量实验表明，MTIF consistently 优于先前方法，提升了融合图像的质量与跨模态对齐效果。 Conclusion: MTIF通过多粒度文本引导和分层对齐机制，有效提升了图像融合性能，验证了细粒度跨模态信息在图像融合中的价值。 Abstract: Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.

[91] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Shengchao Zhou,Yuxin Chen,Yuying Ge,Wei Huang,Jiehong Lin,Ying Shan,Xiaojuan Qi

Main category: cs.CV

TL;DR: 本文提出了DSR Suite，以解决视觉语言模型在动态空间推理（DSR）上的不足，包括一个自动生成4D感知问答数据的管道和一个轻量化的几何选择模块（GSM），显著提升了模型在动态3D空间推理能力。

Details

Motivation: 现有视觉语言模型在动态空间推理（如物体在3D空间中随时间演变的几何关系）方面表现较弱，主要由于缺乏大规模4D感知训练数据。 Method: 提出一个自动化数据生成管道，利用视觉基础模型从真实视频中提取几何与运动信息（如相机位姿、点云、物体轨迹等），构建用于训练的DSR-Train和用于评估的DSR-Bench；同时设计轻量化的几何选择模块（GSM），将问题相关的4D先验知识编码为紧凑的几何token并融入VLM。 Result: 在真实视频上构建了强调多对象交互、视角变化和细粒度推理的数据集；实验表明，将DSR-Train与GSM结合到Qwen2.5-VL-7B后，其在DSR任务上的性能显著提升，同时保持在通用视频理解基准上的准确率。 Conclusion: 通过引入4D感知数据生成 pipeline 和轻量化的几何集成模块GSM，有效增强了视觉语言模型在动态空间推理方面的能力，为未来VLM在复杂3D时序推理任务中的应用提供了可行路径。 Abstract: Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

[92] FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

Kaitong Cai,Jusheng Zhang,Jing Yang,Yijia Fan,Pengtao Xie,Jian Wang,Keze Wang

Main category: cs.CV

TL;DR: FlashVLM提出了一种文本引导的视觉token选择框架，通过跨模态相似性动态适应输入，在显著压缩视觉token的同时保持甚至超越未剪枝模型的性能。

Details

Motivation: 现有视觉token压缩方法常忽略文本查询或依赖不稳定的注意力图，导致语义对齐下降，亟需更稳定高效的压缩机制。 Method: FlashVLM计算图像token与文本嵌入在语言模型空间中的显式跨模态相似性，结合内在视觉显著性和对数域加权、温度锐化，并通过多样性保留分区维持全局上下文。 Result: 在相同token预算下，FlashVLM在LLaVA 1.5上最多剪枝77.8%的视觉token时性能略超基线，94.4%压缩率下仍保持92.8%的准确性，在14个图像和视频基准上表现SOTA。 Conclusion: FlashVLM实现了高效且鲁棒的视觉token压缩，兼顾性能与泛化能力，为大规模视觉语言模型的部署提供了有效解决方案。 Abstract: Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.

[93] LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

Long Nguyen,Micha Fauth,Bernhard Jaeger,Daniel Dauner,Maximilian Igl,Andreas Geiger,Kashyap Chitta

Main category: cs.CV

TL;DR: 本文研究了在模拟驾驶中，由于专家示范与学生模型观测之间的不匹配（如可见性、不确定性、导航意图指定不足）导致模仿学习性能受限的问题，并提出了改进方法，显著提升了CARLA封闭测试基准上的表现，同时在真实场景基准上也取得了增益。

Details

Motivation: 专家在模拟环境中拥有更高的可见性和更低的不确定性，而学生模型仅依赖传感器观测，导致模仿困难；此外，导航意图在测试时仅通过单一目标点指定，导致不明确。这些因素限制了模仿学习在闭环驾驶中的性能。 Method: 通过减少专家与学生之间的感知差距（如考虑遮挡、不确定性建模），并改进导航意图的表示方式，结合感知监督和数据集优化TransFuser v6策略模型。 Result: TransFuser v6在CARLA多个公开闭环基准上达到新SOTA，Bench2Drive得分为95 DS，在Longest6~v2和Town13上性能超过此前方法一倍以上；在NAVSIM和Waymo Vision-Based End-to-End基准上也实现一致提升。 Conclusion: 缩小专家与学生之间的感知与意图表示差距能显著提升模仿学习在自动驾驶中的闭环性能，验证了对齐设计的重要性，并推动了模拟到现实迁移的有效性。 Abstract: Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.

[94] Repurposing Video Diffusion Transformers for Robust Point Tracking

Soowon Son,Honggyu An,Chaehyun Kim,Hyunah Ko,Jisu Nam,Dahyun Chung,Siyoon Jin,Jung Yi,Jaewon Min,Junhwa Hur,Seungryong Kim

Main category: cs.CV

TL;DR: 提出DiTracker，利用预训练的视频Diffusion Transformers进行点跟踪，通过查询-键注意力匹配、轻量级LoRA微调和与ResNet的成本融合，在多个基准上达到或超越现有最先进方法。

Details

Motivation: 现有方法依赖浅层卷积骨干网络，缺乏时间一致性，在动态运动和频繁遮挡等挑战条件下表现不佳。 Method: 利用大规模真实视频上预训练的视频DiT，结合查询-键注意力匹配、轻量级LoRA调整以及与ResNet骨干的成本融合来实现鲁棒的点跟踪。 Result: DiTracker在ITTO基准上取得最先进性能，并在TAP-Vid基准上达到或超过现有最先进模型，尽管训练批量大小小8倍。 Conclusion: 验证了视频DiT特征可作为点跟踪任务有效且高效的基础。 Abstract: Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.

[95] FedPOD: the deployable units of training for federated learning

Daewoon Kim,Si Young Yie,Jae Sung Lee

Main category: cs.CV

TL;DR: 本文提出了FedPOD（Proportionally Orchestrated Derivative），一种用于优化联邦学习中多个客户端间学习效率和通信成本的方法。受FedPIDAvg启发，FedPOD通过引入轮次任务提升训练效率，并解决了FedPIDAvg在数据利用和跨轮次依赖上的局限性。实验表明，FedPOD在Dice分数和收敛性能上与FedPIDAvg相当，同时具备更好的灵活性和可扩展性，灵感来源于Kubernetes的POD设计，支持类似自动扩缩容的弹性架构。

Details

Motivation: FedPIDAvg虽然通过PID控制器和损失变化加权提升了性能，但其依赖固定参与者、排除异常值导致数据利用率低，且难以适应动态参与场景。因此需要一种更灵活、不依赖历史信息、能充分利用所有客户数据的新方法。 Method: 提出FedPOD，采用轮次任务机制，在每轮中独立计算验证损失，避免对前序轮次学习信息的依赖；纳入原本被判定为异常值的参与者以提高数据利用率；结合泊松分布建模数据分布，并借鉴Kubernetes中POD的概念，实现基于POD单元的轮次任务扩展，支持类自动扩缩容的弹性设计。 Result: FedPOD在WT、ET、TC的平均Dice分数分别为0.78、0.71、0.72，平均投影收敛得分为0.74，性能与FedPIDAvg相当；同时展现出更高的系统灵活性和可扩展性，适用于动态参与的联邦学习环境。 Conclusion: FedPOD有效克服了FedPIDAvg在数据利用和跨轮依赖方面的限制，在保持相近甚至更优性能的同时，提升了联邦学习的训练效率、通信效率及系统弹性，其与Kubernetes兼容的设计为未来边缘计算与分布式学习的集成提供了新方向。 Abstract: This paper proposes FedPOD (Proportionally Orchestrated Derivative) for optimizing learning efficiency and communication cost in federated learning among multiple clients. Inspired by FedPIDAvg, we define a round-wise task for FedPOD to enhance training efficiency. FedPIDAvg achieved performance improvement by incorporating the training loss reduction for prediction entropy as weights using differential terms. Furthermore, by modeling data distribution with a Poisson distribution and using a PID controller, it reduced communication costs even in skewed data distribution. However, excluding participants classified as outliers based on the Poisson distribution can limit data utilization. Additionally, PID controller requires the same participants to be maintained throughout the federated learning process as it uses previous rounds' learning information in the current round. In our approach, FedPOD addresses these issues by including participants excluded as outliers, eliminating dependency on previous rounds' learning information, and applying a method for calculating validation loss at each round. In this challenge, FedPOD presents comparable performance to FedPIDAvg in metrics of Dice score, 0.78, 0.71 and 0.72 for WT, ET and TC in average, and projected convergence score, 0.74 in average. Furthermore, the concept of FedPOD draws inspiration from Kubernetes' smallest computing unit, POD, designed to be compatible with Kubernetes auto-scaling. Extending round-wise tasks of FedPOD to POD units allows flexible design by applying scale-out similar to Kubernetes' auto-scaling. This work demonstrated the potentials of FedPOD to enhance federated learning by improving efficiency, flexibility, and performance in metrics.

[96] Active Intelligence in Video Avatars via Closed-loop World Modeling

Xuanhua He,Tianyu Yang,Ke Cao,Ruiqi Wu,Cheng Meng,Yong Zhang,Zhuoliang Kang,Xiaoming Wei,Qifeng Chen

Main category: cs.CV

TL;DR: 本文提出了L-IVA任务与基准以及ORCA框架，首次实现了具有主动智能的视频虚拟形象，通过内部世界模型和闭环推理机制，在开放域环境中实现自主多步任务完成。

Details

Motivation: 现有视频虚拟形象生成方法缺乏真正的主动性，无法在动态环境中为长期目标进行自适应交互。为此，本文旨在赋予虚拟形象目标导向的规划能力与主动智能。 Method: 提出ORCA框架，包含闭环OTAR（观察-思考-行动-反思）循环和双系统分层架构；将虚拟形象控制建模为部分可观测马尔可夫决策过程（POMDP），并通过持续信念更新与结果验证实现鲁棒的状态追踪与决策。 Result: 实验表明，ORCA在任务成功率和行为连贯性上显著优于开环和无反思基线方法，能够在生成不确定性下有效完成多步任务。 Conclusion: ORCA通过内部世界模型机制，推动视频虚拟形象从被动动画向主动、目标驱动的行为转变，为生成智能体的发展提供了新方向。 Abstract: Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.

[97] SpatialTree: How Spatial Abilities Branch Out in MLLMs

Yuxi Xiao,Longfei Li,Shen Yan,Xinhang Liu,Sida Peng,Yunchao Wei,Xiaowei Zhou,Bingyi Kang

Main category: cs.CV

TL;DR: 本文提出了一个受认知科学启发的层次化框架SpatialTree，将多模态大语言模型中的空间能力分为四个层级，并构建了首个以能力为中心的分层基准，系统评估并改进了主流MLLM的空间能力。

Details

Motivation: 现有研究对多模态大语言模型（MLLMs）中的空间能力缺乏系统性理解，大多局限于狭窄任务，而认知科学提示空间能力是分层发展的，因此需要一个更全面、结构化的评估体系。 Method: 基于认知科学提出四层空间能力分类体系SpatialTree（L1-L4），构建覆盖27个子能力的分层评测基准，通过监督微调和强化学习方法分析不同层级能力间的迁移关系，并提出auto-think策略优化推理过程。 Result: 实验发现低层能力相对独立，高层能力间高度相关；存在低层内的负迁移但有显著的跨层级正向迁移；传统强化学习损害直觉感知，而auto-think策略能抑制不必要的深思，使RL在所有层级上稳定提升性能。 Conclusion: SpatialTree为理解和系统扩展MLLM中的空间能力提供了有效的概念框架和实践路径，揭示了能力间的结构关系与迁移规律，推动多模态模型向更高级的空间认知发展。 Abstract: Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

[98] SemanticGen: Video Generation in Semantic Space

Jianhong Bai,Xiaoshi Wu,Xintao Wang,Fu Xiao,Yuanxing Zhang,Qinghe Wang,Xiaoyu Shi,Menghan Xia,Zuozhu Liu,Haoji Hu,Pengfei Wan,Kun Gai

Main category: cs.CV

TL;DR: 本文提出了SemanticGen，一种在语义空间中生成视频的新方法，通过两阶段扩散模型实现更高效、高质量的视频生成。

Details

Motivation: 现有视频生成模型在VAE潜空间中学习，存在收敛慢、计算成本高、难以生成长视频的问题。由于视频具有内在冗余性，直接建模大量低级视频token效率低下。 Method: SemanticGen采用两阶段生成：第一阶段用扩散模型生成紧凑的语义视频特征以确定全局结构；第二阶段用条件扩散模型基于这些语义特征生成VAE潜变量以输出最终视频。 Result: 实验表明，语义空间生成比VAE潜空间收敛更快，能高效生成长视频，并在质量和性能上优于当前最先进方法和强基线模型。 Conclusion: SemanticGen通过从高层语义空间开始生成，有效提升了视频生成的效率和可扩展性，为长视频生成提供了新的解决方案。 Abstract: State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

Table of Contents

cs.CL [Back]

[1] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data

[2] How well do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse

[3] Counterfactual LLM-based Framework for Measuring Rhetorical Style

[4] PRISM: A Personality-Driven Multi-Agent Framework for Social Media Simulation

[5] Bias Beneath the Tone: Empirical Characterisation of Tone Bias in LLM-Driven UX Systems

[6] Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

[7] Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

[8] A Novel Graph-Sequence Learning Model for Inductive Text Classification

[9] ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

[10] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

[11] Multi-hop Reasoning via Early Knowledge Alignment

[12] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

[13] Fun-Audio-Chat Technical Report

[14] AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

[15] FaithLens: Detecting and Explaining Faithfulness Hallucination

[16] Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings

[17] SlideTailor: Personalized Presentation Slide Generation for Scientific Papers

[18] AprielGuard

[19] Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

[20] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

[21] Can LLMs Solve My Grandma's Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles

[22] Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation

[23] Sentiment-Aware Extractive and Abstractive Summarization for Unstructured Text Mining

[24] Step-DeepResearch Technical Report

[25] Distilling to Hybrid Attention Models via KL-Guided Layer Selection

[26] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

[27] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

[28] MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts

cs.CV [Back]

[29] PHANTOM: PHysical ANamorphic Threats Obstructing Connected Vehicle Mobility

[30] Generating the Past, Present and Future from a Motion-Blurred Image

[31] Learning to Refocus with Video Diffusion Models

[32] RANSAC Scoring Functions: Analysis and Reality Check

[33] HyGE-Occ: Hybrid View-Transformation with 3D Gaussian and Edge Priors for 3D Panoptic Occupancy Prediction

[34] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

[35] Unified Brain Surface and Volume Registration

[36] Vehicle-centric Perception via Multimodal Structured Pre-training

[37] Block-Recurrent Dynamics in Vision Transformers

[38] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction

[39] How Much 3D Do Video Foundation Models Encode?

[40] HistoWAS: A Pathomics Framework for Large-Scale Feature-Wide Association Studies of Tissue Topology and Patient Outcomes

[41] WSD-MIL: Window Scale Decay Multiple Instance Learning for Whole Slide Image Classification

[42] A Novel CNN Gradient Boosting Ensemble for Guava Disease Detection

[43] A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping

[44] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

[45] PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

[46] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

[47] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments

[48] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

[49] $\text{H}^2$em: Learning Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning

[50] VALLR-Pin: Dual-Decoding Visual Speech Recognition for Mandarin with Pinyin-Guided LLM Refinement

[51] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs

[52] Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva

[53] Progressive Learned Image Compression for Machine Perception

[54] Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

[55] Item Region-based Style Classification Network (IRSN): A Fashion Style Classifier Based on Domain Knowledge of Fashion Experts

[56] Effect of Activation Function and Model Optimizer on the Performance of Human Activity Recognition System Using Various Deep Learning Models

[57] LiDARDraft: Generating LiDAR Point Cloud from Versatile Inputs

[58] UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis

[59] Multi Modal Attention Networks with Uncertainty Quantification for Automated Concrete Bridge Deck Delamination Detection

[60] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

[61] HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer

[62] milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion

[63] Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS)

[64] CoDi -- an exemplar-conditioned diffusion model for low-shot counting

[65] AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

[66] Generative Latent Coding for Ultra-Low Bitrate Image Compression

[67] JDPNet: A Network Based on Joint Degradation Processing for Underwater Image Enhancement

[68] LiteFusion: Taming 3D Object Detectors from Vision-Based to Multi-Modal with Minimal Adaptation

[69] IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing

[70] Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

[71] BiCoR-Seg: Bidirectional Co-Refinement Framework for High-Resolution Remote Sensing Image Segmentation

[72] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation

[73] ${D}^{3}${ETOR}: ${D}$ebate-Enhanced Pseudo Labeling and Frequency-Aware Progressive ${D}$ebiasing for Weakly-Supervised Camouflaged Object ${D}$etection with Scribble Annotations

[74] UbiQVision: Quantifying Uncertainty in XAI for Image Recognition

[75] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

[76] The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

[77] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation