Skip to content

Table of Contents

cs.CL [Back]

[1] DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse

Jindi Wang,Yidi Zhang,Zhaoxing Li

Main category: cs.CL

TL;DR: 本研究提出DeBERTa-KC模型,用于自动分类YouTube科学视频评论中的知识建构水平,结合Focal Loss、Label Smoothing和R-Drop提升性能,在20,000条标注数据上取得优异效果。

Details Motivation: 为了解决在线科学学习讨论中知识建构水平分类的挑战,尤其是类别不平衡和泛化能力问题,需要一种高效且理论驱动的自动化方法。 Method: 基于DeBERTa-v3模型,引入Focal Loss、Label Smoothing和R-Drop正则化策略,构建端到端可复现的分类 pipeline,并在四个YouTube科学频道的20,000条人工标注评论上进行训练与评估。 Result: 在10折交叉验证中,DeBERTa-KC达到macro-F1为0.836±0.008,显著优于传统与Transformer基线模型(p<0.01),尤其在Explore和Negotiate类表现灵敏。 Conclusion: 大型语言模型能有效捕捉非正式数字学习环境中知识建构的细微特征,为分析表征认知参与提供了可扩展、理论支持的自动化工具。 Abstract: This study presents DeBERTa-KC, a transformer-based model for automatic classification of knowledge construction (KC) levels in online science learning discourse. Using comments collected from four popular YouTube science channels (2022--2024), a balanced corpus of 20,000 manually annotated samples was created across four KC categories: \textit{nonKC}, \textit{Share}, \textit{Explore}, and \textit{Negotiate}. The proposed model extends DeBERTa-v3 with Focal Loss, Label Smoothing, and R-Drop regularization to address class imbalance and enhance generalization. A reproducible end-to-end pipeline was implemented, encompassing data extraction, annotation, preprocessing, training, and evaluation. Across 10-fold stratified cross-validation, DeBERTa-KC achieved a macro-F1 of $0.836 \pm 0.008$, significantly out-performing both classical and transformer baselines ($p<0.01$). Per-category results indicate strong sensitivity to higher-order epistemic engagement, particularly in \textit{Explore} and \textit{Negotiate} discourse. These findings demonstrate that large language models can effectively capture nuanced indicators of knowledge construction in informal digital learning environments, offering scalable, theory-informed approaches to discourse analysis and the development of automated tools for assessing epistemic engagement.

[2] An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

Xincheng Liu

Main category: cs.CL

TL;DR: 该研究评估了五种主流大语言模型生成的教案在高中物理主题“电磁波谱”上的教学合理性和可用性,发现模型选择主要影响语言可读性,而提示框架(尤其是RACE)显著提升事实准确性和课程标准对齐度,但所有模型的学习目标多停留在记忆和理解层面,缺乏高阶认知目标。

Details Motivation: 随着AI在教育中的应用日益广泛,确保AI生成的教案具有教学有效性、事实准确性和课程对齐性成为关键问题。本研究旨在系统评估不同大语言模型与提示工程策略在生成高质量教案方面的表现。 Method: 选取ChatGPT-5、Claude Sonnet 4.5、Gemini 2.5 Flash、DeepSeek V3.2和Grok 4五个模型,结合TAG、RACE和COSTAR三种提示框架,为同一高中物理主题生成15份教案,并通过可读性、事实准确性、课程标准对齐度和认知要求四个自动化指标进行分析。 Result: DeepSeek生成的教案最易读(FKGL=8.64),Claude语言最复杂(FKGL=19.89);RACE框架在降低幻觉率和提升NGSS标准对齐方面表现最佳;所有教案的学习目标大多局限于布鲁姆分类的记忆与理解层级,缺乏高阶思维动词。 Conclusion: 模型选择决定语言可读性,提示框架影响教学可靠性与课程对齐性;最优方案是将可读性强的模型与RACE框架结合,并辅以包含核心概念、标准和高阶目标的显式检查清单。 Abstract: This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom's taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives.

[3] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Yatai Ji,Teng Wang,Yuying Ge,Zhiheng Liu,Sidi Yang,Ying Shan,Ping Luo

Main category: cs.CL

TL;DR: 本文提出ReDiff,一种改进的离散扩散框架,通过主动修正机制解决离散扩散模型在推理时因初始错误引发的级联误差问题,提升生成内容的一致性和事实准确性。

Details Motivation: 离散扩散模型在视觉-语言任务中具有潜力,但训练与推理之间的差异导致初始令牌错误引发连锁反应,造成语法错误和语义幻觉,限制了其实际应用。 Method: 将生成过程从被动去噪改为主动精炼,提出ReDiff框架,包含两阶段训练:首先训练模型修正合成错误以建立基础修订能力;其次引入在线自纠正循环,让模型学习专家修正来改进自身生成的错误草案。 Result: 实验表明,ReDiff显著提高了生成内容的连贯性和事实准确性,实现了比传统去噪方法更稳定高效的并行生成。 Conclusion: ReDiff通过错误驱动的学习机制有效打破了错误级联,赋予模型自我修正能力,为离散扩散模型的实际部署提供了可行方案。 Abstract: Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert's corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at https://rediff-hku.github.io/.

[4] Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

J Rosser,José Luis Redondo García,Gustavo Penha,Konstantina Palla,Hugues Bouchard

Main category: cs.CL

TL;DR: 本文提出了一种名为Sparse Tracing的新方法,利用动态稀疏注意力高效分析长上下文中的注意力模式,并通过Stream算法在近线性时间和线性空间内实现大规模可解释性分析。

Details Motivation: 随着大语言模型上下文长度扩展到百万级别,传统机械可解释技术因计算和内存开销呈二次增长而难以适用,亟需一种高效、可扩展的注意力分析方法。 Method: 提出Stream算法,采用分层剪枝策略,通过类二分搜索的精细化过程,在每个查询中仅保留前k个关键块,以O(T log T)时间复杂度和O(T)空间复杂度估计每头的稀疏注意力掩码,实现单次扫描的高效分析。 Result: 在链式思维推理轨迹上应用该方法,成功识别出‘思想锚点’,并剪除97-99%的token交互;在RULER基准上保留关键检索路径的同时减少90-96%的交互,揭示了从‘needle’到输出的逐层信息路径。 Conclusion: Sparse Tracing使长上下文下的模型可解释性在消费级GPU上成为可能,提供了一个实用的即插即用工具,有助于推动长上下文推理监控的普及化。 Abstract: As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99\% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96\% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.

[5] Automated HIV Screening on Dutch EHR with Large Language Models

Lang Zhou,Amrish Jhingoer,Yinghao Luo,Klaske Vliegenthart--Jongbloed,Carlijn Jordans,Ben Werkhoven,Tom Seinen,Erik van Mulligen,Casper Rokx,Yunlei Li

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的新型管道,用于分析非结构化电子健康记录文本,以确定患者是否需要进一步进行HIV检测。

Details Motivation: 现有的HIV诊断研究主要集中在结构化数据上,忽略了临床笔记等非结构化文本中可能包含的重要风险信息。 Method: 利用大语言模型(LLM)构建一个新管道,分析来自伊拉斯姆斯大学医学中心的电子健康记录中的非结构化文本数据。 Result: 实验结果显示该管道在保持低假阴性率的同时实现了高准确率。 Conclusion: 所提出的LLM管道能有效利用EHR中的非结构化文本,提升HIV筛查的准确性,具有临床应用潜力。 Abstract: Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient's eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.

[6] An Expert-grounded benchmark of General Purpose LLMs in LCA

Artur Donaldson,Bharathan Balaji,Cajetan Oriekezie,Manish Kumar,Laure Patouillard

Main category: cs.CL

TL;DR: 本研究首次通过专家评审的方式对11种大语言模型(LLM)在生命周期评估(LCA)中的应用进行系统性基准测试,发现37%的模型输出包含错误或误导性信息,且幻觉现象严重(部分模型引用幻觉率达40%),但部分模型在解释质量和格式遵循方面表现良好。开放权重模型在准确性等方面不逊于闭源模型。研究强调在缺乏标准化评估框架的情况下,需谨慎使用通用LLM,避免将其视为自由问答工具。

Details Motivation: 尽管大语言模型(LLMs)在环境和社会领域中被探索用于支持生命周期评估(LCA),但其可靠性、鲁棒性和可用性的系统证据仍不足。当前缺乏标准化的评估框架,且LCA领域本身没有明确的真值或共识协议,因此需要基于专家意见建立可靠的评估基准。 Method: 评估了11种通用大语言模型(涵盖商业和开源类型),在22项LCA相关任务上的表现。由17位经验丰富的从业者从科学准确性、解释质量、鲁棒性、可验证性和指令遵循等方面评审模型输出,共收集168份专家评审。 Result: 37%的模型响应包含不准确或误导性信息;较小的模型在准确性和解释质量上也能达到平均或良好水平;格式遵循普遍较好;部分模型幻觉引用率高达40%;开放权重模型在准确性与解释质量上表现不逊于甚至优于闭源模型。 Conclusion: 研究结果表明,若将通用大语言模型作为自由问答式工具直接应用于LCA,存在较大风险,尤其是幻觉和错误信息问题。然而,这些模型在提升解释质量和减轻简单任务负担方面具有一定潜力。因此,在缺乏 grounding 机制的情况下使用通用LLM需格外谨慎。 Abstract: Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs na\"ively in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...

[7] Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

Nishant Balepur,Dang Nguyen,Dayeon Ki

Main category: cs.CL

TL;DR: 提出基于游戏的评估方法Dixit,用于全面评估多模态大语言模型(MLMs)的能力,克服传统基准测试和主观比较的局限性。实验表明,Dixit胜率与主流基准高度一致,并揭示了MLM在推理策略上的不足。

Details Motivation: 现有MLM评估方法存在割裂性、主观性和易被捷径利用的问题,缺乏能综合、客观评估多能力协同的任务框架。 Method: 设计基于卡牌游戏Dixit的游戏化评估框架,要求模型生成具有误导性的描述,考验其多模态理解与推理能力,并通过人机对战与模型间比赛进行定量与定性分析。 Result: 五种MLM在Dixit中的胜率排名与主流基准完全一致;人机对战揭示了MLM在策略灵活性和深层推理上的缺陷。 Conclusion: 游戏化评估能有效、客观地衡量MLM的综合能力,Dixit为未来MLM评估提供了可扩展、有趣的范式。 Abstract: Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks -- which cannot jointly assess MLM capabilities in a single task -- or rely on human or model pairwise comparisons -- which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.

[8] Large Language Model enabled Mathematical Modeling

Guoyun Zhang

Main category: cs.CL

TL;DR: 本研究探讨了DeepSeek-R1大语言模型在运筹学(OR)优化建模中的应用潜力,通过自然语言理解与代码生成弥合现实问题与数学模型之间的鸿沟。

Details Motivation: 传统优化方法依赖领域专家进行问题建模,而现有大模型如GPT-4等存在成本高、易产生幻觉等问题,限制其在供应链等实际场景中的应用。 Method: 在NL4OPT、IndustryOR、EasyLP和ComplexOR四个OR基准上系统评估DeepSeek-R1,并采用LLM-as-a-Judge、少样本学习、工具调用和多智能体框架等策略减少幻觉并提升建模准确性。 Result: 验证了DeepSeek-R1在运筹学任务中具有高效能和低成本的优势,且通过提出的缓解策略显著降低了幻觉现象,提高了模型输出与用户意图的一致性。 Conclusion: DeepSeek-R1结合特定优化策略可有效支持运筹学中的决策建模,为低幻觉、低成本的工业级应用提供了可行路径。 Abstract: The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.

[9] Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Jackson Hassell,Dan Zhang,Hannah Kim,Tom Mitchell,Estevam Hruschka

Main category: cs.CL

TL;DR: 本文提出了一种基于预训练大语言模型的记忆增强框架,利用实例级批评(通过情节记忆)和任务级指导(通过语义记忆)来实现无需参数更新的目标分类学习,显著优于仅依赖标签的检索基线,并引入“可引导性”指标解释模型对监督信号的响应机制。

Details Motivation: 传统微调方法成本高、灵活性差且缺乏透明度,因此需要一种无需参数更新、更灵活且可解释的模型适应方式。 Method: 提出一个记忆增强框架,结合情景记忆存储带批评的标注实例,语义记忆提炼任务级指导,利用大语言模型生成批评并指导推理过程,无需梯度更新。 Result: 在多种任务上,加入批评的方法相比仅使用标签的RAG式基线最高提升了24.8%的准确率;发现闭源与开源模型在处理事实型与偏好型数据时行为差异显著;提出‘可引导性’指标用于量化模型对记忆中监督信号的响应程度。 Conclusion: 记忆驱动的反思式学习能有效提升大语言模型代理的适应性和可解释性,为无需参数更新的模型优化提供了有前景的方向。 Abstract: We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.

[10] LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation

Le Ren,Xiangjian Zeng,Qingqiang Wu,Ruoxuan Liang

Main category: cs.CL

TL;DR: 提出了一种名为LyriCAR的无监督可控歌词翻译框架,通过难度感知的课程设计和自适应策略显著提升翻译质量与训练效率。

Details Motivation: 现有方法依赖手工规则和句子级建模,难以在段落级别上保持跨行连贯性和全局押韵,泛化能力有限。 Method: 提出LyriCAR框架,采用难度感知的课程设计器和自适应课程策略,以无监督方式逐步引导模型学习复杂模式。 Result: 在EN-ZH歌词翻译任务上达到SOTA效果,训练步数减少近40%,且在翻译指标和多维奖励评分上均优于强基线。 Conclusion: LyriCAR能有效平衡音乐与语言约束,提升歌词翻译的连贯性、押韵性和整体质量,同时显著加快训练收敛速度。 Abstract: Lyric translation is a challenging task that requires balancing multiple musical constraints. Existing methods often rely on hand-crafted rules and sentence-level modeling, which restrict their ability to internalize musical-linguistic patterns and to generalize effectively at the paragraph level, where cross-line coherence and global rhyme are crucial. In this work, we propose LyriCAR, a novel framework for controllable lyric translation that operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware curriculum designer and an adaptive curriculum strategy, ensuring efficient allocation of training resources, accelerating convergence, and improving overall translation quality by guiding the model with increasingly complex challenges. Extensive experiments on the EN-ZH lyric translation task show that LyriCAR achieves state-of-the-art results across both standard translation metrics and multi-dimensional reward scores, surpassing strong baselines. Notably, the adaptive curriculum strategy reduces training steps by nearly 40% while maintaining superior performance. Code, data and model can be accessed at https://github.com/rle27/LyriCAR.

[11] LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation

Xin Lian,Kenneth D. Forbus

Main category: cs.CL

TL;DR: 本文提出了一种结合大语言模型(LLM)和符号化自然语言理解(NLU)系统的混合方法,利用LLM的广泛覆盖能力与符号系统的结构化表示优势,在常识科学文本中提取数量和因果关系的任务上表现优于纯符号系统。

Details Motivation: 大语言模型虽然应用广泛,但在事实生成和输出结构一致性方面易出错;而符号NLU系统虽可解释性强且适合推理,但覆盖范围有限且维护成本高。因此需要结合两者优势。 Method: 采用混合架构:使用LLM进行文本重述与简化以提升覆盖性,并自动填补知识空白;使用符号NLU生成可用于推理和增量学习的结构化关系表示。 Result: 在提取常识科学文本中的数量和因果规律任务中,混合方法显著优于纯符号系统,同时保持了良好的可解释性和推理能力。 Conclusion: 该混合方法有效结合了LLM的广泛语言处理能力和符号NLU的精确结构化表示,在性能和可维护性之间取得了更好平衡,具有在复杂语义理解任务中广泛应用的潜力。 Abstract: Despite the broad applicability of large language models (LLMs), their reliance on probabilistic inference makes them vulnerable to errors such as hallucination in generated facts and inconsistent output structure in natural language understanding (NLU) tasks. By contrast, symbolic NLU systems provide interpretable understanding grounded in curated lexicons, semantic resources, and syntactic & semantic interpretation rules. They produce relational representations that can be used for accurate reasoning and planning, as well as incremental debuggable learning. However, symbolic NLU systems tend to be more limited in coverage than LLMs and require scarce knowledge representation and linguistics skills to extend and maintain. This paper explores a hybrid approach that integrates the broad-coverage language processing of LLMs with the symbolic NLU capabilities of producing structured relational representations to hopefully get the best of both approaches. We use LLMs for rephrasing and text simplification, to provide broad coverage, and as a source of information to fill in knowledge gaps more automatically. We use symbolic NLU to produce representations that can be used for reasoning and for incremental learning. We evaluate this approach on the task of extracting and interpreting quantities and causal laws from commonsense science texts, along with symbolic- and LLM-only pipelines. Our results suggest that our hybrid method works significantly better than the symbolic-only pipeline.

[12] A Fundamental Algorithm for Dependency Parsing (With Corrections)

Michael A. Covington

Main category: cs.CL

TL;DR: 提出一种用于将自然语言句子解析为依存句法树的基本算法,逐词处理并即时依附词语,模拟人脑解析特性。

Details Motivation: 设计一种更符合人类语言处理方式的依存句法分析算法,提高解析效率与认知合理性。 Method: 采用逐词处理机制,每个词在可依附时立即进行依附,算法最坏时间复杂度为O(n^3),但在实际语言中仅对小n出现最坏情况。 Result: 该算法具有与短语结构解析相当的复杂度,但在真实语言环境中表现更高效,符合人类语言处理的实时性特点。 Conclusion: 该依存句法解析算法在计算效率和认知合理性之间取得了良好平衡,适用于模拟人类语言理解过程。 Abstract: This paper presents a fundamental algorithm for parsing natural language sentences into dependency trees. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached, corresponding to properties claimed for the parser in the human brain. Like phrase-structure parsing, its worst-case complexity is $O(n^3)$, but in human language, the worst case occurs only for small $n$.

[13] Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs

Yunpeng Xiao,Carl Yang,Mark Mai,Xiao Hu,Kai Shu

Main category: cs.CL

TL;DR: 提出一种统一的临床决策任务评估范式,从临床背景和临床问题两个维度刻画任务难度,旨在提升大语言模型在真实医疗场景中的适用性。

Details Motivation: 现有医学数据集(如MedQA)多依赖简化的问答形式,未能充分反映真实的临床决策过程,限制了大语言模型在临床实践中的有效评估和应用。 Method: 构建一个包含临床背景和临床问题两个维度的评估框架,对现有数据集和基准进行系统梳理,并综述训练时与测试时的应对方法,同时扩展评估指标至准确性之外的效率与可解释性。 Result: 该范式能够系统化地分析不同数据集的设置,明确模型假设,标准化模型比较,并揭示现有方法在不同任务难度下的有效性。 Conclusion: 所提出的双维范式有助于澄清假设、统一评估标准,并指导开发更具临床意义的大语言模型,推动其在真实医疗环境中的应用。 Abstract: Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.

[14] Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation and Specialized Pre-training

Alexandra Apostolopoulou,Konstantinos Kanaris,Athanasios Koursaris,Dimitris Tsakalidis,George Domalis,Ioannis E. Livieris

Main category: cs.CL

TL;DR: 本文提出了一种针对现代希腊语的新一代嵌入模型(GEM),通过高质量的数据预处理和多样化的现代Transformer架构(如ELECTRA、ConvBERT、ModernBERT)在通用和法律领域实现了性能突破,并首次推出了双语希腊-英语法律领域嵌入模型。

Details Motivation: 现有希腊语NLP模型受限于研究分散、架构单一及上下文长度不足(如512 token限制),尤其在需长文本分析的法律领域表现不佳,亟需更先进、适应性强的模型。 Method: 构建大规模、高质量的通用与法律领域希腊语语料库,采用严格的数据清洗与预处理流程;基于此预训练并评估多种现代Transformer架构(ELECTRA、ConvBERT、ModernBERT),并提出首个希腊-英语双语法律嵌入模型。 Result: 实验表明,GEM-RoBERTa和GEM-ConvBERT在下游任务中显著优于现有基线模型,验证了所提方法的有效性。 Conclusion: 通过高质量数据与现代架构结合,所提出的GEM模型显著提升了希腊语(尤其是法律领域)的自然语言处理能力,为形态丰富、资源中等的语言提供了可借鉴的建模范式。 Abstract: The advancement of natural language processing for morphologically rich, moderately-resourced languages like Modern Greek is often hindered by a fragmented research landscape, a lack of architectural diversity and reliance on limited context-length models. This is particularly true in specialized, high-value domains such as law, where existing models are frequently confined to early transformer architectures with a restrictive 512-token window, insufficient for analyzing long legal documents. To address these challenges, this paper presents Greek Embedding Models, a new family of transformer models for Greek language built upon a foundation of extensive, quality-driven data curation. We detail the construction of several large-scale Greek corpora, emphasizing a rigorous, quality-based filtering and preprocessing methodology to create high-value training datasets from both general-domain and specialized legal sources. On this carefully curated foundation, we pre-train and systematically evaluate a diverse suite of modern architectures, which has not previously applied to Greek language, such as ELECTRA, ConvBERT and ModernBERT. Furthermore, we propose the first bilingual Greek-English Embedding Models tailored for the legal domain. The extensive experiments on downstream tasks demonstrate that the new class of models establish the effectiveness of the proposed approach, highlighting that the GEM-RoBERTa and GEM-ConvBERT models significantly outperform existing baselines.

[15] Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models

David Dukić

Main category: cs.CL

TL;DR: 本文提出三种改进迁移学习的方法,以提升预训练语言模型在序列标注任务中的表现,包括多任务学习引入额外信号、自回归大模型架构修改以实现双向信息流动,以及基于生成式上下文微调的序列标注框架。

Details Motivation: 为了提升预训练语言模型在序列标注任务(如事件触发词检测)中的跨领域迁移能力,需解决现有迁移学习方法对领域变化敏感、模型结构限制信息流动以及缺乏有效上下文微调机制的问题。 Method: 1) 设计多任务模型并引入来自领域无关文本处理系统的额外信号;2) 修改自回归大语言模型架构,实现层间的双向信息流动;3) 提出一种结合监督式上下文内微调与响应导向适应策略的序列标注框架。 Result: 所提出的模型、方法和框架显著提升了预训练语言模型在序列标注任务上的性能,尤其在事件触发词检测的领域迁移场景中表现出优越效果。 Conclusion: 通过针对性的迁移学习范式(如多任务学习、架构改进和生成式上下文微调),可最大化预训练语言模型在序列标注任务中的潜力。 Abstract: This doctoral thesis improves the transfer learning for sequence labeling tasks by adapting pre-trained neural language models. The proposed improvements in transfer learning involve introducing a multi-task model that incorporates an additional signal, a method based on architectural modifications in autoregressive large language models, and a sequence labeling framework for autoregressive large language models utilizing supervised in-context fine-tuning combined with response-oriented adaptation strategies. The first improvement is given in the context of domain transfer for the event trigger detection task. The domain transfer of the event trigger detection task can be improved by incorporating an additional signal obtained from a domain-independent text processing system into a multi-task model. The second improvement involves modifying the model's architecture. For that purpose, a method is proposed to enable bidirectional information flow across layers of autoregressive large language models. The third improvement utilizes autoregressive large language models as text generators through a generative supervised in-context fine-tuning framework. The proposed model, method, and framework demonstrate that pre-trained neural language models achieve their best performance on sequence labeling tasks when adapted through targeted transfer learning paradigms.

[16] ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

Marianne Menglin Liu,Daniel Garcia,Fjona Parllaku,Vikas Upadhyay,Syed Fahad Allam Shah,Dan Roth

Main category: cs.CL

TL;DR: 提出ToolScope,通过自动纠正工具合并和检索最相关工具,提升大模型在复杂任务中的工具选择准确率。

Details Motivation: 解决大语言模型在面对冗余工具时的选择歧义问题,并应对上下文长度限制对大规模工具集使用的制约。 Method: 设计ToolScopeMerger进行自动纠错式工具合并以减少冗余,结合ToolScopeRetriever对查询动态筛选最相关工具,压缩工具集规模以适应上下文限制。 Result: 在三个主流大模型和开源基准上的实验显示,工具选择准确率提升8.38%至38.6%。 Conclusion: ToolScope有效提升了大语言模型在复杂任务中使用外部工具的准确性和效率,尤其在处理冗余和上下文受限场景下表现突出。 Abstract: Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.

[17] From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge

Nafis Chowdhury,Moinul Haque,Anika Ahmed,Nazia Tasnim,Md. Istiak Hossain Shihab,Sajjadur Rahman,Farig Sadeque

Main category: cs.CL

TL;DR: 提出了一个名为BLanCK的数据集,用于评估多语言大模型在孟加拉语文化知识(如民间传统、烹饪艺术和方言)方面的表现,发现现有模型在文化相关任务上表现较差,但通过提供上下文可显著提升性能。

Details Motivation: 现有的多语言基准在捕捉低资源文化细微差别方面存在不足,难以准确评估大语言模型对特定文化知识的理解能力。 Method: 构建了一个涵盖民间传统、烹饪艺术和区域方言的孟加拉语文化知识数据集(BLanCK),并评估了多个多语言大模型在有无上下文条件下的文化知识理解表现。 Result: 实验表明,当前多语言模型在非文化类任务上表现良好,但在文化知识任务上表现不佳;当提供上下文时,所有模型的表现均有显著提升。 Conclusion: 提升大语言模型对低资源文化的理解能力需要结合上下文感知架构和经过文化定制的训练数据。 Abstract: Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.

[18] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

Mehrdad Ghassabi,Sadra Hakim,Hamidreza Baradaran Kashani,Pedram Rostami

Main category: cs.CL

TL;DR: 本研究利用强化学习与AI反馈(RLAIF)和直接偏好优化(DPO)提升波斯语小型语言模型的医学推理能力,通过构建包含正确与错误思维链的双语医学问答数据集,在较少数据下显著超越了更大规模训练的前代模型。

Details Motivation: 提升小规模语言模型在资源匮乏语言(如波斯语)中的医学推理能力,以支持专业领域的应用需求。 Method: 采用RLAIF生成正负回答对,结合DPO训练方法,并利用翻译后的波斯语医学多选题数据集,通过教师-学生模型生成包含正确与错误思维链的推理轨迹用于训练。 Result: 构建了包含200万偏好答案token和250万拒绝答案token的数据集,训练出的模型在波斯语医学推理任务中表现优于基于5700万token训练的前代模型gaokerena-V。 Conclusion: 面向推理的训练方法(如RLAIF+DPO结合CoT)可在数据有限的情况下高效提升领域特定语言模型的性能,尤其适用于低资源语言场景。 Abstract: Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.

[19] CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

Zhaoyi Joey Hou,Bowei Alvin Zhang,Yining Lu,Bhiman Kumar Baghel,Anneliese Brei,Ximing Lu,Meng Jiang,Faeze Brahman,Snigdha Chaturvedi,Haw-Shiuan Chang,Daniel Khashabi,Xiang Lorraine Li

Main category: cs.CL

TL;DR: 本文提出了CreativityPrism框架,通过质量、新颖性和多样性三个维度系统评估大语言模型的创造力,涵盖九项任务、三个领域和二十种指标。实验分析了17个最先进的模型,揭示了专有模型与开源模型之间的性能差距,并发现不同创造力维度之间相关性较弱,强调需要全面评估LLM的创造力。

Details Motivation: 由于现有创造力评估方法在定义和度量上存在碎片化和不一致,缺乏跨场景的统一框架,因此需要一个能够多维度、系统化评估大语言模型创造力的综合框架。 Method: 提出CreativityPrism框架,将创造力分解为质量、新颖性和多样性三个维度,涵盖发散思维、创意写作和逻辑推理三个领域中的九项任务,并设计或整合了二十种任务特定的评估指标,对17个主流大语言模型进行评测并分析各指标与领域间的性能相关性。 Result: 实验结果显示专有模型整体优于开源模型;同一领域内任务间性能高度相关,跨领域则相关性较低;质量和多样性指标间存在强相关性,而新颖性与其他两个维度的相关性较弱。 Conclusion: 创造力是一个多维且非单一的概念,单一任务或维度上的优秀表现不能推广到其他方面,因此必须采用像CreativityPrism这样的综合性框架来全面评估大语言模型的创造力。 Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.

[20] Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning

Yajie Li,Albert Galimov,Mitra Datta Ganapaneni,Pujitha Thejaswi,De Meng,Priyanshu Kumar,Saloni Potdar

Main category: cs.CL

TL;DR: ARTER提出了一种高效的实体链接方法,通过结合候选生成、上下文评分、自适应路由和选择性推理,在减少对大型语言模型(LLM)依赖的同时提升了性能。

Details Motivation: 传统实体链接依赖大量标注数据和精细调优,而现有少样本方法因过度使用LLM推理导致计算成本高,效率低。 Method: ARTER采用多信号融合策略,利用嵌入和LLM生成互补信号,将提及分为简单和困难案例,分别由轻量级链接器和针对性LLM推理处理,实现自适应路由。 Result: 在6个标准数据集中,5个平均提升+2.53%,最高提升达+4.47%,且相比全LLM推理方案,LLM token使用量减少一半,效率显著提高。 Conclusion: ARTER在保持高性能的同时大幅降低计算开销,为高效实体链接提供了可扩展的框架。 Abstract: Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.

[21] BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation

Haoyuan Li,Zhengyuan Shen,Sullam Jeoung,Yueyan Chen,Jiayu Li,Qi Zhu,Shuai Wang,Vassilis Ioannidis,Huzefa Rangwala

Main category: cs.CL

TL;DR: 提出BoundRL,一种高效的长结构化文本分段与标签预测方法,通过强化学习与可验证奖励优化重建保真度和语义对齐。

Details Motivation: 传统文本分段方法难以处理包含表格、代码片段和占位符等非纯语言元素的复杂结构化文本,需更有效的分割方案。 Method: BoundRL在token级别进行文本分段和标签预测,仅生成段落起始token并从原文定位内容;采用强化学习与可验证奖励(RLVR)联合优化重建质量和语义一致性,并通过扰动生成中间候选解以缓解熵崩溃。 Result: 实验表明,1.7B参数的小型语言模型使用BoundRL优于大模型的少样本提示效果;RLVR相比监督微调显著提升性能,加入中间候选进一步增强泛化能力。 Conclusion: BoundRL高效且可扩展,能有效处理复杂结构化文本,在降低推理成本的同时提升分段准确性与模型泛化性。 Abstract: As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.

[22] Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ?

Anthony Dubreuil,Antoine Gourru,Christine Largeron,Amine Trabelsi

Main category: cs.CL

TL;DR: 该论文研究了大语言模型在零样本立场检测任务中的偏见问题,发现模型会因文本复杂度和特定群体方言等属性而产生显著的刻板印象。

Details Motivation: 大语言模型从预训练数据中继承了社会偏见,但在立场检测任务中的偏见评估尚未得到充分关注,尤其是在涉及政治倾向等敏感场景下。 Method: 通过自动标注现有立场检测数据集中的文本属性(如特定群体的方言和文本复杂度),分析这些属性如何影响大语言模型在零样本设置下的立场判断。 Result: 实验结果显示,大语言模型存在显著偏见,例如将支持大麻的观点错误地与低文本复杂度关联,或将非裔美国人方言与反对特朗普的立场关联。 Conclusion: 大语言模型在立场检测任务中表现出明显的社会刻板印象,需进一步改进以减少对特定群体的不公平偏见。 Abstract: Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model's stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.

[23] DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

Tian Lan,Bin Zhu,Qianghuai Jia,Junyang Ren,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo,Kaifu Zhang

Main category: cs.CL

TL;DR: 本文提出了DeepWideSearch,首个旨在评估智能体在信息检索中同时进行深度(多跳推理)和广度(大规模信息收集)能力的基准。实验表明现有最先进智能体在其上的成功率仅为2.39%,揭示了当前模型在整合深广搜索方面的严重不足,并分析了四大失败模式。

Details Motivation: 当前搜索智能体无法同时实现深度推理(多跳检索)和广泛信息收集,限制了其在市场分析、商业发展等现实场景中的应用,亟需新的基准来推动相关研究。 Method: 构建了名为DeepWideSearch的基准,通过转换现有数据集生成涵盖15个领域的220个问题,要求智能体在大量数据中进行多跳推理以完成复杂信息检索任务。 Result: 即使最先进的智能体在DeepWideSearch上的平均成功率也仅有2.39%,错误分析揭示出四大失败模式:缺乏反思、过度依赖内部知识、检索不足和上下文溢出。 Conclusion: DeepWideSearch为评估兼具深度与广度的信息检索智能体提供了新标准,暴露了当前技术的重大局限,有望推动更强大、鲁棒的搜索智能体的发展。 Abstract: Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.

[24] Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding

Yuhang Zhou,Mingrui Zhang,Ke Li,Mingyi Wang,Qiao Liu,Qifei wang,Jiayi Liu,Fei Liu,Serena Li,Weiwi Li,Mingze Gao,Abhishek Kumar,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang

Main category: cs.CL

TL;DR: 提出Mixture-of-Minds,一种结合多智能体分工与强化学习的表格推理框架,在TableBench上表现优异。

Details Motivation: 现有表格推理方法在语言理解与精确计算之间存在互补缺陷,需融合强推理与可靠处理能力。 Method: 设计三角色多智能体框架(规划、编码、回答),结合代码执行与基于MCTS的自我改进训练。 Result: 在TableBench上达到62.13%准确率,超越OpenAI-o4-mini-high。 Conclusion: 结构化多智能体流程结合强化学习可有效提升表格理解性能。 Abstract: Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.

[25] Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models

Maggie Bai,Ava Kim Cohen,Eleanor Koss,Charlie Lichtenbaum

Main category: cs.CL

TL;DR: 该论文研究了大语言模型(LLMs)在文本输入下的空间推理能力,通过五个任务评估其在网格环境中的空间理解与多步问题解决能力,发现模型在小规模任务中表现尚可,但随着复杂度增加性能显著下降,平均准确率下降42.7%,最高达84%,揭示了LLMs在空间表征上的局限性。

Details Motivation: 探究大语言模型是否具备超越语言理解的抽象空间推理能力,尤其是在结构化环境中的表现,以揭示其在空间认知方面的潜在缺陷。 Method: 设计五项基于网格的任务(象限识别、几何变换、距离评估、单词搜索和滑块拼图),逐步增加网格规模以测试模型在不同复杂度下的空间推理能力。 Result: 模型在低复杂度任务中准确率超过50%,但随着规模增大,性能急剧下降,所有任务平均准确率下降42.7%,最高下降84%,且未见有效泛化。 Conclusion: 大语言模型缺乏稳健的空间表征能力,其空间推理性能随问题规模扩展而显著退化,表明当前LLMs在语言与几何结合任务中存在根本性局限,需未来工作进行改进和评测。 Abstract: This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.

[26] Decoding-Free Sampling Strategies for LLM Marginalization

David Pohl,Marco Cognetta,Junyoung Lee,Naoaki Okazaki

Main category: cs.CL

TL;DR: 本文研究了在子词分词框架下语言模型的评估问题,提出使用无需解码的采样策略进行近似边缘化计算,以大幅降低计算成本并保持较高的估计准确性。

Details Motivation: 由于子词分词的多样性,传统语言模型仅评估单一输出分词方式的概率,忽略了其他可能的等效表示;因此需要通过边缘化所有可能分词的总概率来更准确地评估模型性能。 Method: 提出并研究无需解码的采样策略,即不依赖语言模型生成过程,仅基于廉价、通用的采样方法进行边缘概率的近似计算,并在多个开源模型上评估其精度与速度。 Result: 实验表明,这些无需解码的采样策略能在极低运行时间开销下提供足够准确的边缘概率估计,并成功应用于下游推理任务。 Conclusion: 无需解码的采样策略是一种高效且实用的方法,可用于语言模型的边缘化评估,显著提升了评估效率与可扩展性。 Abstract: Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks.

[27] Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders

Filippo Cenacchi,Deborah Richards,Longbing Cao

Main category: cs.CL

TL;DR: 提出一种统一的三模态情感严重程度评估框架,用于同时评估抑郁症和创伤后应激障碍(PTSD)的严重程度,融合文本、音频和面部信号特征,通过校准的晚期融合分类器实现跨疾病分级诊断,并提供可解释性支持。

Details Motivation: 抑郁症和PTSD常共病且症状交织,传统二元、单疾病自动评估方法难以应对,临床需要能提供严重程度分级和解释的跨疾病评估工具。 Method: 同步融合访谈文本(句子级Transformer嵌入)、音频(对数梅尔统计量及delta)和面部信号(动作单元、注视、头部姿态等描述符),采用校准的晚期融合分类器输出每种疾病的严重程度概率及特征归因。 Result: 在DAIC衍生语料库上进行分层交叉验证,融合模型在准确率和加权F1上与最强单模态基线相当,但在决策曲线效用和模态缺失/噪声下的鲁棒性更优;对PTSD,融合降低了回归误差并提升类别一致性;错误多集中在相邻严重程度之间;消融实验显示文本对抑郁、音视频对面部线索对PTSD更为关键。 Conclusion: 该三模态融合方法支持可重复评估和临床决策解释,为共病抑郁症和PTSD的严重程度分级提供了可靠且鲁棒的自动化工具。 Abstract: Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.

[28] Context-level Language Modeling by Learning Predictive Context Embeddings

Beiya Dai,Yuliang Liu,Daozheng Xue,Qipeng Guo,Kai Chen,Xinbing Wang

Main category: cs.CL

TL;DR: 提出ContextLM框架,通过引入下文预测目标增强语言模型预训练,提升长距离连贯性和下游任务性能。

Details Motivation: 标准的下个词预测限制了模型捕捉高层语义结构和长距离上下文关系的能力。 Method: 在标准预训练基础上增加下文预测(next-context prediction)目标,使模型学习多词上下文的预测表示。 Result: 在GPT2和Pythia模型上实验显示,ContextLM在困惑度和下游任务性能上均有持续提升,且计算开销小。 Conclusion: 下文预测为强化语言模型提供了可扩展且高效的路径,增强了长距离连贯性和注意力分配效率。 Abstract: Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.

[29] Citation Failure: Definition, Analysis and Efficient Mitigation

Jan Buchmann,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出了CITECONTROL基准和CITENTION框架,以研究并改善大语言模型在引用证据时的失败问题,特别是在响应正确但引用不全的情况。

Details Motivation: 现有的基于LLM的RAG系统在生成引用时存在引用失败问题,即响应正确但未能完整引用证据。本文旨在将引用失败与响应失败区分开来,并针对性地进行研究和改进。 Method: 采用两步法:首先通过CITECONTROL基准分析引用失败的发生条件,然后提出CITENTION框架,融合生成式、注意力机制和检索式方法来提升引用效果。 Result: 实验表明,随着响应与证据之间关系复杂性的增加,引用失败增多;CITENTION在CITECONTROL基准及迁移场景中均显著提升了引用质量。 Conclusion: 通过系统化分析和多方法融合,可以有效缓解大语言模型在引用过程中的遗漏问题,提升结果的可验证性。 Abstract: Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.

[30] Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering

Lei Tang,Wei Zhou,Mohsen Mesgar

Main category: cs.CL

TL;DR: 本研究首次系统探讨了过程奖励模型(PRMs)在表格问答(TQA)中的应用,发现结合文本与代码验证的PRMs有助于解的选择,但在跨领域数据上泛化能力有限,且步骤级验证与答案准确性相关性较弱。

Details Motivation: 探索PRMs在含半结构化数据任务(如TQA)中的适用性,解决其在信息冗余、推理步骤松散和领域特定推理等方面的挑战。 Method: 评估最先进的生成式PRMs在TQA任务中从答案和推理步骤两个层面的表现,分析其在不同验证模式下的性能。 Result: 结合文本和代码验证的PRMs能辅助解选择,但在跨领域数据上表现不佳;步骤级验证得分与最终答案准确率相关性弱,可能源于推理步骤间依赖性不足。 Conclusion: 当前PRMs在TQA任务中存在泛化能力和步骤关联性建模的局限,需构建更鲁棒、具备过程感知能力的验证器。 Abstract: Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.

[31] Teaching Language Models to Reason with Tools

Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu

Main category: cs.CL

TL;DR: 本文提出了CoRT(Code-Optimized Reasoning Training),一种用于提升大推理模型(LRM)在数学任务中使用代码解释器(CI)能力的后训练框架,通过Hint-Engineering生成高质量的代码融合推理数据,并结合拒绝采样与强化学习优化多轮内外推理交互,在多个数学数据集上显著提升了性能与效率。

Details Motivation: 大推理模型在自然语言推理方面表现出色,但在复杂数学运算中常出现不准确或低效的问题;尽管引入代码解释器有助于解决计算问题,但模型内部的概率推理与外部确定性知识之间存在冲突,导致无效推理。 Method: 提出CoRT框架,包括Hint-Engineering这一新的数据合成策略,在推理路径的关键位置注入多样化提示,生成专为优化LRM-CI交互设计的高质量、代码集成的推理数据;使用30个高质量样本对1.5B至32B参数模型进行监督微调,并结合拒绝采样和强化学习进一步优化多轮CI使用与内部思考的交错过程。 Result: 实验表明,CoRT在五个具有挑战性的数学推理数据集上,使DeepSeek-R1-Distill-Qwen-32B和1.5B模型分别获得4%和8%的绝对性能提升,同时显著提高效率:32B模型减少约30%的token使用,1.5B模型减少约50%。 Conclusion: CoRT有效解决了大推理模型与外部计算工具之间的推理冲突,提升了数学推理的准确性与效率,为构建更高效、可靠的混合推理系统提供了可行路径。 Abstract: Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: https://github.com/ChengpengLi1003/CoRT.

[32] Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Matteo Silvestri,Flavio Giorgi,Fabrizio Silvestri,Gabriele Tolomei

Main category: cs.CL

TL;DR: 研究表明,大语言模型在表格推理任务上的表现可能源于对具有明显语义线索的数据集的记忆,而非真正的泛化能力。

Details Motivation: 评估大语言模型在结构化数据上的推理能力时,常忽略数据污染这一关键混淆因素,本文旨在探究这一问题。 Method: 通过控制性探针实验,分析大语言模型在包含或去除语义线索(如列名、类别含义)的表格数据上的表现差异。 Result: 当数据集包含明确语义线索时,模型表现良好;一旦线索被移除或随机化,性能急剧下降至接近随机水平。 Conclusion: 大语言模型在常见表格基准上的成功部分归因于对公开数据的记忆,而非真实推理能力,提示未来评估需区分语义泄露与真正推理。 Abstract: Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.

[33] FreeChunker: A Cross-Granularity Chunking Framework

Wenxuan Zhang,Yuan-Hao Jiang,Yonghe Wu

Main category: cs.CL

TL;DR: 本文提出了FreeChunker,一种跨粒度编码框架,将句子视为基本单元,摒弃传统的静态分块方式,支持灵活的任意句子组合检索,显著提升了检索效率和对复杂查询的适应性。

Details Motivation: 现有的固定粒度分块方法依赖静态边界识别,难以适应多样化的查询需求,限制了RAG系统的性能。 Method: FreeChunker将句子作为原子单位,采用跨粒度编码框架,实现灵活检索,支持任意句子组合,避免显式语义边界检测。 Result: 在LongBench V2上的实验表明,FreeChunker在检索性能上优于传统分块方法,同时在计算效率方面显著优于现有方法。 Conclusion: FreeChunker通过范式转变,有效提升了RAG系统在检索效果和计算效率方面的表现,具有更强的灵活性和适应性。 Abstract: Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly reduces the computational overhead required for semantic boundary detection but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker achieves superior retrieval performance compared to traditional chunking methods, while significantly outperforming existing approaches in computational efficiency.

[34] Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)

Francesca Padovani,Bastian Bunzeck,Manar Ali,Omar Momen,Arianna Bisazza,Hendrik Buschmeier,Sina Zarrieß

Main category: cs.CL

TL;DR: 研究探讨了仅在对话数据上预训练的小型语言模型的表现,并通过不同微调策略提升其对话生成能力,发现DPO微调能显著改善模型在自定义对话基准上的表现。

Details Motivation: 探索仅使用对话数据预训练是否能产生形式和功能上合适的语言模型,并提升其对话生成能力。 Method: 基于对话数据预训练llamalogue模型,采用PPO和DPO等多种微调策略,在标准BabyLM和自定义对话基准上评估模型性能。 Result: 模型在大多数标准BabyLM基准上表现不佳,但在最小对设置的对话延续预测中表现出色;DPO微调进一步提升了在自定义对话基准上的性能,而PPO微调效果不一甚至有负面影响。 Conclusion: 仅对话数据预训练适合提升特定对话任务的表现,DPO是一种有效的微调方法,优于PPO。 Abstract: We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce "more communicative" text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.

[35] The Impact of Negated Text on Hallucination with Large Language Models

Jaehyung Seo,Hyeonseok Moon,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)在处理否定文本时的幻觉检测能力,发现LLM在否定语境下难以有效识别幻觉,并表现出逻辑不一致的问题。

Details Motivation: 否定文本对LLM幻觉的影响尚未被充分探索,本文旨在回答三个关键的研究问题。 Method: 通过构建NegHalu数据集并重构现有幻觉检测数据集中的否定表达,分析LLM在否定语境下的表现及其内部状态。 Result: 实验表明,LLM在处理否定文本时难以有效检测幻觉,常产生逻辑不一致或不忠实的判断,并揭示了其在处理否定输入时的内部挑战。 Conclusion: 否定文本显著影响LLM的幻觉检测能力,需进一步研究以缓解其负面影响。 Abstract: Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.

Son T. Luu,Trung Vo,Hiep Nguyen,Khanh Quoc Tran,Kiet Van Nguyen,Vu Tran,Ngan Luu-Thuy Nguyen,Le-Minh Nguyen

Main category: cs.CL

TL;DR: 本文介绍了VLSP 2025 MLQA-TSR任务,旨在推动越南多模态法律文本处理研究,特别是交通标志法规领域。该任务包含多模态法律检索和多模态问答两个子任务,并提供了基准数据集。目前最佳结果为F2得分64.55%和准确率86.30%。

Details Motivation: 推动越南语多模态法律文本处理的研究,特别是在交通标志法规领域的智能系统发展。 Method: 通过构建包含多模态法律检索和多模态问答的共享任务,提供基准数据集以评估相关模型性能。 Result: 在多模态法律检索子任务上达到64.55%的F2分数,在多模态问答子任务上达到86.30%的准确率。 Conclusion: VLSP 2025 MLQA-TSR为越南语多模态法律文本处理提供了有效的基准平台,促进了相关领域的发展。 Abstract: This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.

[37] NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew

Shaltiel Shmidman,Avi Shmidman,Moshe Koppel

Main category: cs.CL

TL;DR: 本文介绍了NeoDictaBERT和NeoDictaBERT-bilingual两种新型BERT风格模型,采用NeoBERT架构并专注于希伯来语文本,在多种基准测试中超越现有模型,尤其在检索任务中表现优异,并已向社区公开发布。

Details Motivation: 现有的BERT模型架构相对过时,无法充分利用最新Transformer架构的优势,且在希伯来语等特定语言上的性能有待提升。 Method: 基于NeoBERT的现代架构,训练了两种专注于希伯来语的BERT-style模型:NeoDictaBERT及其双语版本NeoDictaBERT-bilingual,采用了改进的训练策略和更长的上下文窗口。 Result: NeoDictaBERT在几乎所有希伯来语基准测试上均优于现有模型;NeoDictaBERT-bilingual在检索任务中表现突出,超过类似规模的多语言模型。 Conclusion: NeoDictaBERT系列模型为希伯来语NLP任务提供了强有力的基线模型,展示了现代化架构在特定语言建模中的有效性,并促进了希伯来语自然语言处理的研究发展。 Abstract: Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.

[38] Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction

Suchir Salhan,Hongyi Gu,Donya Rooein,Diana Galvan-Sosa,Gabrielle Gaudeau,Andrew Caines,Zheng Yuan,Paula Buttery

Main category: cs.CL

TL;DR: ContingentChat是一个师生框架,用于评估和提升基于100M词训练的BabyLM中的多轮对话偶然性。通过新的对齐数据集进行后训练,BabyLM生成的回应更合乎语法且连贯。实验表明,针对性的后训练有助于提高对话质量,但偶然性对BabyLM来说仍是挑战。

Details Motivation: 为了提升儿童与照顾者之间多轮对话中体现的‘偶然性’(即及时、直接且有意义的交流),研究者希望改进语言模型在该特性上的表现。 Method: 提出ContingentChat师生框架,并使用一个新的对齐数据集对BabyLM进行后训练,以增强其多轮对话中的偶然性;同时尝试自适应教师解码策略。 Result: 经过后训练,BabyLM生成的回应在语法性和连贯性上有所提升;自适应教师解码策略带来的额外改进有限。 Conclusion: 针对性的后训练能有效提升对话质量,但实现真正的对话偶然性仍是BabyLM面临的重要挑战。 Abstract: Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.

[39] LM-mixup: Text Data Augmentation via Language Model based Mixup

Zhijie Deng,Zhouan Shen,Ling Li,Yao Zhou,Zhaowei Zhu,Yanji He,Wei Wang,Jiaheng Wei

Main category: cs.CL

TL;DR: 本文提出了指令蒸馏任务,通过构建MIXTURE数据集和LM-Mixup方法,将低质量、冗余的指令数据蒸馏为高质量的指令对,显著提升大模型指令微调的效率与性能。

Details Motivation: 高质量指令数据稀缺且低质量数据常被丢弃导致信息损失,现有数据增强方法难以有效利用低质量数据,缺乏明确评估机制。 Method: 提出指令蒸馏任务,构建包含14.4万样本的MIXTURE数据集,并设计LM-Mixup框架:先在MIXTURE上进行监督微调,再结合质量、语义对齐和格式合规三种奖励信号,使用GRPO进行强化学习优化。 Result: 在多个基准测试中,仅用LM-Mixup蒸馏出的数据(占全集约3%)进行微调即可超越全量数据训练效果,并媲美最先进的高质量数据筛选方法。 Conclusion: 经过适当蒸馏和增强,低质量数据是宝贵的资源,能显著提升指令微调大模型的效率和性能。 Abstract: Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.

[40] Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

Christian Hobelsberger,Theresa Winner,Andreas Nawroth,Oliver Mitevski,Anna-Carolina Haensch

Main category: cs.CL

TL;DR: 本文系统评估了四种大语言模型(LLM)输出置信度估计方法,发现混合型CoCoA方法在校准性和正确答案区分能力上表现最佳。

Details Motivation: 大语言模型输出的不确定性和正确性变化较大,实际可靠性难以保证,因此需要有效的方法来量化其不确定性。 Method: 对VCE、MSP、Sample Consistency和CoCoA四种置信度估计方法,在四个问答任务上使用最先进的开源大语言模型进行实验评估。 Result: 每种不确定性度量捕捉到模型置信度的不同方面,其中CoCoA方法整体可靠性最好,显著提升了校准性和正确答案的判别能力。 Conclusion: CoCoA是当前最优的置信度估计方法,研究还讨论了各方法的权衡,并为LLM应用中选择不确定性度量提供了建议。 Abstract: Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.

[41] Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs

Lukas Edman,Alexander Fraser

Main category: cs.CL

TL;DR: 提出了一种改进的掩码语言建模(MLM)方法,根据模型预测能力动态调整掩码概率,并结合子词嵌入,提升了在(Super)GLUE任务上的性能。

Details Motivation: 提升小规模语言模型在复杂语言理解任务中的表现,特别是在低资源设置下的学习效率和泛化能力。 Method: 改进掩码语言建模(MLM),动态调整被掩码token的掩码概率,基于模型预测难度;同时引入子词嵌入以增强形态学泛化能力。 Result: 在(Super)GLUE基准上显著优于标准MLM,在BabyLM Challenge的strict-small赛道中超过基线模型。 Conclusion: 改进的MLM策略和子词嵌入有效提升了小模型的语言学习能力和下游任务性能,验证了其在低资源预训练中的潜力。 Abstract: We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.

[42] RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging

Bowen Wang,Haiyuan Wan,Liwen Shi,Chen Yang,Peng He,Yue Ma,Haochen Han,Wenhao Li,Tiao Tan,Yongjian Li,Fangming Liu,Yifan Gong,Sheng Zhang

Main category: cs.CL

TL;DR: 提出RECALL,一种基于表示的模型融合框架,用于无需历史数据的持续学习,通过分层隐藏表示实现跨模型的知识对齐与自适应参数融合。

Details Motivation: 解决现有持续学习方法需要任务标签或存在性能折衷的问题,避免灾难性遗忘并实现多领域知识的无缝整合。 Method: 利用大规模语言模型中的内部表征作为知识代理,基于聚类的典型样本计算层间模型相似性,并进行自适应、分层的参数融合,以保持浅层的通用特征和深层的任务特异性。 Result: 在五个NLP任务和多种持续学习场景中,RECALL在知识保持和泛化能力上均优于基线方法,表现出强抗遗忘能力和多领域集成效果。 Conclusion: RECALL提供了一种可扩展且无需数据的持续学习方案,有效支持大模型在动态环境中的演化。 Abstract: We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose RECALL, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.

[43] Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Tim Tian Hua,Andrew Qin,Samuel Marks,Neel Nanda

Main category: cs.CL

TL;DR: 本文提出通过激活引导(activation steering)抑制大语言模型在评估中的意识,使其表现如同部署状态,从而提高安全评估的可靠性。

Details Motivation: 大语言模型可能在检测到被评估时调整行为以显得更对齐,这会损害安全评估的可信度。因此,需要一种方法来减少这种评估意识的影响。 Method: 通过两步训练构建具有评估意识的模型:首先在包含模型描述的数据上继续预训练,使其识别评估线索;然后使用专家迭代训练其在评估场景中使用Python类型提示。接着,利用原始模型生成的引导向量对激活进行干预,抑制评估意识。 Result: 训练后的模型在评估环境中比部署环境更多地添加类型提示,表现出评估意识;但通过激活引导可有效抑制该行为,使模型即使在存在评估线索时也表现得像处于部署状态。且引导向量仅基于原始模型构建即可生效。 Conclusion: 激活引导技术可用于抑制大语言模型的评估意识,提升安全评估的真实性与可靠性,为未来AI评估提供新思路。 Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.

[44] Robust Preference Alignment via Directional Neighborhood Consensus

Ruochen Mao,Yuling Shi,Xiaodong Gu,Jiaheng Wei

Main category: cs.CL

TL;DR: 提出了一种无需训练的后处理方法Robust Preference Selection (RPS),通过利用方向邻域一致性来提升大语言模型在多样化人类偏好下的对齐鲁棒性。

Details Motivation: 现有对齐方法在面对偏离训练数据主流偏好的用户需求时表现不稳定,存在偏好覆盖缺口,且依赖昂贵的重训练,泛化能力有限。 Method: RPS通过从用户偏好的局部邻域采样多个响应构建候选池,并选择最符合原意的响应,而非依赖单一特定偏好生成;该方法无需重新训练模型。 Result: 在DPA、DPO和SFT三种对齐范式上实验表明,RPS在挑战性偏好下相比强基线方法胜率达69%,显著提升了模型鲁棒性。 Conclusion: RPS是一种实用且有理论支持的方法,能有效增强偏好对齐模型在多样化用户需求下的可靠性与稳定性。 Abstract: Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.

[45] Hierarchical Sequence Iteration for Heterogeneous Question Answering

Ruiyi Yang,Hao Xue,Imran Razzak,Hakim Hacid,Flora D. Salim

Main category: cs.CL

TL;DR: 本文提出了HSEQ Iteration框架,用于异构数据源上的检索增强生成,通过将文档、表格和知识图统一为可逆的层次化序列,并结合结构感知的迭代机制,在多跳问答任务中实现了高效且准确的答案生成。

Details Motivation: 现有RAG方法在处理多步推理和异构证据源时表现脆弱,往往在准确性、延迟和资源消耗之间权衡,缺乏统一且高效的机制来整合不同类型的数据并进行结构感知的推理。 Method: 将文档、表格和知识图谱线性化为带轻量级结构标签的可逆层次序列(HSeq),由Head Agent引导检索,Iteration Agent执行结构感知的操作(如父子跳转、表格邻接、KG关系扩展)进行证据收集,最后由Head Agent合成答案,并可选地通过精炼循环解决矛盾。 Result: 在HotpotQA、HybridQA/TAT-QA和MetaQA等多个基准上,HSEQ在EM/F1指标上均优于强基线模型,同时显著提升效率,减少不必要的检索跳跃、工具调用和token使用。 Conclusion: HSEQ提供了一种格式无关的统一框架,支持跨文本、表格和知识图谱的高效、准确、可审计的问答,具备良好的通用性和实际应用潜力。 Abstract: Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration for Heterogeneous Question Answering, a unified framework that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) guided, budget-aware iteration that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.

[46] Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset

Paul Lerner,François Yvon

Main category: cs.CL

TL;DR: 提出基于多语言翻译公平性评估大模型政治偏见的新框架,利用欧洲议会演讲的21向多语言平行语料库分析翻译质量差异。

Details Motivation: 现有方法主要通过英文调查模拟回答来评估大模型政治偏见,缺乏跨语言和实际语境的考量,因此需要一种更公平、多语言的评估方式。 Method: 构建包含1.5百万句子的新版EuroParl多语言平行语料库,涵盖多个维度信息(如政党归属),系统比较不同政治立场政党在翻译质量上的差异。 Result: 发现左翼、中间和右翼主流政党演讲的翻译质量显著优于边缘政党,存在系统性翻译偏差。 Conclusion: 大模型在多语言翻译中表现出对主流政党的偏好,反映出潜在的政治偏见,需在公平性评估中加以考虑。 Abstract: The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left, center, and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.

[47] ARC-Encoder: learning compressed text representations for large language models

Hippolyte Pilchen,Edouard Grave,Patrick Pérez

Main category: cs.CL

TL;DR: 本文提出了一种名为ARC-Encoder的上下文压缩方法,通过将文本上下文压缩为连续表示来减少解码器LLM的推理成本,无需微调或修改目标模型,在多种场景下实现了高效且可迁移的上下文压缩。

Details Motivation: 现有的上下文压缩技术往往需要微调或修改模型结构,导致通用能力下降,因此需要一种无需改动目标模型、可跨LLM通用的高效压缩方法。 Method: 设计一个编码器(ARC-Encoder),将上下文压缩为数量更少的连续表示(通常减少4到8倍),并替换解码器LLM中的token嵌入;系统研究了训练策略和架构选择,并实现多解码器适配。 Result: ARC-Encoder在多个基准测试中达到SOTA性能,显著提升推理效率,并支持在不同指令型和基础型解码器上使用,单个编码器可泛化至多个解码器。 Conclusion: ARC-Encoder是一种灵活、高效且可移植的上下文压缩解决方案,能够在不修改目标LLM的情况下有效降低长上下文带来的计算开销。 Abstract: Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x\!\in\!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder , fine-tuning dataset and pretrained models are available at https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .

[48] The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

Sangmitra Madhusudan,Kaige Chen,Ali Emami

Main category: cs.CL

TL;DR: 本文提出了CenterBench数据集,用于检测语言模型在处理嵌套句法结构时是依赖语法分析还是语义模式匹配,发现随着句子复杂度增加,模型更倾向于放弃结构分析而依赖语义关联。

Details Motivation: 缺乏有效方法区分语言模型的句法理解与语义模式匹配,尤其是在处理中心嵌入句时。 Method: 构建包含9,720个问题的CenterBench数据集,使用语法相同但语义不合理对偶句,并设计六类测试问题评估表层理解、句法依赖和因果推理。 Result: 六种模型在复杂结构中对语义合理与不合理句子的表现差距随复杂度增大,最大中位数差距达26.8个百分点;语义合理性反而损害对动作结果类问题的回答;推理模型虽提升准确率但仍依赖语义捷径。 Conclusion: CenterBench首次提供了识别模型何时从结构分析转向模式匹配的框架,揭示了模型依赖语义关联而非真正句法理解的局限性。 Abstract: When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

[49] GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

Jinchang Luo,Mingquan Cheng,Fan Wan,Ni Li,Xiaoling Xia,Shuangshuang Tian,Tingcheng Bian,Haiwei Wang,Haohuan Fu,Yan Tao

Main category: cs.CL

TL;DR: 提出GlobalRAG,一种基于强化学习的框架,通过引入全局规划和子目标奖励机制,在多跳问答中显著提升检索增强生成的效果。

Details Motivation: 现有强化学习在多跳问答中的应用受限于缺乏全局规划和执行不忠实问题,导致推理不连贯和证据利用不一致。 Method: 将问题分解为子目标,协调检索与推理过程,并通过迭代优化证据;设计规划质量奖励和子目标完成奖励,并采用渐进式权重退火策略平衡过程与结果目标。 Result: 在领域内和跨领域基准上均显著优于强基线模型,仅使用42%的训练数据即在EM和F1上平均提升14.2%。 Conclusion: GlobalRAG通过结构化全局推理和忠实执行机制,有效提升了小样本下多跳问答的性能,展示了强化学习在RAG系统中的潜力。 Abstract: Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.

Zhouwei Zhai,Mengxiang Chen,Haoyun Xia,Jin Li,Renquan Zhou,Min Yang

Main category: cs.CL

TL;DR: 提出多智能体认知决策框架MACDF,将电商搜索从被动检索转变为积极的决策支持,显著提升复杂查询下的推荐准确性和用户满意度。

Details Motivation: 传统电商搜索依赖查询-物品匹配,无法对齐用户的多阶段认知决策过程,导致语义鸿沟、决策成本高和缺乏专业购物指导等问题。 Method: 设计并实现了一个多智能体认知决策框架(MACDF),通过多个智能体协同模拟人类认知决策过程,提供主动的购物决策支持。 Result: 离线实验显示MACDF在推荐准确性和用户满意度方面显著优于传统方法,尤其在涉及否定、多约束或推理的复杂查询上表现突出;在线A/B测试验证了其在京东搜索平台的实际有效性。 Conclusion: 多智能体认知系统有望彻底改变电商搜索范式,为用户提供更智能、更人性化的购物体验。 Abstract: The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF's significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.

[51] Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks

Jiangang Hao,Wenju Cui,Patrick Kyllonen,Emily Kerzabi

Main category: cs.CL

TL;DR: 本研究探讨了基于ChatGPT的自动编码在协作问题解决中的沟通数据分类是否存在性别和种族偏见,结果表明无显著偏见。

Details Motivation: 现有研究表明ChatGPT可有效编码沟通数据,但其在不同人口统计群体中是否存在偏见尚不清楚。 Method: 使用典型的协作问题解决编码框架,对来自谈判、问题解决和决策三类任务的数据进行分析,评估ChatGPT编码的公平性。 Result: ChatGPT在性别和种族群体间表现出一致的编码准确性,未发现显著偏见。 Conclusion: 基于ChatGPT的自动编码具有公平性,适合用于大规模协作与沟通评估。 Abstract: Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.

[52] BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection

Ali Zain,Sareem Farooqui,Muhammad Rafi

Main category: cs.CL

TL;DR: 本文介绍了BUSTED团队在AraGenEval共享任务中的提交方案,旨在检测阿拉伯语AI生成文本。团队通过微调三种预训练Transformer模型(AraELECTRA、CAMeLBERT和XLM-RoBERTa)进行二分类任务,最终发现多语言模型XLM-RoBERTa以F1分数0.7701表现最佳,优于专用阿拉伯语模型,突显了多语言模型在该任务中的强大泛化能力。

Details Motivation: 探索不同预训练Transformer模型在阿拉伯语AI生成文本检测任务中的有效性,特别是比较专用阿拉伯语模型与多语言模型的性能差异。 Method: 采用AraELECTRA、CAMeLBERT和XLM-RoBERTa三种预训练模型,在提供的数据集上进行微调,完成二分类任务(人类撰写 vs AI生成)。 Result: XLM-RoBERTa模型取得了最高的F1分数0.7701,优于AraELECTRA和CAMeLBERT等专门针对阿拉伯语设计的模型。 Conclusion: 多语言预训练模型在阿拉伯语AI生成文本检测任务中表现出优异的泛化能力,可能优于专门的语言特定模型,这对未来文本检测系统的设计具有重要启示。 Abstract: This paper details our submission to the Ara- GenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, se- cured 5th place. We investigated the effec- tiveness of three pre-trained transformer mod- els: AraELECTRA, CAMeLBERT, and XLM- RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a sur- prising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the spe- cialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capa- bilities of multilingual models.

[53] Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model

Haoyu Wang,Sihang Jiang,Yuyan Chen,Yitong Wang,Yanghua Xiao

Main category: cs.CL

TL;DR: 本文基于人类好奇心评估问卷5DCR,设计了一个综合评估框架来衡量大语言模型(LLM)的好奇心表现,发现LLM在知识获取上比人类更强烈,但在不确定环境中仍倾向于保守选择,并验证了好奇心行为能增强模型的推理和主动学习能力。

Details Motivation: 探讨大语言模型是否具备类似人类的好奇心驱动学习能力,借鉴人类好奇心评估体系构建可量化的评测框架。 Method: 基于Five-Dimensional Curiosity scale Revised (5DCR) 设计涵盖信息寻求、刺激寻求和社会好奇心等多个维度的评估框架,对大语言模型进行系统评测,并分析其与推理和主动学习能力的关系。 Result: 大语言模型展现出比人类更强的知识渴求,但在不确定性环境下仍表现保守;好奇行为被证实能提升模型的推理和主动学习能力。 Conclusion: 大语言模型具备类人好奇心的潜力,该研究为未来提升模型学习能力和推动创新研究提供了实证支持。 Abstract: Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.

[54] The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI

Alan Saji,Raj Dabre,Anoop Kunchukuttan,Ratish Puduppully

Main category: cs.CL

TL;DR: 该论文研究了大型推理模型(LRM)在多语言推理中的表现,发现其倾向于使用英语进行推理,尽管这通常提高准确性,但在复杂任务中易因翻译错误而失败。

Details Motivation: 探索大型推理模型在非英语问题上的推理能力及其对语言和文化差异的处理问题。 Method: 比较大型推理模型在英语与问题语言中推理的表现,评估MGSM和GPQA Diamond两个任务,并分析推理过程中的认知特征。 Result: 英语推理表现出更强的认知行为特征且准确率更高,尤其在复杂任务中;但存在‘迷失在翻译中’的失败模式。 Conclusion: 虽然英语推理通常更准确,但依赖翻译可能导致错误,突显了发展多语言原生推理能力的重要性。 Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.

[55] \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding

Junghyun Min,York Hay Ng,Sophia Chan,Helena Shunhua Zhao,En-Shiun Annie Lee

Main category: cs.CL

TL;DR: 本文提出了首个粤语自然语言理解基准CantoNLU,涵盖七个语法和语义任务,并评估了多种模型在粤语上的表现,结果表明粤语适配模型整体最优,而单语模型在句法任务上更优。

Details Motivation: 粤语虽使用广泛但资源匮乏,缺乏标准化的评估框架,限制了粤语NLP的发展。 Method: 构建包含七个任务的CantoNLU基准,评估四种模型:未经粤语训练的普通话模型、两个通过持续预训练适配粤语的模型,以及一个从零训练的单语粤语模型。 Result: 粤语适配模型整体表现最佳,单语模型在句法任务(如POS标注和依存分析)上优于其他模型,普通话模型在某些任务中仍具竞争力。 Conclusion: CantoNLU为粤语NLP研究提供了重要基准,适配模型是当前最优选择,但在数据稀缺场景下普通话模型可作为有效替代;作者公开了所有数据、代码和模型权重。 Abstract: Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.

[56] Neural Diversity Regularizes Hallucinations in Small Models

Kushal Chakrabarti,Nirmal Balachundhar

Main category: cs.CL

TL;DR: 提出神经多样性(neural diversity)作为减少语言模型幻觉的第三维度,通过ND-LoRA方法在不增加参数和数据的情况下显著降低幻觉率。

Details Motivation: 语言模型在参数、计算和数据增加的情况下仍存在幻觉问题,需要一种在固定资源下有效减少幻觉的新机制。 Method: 受投资组合理论启发,提出神经多样性概念,使用ND-LoRA(结合并行LoRA适配器与Barlow Twins正则化)来实现去相关表示。 Result: ND-LoRA平均减少14.6%幻觉,最多减少25.6%,且不影响整体准确性;神经相关性每增加0.1%,幻觉增加3.8%;不同任务需不同的最优神经多样性水平。 Conclusion: 神经多样性是提升语言模型可靠性的关键新维度,与参数和数据正交,可在固定预算下优化模型性能。 Abstract: Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.

[57] Structure-Conditional Minimum Bayes Risk Decoding

Bryan Eikema,Anna Rutkiewicz,Mario Giulianelli

Main category: cs.CL

TL;DR: 本文提出三种轻量级的效用函数改进方法,以提升最小贝叶斯风险(MBR)解码在开放生成任务中对潜在结构的敏感性,并在对话和指令跟随任务中显著提高生成质量。

Details Motivation: MBR解码在机器翻译等受限任务中表现良好,但在对话或指令跟随等开放性任务中,标准基于相似性的效用函数可能无法捕捉生成结果中的潜在结构差异,导致选择次优响应。 Method: 提出三种针对效用函数的轻量级改进,设计两个评估结构最优性的指标,并构建包含对话行为、情感和响应结构三类潜在结构的数据集进行验证。 Result: 实验表明,传统基于相似性的效用函数在结构最优性上表现不佳,而所提方法显著提升了结构敏感性;在AlpacaEval和MT-Bench上的实际评估显示,胜率最高提升13.7个百分点。 Conclusion: 通过改进MBR的效用函数以增强对生成空间中潜在结构的敏感性,可在开放性生成任务中显著提升生成质量。 Abstract: Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model's outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model's distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure: dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list). We further propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.

[58] User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Xiaoyuan Wu,Roshni Kaushik,Wenkai Li,Lujo Bauer,Koichi Onoue

Main category: cs.CL

TL;DR: 该研究通过用户实验发现,用户对大语言模型(LLM)在隐私敏感场景中响应的隐私保护性和有用性评价一致性较低,而代理LLM之间虽高度一致,却与用户评价相关性低,表明现有基于代理LLM的评估方法无法准确反映真实用户体验,呼吁开展以用户为中心的评估研究,并改进代理模型与用户感知的对齐。

Details Motivation: 现有研究依赖代理LLM评估大语言模型在隐私敏感任务中的表现,忽略了真实用户的感知差异,且缺乏对响应有用性的细致分析,因此需要探究用户对LLM响应在隐私保护和帮助性方面的实际看法。 Method: 通过一项包含94名参与者的用户研究,使用PrivacyLens数据集中的90个真实场景,收集用户对LLM响应在隐私保护性和有用性方面的评分,并与五个代理LLM的评估结果进行对比分析。 Result: 用户在评估相同响应时对隐私保护性和有用性的判断一致性较低;五个代理LLM之间评估结果高度一致,但每个代理LLM与用户评分的相关性均较低。 Conclusion: LLM响应的隐私保护性和有用性感知具有个体差异性,代理LLM不能有效代表真实用户的判断,未来应加强以用户为中心的评估方法,并提升代理模型与用户感知的对齐程度。 Abstract: Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.

Xizhi Wu,Madeline S. Kreider,Philip E. Empey,Chenyu Li,Yanshan Wang

Main category: cs.CL

TL;DR: 本研究比较了多种自然语言处理(NLP)方法从临床笔记中提取氟嘧啶治疗及毒性信息的效果,发现基于大语言模型(LLM)的方法(尤其是错误分析提示)表现最佳,F1分数达1.000,显著优于传统机器学习和深度学习模型。

Details Motivation: 由于氟嘧啶类药物的毒性信息常隐含于非结构化的临床笔记中,手动提取效率低且易遗漏,因此需要自动化NLP方法来高效、准确地提取这些关键信息以支持肿瘤学研究和药物安全监测。 Method: 构建包含236份临床笔记的金标准数据集,由领域专家标注治疗方案和毒性类别;采用规则-based、机器学习(随机森林、SVM、逻辑回归)、深度学习(BERT、ClinicalBERT)以及大语言模型(零样本和错误分析提示)等NLP方法进行对比;使用80:20训练-测试划分评估性能。 Result: LLM-based方法在治疗和毒性提取中表现最优,其中错误分析提示F1=1.000,零样本提示F1=1.000(治疗)和0.876(毒性);逻辑回归和SVM次之(F1=0.937);BERT和ClinicalBERT表现一般;规则方法为基线(F1≈0.857–0.858)。 Conclusion: 基于大语言模型的NLP方法在提取氟嘧啶相关临床信息方面最有效,具有推动肿瘤学研究和药物警戒应用的巨大潜力。 Abstract: Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities extraction.LR and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance.

[60] Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Runzhe Zhan,Zhihong Huang,Xinyi Yang,Lidia S. Chao,Min Yang,Derek F. Wong

Main category: cs.CL

TL;DR: 本文首次系统分析了大推理模型(LRM)在机器翻译评估中的应用,提出通过训练合成的人类思维轨迹来校准LRM的“思考”过程,显著降低计算开销并提升评估性能。

Details Motivation: 探索大推理模型作为机器翻译质量评估者的潜力,解决其在现有评估任务中存在过度思考、评分机制偏差等问题。 Method: 提出一种基于合成人类样思维轨迹的校准方法,训练LRM以更高效、准确地进行MT评估,并在WMT24 Metrics基准上验证效果。 Result: 该方法将思考预算减少了约35倍,在7B到32B规模的LRM上均提升了评估相关性,例如R1-Distill-Qwen-7B的相关系数提高了8.7个百分点。 Conclusion: 经过有效校准的LRM在细粒度自动机器翻译评估中具有巨大潜力,能够在大幅降低计算成本的同时提升评估准确性。 Abstract: Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.

[61] A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

Alicia Sagae,Chia-Jung Lee,Sandeep Avula,Brandon Dang,Vanessa Murdock

Main category: cs.CL

TL;DR: 提出了一种针对负责任AI(如公平性)评估大语言模型的新方法,基于真实应用场景构建了包含性别化形容词和产品类别的数据集,并用于识别LLM在质量、真实性、安全性和公平性方面的差距。

Details Motivation: 现有LLM评估方法多关注高层任务,难以有效评估负责任AI维度(如公平性),因不同应用中敏感属性的重要性各异。 Method: 构建一个基于真实应用(生成产品描述)的参数化数据集,结合性别化形容词与产品类别等公平性属性,生成带标签的提示语,并用其评估LLM在多个维度的表现。 Result: 展示了该数据集可用于发现LLM在质量、真实性、安全性和公平性方面的缺陷,提供了一个可复现的评估框架和公开资源。 Conclusion: 该工作为特定应用场景下的LLM负责任AI评估提供了可行方案和实用数据资源,推动更精细化的模型评估。 Abstract: Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.

[62] Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Mutian He,Philip N. Garner

Main category: cs.CL

TL;DR: 本文提出了一种混合注意力模型,通过引入可学习的token驱逐机制和滑动窗口注意力,结合轻量级CNN聚合邻近token信息,在保持线性复杂度的同时改善了线性注意力模型在检索密集型任务中的遗忘问题。

Details Motivation: 线性注意力模型由于将整个输入序列压缩为固定大小的循环状态,存在有限内存导致的遗忘问题,影响其在检索密集型任务上的表现。为此,作者希望设计一种既能保留线性效率又能恢复对过去token直接访问能力的模型。 Method: 提出一系列混合模型,结合介于线性和全注意力之间的稀疏注意力机制(如带token驱逐的稀疏注意力和查询感知的原生稀疏注意力),并引入可学习的token驱逐策略;结合滑动窗口注意力与端到端可训练的轻量CNN,自适应保留每头的关键KV对。同时提供高效的Triton稀疏注意力内核。 Result: 在多个检索密集型基准上的实验表明,所提方法有效提升了线性注意力模型的性能,同时维持了常量的时间和空间复杂度。 Conclusion: 所提出的混合注意力结构成功缓解了线性注意力模型的遗忘问题,在不牺牲计算效率的前提下显著提升了其在需要长期依赖任务上的表现。 Abstract: Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.

[63] Simple Context Compression: Mean-Pooling and Multi-Ratio Training

Yair Feldman,Yoav Artzi

Main category: cs.CL

TL;DR: 提出一种轻量且简单的均值池化方法用于软上下文压缩,在多种问答数据集和模型上优于现有的压缩标记架构。

Details Motivation: 降低在检索增强生成(RAG)中使用长上下文时的计算成本,提高大语言模型处理效率。 Method: 采用均值池化对输入序列进行压缩,并训练同一压缩器支持多种压缩比率。 Result: 均值池化方法在多个领域内和领域外的QA数据集上表现最佳,且在多压缩比训练下性能下降较小;但不同架构和训练方式下的权衡较为复杂。 Conclusion: 简单的均值池化是一种高效、鲁棒的上下文压缩方法,适用于多种模型和场景,显示出软压缩方法的潜力与复杂性。 Abstract: A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.

[64] On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?

Mingmeng Geng,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文讨论了当前大语言模型生成文本检测面临的挑战,指出缺乏对“LLM生成文本”的明确定义、人类编辑与模型输出的界限模糊,以及现有评测基准不充分等问题,强调检测结果应仅作为参考而非决定性依据。

Details Motivation: 由于大语言模型(LLMs)广泛使用,研究者关注其生成文本的检测,但缺乏统一定义和真实场景下的评估标准,导致检测结果被误解。 Method: 通过分析现有检测目标的局限性、应用场景的多样性以及人类对LLM输出的修改影响,揭示当前检测方法和评测基准的不足。 Result: 发现当前检测器在现实应用中存在解释偏差,数值结果的意义正在减弱,且无法覆盖LLM可能生成的全部文本范围。 Conclusion: 检测器仅在特定条件下有效,其结果应被视为参考,而非决定性判断依据。 Abstract: With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.

cs.CV [Back]

[65] Fourier-Based GAN Fingerprint Detection using ResNet50

Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru

Main category: cs.CV

TL;DR: 提出一种结合频域分析与深度学习的方法,利用2D DFT揭示StyleGAN生成图像的周期性伪影,并使用ResNet50进行检测,显著提升鉴伪性能。

Details Motivation: 应对生成对抗网络(GAN)生成的逼真图像对图像取证和内容真实性验证带来的挑战。 Method: 将图像通过二维离散傅里叶变换(2D DFT)转换到频域,提取周期性伪影特征,并使用ResNet50网络在频域图像上训练分类模型以区分真实与合成图像。 Result: 该方法在检测StyleGAN生成图像时准确率达到92.8%,AUC为0.95,显著优于直接在空间域图像上训练的模型。 Conclusion: GAN生成图像具有可识别的频域“指纹”,结合信号处理与深度学习可有效提升图像鉴伪能力,具有工业应用潜力。 Abstract: The rapid rise of photorealistic images produced from Generative Adversarial Networks (GANs) poses a serious challenge for image forensics and industrial systems requiring reliable content authenticity. This paper uses frequency-domain analysis combined with deep learning to solve the problem of distinguishing StyleGAN-generated images from real ones. Specifically, a two-dimensional Discrete Fourier Transform (2D DFT) was applied to transform images into the Fourier domain, where subtle periodic artifacts become detectable. A ResNet50 neural network is trained on these transformed images to differentiate between real and synthetic ones. The experiments demonstrate that the frequency-domain model achieves a 92.8 percent and an AUC of 0.95, significantly outperforming the equivalent model trained on raw spatial-domain images. These results indicate that the GAN-generated images have unique frequency-domain signatures or "fingerprints". The method proposed highlights the industrial potential of combining signal processing techniques and deep learning to enhance digital forensics and strengthen the trustworthiness of industrial AI systems.

[66] Transformed Multi-view 3D Shape Features with Contrastive Learning

Márcus Vinícius Lobo Costa,Sherlon Almeida da Silva,Bárbara Caroline Benato,Leo Sampaio Ferraz Ribeiro,Moacir Antonelli Ponti

Main category: cs.CV

TL;DR: 本文研究了基于Vision Transformers(ViTs)和对比学习的3D形状表示学习方法,展示了其在多视图3D分析任务中的有效性,克服了传统CNN对标注数据的依赖及其在捕捉形状关系上的局限性。

Details Motivation: 解决现有2D图像识别3D物体方法中对大量标注数据的依赖以及CNN难以捕捉关键形状关系的问题。 Method: 采用Vision Transformers架构结合监督与自监督的对比学习目标,进行3D形状特征的表示学习。 Result: 在ModelNet10数据集上,监督对比损失达到了约90.6%的准确率,实验验证了该方法在捕捉全局形状语义和局部判别特征方面的优势。 Conclusion: ViTs结合对比学习能够有效提升3D形状理解能力,减少对标注数据的依赖,为3D表示学习提供了新的可行路径。 Abstract: This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs' ability to understand overall shapes and contrastive learning's effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.

[67] FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking

Martha Teiko Teye,Ori Maoz,Matthias Rottmann

Main category: cs.CV

TL;DR: FutrTrack是一个基于相机-LiDAR的多目标跟踪框架,采用Transformer平滑器和融合驱动跟踪器,在nuScenes和KITTI数据集上实现了优异的3D MOT性能。

Details Motivation: 现有单传感器3D跟踪方法在复杂场景下存在轨迹抖动和身份切换问题,需要更鲁棒的多模态融合方案。 Method: 提出一种两阶段Transformer细化与跟踪流水线,结合多模态BEV融合特征进行无需显式运动模型的身份分配,并引入时间平滑器优化轨迹一致性。 Result: 在nuScenes测试集上达到74.7 aMOTA,显著减少身份切换,同时保持高精度,在3D多目标跟踪任务中表现优异。 Conclusion: FutrTrack证明了多模态传感器特征对基于查询的Transformer跟踪方法的有效性,提供了一种无需预训练即可提升性能的高效框架。 Abstract: We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird's-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.

[68] Improving Predictive Confidence in Medical Imaging via Online Label Smoothing

Kushan Choudhury,Shubhrodeep Roy,Ankur Chanda,Shubhajit Biswas,Somenath Kuiry

Main category: cs.CV

TL;DR: 本研究探讨了在线标签平滑(OLS)在医学图像分类中的应用,发现其在提升模型准确性和特征表示学习方面优于传统方法。

Details Motivation: 深度学习模型在医学图像分类中表现出色,但往往产生过度自信的预测,影响其在关键医疗环境中的可靠性。传统的标签平滑方法未能考虑类别间的关系,限制了其效果。 Method: 采用在线标签平滑(OLS)方法,在训练过程中根据模型的预测模式动态调整软标签,并在RadImageNet数据集上使用ResNet-50、MobileNetV2和VGG-19三种主流架构进行评估。 Result: OLS在Top-1和Top-5分类准确率上均优于标准训练方法(如硬标签、传统标签平滑和无教师知识蒸馏),并生成更紧凑且分离良好的特征嵌入,表明表示学习能力增强。 Conclusion: OLS不仅提高了预测性能和模型校准性,还为医学影像领域构建可信AI系统提供了一种实用有效的解决方案。 Abstract: Deep learning models, especially convolutional neural networks, have achieved impressive results in medical image classification. However, these models often produce overconfident predictions, which can undermine their reliability in critical healthcare settings. While traditional label smoothing offers a simple way to reduce such overconfidence, it fails to consider relationships between classes by treating all non-target classes equally. In this study, we explore the use of Online Label Smoothing (OLS), a dynamic approach that adjusts soft labels throughout training based on the model's own prediction patterns. We evaluate OLS on the large-scale RadImageNet dataset using three widely used architectures: ResNet-50, MobileNetV2, and VGG-19. Our results show that OLS consistently improves both Top-1 and Top-5 classification accuracy compared to standard training methods, including hard labels, conventional label smoothing, and teacher-free knowledge distillation. In addition to accuracy gains, OLS leads to more compact and well-separated feature embeddings, indicating improved representation learning. These findings suggest that OLS not only strengthens predictive performance but also enhances calibration, making it a practical and effective solution for developing trustworthy AI systems in the medical imaging domain.

[69] A Unified Detection Pipeline for Robust Object Detection in Fisheye-Based Traffic Surveillance

Neema Jakisa Owor,Joshua Kofi Asamoah,Tanner Wambui Muturi,Anneliese Jakisa Owor,Blessing Agyei Kyem,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah

Main category: cs.CV

TL;DR: 提出了一种针对鱼眼相机图像的检测框架,通过预处理和后处理流程以及集成多个先进检测模型的方法,在2025 AI City Challenge Track 4中取得了F1分数0.6366,排名第8,验证了该方法在应对鱼眼畸变问题上的有效性。

Details Motivation: 鱼眼相机虽然能提供大视场交通监控,但其严重的径向畸变和非均匀分辨率给标准目标检测器带来了挑战,尤其是在图像边缘区域。因此需要一种能够在这种条件下鲁棒工作的检测方法。 Method: 设计了一个简单而有效的预处理和后处理流水线,并训练了多个最先进的检测模型,通过集成策略融合它们的输出以提高检测精度。 Result: 在2025 AI City Challenge Track 4上达到了F1分数0.6366,总排名第八(共62支队伍)。 Conclusion: 所提出的框架能有效应对鱼眼图像中的畸变和分辨率不均问题,提升了目标检测的一致性和准确性。 Abstract: Fisheye cameras offer an efficient solution for wide-area traffic surveillance by capturing large fields of view from a single vantage point. However, the strong radial distortion and nonuniform resolution inherent in fisheye imagery introduce substantial challenges for standard object detectors, particularly near image boundaries where object appearance is severely degraded. In this work, we present a detection framework designed to operate robustly under these conditions. Our approach employs a simple yet effective pre and post processing pipeline that enhances detection consistency across the image, especially in regions affected by severe distortion. We train several state-of-the-art detection models on the fisheye traffic imagery and combine their outputs through an ensemble strategy to improve overall detection accuracy. Our method achieves an F1 score of0.6366 on the 2025 AI City Challenge Track 4, placing 8thoverall out of 62 teams. These results demonstrate the effectiveness of our framework in addressing issues inherent to fisheye imagery.

[70] Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses

Damian Bowness,Charalambos Poullis

Main category: cs.CV

TL;DR: 提出一种实时渲染感知滤波方法,用于减少3D高斯点阵在训练数据分布外视角下的视觉噪声,提升视觉质量与一致性。

Details Motivation: 在远离训练数据的视角下,3D高斯点阵常出现显著视觉噪声,因模型对密度、颜色和几何的预测不确定性导致。 Method: 利用中间梯度计算敏感性分数,针对各向异性方向引起的不稳定性进行滤波,直接缓解生成不确定性问题,并可实时集成到现有3DGS渲染管线中。 Result: 实验表明该方法相比BayesRays等NeRF基线方法,在视觉质量、真实感和一致性方面有显著提升,且无需重训练即可实时运行。 Conclusion: 所提滤波方法有效抑制了3DGS在外推区域的渲染伪影,支持自由视角导航下的高质量3D重建。 Abstract: When viewing a 3D Gaussian Splatting (3DGS) model from camera positions significantly outside the training data distribution, substantial visual noise commonly occurs. These artifacts result from the lack of training data in these extrapolated regions, leading to uncertain density, color, and geometry predictions from the model. To address this issue, we propose a novel real-time render-aware filtering method. Our approach leverages sensitivity scores derived from intermediate gradients, explicitly targeting instabilities caused by anisotropic orientations rather than isotropic variance. This filtering method directly addresses the core issue of generative uncertainty, allowing 3D reconstruction systems to maintain high visual fidelity even when users freely navigate outside the original training viewpoints. Experimental evaluation demonstrates that our method substantially improves visual quality, realism, and consistency compared to existing Neural Radiance Field (NeRF)-based approaches such as BayesRays. Critically, our filter seamlessly integrates into existing 3DGS rendering pipelines in real-time, unlike methods that require extensive post-hoc retraining or fine-tuning. Code and results at https://damian-bowness.github.io/EV3DGS

[71] BrainPuzzle: Hybrid Physics and Data-Driven Reconstruction for Transcranial Ultrasound Tomography

Shengyu Chen,Shihang Feng,Yi Luo,Xiaowei Jia,Youzuo Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为BrainPuzzle的混合两阶段框架,用于实现经颅超声脑成像中的定量声速(SoS)图重建。该方法结合物理建模与机器学习,克服了传统全波形反演和纯数据驱动方法在低信噪比和稀疏孔径条件下的局限性。

Details Motivation: 由于颅骨与脑组织之间声速差异大且探头耦合困难,超声脑成像面临挑战。传统方法受限于颅骨引起的信号衰减、模式转换和相位畸变,以及孔径覆盖不全的问题。 Method: 第一阶段采用逆时迁移(时间反转声学)处理多角度采集数据,生成保留结构细节的迁移片段;第二阶段利用基于Transformer的超分辨率编码器-解码器与图注意力单元(GAU)融合这些片段,重建出连贯且定量准确的SoS图像。同时采用可移动小规模换能器阵列进行部分孔径采集以提高临床可行性。 Result: 在两个合成数据集上的实验表明,BrainPuzzle在SoS重建精度和图像完整性方面优于现有方法。 Conclusion: BrainPuzzle通过结合物理模型与深度学习,在低信噪比和稀疏孔径条件下实现了更准确的定量超声脑成像,具有推动该领域发展的潜力。 Abstract: Ultrasound brain imaging remains challenging due to the large difference in sound speed between the skull and brain tissues and the difficulty of coupling large probes to the skull. This work aims to achieve quantitative transcranial ultrasound by reconstructing an accurate speed-of-sound (SoS) map of the brain. Traditional physics-based full-waveform inversion (FWI) is limited by weak signals caused by skull-induced attenuation, mode conversion, and phase aberration, as well as incomplete spatial coverage since full-aperture arrays are clinically impractical. In contrast, purely data-driven methods that learn directly from raw ultrasound data often fail to model the complex nonlinear and nonlocal wave propagation through bone, leading to anatomically plausible but quantitatively biased SoS maps under low signal-to-noise and sparse-aperture conditions. To address these issues, we propose BrainPuzzle, a hybrid two-stage framework that combines physical modeling with machine learning. In the first stage, reverse time migration (time-reversal acoustics) is applied to multi-angle acquisitions to produce migration fragments that preserve structural details even under low SNR. In the second stage, a transformer-based super-resolution encoder-decoder with a graph-based attention unit (GAU) fuses these fragments into a coherent and quantitatively accurate SoS image. A partial-array acquisition strategy using a movable low-count transducer set improves feasibility and coupling, while the hybrid algorithm compensates for the missing aperture. Experiments on two synthetic datasets show that BrainPuzzle achieves superior SoS reconstruction accuracy and image completeness, demonstrating its potential for advancing quantitative ultrasound brain imaging.

[72] Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

Huichan Seo,Sieun Choi,Minki Hong,Yi Zhou,Junseo Kim,Lukman Ismaila,Naome Etori,Mehul Agarwal,Zhixuan Liu,Jihie Kim,Jean Oh

Main category: cs.CV

TL;DR: 本研究提出了一种统一的评估框架,用于检测文本到图像(T2I)和图像到图像(I2I)生成模型中的文化偏见,涵盖六个国家、多类别及时代感知提示。通过自动指标、文化感知的视觉问答与本地专家评估相结合的方法,发现现有模型倾向于呈现全球北方现代化图像,且在编辑过程中常丧失文化保真度。

Details Motivation: 现有生成模型常误表文化,尤其在图像到图像编辑任务中缺乏对文化偏见的系统性评估。因此需要一个标准化、跨文化、可复现的评估框架来揭示并追踪这类问题。 Method: 构建包含六个国家、8大类36子类的文化分类体系,设计时代感知提示词,在固定设置下使用开源模型进行T2I生成与I2I编辑;结合自动指标、基于检索增强的文化感知VQA以及来自本地评审者的专家人工判断进行综合评估。 Result: 1) 在无国家指向的提示下,模型偏向全球北方现代风格,弱化国家差异;2) 即使自动指标不变或改善,迭代I2I编辑仍会降低文化保真度;3) I2I模型仅应用表面变化(如调色板、通用道具),未能实现符合时代与语境的修改,尤其在全球南方目标上保留源身份特征。 Conclusion: 当前生成模型在文化敏感编辑方面仍不可靠。本文通过发布标准化数据集、提示词与人工评估协议,建立了一个可复现、以文化为中心的基准,有助于未来对生成模型中文化偏见的诊断与追踪。 Abstract: Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models.

[73] Filter-Based Reconstruction of Images from Events

Bernd Pfrommer

Main category: cs.CV

TL;DR: 本文提出了一种名为FIBAR的异步滤波器方法,用于从移动事件相机的事件流中重建强度图像。该方法基于IIR滤波和高斯去噪,无需神经网络,可在CPU上高效运行,适用于如标志点检测等任务,但存在噪声和重影问题。

Details Motivation: 现有事件相机图像重建方法多依赖神经网络和GPU,计算成本高且难以实时异步输出。本文旨在提出一种更简单、轻量、可在CPU上运行的异步重建方法,适用于资源受限或需低延迟的应用场景。 Method: 采用时域数字IIR滤波器积分事件引起的亮度变化;通过新算法检测并处理长时间未更新的‘陈旧像素’,利用近期更新窗口进行调节;针对移动相机场景,假设无事件区域梯度较低,对陈旧像素应用Gaussian滤波进行平滑以减少噪声。 Result: FIBAR在现代笔记本CPU上可处理约4200万(启用空间滤波)至1.4亿事件/秒;与FireNet等神经网络方法相比,重建图像更嘈杂且存在重影现象,但在如fiducial marker检测等任务中仍有效。 Conclusion: FIBAR是一种轻量、异步、无需深度学习的图像重建方法,适合在CPU上实时运行,虽重建质量不及神经网络方法,但在特定应用场景下已足够使用,且具备高效率和灵活性优势。 Abstract: Reconstructing an intensity image from the events of a moving event camera is a challenging task that is typically approached with neural networks deployed on graphics processing units. This paper presents a much simpler, FIlter Based Asynchronous Reconstruction method (FIBAR). First, intensity changes signaled by events are integrated with a temporal digital IIR filter. To reduce reconstruction noise, stale pixels are detected by a novel algorithm that regulates a window of recently updated pixels. Arguing that for a moving camera, the absence of events at a pixel location likely implies a low image gradient, stale pixels are then blurred with a Gaussian filter. In contrast to most existing methods, FIBAR is asynchronous and permits image read-out at an arbitrary time. It runs on a modern laptop CPU at about 42(140) million events/s with (without) spatial filtering enabled. A few simple qualitative experiments are presented that show the difference in image reconstruction between FIBAR and a neural network-based approach (FireNet). FIBAR's reconstruction is noisier than neural network-based methods and suffers from ghost images. However, it is sufficient for certain tasks such as the detection of fiducial markers. Code is available at https://github.com/ros-event-camera/event_image_reconstruction_fibar

[74] Data-Adaptive Transformed Bilateral Tensor Low-Rank Representation for Clustering

Hui Chen,Xinjie Wang,Xianchao Xiu,Wanquan Liu

Main category: cs.CV

TL;DR: 提出了一种新的变换双侧张量低秩表示模型TBTLRR,通过学习数据自适应的酉变换来增强对噪声的鲁棒性,并利用双侧结构捕捉图像样本和特征之间的局部相关性,结合ℓ₁/₂范数和Frobenius范数正则化项以更好处理复杂噪声,在聚类任务中表现出优于现有方法的性能。

Details Motivation: 现有张量低秩表示方法依赖固定变换且对噪声鲁棒性差,难以有效捕捉图像数据中的全局与局部相关性。 Method: 提出TBTLRR模型,引入可学习的任意酉变换实现数据自适应的张量核范数,结合双侧张量结构建模局部相关性,并采用ℓ₁/₂范数和Frobenius范数正则化处理复杂噪声;设计基于ADMM的高效优化算法并提供收敛性证明。 Result: 在多个实验中验证了TBTLRR在图像聚类任务上显著优于当前先进方法,具备更强的噪声鲁棒性和相关性建模能力。 Conclusion: TBTLRR通过数据自适应变换和双侧结构设计,有效提升了张量低秩表示在图像聚类中的性能和鲁棒性,为处理真实场景中的复杂噪声提供了新思路。 Abstract: Tensor low-rank representation (TLRR) has demonstrated significant success in image clustering. However, most existing methods rely on fixed transformations and suffer from poor robustness to noise. In this paper, we propose a novel transformed bilateral tensor low-rank representation model called TBTLRR, which introduces a data-adaptive tensor nuclear norm by learning arbitrary unitary transforms, allowing for more effective capture of global correlations. In addition, by leveraging the bilateral structure of latent tensor data, TBTLRR is able to exploit local correlations between image samples and features. Furthermore, TBTLRR integrates the $\ell_{1/2}$-norm and Frobenius norm regularization terms for better dealing with complex noise in real-world scenarios. To solve the proposed nonconvex model, we develop an efficient optimization algorithm inspired by the alternating direction method of multipliers (ADMM) and provide theoretical convergence. Extensive experiments validate its superiority over the state-of-the-art methods in clustering. The code will be available at https://github.com/xianchaoxiu/TBTLRR.

[75] Endoshare: A Source Available Solution to De-Identify and Manage Surgical Videos

Lorenzo Arboit,Dennis N. Schneider,Britty Baby,Vinkle Srivastav,Pietro Mascagni,Nicolas Padoy

Main category: cs.CV

TL;DR: Endoshare是一个开源、跨平台的应用程序,用于整合、标准化和去标识化微创手术中的内窥镜视频,支持隐私保护的外科视频管理。

Details Motivation: 推动视频评估和外科数据科学在手术培训、研究和质量改进中的应用,同时解决视频格式异构性和隐私共享问题。 Method: 遵循软件开发生命周期,结合用户中心的迭代反馈;采用基于十项可用性启发式和科技接受模型的内部与外部临床医生调查,并进行不同硬件配置下的性能基准测试。 Result: 原型测试显示高可用性(临床医生评分4.68±0.40/5,计算机科学家4.03±0.51/5),优化后外科医生对有用性(5.07±1.75/7)、易用性(5.15±1.71/7)、启发式可用性(4.38±0.48/5)和推荐意愿(9.20±0.79/10)评价高;处理时间受模式、视频长度和计算能力影响(p≤0.001)。 Conclusion: Endoshare提供了一个透明且用户友好的标准化外科视频管理流程,具备隐私保护能力,但需进一步认证合规性和互操作性以替代专有系统。 Abstract: Video-based assessment and surgical data science can advance surgical training, research, and quality improvement. However, widespread use remains limited by heterogeneous recording formats and privacy concerns associated with video sharing. We present Endoshare, a source-available, cross-platform application for merging, standardizing, and de-identifying endoscopic videos in minimally invasive surgery. Development followed the software development life cycle with iterative, user-centered feedback. During the analysis phase, an internal survey of clinicians and computer scientists based on ten usability heuristics identified key requirements that guided a privacy-by-design architecture. In the testing phase, an external clinician survey combined the same heuristics with Technology Acceptance Model constructs to assess usability and adoption, complemented by benchmarking across different hardware configurations. Four clinicians and four computer scientists initially tested the prototype, reporting high usability (4.68 +/- 0.40/5 and 4.03 +/- 0.51/5), with the lowest score (4.00 +/- 0.93/5) relating to label clarity. After refinement, the testing phase surveyed ten surgeons who reported high perceived usefulness (5.07 +/- 1.75/7), ease of use (5.15 +/- 1.71/7), heuristic usability (4.38 +/- 0.48/5), and strong recommendation (9.20 +/- 0.79/10). Processing time varied with processing mode, video duration (both p <= 0.001), and machine computational power (p = 0.041). Endoshare provides a transparent, user-friendly pipeline for standardized, privacy-preserving surgical video management. Compliance certification and broader interoperability validation are needed to establish it as a deployable alternative to proprietary systems. The software is available at https://camma-public.github.io/Endoshare/

[76] Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

Hao Yu,Haoyu Chen,Yan Jiang,Wei Peng,Zhaodong Sun,Samuel Kaski,Guoying Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的卷积算子Attentive Convolution (ATConv),通过引入自注意力机制中的自适应路由和横向抑制原理,显著提升了卷积网络的表达能力,在图像分类和生成任务中超越了自注意力机制,同时保持线性复杂度。

Details Motivation: 自注意力机制虽然表达能力强,但计算复杂度高;卷积虽高效但表达能力不足。本文旨在探究自注意力优于卷积的根本原因,并据此改进卷积设计。 Method: 分析自注意力优于卷积的两个关键原理:自适应路由和横向抑制,并将其融入卷积操作中,提出Attentive Convolution (ATConv) 和基于它的AttNet网络家族。 Result: ATConv在仅使用3×3卷核的情况下,在多个视觉任务上优于各种自注意力机制;AttNet在ImageNet-1K上达到84.4%的Top-1准确率(27M参数);在扩散模型中替换自注意力后FID降低0.15且采样更快。 Conclusion: 通过吸收自注意力的核心优势,卷积可以实现更强的表达能力和更优的性能,推动卷积网络的复兴。 Abstract: Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3\times3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3\times 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.

[77] StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

Jiho Park,Sieun Choi,Jaeyoon Seo,Jihie Kim

Main category: cs.CV

TL;DR: 提出StableSketcher框架,通过优化变分自编码器和引入基于视觉问答的奖励函数,提升扩散模型生成手绘草图的提示保真度和语义一致性,并发布首个包含实例级草图、描述及问答对的SketchDUO数据集。

Details Motivation: 现有扩散模型在生成像素级手绘草图等抽象表达方面仍存在挑战,且缺乏高质量的标注数据集来支持草图生成中的语义对齐。 Method: 提出StableSketcher框架:1)微调变分自编码器以优化潜在空间解码,更好捕捉草图特征;2)结合基于视觉问答的奖励函数进行强化学习,提升文本-图像对齐与语义一致性。同时构建了SketchDUO数据集,包含草图、描述及问答对。 Result: 实验表明,StableSketcher相比Stable Diffusion基线能生成风格更一致、与提示对齐更好的草图;SketchDUO为草图生成与理解任务提供了新的数据支持。 Conclusion: StableSketcher有效提升了扩散模型在手绘草图生成中的表现,结合新提出的SketchDUO数据集,推动了草图生成领域向更高语义保真度的发展。 Abstract: Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.

[78] BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

Ziheng Zhang,Xinyue Ma,Arpita Chowdhury,Elizabeth G. Campolongo,Matthew J. Thompson,Net Zhang,Samuel Stevens,Hilmar Lapp,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao,Jianyang Gu

Main category: cs.CV

TL;DR: 本研究提出利用描述性文本来增强生物多模态基础模型的训练,通过生成合成文本填补大规模实例级标注的空白,并训练了BIOCAP模型,在物种分类和图文检索中表现出色。

Details Motivation: 生物图像通常缺乏详细的实例级文本描述,限制了多模态模型在生物学中的应用;本文旨在通过引入描述性文本来提供更丰富的监督信号。 Method: 利用维基百科的视觉信息和特定分类格式示例,指导多模态大语言模型(MLLMs)生成准确的合成描述文本,并用这些文本训练BIOCAP模型。 Result: BIOCAP模型在物种分类和文本-图像检索任务上表现优异,证明了描述性文本比简单标签更能有效桥接生物图像与多模态模型。 Conclusion: 描述性文本作为补充监督信号,能有效提升生物多模态模型的语义理解能力,合成文本生成是解决标注稀缺的有效途径。 Abstract: This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.

[79] Physics-Guided Fusion for Robust 3D Tracking of Fast Moving Small Objects

Prithvi Raj Singh,Raju Gottumukkala,Anthony S. Maida,Alan B. Barhorst,Vijaya Gopu

Main category: cs.CV

TL;DR: 提出了一种结合深度学习检测与基于物理的跟踪算法的系统,用于在RGB-D相机下实现快速移动小物体的3D检测与跟踪,显著优于传统卡尔曼滤波方法。

Details Motivation: 快速移动的小物体检测与跟踪在现有计算机视觉研究中仍属薄弱环节,尤其在复杂场景(如遮挡、快速变向)下表现不佳。 Method: 采用深度学习进行物体检测,并设计基于运动学方程的物理跟踪算法,融合RGB-D数据实现3D跟踪,同时引入异常值检测与修正模块以提升鲁棒性。 Result: 在自建壁球数据集上验证,相比卡尔曼滤波跟踪器,平均位移误差减少高达70%,在遮挡和快速运动场景中表现出更强的稳定性。 Conclusion: 将物理模型与深度学习结合可有效提升对高速小目标的感知能力,为自主机器人系统提供了更可靠的实时3D检测与跟踪方案。 Abstract: While computer vision has advanced considerably for general object detection and tracking, the specific problem of fast-moving tiny objects remains underexplored. This paper addresses the significant challenge of detecting and tracking rapidly moving small objects using an RGB-D camera. Our novel system combines deep learning-based detection with physics-based tracking to overcome the limitations of existing approaches. Our contributions include: (1) a comprehensive system design for object detection and tracking of fast-moving small objects in 3D space, (2) an innovative physics-based tracking algorithm that integrates kinematics motion equations to handle outliers and missed detections, and (3) an outlier detection and correction module that significantly improves tracking performance in challenging scenarios such as occlusions and rapid direction changes. We evaluated our proposed system on a custom racquetball dataset. Our evaluation shows our system surpassing kalman filter based trackers with up to 70\% less Average Displacement Error. Our system has significant applications for improving robot perception on autonomous platforms and demonstrates the effectiveness of combining physics-based models with deep learning approaches for real-time 3D detection and tracking of challenging small objects.

[80] Inverse Image-Based Rendering for Light Field Generation from Single Images

Hyunjun Jung,Hae-Gon Jeon

Main category: cs.CV

TL;DR: 本文提出了一种名为逆向图像渲染(inverse image-based rendering)的新方法,仅从单张图像生成光场,通过神经渲染管线重建光线流,实现了高质量的新视角合成。

Details Motivation: 光场在场景表示和新型视图渲染方面具有优势,但传统获取方式需要高昂计算成本或专用设备。本文旨在通过单图像生成光场,提升其可用性和应用范围。 Method: 提出逆向图像渲染方法,设计神经渲染管线:首先存储输入图像中光源射线的光线流,通过交叉注意力计算射线间关系,预测目标射线颜色;生成新视图后,将新内容更新为源射线集,迭代生成以保持遮挡区域的一致性。 Result: 该方法在多个挑战性数据集上表现良好,无需重新训练或微调,优于当前最先进的新视角合成方法。 Conclusion: 逆向图像渲染能有效从单张图像生成光场,实现高质量、一致的新视角合成,具备广泛适用性和强泛化能力。 Abstract: A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the generated out-of-view contents are updated to the set of source rays. This procedure is iteratively performed while ensuring the consistent generation of occluded contents. We demonstrate that our inverse image-based rendering works well with various challenging datasets without any retraining or finetuning after once trained on synthetic dataset, and outperforms relevant state-of-the-art novel view synthesis methods.

[81] Revisiting Logit Distributions for Reliable Out-of-Distribution Detection

Jiachen Liang,Ruibing Hou,Minyang Hu,Hong Chang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 提出了一种新的后处理OOD检测方法LogitGap,通过利用logits空间中最大logit与其他logit的关系来增强ID和OOD样本的可分性,并引入一种无需训练的策略选择最有信息量的logits,实验证明其在多种场景下达到SOTA性能。

Details Motivation: 现有后处理OOD检测方法未能充分利用模型logits空间中的丰富信息,导致ID与OOD样本分离效果不佳。 Method: 提出LogitGap方法,利用最大logit与其余logit之间的关系;引入一种训练免费策略,自动识别最具信息量的logits子集用于评分。 Result: 在视觉-语言和纯视觉模型上进行了广泛实验,LogitGap在多个OOD检测基准和场景中 consistently 达到最先进的性能。 Conclusion: LogitGap通过显式建模logits内部关系并聚焦于信息密集子空间,显著提升了后处理OOD检测的效果,具有良好的通用性和实用性。 Abstract: Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning models in open-world applications. While post-hoc methods are favored for their efficiency and ease of deployment, existing approaches often underexploit the rich information embedded in the model's logits space. In this paper, we propose LogitGap, a novel post-hoc OOD detection method that explicitly exploits the relationship between the maximum logit and the remaining logits to enhance the separability between in-distribution (ID) and OOD samples. To further improve its effectiveness, we refine LogitGap by focusing on a more compact and informative subset of the logit space. Specifically, we introduce a training-free strategy that automatically identifies the most informative logits for scoring. We provide both theoretical analysis and empirical evidence to validate the effectiveness of our approach. Extensive experiments on both vision-language and vision-only models demonstrate that LogitGap consistently achieves state-of-the-art performance across diverse OOD detection scenarios and benchmarks. Code is available at https://github.com/GIT-LJc/LogitGap.

[82] PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding

Penghao Wang,Yiyang He,Xin Lv,Yukai Zhou,Lan Xu,Jingyi Yu,Jiayuan Gu

Main category: cs.CV

TL;DR: PartNeXt是一个大规模、高质量、带纹理的3D部件理解数据集,包含超过23,000个模型和精细层次化部件标注,支持细粒度部件分割和3D部件问答等多任务评估,显著提升现有方法性能。

Details Motivation: 现有3D部件数据集如PartNet依赖无纹理几何和专家标注,限制了可扩展性和实用性,亟需一个更高质量、可扩展且支持多任务的数据集。 Method: 提出PartNeXt数据集,包含23,000+带纹理的3D模型,覆盖50个类别,采用可扩展的标注流程生成细粒度、层次化的部件标签,并在部件分割和3D-LLM问答任务上进行基准测试。 Result: 在类无关部件分割任务中,现有SOTA方法表现不佳;在3D部件问答任务中揭示了3D-LLM在开放词汇部件定位上的不足;使用PartNeXt训练Point-SAM相比PartNet有显著性能提升。 Conclusion: PartNeXt通过可扩展标注、纹理感知标签和多任务评估,为结构化3D理解研究提供了新方向。 Abstract: Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset's superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.

[83] Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists

Eduardo R. Corral-Soto,Yang Liu,Yuan Ren,Bai Dongfeng,Liu Bingbing

Main category: cs.CV

TL;DR: 本文提出了一种从单张RGB图像中对铰接式自行车和骑行者进行类别级8D姿态估计的方法,除了估计自行车的3D平移和旋转外,还估计其车把和踏板相对于车身框架的旋转角度,从而更精确地预测骑行者的行驶方向和行为。

Details Motivation: 在自动驾驶中,准确估计骑行者的姿态对于判断其过街意图、行为预测和碰撞避免至关重要。传统的6D姿态估计方法无法应对自行车部件(如车把和踏板)运动带来的变化,因此需要更精细的姿态表示。 Method: 提出一种联合估计8D姿态和3D关键点的模型,其中8D姿态包括3D位置、3D旋转以及车把和踏板的两个旋转角度,并使用合成与真实图像混合数据进行训练以提升泛化能力。 Result: 实验表明该方法在8D姿态参数估计上表现良好,并在与使用刚性模板的最先进6D姿态估计器对比时取得了具有竞争力的结果。 Conclusion: 所提出的8D姿态估计方法能够更精细地描述铰接式自行车的运动状态,有助于提升自动驾驶系统对骑行者意图和行为的理解能力。 Abstract: In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.

[84] TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan,Songhe Feng

Main category: cs.CV

TL;DR: 提出一种新的方法TOMCAT,通过在测试时利用无监督数据积累多模态知识并自适应更新原型,有效应对分布偏移问题,在四个基准数据集上实现了最先进的性能。

Details Motivation: 现有CZSL方法因测试时标签空间的分布偏移(由属性和对象重新组合产生的未见组合引起)导致性能下降。 Method: 从无监督数据中积累文本和视觉模态的全面知识,在测试时更新多模态原型;设计自适应更新权重控制原型调整程度;引入动态优先队列存储高置信度图像以获取历史视觉知识;通过多模态协同表示学习对齐文本与视觉原型。 Result: 在四个基准数据集的闭世界和开世界设置下均达到最先进性能。 Conclusion: 所提方法能有效应对测试时分布偏移,提升CZSL模型泛化能力。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .

[85] IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks

Insu Jeon,Wonkwang Lee,Myeongjang Pyeon,Gunhee Kim

Main category: cs.CV

TL;DR: 提出了一种基于信息瓶颈框架的新型GAN模型IB-GAN,用于无监督解耦表示学习,在多个数据集上实现了优于InfoGAN和β-VAE的解耦性能和生成质量。

Details Motivation: 尝试将信息瓶颈(IB)框架引入GAN的优化过程,以实现更有效且可解释的解耦表示学习。 Method: 在GAN的生成器中引入一个中间随机层作为信息瓶颈,约束输入与输出之间的互信息,并通过端到端联合训练学习可解耦的潜在空间。 Result: 在dSprites和Color-dSprites上解耦性能优于或媲美β-VAE和InfoGAN,在CelebA和3D Chairs上的FID得分显示其生成样本质量和多样性更优。 Conclusion: IB-GAN通过引入信息瓶颈机制,能够有效实现解耦表示学习,并在生成质量和多样性方面表现优越,验证了其在无监督解耦学习中的潜力。 Abstract: We propose a new GAN-based unsupervised model for disentangled representation learning. The new model is discovered in an attempt to utilize the Information Bottleneck (IB) framework to the optimization of GAN, thereby named IB-GAN. The architecture of IB-GAN is partially similar to that of InfoGAN but has a critical difference; an intermediate layer of the generator is leveraged to constrain the mutual information between the input and the generated output. The intermediate stochastic layer can serve as a learnable latent distribution that is trained with the generator jointly in an end-to-end fashion. As a result, the generator of IB-GAN can harness the latent space in a disentangled and interpretable manner. With the experiments on dSprites and Color-dSprites dataset, we demonstrate that IB-GAN achieves competitive disentanglement scores to those of state-of-the-art \b{eta}-VAEs and outperforms InfoGAN. Moreover, the visual quality and the diversity of samples generated by IB-GAN are often better than those by \b{eta}-VAEs and Info-GAN in terms of FID score on CelebA and 3D Chairs dataset.

[86] PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

Yun Wang,Junjie Hu,Qiaole Dong,Yongjian Zhang,Yanwei Fu,Tin Lun Lam,Dapeng Wu

Main category: cs.CV

TL;DR: 本文提出了一种用于立体视频中时序一致深度估计的新方法PPMStereo,通过引入“挑选-播放”记忆模块(PPM)实现高效且长时程的时空一致性建模,在准确性和时间一致性方面均达到最先进水平,同时计算成本更低。

Details Motivation: 在增强现实等实际应用中,立体视频的深度估计需要保持时间一致性以避免影响用户体验。然而,现有方法在建模长期时间一致性时面临计算效率与性能之间的权衡问题。 Method: 提出PPMStereo,包含‘pick’过程选择最相关的帧和‘play’过程自适应加权这些帧进行时空聚合,构建紧凑且信息丰富的记忆缓冲区,实现高效的动态立体匹配。 Result: 实验表明PPMStereo在Sintel数据集上取得了最先进的性能,Sintel clean/final上的TEPE分别为0.62/1.11,相比BiDAStereo提升了17.3%和9.02%,同时计算成本更低。 Conclusion: PPMStereo通过两阶段的记忆机制有效平衡了时间一致性建模与计算效率,为动态立体匹配提供了新的解决方案。 Abstract: Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3\% \& 9.02\% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolor{blue}{https://github.com/cocowy1/PPMStereo}.

[87] Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

Aaron Appelle,Jerome P. Lynch

Main category: cs.CV

TL;DR: 提出了一种评估文本到视频和图像到视频模型作为行人动态模拟器的新协议,发现现有模型在多智能体行为上表现良好,但仍存在如人物合并和消失等缺陷。

Details Motivation: 现有的大规模视频生成模型在多种场景下展示了高视觉真实感,但其在多个人物交互场景中的合理性尚未验证。 Method: 提出了一个严格的评估协议,利用已有的数据集起始帧进行I2V比较,并为T2V开发了探索不同行人密度和交互的提示套件;同时提出一种无需已知相机参数即可从像素空间重建2D鸟瞰轨迹的方法。 Result: 分析显示领先的视频生成模型已学习到有效的多智能体行为先验,但在人物合并和消失等方面仍存在失败模式。 Conclusion: 当前T2V和I2V模型具备作为隐式行人动态模拟器的潜力,但需进一步改进以解决多智能体动态中的不一致性问题。 Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.

[88] SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization

Xinyi Hu,Yuran Wang,Yue Li,Wenxuan Liu,Zheng Wang

Main category: cs.CV

TL;DR: 本文提出了Suspicion Progression Analysis Network (SPAN),将时间意图定位从离散分类转为连续回归,以捕捉视频监控中可疑意图的动态演变。通过引入怀疑分数公式、怀疑系数调制和概念锚定映射,SPAN显著提升了检测精度与可解释性。

Details Motivation: 现有方法采用离散分类,难以建模可疑意图的连续性和时序演化特性,限制了早期干预与系统可解释性。 Method: 提出SPAN模型,基于Temporal Point Process理论设计连续怀疑分数公式;引入多模态信息驱动的怀疑系数调制机制;设计概念锚定映射以关联行为与潜在意图。 Result: 在HAI数据集上,SPAN比现有方法MSE降低19.8%,平均mAP提升1.78%,低频场景下mAP提升2.74%。 Conclusion: 连续建模可疑意图优于离散分类,能更早发现异常、支持主动干预,并增强系统在安全应用中的可解释性与实用性。 Abstract: Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.

[89] A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development

Minh Sao Khue Luu,Margaret V. Benedichuk,Ekaterina I. Roppert,Roman M. Kenzhin,Bair N. Tuchinov

Main category: cs.CV

TL;DR: 该研究系统分析了54个公开的脑部MRI数据集,揭示了数据规模、模态组成和疾病覆盖方面的显著不平衡,以及图像层面的异质性,即使经过标准化预处理仍存在协变量偏移,表明需要预处理感知和领域自适应策略来开发可泛化的脑部MRI基础模型。

Details Motivation: 为了支持脑部MRI基础模型的开发,需要系统评估现有公开数据集在规模、多样性和一致性方面的特性,但目前此类评估十分缺乏。 Method: 在数据集层面分析模态组成、疾病覆盖和规模;在图像层面量化体素间距、方向和强度分布;评估多种预处理步骤对体素统计和几何结构的影响;并通过3D DenseNet121进行特征空间案例研究,检验预处理后残余的协变量偏移。 Result: 发现健康人群数据集规模远大于临床数据集,图像特征存在显著异质性;预处理虽提升数据集内部一致性,但跨数据集差异仍然存在;特征空间分析证实标准化预处理无法完全消除协变量偏移。 Conclusion: 公共脑部MRI数据存在多层次变异,仅靠标准化预处理不足以实现数据和谐,未来基础模型设计需结合预处理感知和领域自适应方法以提升泛化能力。 Abstract: The development of foundation models for brain MRI depends critically on the scale, diversity, and consistency of available data, yet systematic assessments of these factors remain scarce. In this study, we analyze 54 publicly accessible brain MRI datasets encompassing over 538,031 to provide a structured, multi-level overview tailored to foundation model development. At the dataset level, we characterize modality composition, disease coverage, and dataset scale, revealing strong imbalances between large healthy cohorts and smaller clinical populations. At the image level, we quantify voxel spacing, orientation, and intensity distributions across 15 representative datasets, demonstrating substantial heterogeneity that can influence representation learning. We then perform a quantitative evaluation of preprocessing variability, examining how intensity normalization, bias field correction, skull stripping, spatial registration, and interpolation alter voxel statistics and geometry. While these steps improve within-dataset consistency, residual differences persist between datasets. Finally, feature-space case study using a 3D DenseNet121 shows measurable residual covariate shift after standardized preprocessing, confirming that harmonization alone cannot eliminate inter-dataset bias. Together, these analyses provide a unified characterization of variability in public brain MRI resources and emphasize the need for preprocessing-aware and domain-adaptive strategies in the design of generalizable brain MRI foundation models.

[90] RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao,Qianli Ma,Xiaoxue Wu,Shuai Yang,Guanzhou Lan,Haonan Zhao,Jiaxuan Chen,Qingyang Liu,Yu Qiao,Xinyuan Chen,Yaohui Wang,Li Niu

Main category: cs.CV

TL;DR: RAPO++ 是一个跨阶段的提示优化框架,通过训练数据对齐的提示增强、测试时迭代优化和大语言模型微调,显著提升文本到视频生成的质量,且无需修改生成模型本身。

Details Motivation: 用户提供的文本提示通常简短、无结构且与训练数据不匹配,限制了扩散模型在文本到视频生成中的表现,因此需要一种有效的提示优化方法。 Method: RAPO++ 包含三个阶段:第一阶段使用检索增强的提示优化(RAPO)从关系图中检索语义相关修饰词并重构提示以匹配训练分布;第二阶段引入样本特定的提示优化(SSPO),利用多源反馈(如语义对齐、空间保真度、时间连贯性等)进行闭环迭代优化;第三阶段使用优化后的提示对大语言模型进行微调,使其内化优化模式。 Result: 在五个最先进的文本到视频模型和五个基准上的实验表明,RAPO++ 在语义对齐、组合推理、时间稳定性和物理合理性方面显著优于现有方法。 Conclusion: RAPO++ 是一种模型无关、成本低且可扩展的提示优化方案,为文本到视频生成中的提示工程设立了新标准。 Abstract: Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

[91] FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing

Yanghao Wang,Zhen Wang,Long Chen

Main category: cs.CV

TL;DR: 提出FlowCycle,一种无需反转的流式图像编辑框架,通过循环一致性学习目标感知的中间状态,实现高质量且一致的文本驱动图像编辑。

Details Motivation: 现有方法在文本驱动图像编辑中采用目标不可知的中间状态,导致编辑能力受限或结果不一致,尤其在与原图差异较大时表现不佳。 Method: 提出FlowCycle,使用可学习噪声参数化破坏过程,并通过前向编辑与反向恢复的双重建约束进行循环一致性优化,生成目标感知的中间状态。 Result: 实验表明,FlowCycle在编辑质量和源图像一致性方面优于当前最先进的方法。 Conclusion: 目标感知的中间状态对提升文本到图像编辑的保真度和一致性至关重要,FlowCycle为无需反转的编辑提供了有效框架。 Abstract: Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.

[92] Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection

Talha Ilyas,Duong Nhu,Allison Thomas,Arie Levin,Lim Wei Yap,Shu Gong,David Vera Anaya,Yiwen Jiang,Deval Mehta,Ritesh Warty,Vinayak Smith,Maya Reddy,Euan Wallace,Wenlong Cheng,Zongyuan Ge,Faezeh Marzbanrad

Main category: cs.CV

TL;DR: 提出一种基于自监督对比学习的胎儿运动检测框架CURL,利用双对比损失和特定任务采样策略,从长时间超声视频中实现准确的胎儿运动检测。

Details Motivation: 传统方法如孕妇感知和CTG在胎儿运动检测中存在主观性和准确性不足的问题,需要更客观、可靠的技术来评估胎儿健康状况。 Method: 提出Contrastive Ultrasound Video Representation Learning (CURL),采用空间和时间双重对比损失进行自监督学习,并设计任务特定的采样策略,结合概率性微调实现对任意长度超声视频的灵活推理。 Result: 在包含92名受试者(每人30分钟超声数据)的内部数据集上,CURL达到78.01%的敏感性和81.60%的AUROC。 Conclusion: CURL能有效学习胎儿运动表征,具备用于可靠、客观胎儿运动分析的潜力,有助于提升产前监测和临床决策水平。 Abstract: Accurate fetal movement (FM) detection is essential for assessing prenatal health, as abnormal movement patterns can indicate underlying complications such as placental dysfunction or fetal distress. Traditional methods, including maternal perception and cardiotocography (CTG), suffer from subjectivity and limited accuracy. To address these challenges, we propose Contrastive Ultrasound Video Representation Learning (CURL), a novel self-supervised learning framework for FM detection from extended fetal ultrasound video recordings. Our approach leverages a dual-contrastive loss, incorporating both spatial and temporal contrastive learning, to learn robust motion representations. Additionally, we introduce a task-specific sampling strategy, ensuring the effective separation of movement and non-movement segments during self-supervised training, while enabling flexible inference on arbitrarily long ultrasound recordings through a probabilistic fine-tuning approach. Evaluated on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions, CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its potential for reliable and objective FM analysis. These results highlight the potential of self-supervised contrastive learning for fetal movement analysis, paving the way for improved prenatal monitoring and clinical decision-making.

[93] EditInfinity: Image Editing with Binary-Quantized Generative Models

Jiahuan Wang,Yuxin Chen,Jun Yu,Guangming Lu,Wenjie Pei

Main category: cs.CV

TL;DR: 本文提出EditInfinity,通过适配二值量化生成模型Infinity进行文本驱动的图像编辑,利用其可精确获取中间量化表示的特点,实现精准图像反演和高质量编辑,在PIE-Bench基准上优于现有扩散模型方法。

Details Motivation: 现有基于扩散模型的图像编辑方法受限于图像反演过程中的近似误差,缺乏中间生成步骤的精确监督,导致编辑性能受限。 Method: 提出EditInfinity,基于VQ生成模型,利用可获取源图像精确中间量化表示的优势,设计了融合文本提示校正和图像风格保持的高效图像反演机制,并引入整体平滑策略以提升编辑保真度和语义对齐精度。 Result: 在PIE-Bench基准上,针对“添加”、“修改”和“删除”三类编辑操作的实验表明,EditInfinity在图像保真度和文本对齐方面均优于当前最先进的扩散模型基线方法。 Conclusion: 通过参数高效地适配VQ-based生成模型并利用精确中间表示,EditInfinity实现了更精确的图像反演与高质量的文本驱动图像编辑,显著提升了编辑性能。 Abstract: Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across "add", "change", and "delete" editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.

[94] Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Ge Zheng,Jiaye Qian,Jiajin Tang,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的“诱导-检测-抑制”框架,用于减少大视觉语言模型在长文本生成中的幻觉问题,验证了上下文依赖性而非长度本身是幻觉增加的主要原因。

Details Motivation: 研究旨在探究大视觉语言模型在长而自由的输出中出现更多幻觉现象的根本原因,是否由响应长度直接导致,还是存在更深层机制。 Method: 通过一系列初步实验分析幻觉与响应长度的关系,提出‘诱导-检测-抑制’框架:利用精心设计的上下文主动诱导幻觉,基于诱导样例进行高风险案例的早期检测,并在解码过程中抑制对象级幻觉。 Result: 该方法在多个基准测试中均取得显著且一致的性能提升,有效检测并抑制了幻觉,验证了上下文依赖是导致幻觉增加的关键因素。 Conclusion: 幻觉风险主要源于对上下文连贯性和完整性的依赖增加,而非响应长度本身;所提框架不仅提升了性能,也为理解LVLM中长输出幻觉提供了新视角。 Abstract: Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.

[95] COS3D: Collaborative Open-Vocabulary 3D Segmentation

Runsong Zhu,Ka-Hei Hui,Zhengzhe Liu,Qianyi Wu,Weiliang Tang,Shi Qiu,Pheng-Ann Heng,Chi-Wing Fu

Main category: cs.CV

TL;DR: 本文提出COS3D,一种协作式提示-分割框架,通过引入实例场和语言场的协同场概念,有效融合语言与分割线索,解决了现有高斯溅射方法在开放词汇3D分割中的局限性。

Details Motivation: 现有基于高斯溅射的方法依赖单一3D语言场或预计算的类别无关分割,导致分割效果差或误差累积,难以实现高质量的开放词汇3D分割。 Method: 提出协同场(包含实例场和语言场),设计实例到语言的特征映射与两阶段训练策略,并在推理时采用自适应语言到实例的提示优化机制,以桥接两个场的差异。 Result: 在两个主流基准上显著优于现有方法,并展现出在基于新图像的3D分割、层次化分割和机器人等应用中的潜力。 Conclusion: COS3D通过在整个流程中有效整合语言与分割线索,实现了更优的开放词汇3D分割性能,具有广泛的应用前景。 Abstract: Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}.

[96] Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

Minseok Kang,Minhyeok Lee,Minjung Kim,Donghyeong Kim,Sangyoun Lee

Main category: cs.CV

TL;DR: 本文提出DualGround,一种双分支架构,通过分离句子级和短语级语义来改进视频时序 grounding,实现细粒度的时间对齐,在多个基准上达到SOTA性能。

Details Motivation: 现有方法在跨模态注意力中对所有文本标记一视同仁,忽视其不同语义角色,导致模型过度依赖[EOS]全局语义而忽略词级信号,限制了细粒度时间对齐能力。 Method: 提出DualGround,采用双分支结构:[EOS]令牌通过句子级路径处理,词令牌聚类为短语级单元用于局部定位;引入基于角色感知的跨模态交互策略,并结合联合建模框架,分别增强全局对齐与细粒度定位。 Result: DualGround在QVHighlights和Charades-STA数据集上的Moment Retrieval和Highlight Detection任务中均取得最先进性能。 Conclusion: 通过解耦全局与局部语义建模,DualGround有效提升了视频-语言对齐中的表达能力和上下文感知,验证了解耦语义建模在视频时序 grounding 中的有效性。 Abstract: Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.

[97] Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization

Shuhan Hu,Yiru Li,Yuanyuan Li,Yingying Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于掩码的位置编码方案(MPE)和上下文增强模块(CEM),构建了EDGeo框架,用于提升跨视角物体地理定位的精度,尤其在处理长跨度物体时表现出色,在CVOGL和VIGOR-Building数据集上实现了最先进的性能。

Details Motivation: 现有方法依赖关键点位置编码,仅捕捉二维坐标而忽略物体形状信息,导致对标注偏移敏感且跨视角匹配能力有限。 Method: 提出掩码位置编码(MPE),利用分割掩码同时捕捉空间坐标和物体轮廓;设计上下文增强模块(CEM),采用条带卷积核提取长距离上下文特征,提升细长型物体的区分能力;构建端到端的EDGeo框架。 Result: 在CVOGL和VIGOR-Building两个公开数据集上实验表明,该方法在地面到卫星的挑战性场景下定位精度提升了3.39%,达到最先进水平。 Conclusion: MPE和CEM有效提升了跨视角匹配中对物体形状和上下文的理解,为跨视角地理定位提供了更鲁棒的位置编码范式和上下文建模框架。 Abstract: Cross-view object geo-localization enables high-precision object localization through cross-view matching, with critical applications in autonomous driving, urban management, and disaster response. However, existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information, resulting in sensitivity to annotation shifts and limited cross-view matching capability. To address these limitations, we propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes, thereby upgrading the model from "location-aware" to "object-aware." Furthermore, to tackle the challenge of large-span objects (e.g., elongated buildings) in satellite imagery, we design a context enhancement module. This module employs horizontal and vertical strip convolutional kernels to extract long-range contextual features, enhancing feature discrimination among strip-like objects. Integrating MPE and CEM, we present EDGeo, an end-to-end framework for robust cross-view object geo-localization. Extensive experiments on two public datasets (CVOGL and VIGOR-Building) demonstrate that our method achieves state-of-the-art performance, with a 3.39% improvement in localization accuracy under challenging ground-to-satellite scenarios. This work provides a robust positional encoding paradigm and a contextual modeling framework for advancing cross-view geo-localization research.

[98] Calibrating Multimodal Consensus for Emotion Recognition

Guowei Zhong,Junjie Li,Huaiyu Zhu,Ruohong Huan,Yun Pan

Main category: cs.CV

TL;DR: 本文提出了一种名为校准多模态共识(CMC)的模型,通过伪标签生成模块和无参数融合模块解决多模态情感识别中的语义不一致和文本主导问题,在多个数据集上表现优异。

Details Motivation: 现有方法忽视了多模态间可能存在的语义不一致性,且容易因文本模态表征能力强而导致模态主导问题,影响情感识别准确性。 Method: 提出CMC模型,包含伪单模态标签生成模块(PLGM)用于自监督预训练,以及无参数融合模块(PFM)和多模态共识路由器(MCR)用于微调,以缓解文本主导并实现更可靠的多模态融合。 Result: 在CH-SIMS、CH-SIMS v2、CMU-MOSI和CMU-MOSEI四个数据集上,CMC性能达到或优于现有最先进方法,尤其在存在语义不一致的场景下表现出明显优势。 Conclusion: CMC有效解决了多模态情感识别中的模态语义不一致与文本主导问题,提升了模型鲁棒性和准确性,具备良好的应用前景。 Abstract: In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.

[99] Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals

Saraf Anzum Shreya,MD. Abu Ismail Siddique,Sharaf Tasnim

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv8 nano模型的实时货币检测系统,旨在帮助视障人士独立识别美元、欧元和孟加拉塔卡的纸币与硬币。

Details Motivation: 视障人士在日常生活中处理货币时面临困难,依赖他人不便且缺乏隐私,因此需要一种高效、实时的辅助工具来提升其自主性。 Method: 采用YOLOv8 nano模型,结合自定义检测头,引入深度卷积层和Squeeze-and-Excitation模块以增强特征提取能力,并在包含30类货币样本的数据集上进行训练与优化。 Result: 模型取得了97.73%的准确率、95.23%的召回率、95.85%的F1分数以及97.21%的mAP50(B),表现出优异的检测性能。 Conclusion: 该系统结合语音反馈功能,具备实际应用价值,可有效帮助视障人士独立完成货币识别任务,提升生活质量。 Abstract: Technologies like smartphones have become an essential in our daily lives. It has made accessible to everyone including visually impaired individuals. With the use of smartphone cameras, image capturing and processing have become more convenient. With the use of smartphones and machine learning, the life of visually impaired can be made a little easier. Daily tasks such as handling money without relying on someone can be troublesome for them. For that purpose this paper presents a real-time currency detection system designed to assist visually impaired individuals. The proposed model is trained on a dataset containing 30 classes of notes and coins, representing 3 types of currency: US dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a YOLOv8 nano model with a custom detection head featuring deep convolutional layers and Squeeze-and-Excitation blocks to enhance feature extraction and detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5 (mAP50(B)) of 97.21\%. Using the voice feedback after the detection would help the visually impaired to identify the currency. This paper aims to create a practical and efficient currency detection system to empower visually impaired individuals independent in handling money.

[100] GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection

Guangyu Dai,Dong Chen,Siliang Tang,Yueting Zhuang

Main category: cs.CV

TL;DR: 提出了一种基于多模态信息的视频异常检测方法GMFVAD,通过细粒度文本特征增强视觉特征并减少冗余,实现了最先进的性能。

Details Motivation: 现有方法粗略融合多模态信息,忽略了视频片段中的冗余信息,影响异常检测效果。 Method: 提出GMFVAD,生成更细粒度的多模态特征,利用视频字幕文本特征增强关键部分的视觉特征,并通过多模态多样性减少视觉冗余。 Result: 在四个主流数据集上达到最先进性能,消融实验验证了冗余信息减少对性能提升的有效性。 Conclusion: GMFVAD通过细粒度多模态融合有效减少了特征冗余,显著提升了视频异常检测的准确性。 Abstract: Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.

[101] Causal Debiasing for Visual Commonsense Reasoning

Jiayi Zou,Gengyun Jia,Bing-Kun Bao

Main category: cs.CV

TL;DR: 本文提出了VCR-OOD数据集以评估模型在视觉和文本模态中的泛化能力,并通过后门调整方法和答案字典消除数据集中的共现与统计偏差,提升了去偏效果。

Details Motivation: 现有视觉常识推理方法忽视了数据集中的偏差问题,缺乏有效的去偏策略,影响模型的泛化能力。 Method: 构建VCR-OOD-QA和VCR-OOD-VA两个子集用于评估跨模态泛化能力;分析VCR任务中的因果图与预测捷径,采用后门调整方法进行去偏,并基于正确答案集合构建字典以消除捷径效应。 Result: 实验表明所提出的去偏方法在多个数据集上均有效,能够提升模型在OOD场景下的泛化性能。 Conclusion: 通过构造去偏字典并结合因果干预方法,能有效缓解VCR任务中多模态数据的偏差问题,增强模型的鲁棒性和可解释性。 Abstract: Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.

[102] Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition

Haodong Yang,Zhongling Huang,Shaojie Guo,Zhe Zhang,Gong Cheng,Junwei Han

Main category: cs.CV

TL;DR: 提出知识引导的神经网络KINN,通过物理先验与紧凑架构解决CV-SAR图像识别中的表示三难问题,在少数据和分布外场景下实现高效、可解释且泛化的识别。

Details Motivation: 传统数据驱动模型未能充分利用CV-SAR数据中的电磁散射特征,导致在数据有限和域偏移情况下难以兼顾泛化性、可解释性和效率。 Method: 提出KINN框架,采用“压缩-聚合-压缩”结构:第一阶段通过物理引导的字典处理器嵌入先验知识,使用紧凑展开网络提取稀疏物理特征;第二阶段聚合特征;第三阶段通过自蒸馏的轻量分类头进行语义压缩,学习任务相关且判别性强的表示。 Result: 在五个SAR基准上验证,KINN在仅0.7M(CNN)和0.95M(ViT)参数下达到最优性能,显著提升少样本和分布外场景的泛化能力,并具备良好可解释性。 Conclusion: KINN有效解决了CV-SAR图像识别中的表示三难问题,为可信AI在SAR分析中提供了新路径。 Abstract: Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR) image recognition are fundamentally constrained by a representation trilemma under data-limited and domain-shift scenarios: the concurrent, yet conflicting, optimization of generalization, interpretability, and efficiency. Our work is motivated by the premise that the rich electromagnetic scattering features inherent in CV-SAR data hold the key to resolving this trilemma, yet they are insufficiently harnessed by conventional data-driven models. To this end, we introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture. The first stage performs a physics-guided compression, wherein a novel dictionary processor adaptively embeds physical priors, enabling a compact unfolding network to efficiently extract sparse, physically-grounded signatures. A subsequent aggregation module enriches these representations, followed by a final semantic compression stage that utilizes a compact classification head with self-distillation to learn maximally task-relevant and discriminative embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer (0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios and tangible interpretability, thereby providing an effective solution to the representation trilemma and offering a new path for trustworthy AI in SAR image analysis.

[103] DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering

Jiayi Zou,Chaofan Chen,Bing-Kun Bao,Changsheng Xu

Main category: cs.CV

TL;DR: 提出了一种双模态反事实对比构建框架(DMC³),用于解决第一人称视频问答中的多事件理解和手-物交互识别问题,在多个数据集上达到SOTA性能。

Details Motivation: 现有方法忽略了第一人称视角带来的独特挑战,如多事件理解和手-物交互识别,因此需要更有效的建模方式来提升Egocentric VideoQA性能。 Method: 提出DMC³框架,包括一个基线模型、反事实样本构建模块(通过事件描述改写和核心交互挖掘生成文本和视觉的正负样本)以及反事实样本参与的对比优化模块,采用对比损失拉近正样本距离、推远负样本距离。 Result: 在EgoTaskQA的normal和indirect分割上分别取得52.51%和46.04%的准确率,在QAEGO4D上达到13.2%,均达到当前最优性能。 Conclusion: DMC³通过引入反事实对比学习有效提升了第一人称视频问答的表现,验证了对第一人称特性建模的重要性。 Abstract: Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51\% and 46.04\% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on QAEGO4D, both reaching the state-of-the-art performance.

[104] UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Liangyu Chen,Hanzhang Zhou,Chenglin Cai,Jianan Zhang,Panrong Tong,Quyu Kong,Xu Zhang,Chen Liu,Yuqi Liu,Wenxuan Wang,Yue Wang,Qin Jin,Steven Hoi

Main category: cs.CV

TL;DR: 本文提出了“指令即推理”范式,将自然语言指令视为动态的分析路径,通过两阶段训练框架(监督微调+强化学习)提升GUI元素定位性能,实现了在多个基准上的最先进结果,并展现出强大的代理潜力。

Details Motivation: 现有工作将指令视为静态的用户意图代理,忽略了指令多样性和质量对定位性能的影响。作者发现现有数据集中存在23.3%的指令缺陷,并希望利用指令多样性来提升模型推理能力。 Method: 提出‘指令即推理’范式,采用两阶段训练:首先在合成的多样化指令上进行监督微调,以培养多视角推理能力;然后通过强化学习优化推理路径的选择与组合。 Result: 所提出的UI-Ins-7B和UI-Ins-32B模型在五个具有挑战性的基准上达到最先进的性能,例如UI-Ins-32B在UI-I2E-Bench上达到87.3%的准确率,在AndroidWorld任务中使用UI-Ins-7B作为执行器实现了74.1%的成功率,并展现出新兴的推理能力。 Conclusion: 将指令视为动态推理路径可显著提升GUI定位性能,所提方法不仅提高了准确性,还缓解了SFT+RL框架中的策略崩溃问题,展现出强大的泛化与代理能力。 Abstract: GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.

[105] Breakdance Video classification in the age of Generative AI

Sauptik Dhar,Naveen Ramakrishnan,Michelle Munson

Main category: cs.CV

TL;DR: 本文研究了现代视频基础模型在小众但流行的舞蹈运动——霹雳舞中的应用,发现视频编码器模型在预测任务中优于最先进的视频语言模型,并深入分析了微调解码器模型在霹雳舞视频分类中的表现。

Details Motivation: 现有研究多集中于足球、篮球等主流体育项目,而对霹雳舞这类小众但受欢迎的舞蹈运动关注较少,本文旨在探索视频基础模型在此类特殊场景下的适用性。 Method: 采用现代视频基础模型(包括编码器和解码器),对霹雳舞视频进行分类任务实验,并对比编码器与视频语言模型的性能,同时对微调后的解码器模型进行深入分析。 Result: 实验结果表明,视频编码器模型在预测任务中持续优于当前最先进的视频语言模型,且提供了选择编码器模型的有效策略及解码器模型的工作机制洞察。 Conclusion: 视频编码器模型更适合用于霹雳舞视频的分类任务,本研究为小众体育运动的视频理解提供了有价值的模型选择和优化方向。 Abstract: Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.

[106] A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

LinFeng Li,Jian Zhao,Zepeng Yang,Yuhang Song,Bojun Lin,Tianle Zhang,Yuchen Yuan,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 提出了一种基于领域对齐预处理和混合专家(MoE)框架的跨模态无人机导航解决方案,在RoboSense 2025 Track 4中取得领先成绩。

Details Motivation: 解决跨平台异质性和通用训练文本与平台特定测试查询之间的领域差距问题。 Method: 采用平台划分、卫星数据增强、去除方向词等预处理;通过LLM进行文本描述优化;使用BGE-M3和EVA-CLIP模型,结合两阶段难负样本挖掘训练三个平台专家,并在推理时融合其得分。 Result: 该系统在官方排行榜上排名第一,显著提升了异构视角下的跨模态地理定位性能。 Conclusion: 所提出的领域对齐预处理和MoE架构有效缓解了跨平台差异和语义鸿沟,实现了鲁棒的跨模态地理定位。 Abstract: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.

[107] HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Zelin Peng,Zhengqin Xu,Qingyang Liu,Xiaokang Yang,Wei Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于双曲空间的高效多模态大语言模型训练范式HyperET,通过动态调整双曲半径实现视觉与文本在任意粒度级别的对齐,显著提升模型性能且仅增加不到1%的参数量。

Details Motivation: 现有的多模态大语言模型因视觉编码器缺乏多粒度语言对齐能力,导致训练需要极高的计算资源,效率低下。 Method: 利用双曲空间天然的层次结构特性,提出HyperET方法,通过可学习矩阵和莫比乌斯乘法操作,在双曲空间中动态调整半径,实现视觉与文本在任意粒度上的对齐。 Result: 在多个MLLM基准测试中,HyperET在预训练和微调场景下均显著提升模型性能,且仅引入不到1%的额外参数。 Conclusion: HyperET提供了一种高效、灵活的多模态对齐方案,有效缩小了视觉与语言模态间的粒度差距,降低了多模态模型训练的资源需求。 Abstract: Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.

[108] AnyPcc: Compressing Any Point Cloud with a Single Universal Model

Kangli Wang,Qianxi Yi,Yuqi Ye,Shihao Li,Wei Gao

Main category: cs.CV

TL;DR: 提出了一种名为Anypcc的通用点云压缩框架,通过引入通用上下文模型和实例自适应微调策略,显著提升了点云几何压缩的泛化能力。

Details Motivation: 深度学习在点云几何压缩中的泛化能力受限于上下文模型不健全以及对分布外数据处理效率低的问题。 Method: 提出Anypcc框架,包含两个关键部分:1)通用上下文模型,利用空间和通道分组先验捕捉上下文依赖;2)实例自适应微调(IAFT)策略,结合显式与隐式压缩范式,针对每个实例微调少量网络权重并将其编码进码流。 Result: 在15个不同数据集的基准测试中,Anypcc在点云压缩性能上达到了新的最先进水平,尤其在处理分布外数据时表现出更强的鲁棒性和压缩效率。 Conclusion: Anypcc通过增强上下文建模和引入轻量级实例自适应机制,有效提升了点云压缩模型的泛化能力,为未来可复现研究提供了代码和数据支持。 Abstract: Generalization remains a critical challenge for deep learning-based point cloud geometry compression. We argue this stems from two key limitations: the lack of robust context models and the inefficient handling of out-of-distribution (OOD) data. To address both, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages priors from both spatial and channel-wise grouping to capture robust contextual dependencies. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. It fine-tunes a small subset of network weights for each instance and incorporates them into the bitstream, where the marginal bit cost of the weights is dwarfed by the resulting savings in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression. Our code and datasets will be released to encourage reproducible research.

[109] AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models

Seunghoon Lee,Jeongwoo Choi,Byunggwan Son,Jaehyeon Moon,Jeimin Jeon,Bumsub Ham

Main category: cs.CV

TL;DR: 提出了一种名为AccuQuant的新型后训练量化方法,用于扩散模型,通过在多个去噪步骤中显式模拟来减少量化误差累积。

Details Motivation: 扩散模型在量化过程中存在误差累积问题,尤其是在多步去噪过程中,影响生成质量,因此需要一种能有效缓解该问题的量化方法。 Method: AccuQuant通过在若干去噪步骤内最小化全精度模型与量化模型输出之间的差异,显式模拟多个去噪步骤;同时引入高效的实现技术和新目标函数,将内存复杂度从O(n)降低到O(1)。 Result: 在多个任务和标准基准上的实验表明,AccuQuant在保持生成质量的同时显著提升了量化效率和性能。 Conclusion: AccuQuant有效缓解了扩散模型中的量化误差累积问题,具备高效性和广泛适用性,为扩散模型的部署提供了实用的后训练量化方案。 Abstract: We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$, where $n$ is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.

[110] Positional Encoding Field

Yunpeng Bai,Haoxiang Li,Qixing Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为Positional Encoding Field(PE-Field)的新方法,将位置编码从2D扩展到3D结构场,以增强DiT在三维空间中的几何建模能力,在单图新视角合成和空间图像编辑中实现了最先进的性能。

Details Motivation: 发现DiT中patch token对位置编码具有高度依赖性,而扰动位置编码仍能生成全局一致的输出,说明空间一致性主要由位置编码控制,因此需要更强大的位置编码机制来支持3D建模。 Method: 引入PE-Field,将位置编码扩展为包含深度感知和层次化子patch控制的3D结构场,使DiT能够直接在3D空间中进行几何建模。 Result: 在单图像新视角合成任务上达到最先进水平,并能泛化到可控的空间图像编辑任务。 Conclusion: PE-Field通过扩展位置编码至3D结构场,显著提升了DiT在三维视觉生成任务中的表现,验证了位置编码在视觉Transformer中的核心作用。 Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.

[111] Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval

Qing Wang,Chong-Wah Ngo,Yu Cao,Ee-Peng Lim

Main category: cs.CV

TL;DR: 本文提出了一种新的因果方法,用于图像到食谱检索任务中,通过预测图像中可能被忽略的烹饪元素,并将其显式注入跨模态表示学习中,以缓解因视觉偏差导致的检索困难。

Details Motivation: 现有方法假设食物图像能完全反映食谱中的文本细节,但实际上图像仅体现成品外观,无法捕捉非视觉性的关键烹饪信息(如食材使用和烹饪步骤的细微差异),导致跨模态表示学习存在偏差,尤其在多文化混合数据中更为严重。 Method: 提出一种因果表示学习框架,首先预测图像中未体现的潜在烹饪元素(如隐含食材或操作),然后将这些元素显式地注入到跨模态表示学习过程中,以减少对主导视觉特征的依赖并增强对细微差别的建模能力。 Result: 在标准单语Recipe1M数据集和新构建的多语言多文化菜系数据集上实验表明,该方法能有效发现被忽略的细微成分和烹饪动作,在单语和多语言多文化场景下均实现了优异的检索性能。 Conclusion: 通过引入因果推理来建模图像与食谱之间的缺失环节,所提方法能够缓解跨模态表示学习中的视觉偏差问题,提升图像到食谱的细粒度匹配与检索效果,尤其适用于多文化混合环境。 Abstract: Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.

[112] Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment

Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel

Main category: cs.CV

TL;DR: 提出FuzzyDistillViT-MobileNet模型,结合动态模糊逻辑驱动的知识蒸馏与图像融合技术,用于肺癌分类,在多种数据集上实现高准确率。

Details Motivation: 传统知识蒸馏方法使用固定权重,难以应对肺癌诊断中的不确定性和复杂性,需提升学生模型对高置信度区域的关注并抑制模糊区域干扰。 Method: 采用Vision Transformer(ViT-B32)作为教师模型,MobileNet为学生模型,通过动态模糊逻辑调整蒸馏权重;引入Gamma校正和直方图均衡化进行像素级图像增强,并利用小波融合方法(wavedec2)实现多尺度特征融合;使用遗传算法选择最优预训练学生模型;训练中采用动态等待调整机制优化收敛。 Result: 在LC25000组织病理图像数据集上达到99.16%准确率,在IQOTH/NCCD CT扫描图像数据集上达到99.54%准确率,表现出跨成像域的鲁棒性。 Conclusion: 所提方法通过动态模糊蒸馏和图像融合显著提升了肺癌分类性能,兼顾计算效率与模型泛化能力,适用于复杂医学图像诊断任务。 Abstract: This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven knowledge distillation (KD) to address uncertainty and complexity in disease diagnosis. Unlike traditional models that rely on static KD with fixed weights, our method dynamically adjusts the distillation weight using fuzzy logic, enabling the student model to focus on high-confidence regions while reducing attention to ambiguous areas. This dynamic adjustment improves the model ability to handle varying uncertainty levels across different regions of LC images. We employ the Vision Transformer (ViT-B32) as the instructor model, which effectively transfers knowledge to the student model, MobileNet, enhancing the student generalization capabilities. The training process is further optimized using a dynamic wait adjustment mechanism that adapts the training procedure for improved convergence and performance. To enhance image quality, we introduce pixel-level image fusion improvement techniques such as Gamma correction and Histogram Equalization. The processed images (Pix1 and Pix2) are fused using a wavelet-based fusion method to improve image resolution and feature preservation. This fusion method uses the wavedec2 function to standardize images to a 224x224 resolution, decompose them into multi-scale frequency components, and recursively average coefficients at each level for better feature representation. To address computational efficiency, Genetic Algorithm (GA) is used to select the most suitable pre-trained student model from a pool of 12 candidates, balancing model performance with computational cost. The model is evaluated on two datasets, including LC25000 histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images (99.54% accuracy), demonstrating robustness across different imaging domains.

[113] Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Kun Ouyang,Yuanxin Liu,Linli Yao,Yishuo Cai,Hao Zhou,Jie Zhou,Fandong Meng,Xu Sun

Main category: cs.CV

TL;DR: 本文提出了Conan框架,用于证据支持的多步视频推理,通过构建大规模数据集Conan-91K和设计多阶段渐进式冷启动策略,在六个基准上平均提升超过10%的准确率,实现了最先进的性能。

Details Motivation: 现有的多模态大语言模型在视频推理中存在推理链条脱离视觉证据或帧检索定位不准的问题,缺乏有效的跨帧多步推理能力。 Method: 提出Conan框架,结合识别-推理-行动(AIR)的强化学习训练机制,利用自动构建的大规模推理轨迹数据集Conan-91K,实现上下文帧与证据帧的识别、跨帧线索推理以及自适应决策是否继续探索或终止推理。 Result: 在六个多步视频推理基准上平均准确率超过Qwen2.5-VL-7B-Instruct逾10%,并在长视频理解任务中表现出良好的泛化性、可扩展性和鲁棒性。 Conclusion: Conan显著提升了多模态大模型在复杂视频推理任务中的性能,通过引入视觉证据接地和自适应推理机制,为视频理解提供了更可靠和可扩展的解决方案。 Abstract: Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.

[114] Reliable and Reproducible Demographic Inference for Fairness in Face Analysis

Alexandre Fournier-Montgieux,Hervé Le Borgne,Adrian Popescu,Bertrand Luvison

Main category: cs.CV

TL;DR: 提出了一种模块化的迁移学习方法用于人脸分析系统中的自动人口统计属性推断,提升了公平性评估的可靠性和鲁棒性,并公开了完整资源以促进可复现性。

Details Motivation: 现有的公平性评估依赖于自动人口统计属性推断(DAI),但其可靠性影响评估结果的有效性,因此需要提升DAI的可靠性以确保公平性审计的准确性。 Method: 采用预训练的人脸识别编码器结合非线性分类头,构建模块化迁移学习的DAI流水线,并从准确性、公平性和基于身份内一致性的新鲁棒性维度进行评估。 Result: 在多个数据集和训练设置下,该方法在性别和族裔推断上优于强基线模型,尤其在更具挑战性的族裔推断任务中表现更优。 Conclusion: 所提出的DAI流水线为公平性审计中的人口统计推断提供了更可靠的基础,且具备良好的可复现性和广泛适用性。 Abstract: Fairness evaluation in face analysis systems (FAS) typically depends on automatic demographic attribute inference (DAI), which itself relies on predefined demographic segmentation. However, the validity of fairness auditing hinges on the reliability of the DAI process. We begin by providing a theoretical motivation for this dependency, showing that improved DAI reliability leads to less biased and lower-variance estimates of FAS fairness. To address this, we propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach. Our design integrates pretrained face recognition encoders with non-linear classification heads. We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency. The proposed robustness metric is applicable to any demographic segmentation scheme. We benchmark the pipeline on gender and ethnicity inference across multiple datasets and training setups. Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute. To promote transparency and reproducibility, we will publicly release the training dataset metadata, full codebase, pretrained models, and evaluation toolkit. This work contributes a reliable foundation for demographic inference in fairness auditing.

[115] EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization

Yixiong Yang,Tao Wu,Senmao Li,Shiqi Yang,Yaxing Wang,Joost van de Weijer,Kai Wang

Main category: cs.CV

TL;DR: 提出了一种双向概念蒸馏框架EchoDistill,用于实现单步扩散模型的个性化(1-SDP),通过师生模型间的双向知识回传和共享文本编码器,显著提升了文本到图像生成中个性化新概念的能力和生成质量。

Details Motivation: 单步文本到图像扩散模型虽快,但在个性化新概念时表现受限,因难以有效捕捉新概念分布,需提升其个性化能力。 Method: 设计了一个师生联合训练框架:多步模型为教师,单步模型为学生;通过双向概念蒸馏(先由教师到学生,再由学生回传教师),共享文本编码器,并引入对抗损失和对齐损失;学生利用快速生成能力反馈优化教师。 Result: 实验表明,该方法在1-SDP设置下显著优于现有个性化方法,不仅提升了学生模型的个性化能力,也改善了教师模型的生成质量。 Conclusion: EchoDistill建立了一种快速且高效的T2I扩散模型个性化新范式,通过双向回传机制实现了师生模型的协同优化。 Abstract: Recent advances in accelerating text-to-image (T2I) diffusion models have enabled the synthesis of high-fidelity images even in a single step. However, personalizing these models to incorporate novel concepts remains a challenge due to the limited capacity of one-step models to capture new concept distributions effectively. We propose a bidirectional concept distillation framework, EchoDistill, to enable one-step diffusion personalization (1-SDP). Our approach involves an end-to-end training process where a multi-step diffusion model (teacher) and a one-step diffusion model (student) are trained simultaneously. The concept is first distilled from the teacher model to the student, and then echoed back from the student to the teacher. During the EchoDistill, we share the text encoder between the two models to ensure consistent semantic understanding. Following this, the student model is optimized with adversarial losses to align with the real image distribution and with alignment losses to maintain consistency with the teacher's output. Furthermore, we introduce the bidirectional echoing refinement strategy, wherein the student model leverages its faster generation capability to feedback to the teacher model. This bidirectional concept distillation mechanism not only enhances the student ability to personalize novel concepts but also improves the generative quality of the teacher model. Our experiments demonstrate that this collaborative framework significantly outperforms existing personalization methods over the 1-SDP setup, establishing a novel paradigm for rapid and effective personalization in T2I diffusion models.

[116] Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

Xiaohan Lan,Fanfan Liu,Haibo Qiu,Siqi Yang,Delian Ruan,Peng Shi,Lin Ma

Main category: cs.CV

TL;DR: 本文提出了Metis-HOME,一种混合专家框架,通过将模型分为思考分支和非思考分支,动态分配查询以提升复杂推理能力和通用性能,解决了多模态大模型在推理与泛化之间的权衡问题。

Details Motivation: 现有大规模多模态推理模型在处理简单查询时计算开销大,且过度专注于推理任务会削弱其通用理解能力,因此需要一种兼顾效率与泛化的架构。 Method: 提出Metis-HOME,基于MoE架构将Qwen2.5-VL-7B模型改造为包含思考分支(用于复杂推理)和非思考分支(用于快速直接推断)的双专家系统,并引入轻量级可训练路由器动态分配任务。 Result: 实验表明,Metis-HOME不仅显著提升了复杂推理性能,还增强了模型在一般VQA和OCR等任务上的表现,克服了传统推理模型泛化能力下降的问题。 Conclusion: Metis-HOME提供了一种新的多模态大模型设计范式,有效平衡了推理能力与通用性,为构建高效且多功能的MLLMs指明了方向。 Abstract: Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ''Hybrid Thinking'' paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model's general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma.

[117] Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis

Lixiong Qin,Yang Zhang,Mei Wang,Jiani Hu,Weihong Deng,Weiran Xu

Main category: cs.CV

TL;DR: 本文提出了Fake-in-Facext (FiFa) 框架,通过细粒度面部区域标注和多任务学习模型FiFa-MLLM,提升可解释DeepFake分析的视觉上下文关联能力。

Details Motivation: 现有方法在DeepFake解释中缺乏细粒度感知,标注粗糙且无法建立文本解释与视觉证据间的联系,限制了模型在面部视觉上下文中的定位能力。 Method: 提出面部图像概念树(FICT)以实现细粒度标注,构建FiFa-Annotator数据标注流程,并引入Artifact-Grounding Explanation (AGE)任务;设计FiFa-MLLM多任务架构,支持图文交错输出与多模态输入。 Result: FiFa-MLLM在AGE任务上优于强基线模型,并在现有可解释DeepFake分析数据集上达到SOTA性能。 Conclusion: FiFa框架通过细粒度结构化标注和统一多任务模型,显著提升了DeepFake解释的准确性与视觉可追溯性。 Abstract: The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at https://github.com/lxq1000/Fake-in-Facext.

[118] Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image

Guillermo Carbajal,Andrés Almansa,Pablo Musé

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的图像去模糊框架,通过联合估计清晰图像和相机运动轨迹,有效处理由大范围或旋转运动引起的运动模糊。

Details Motivation: 运动模糊在大范围或旋转相机抖动下仍是图像恢复的主要挑战,现有端到端网络在严重或空间变化模糊情况下表现不佳。 Method: 利用可微分的投影运动模糊模型(PMBM),通过神经网络预测3D旋转轨迹,并指导基于模型的恢复网络进行端到端训练,结合推理后的重模糊损失优化轨迹。 Result: 在合成与真实数据集上均达到最先进的性能,尤其在严重或空间变化模糊情况下优于现有方法。 Conclusion: 该方法不仅提升了去模糊效果,还通过可解释的相机轨迹实现模糊成因分析和清晰图像序列重建。 Abstract: Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at https://github.com/GuillermoCarbajal/Blur2Seq/

[119] Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation

Marziyeh Bamdad,Hans-Peter Hutter,Alireza Darvishy

Main category: cs.CV

TL;DR: SELM-SLAM3 是一种基于深度学习的视觉SLAM框架,结合SuperPoint和LightGlue实现鲁棒的特征提取与匹配,在低纹理、运动模糊等挑战性条件下显著优于ORB-SLAM3和现有RGB-D SLAM系统。

Details Motivation: 在低纹理、运动模糊或复杂光照等挑战性条件下,现有SLAM技术难以保持稳定和准确,尤其影响视障辅助导航等应用的可靠性与安全性。 Method: 提出SELM-SLAM3框架,融合SuperPoint进行关键点检测和LightGlue进行特征匹配,增强视觉SLAM在恶劣条件下的性能。 Result: 在TUM RGB-D、ICL-NUIM和TartanAir数据集上测试显示,SELM-SLAM3比ORB-SLAM3平均提升87.84%,优于当前RGB-D SLAM系统36.77%,在低纹理和快速运动场景中表现更优。 Conclusion: SELM-SLAM3显著提升了SLAM在挑战性环境中的鲁棒性和精度,为视障人士导航辅助系统提供了可靠的技术基础。 Abstract: Despite advancements in SLAM technologies, robust operation under challenging conditions such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.

[120] From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging

Fuchen Li,Yansong Du,Wenbo Cheng,Xiaoxia Zhou,Sen Yin

Main category: cs.CV

TL;DR: 提出了一种轻量级、场景自适应的相机参数调整网络ACamera-Net,可从RAW图像中直接预测最佳曝光和白平衡参数,有效改善复杂光照下的图像质量。

Details Motivation: 消费级相机在复杂光照条件下(如低光、高动态范围、逆光和色温变化)常出现曝光不足、色彩偏色和色调不一致等问题,影响视觉任务性能,因此需要一种能实时优化图像质量的方法。 Method: 设计了ACamera-Net,包含两个模块:ACamera-Exposure用于估计ISO以缓解曝光不足和对比度损失;ACamera-Color用于预测相关色温和增益因子以提升色彩一致性。模型基于真实世界带标注数据训练,专为边缘设备实时推理优化,可无缝集成到成像流程中。 Result: 实验表明,ACamera-Net在多种光照条件下均能有效提升图像质量,稳定感知输出,优于传统自动模式和轻量级基线方法,且无需额外图像增强模块。 Conclusion: ACamera-Net是一种高效、实用的相机参数调整方案,能够在资源受限的设备上实现高质量、稳定的图像采集,适用于复杂多变的照明环境。 Abstract: Consumer-grade camera systems often struggle to maintain stable image quality under complex illumination conditions such as low light, high dynamic range, and backlighting, as well as spatial color temperature variation. These issues lead to underexposure, color casts, and tonal inconsistency, which degrade the performance of downstream vision tasks. To address this, we propose ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment network that directly predicts optimal exposure and white balance from RAW inputs. The framework consists of two modules: ACamera-Exposure, which estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color, which predicts correlated color temperature and gain factors for improved color consistency. Optimized for real-time inference on edge devices, ACamera-Net can be seamlessly integrated into imaging pipelines. Trained on diverse real-world data with annotated references, the model generalizes well across lighting conditions. Extensive experiments demonstrate that ACamera-Net consistently enhances image quality and stabilizes perception outputs, outperforming conventional auto modes and lightweight baselines without relying on additional image enhancement modules.

[121] From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail

Xiaohan Sun,Carol O'Sullivan

Main category: cs.CV

TL;DR: 研究了不同细节层次和观看距离下用户对人群角色表示的视觉质量感知,比较了几何网格、基于图像的替身、NeRF和3D高斯等方法在视觉保真度与计算性能间的权衡。

Details Motivation: 为了优化人群渲染中的细节层次策略,需理解用户对不同表示方法的视觉质量感知。 Method: 通过定性和定量实验评估几何网格、图像替身、NeRF和3D高斯在不同LoD和距离下的视觉表现。 Result: 发现不同表示方法在视觉保真度和性能上有明显差异,且感知质量随距离变化显著。 Conclusion: 结果可指导设计更符合人类感知的高效人群渲染LoD策略。 Abstract: In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.

[122] EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence

Ding Zou,Feifan Wang,Mengyu Ge,Siyuan Fan,Zongbing Zhang,Wei Chen,Lingfeng Wang,Zhongyou Hu,Wenrui Yan,Zhengwei Gao,Hao Wang,Weizhao Jin,Yu Zhang,Hainan Zhao,Mingliang Zhang,Xianxian Xi,Yaru Zhang,Wenyuan Li,Zhengguang Gao,Yurui Zhu

Main category: cs.CV

TL;DR: 本文提出了EmbodiedBrain,一种新型的视觉-语言基础模型,旨在解决当前大模型在具身智能任务中的局限性,通过结合大规模监督微调和Step-GRPO方法提升长视野任务的成功率,并引入生成式奖励模型以提高训练效率。

Details Motivation: 现有大语言模型和多模态大模型在具身任务中存在模型设计与代理需求之间的差距、实时延迟与性能之间的权衡以及使用非真实离线评估指标的问题。 Method: 提出了一种新的视觉-语言基础模型EmbodiedBrain,采用代理对齐的数据结构,结合大规模监督微调(SFT)和Step-GRPO进行训练,并集成了生成式奖励模型(GRM)来提高训练效率。 Result: 实验结果表明,EmbodiedBrain在所有指标上均取得了优越的表现,建立了具身基础模型的新标杆。 Conclusion: EmbodiedBrain通过其创新的训练方法和全面的评估体系,为下一代通用具身代理的发展铺平了道路。 Abstract: The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.

[123] Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Jiahao Meng,Xiangtai Li,Haochen Wang,Yue Tan,Tao Zhang,Lingdong Kong,Yunhai Tong,Anran Wang,Zhiyang Teng,Yujing Wang,Zhuochen Wang

Main category: cs.CV

TL;DR: 本文提出了Open-o3 Video,一种将显式时空证据融入视频推理的非代理框架,通过构建高质量数据集和设计强化学习策略,在多个视频理解基准上实现了最先进的性能。

Details Motivation: 现有视频推理模型大多仅生成文本推理轨迹,缺乏对关键证据出现的时间和位置的指示;而将图像中的证据中心推理扩展到视频面临时空联合建模的挑战。 Method: 提出Open-o3 Video框架,构建STGR-CoT-30k和STGR-RL-36k两个高质量数据集,并采用冷启动强化学习策略,结合多种专门设计的奖励机制,联合优化答案准确性、时间对齐和空间精度。 Result: 在V-STAR基准上,相比Qwen2.5-VL基线,mAM提升14.4%,mLGM提升24.2%;在VideoMME、WorldSense、VideoMMMU和TVGBench等多个基准上也取得一致改进。 Conclusion: Open-o3 Video能有效生成可解释的时空推理轨迹,不仅提升模型性能,还为测试时缩放和答案可靠性提供支持。 Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

[124] GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models

Muhammad Atif Butt,Alexandra Gomez-Villa,Tao Wu,Javier Vazquez-Corral,Joost Van De Weijer,Kai Wang

Main category: cs.CV

TL;DR: 本文提出了GenColorBench,首个针对文本到图像生成中颜色精确性的综合评测基准,基于ISCC-NBS和CSS3/X11等色彩系统,包含44K专注于颜色的提示和400多种颜色,揭示了现有模型在颜色控制方面的表现差异与缺陷。

Details Motivation: 现有文本到图像模型在细粒度颜色控制方面表现不佳,且缺乏系统评估颜色精度的基准,而颜色在视觉感知和实际应用中至关重要,因此需要一个全面、精细的颜色生成评测标准。 Method: 构建了一个名为GenColorBench的新基准,整合了ISCC-NBS和CSS3/X11等色彩系统,并包含数值颜色(如RGB);设计了44,000个颜色相关提示,覆盖400多种颜色,结合感知实验与自动化评估方法对主流文本到图像模型进行评测。 Result: 评估结果显示不同模型在颜色生成精度上存在显著差异,揭示了模型对不同颜色规范(如命名颜色 vs 数值颜色)的理解程度以及常见的失败模式,例如无法准确解析RGB值或匹配细微色调差异。 Conclusion: GenColorBench能有效评估文本到图像模型的颜色生成能力,为提升颜色可控性提供了重要工具和方向,未来的工作可基于此改进模型对复杂颜色语义的理解与生成。 Abstract: Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.

[125] Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation

Ziyu Ye,Chen Ju,Chaofan Ma,Xiaoyun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于相似性原型的跨模态分割新框架,通过在嵌入空间中学习类级原型并引入相似性约束,结合字典存储和对比学习机制,有效缓解了域间差异和类别缺失问题,在无监督域适应场景下取得了优于现有方法的性能。

Details Motivation: 深度学习模型在训练数据上表现良好,但在面对未见数据时由于域偏移而性能显著下降,因此需要有效的无监督域适应方法来减少域间差异,避免昂贵的标注成本。 Method: 提出一种基于相似性原型的跨模态分割框架:在嵌入空间中学习类级原型,引入相似性约束以增强类内代表性和类间可分性,并使用字典存储多图提取的原型,支持对比学习并防止类别缺失。 Result: 大量实验表明,该方法在跨模态分割任务中优于其他最先进的无监督域适应方法。 Conclusion: 所提出的基于相似性原型和字典机制的框架能有效提升模型在未见域上的泛化能力,为无监督域适应下的跨模态分割提供了新的解决方案。 Abstract: Deep learning models have achieved great success on various vision challenges, but a well-trained model would face drastic performance degradation when applied to unseen data. Since the model is sensitive to domain shift, unsupervised domain adaptation attempts to reduce the domain gap and avoid costly annotation of unseen domains. This paper proposes a novel framework for cross-modality segmentation via similarity-based prototypes. In specific, we learn class-wise prototypes within an embedding space, then introduce a similarity constraint to make these prototypes representative for each semantic class while separable from different classes. Moreover, we use dictionaries to store prototypes extracted from different images, which prevents the class-missing problem and enables the contrastive learning of prototypes, and further improves performance. Extensive experiments show that our method achieves better results than other state-of-the-art methods.

[126] OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects

Mark He Huang,Lin Geng Foo,Christian Theobalt,Ying Sun,De Wen Soh

Main category: cs.CV

TL;DR: 本文提出了一种名为OnlineSplatter的在线前馈框架,能够直接从单目视频的RGB帧中生成高质量、以物体为中心的3D高斯表示,无需相机位姿、深度先验或束优化。

Details Motivation: 在缺乏可靠位姿或深度线索且物体任意运动的情况下,自由移动物体的重建仍具挑战性。现有方法通常依赖于位姿估计或优化过程,限制了其在动态场景中的应用。 Method: 该方法以第一帧为锚点,通过密集的高斯基元场逐步更新物体表征;提出双键记忆模块,结合潜在的外观-几何特征键和显式的方向键,实现当前帧与历史状态的有效融合,并通过空间引导的记忆读取和高效稀疏化机制保持紧凑表示。 Result: 在真实世界数据集上的实验表明,OnlineSplatter显著优于现有的无位姿态重建方法,能随着观测帧数增加持续提升重建质量,同时保持恒定的内存和运行时间。 Conclusion: OnlineSplatter为单目视频中的自由移动物体提供了高效、鲁棒的在线3D重建方案,具备良好的实用性与扩展性。 Abstract: Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.

[127] SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

Yuan Sheng,Yanbin Hao,Chenxu Li,Shuo Wang,Xiangnan He

Main category: cs.CV

TL;DR: 本文提出了一种名为SeViCES的语义-视觉共识证据选择框架,用于高效且可靠的长视频理解。该方法无需训练且模型无关,通过结合语义与视觉分支选择关键帧,并利用答案共识优化预测,显著提升了视频大语言模型在长视频任务中的准确性和鲁棒性。

Details Motivation: 长视频内容复杂、分散,现有视频大语言模型在处理长序列时计算成本高,推理易不聚焦或不一致;现有帧选择方法常忽略时间依赖性或依赖单模态证据,难以提供完整且与查询相关的上下文。 Method: 提出SeViCES框架,包含两个核心模块:语义-视觉共识帧选择(SVCFS)模块,其中语义分支利用LLM对字幕进行时序感知推理,视觉分支通过聚类引导并利用互信息对齐嵌入与语义得分;答案共识优化(ACR)模块融合多模态证据并约束答案空间以解决预测不一致性。 Result: 在多个长视频理解基准上的实验表明,SeViCES在准确性和鲁棒性方面均优于现有最先进方法,尤其在处理长时程、复杂查询任务中表现突出。 Conclusion: 基于共识驱动的多模态证据选择能有效提升视频大语言模型对长视频的理解能力,SeViCES作为一种无需训练、通用性强的框架,为长视频理解提供了新的可靠范式。 Abstract: Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.

[128] Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges

Zhenhuan Zhou,Jingbo Zhu,Yuchen Zhang,Xiaohang Guan,Peng Wang,Tao Li

Main category: cs.CV

TL;DR: 本文综述了深度学习在牙科图像分析中的应用,涵盖260项研究,重点讨论数据集和模型,系统总结了现有数据集的特点、获取方法、深度学习技术、网络架构、优化策略、训练方法及性能评估指标,并探讨了当前研究的挑战与未来方向。

Details Motivation: 牙科图像分析面临低对比度、金属伪影和投影角度变化等挑战,且人工解读耗时且易受主观因素影响,因此需要自动化、准确的分析方法。 Method: 系统回顾了49篇关于公开牙科数据集和211篇关于深度学习算法的研究论文,分类整理不同牙科图像分析任务中的模型与算法,分析其网络结构、优化策略、训练方式和性能表现。 Result: 总结了牙科图像分析中常用的数据集特征、深度学习基础技术、模型分类、训练与评估指标,并提供了详细的比较表格和补充材料。 Conclusion: 深度学习在牙科图像分析中展现出巨大潜力,本文为该领域研究人员提供了系统性的参考,并指出了未来研究的方向。 Abstract: Efficient analysis and processing of dental images are crucial for dentists to achieve accurate diagnosis and optimal treatment planning. However, dental imaging inherently poses several challenges, such as low contrast, metallic artifacts, and variations in projection angles. Combined with the subjectivity arising from differences in clinicians' expertise, manual interpretation often proves time-consuming and prone to inconsistency. Artificial intelligence (AI)-based automated dental image analysis (DIA) offers a promising solution to these issues and has become an integral part of computer-aided dental diagnosis and treatment. Among various AI technologies, deep learning (DL) stands out as the most widely applied and influential approach due to its superior feature extraction and representation capabilities. To comprehensively summarize recent progress in this field, we focus on the two fundamental aspects of DL research-datasets and models. In this paper, we systematically review 260 studies on DL applications in DIA, including 49 papers on publicly available dental datasets and 211 papers on DL-based algorithms. We first introduce the basic concepts of dental imaging and summarize the characteristics and acquisition methods of existing datasets. Then, we present the foundational techniques of DL and categorize relevant models and algorithms according to different DIA tasks, analyzing their network architectures, optimization strategies, training methods, and performance. Furthermore, we summarize commonly used training and evaluation metrics in the DIA domain. Finally, we discuss the current challenges of existing research and outline potential future directions. We hope that this work provides a valuable and systematic reference for researchers in this field. All supplementary materials and detailed comparison tables will be made publicly available on GitHub.

[129] Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Dong Yang,Pengfei Guo,Marc Edgar,Daguang Xu,Bernhard Kainz,Bjoern Menze

Main category: cs.CV

TL;DR: 本文提出了BTB3D,一种用于3D医学影像的因果卷积编码器-解码器模型,通过统一2D和3D训练并生成紧凑、频率感知的体素令牌,显著提升了报告生成和文本到CT合成任务的性能。

Details Motivation: 现有方法在处理高分辨率、长序列的3D医学图像时存在视觉编码器与临床语言不匹配、切片级标记化模糊精细解剖结构等问题,限制了下游任务的诊断性能。 Method: 提出BTB3D模型,采用因果卷积编码器-解码器架构,结合三阶段训练策略:局部重建、重叠窗口平铺和长上下文解码器优化,实现高效且可扩展的3D图像标记化。 Result: 在报告生成任务中,BLEU分数和临床F1指标比现有模型(如CT2Rep、CT-CHAT、Merlin)提升40%;在文本到CT合成任务中,FID降低75%,FVD减少一半,并能生成512×512×241大小的解剖一致的高质量体积图像。 Conclusion: 精确的三维标记化对于可扩展的3D医学影像视觉-语言建模至关重要,而不仅仅依赖更大的语言模型骨干。 Abstract: Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D

[130] UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset

Chen Zhao,En Ci,Yunzhe Xu,Tiehan Fan,Shanyan Guan,Yanhao Ge,Jian Yang,Ying Tai

Main category: cs.CV

TL;DR: 本文提出了UltraHR-100K,一个包含10万张超高分辨率图像的高质量数据集,并设计了频率感知的后训练方法DOTS和SWFR,以提升文本到图像生成中的细粒度细节质量。

Details Motivation: 现有超高分辨率文本到图像生成面临缺乏大规模高质量数据集以及缺乏针对细粒度细节合成的专门训练策略两大挑战。 Method: 构建了一个超过3K分辨率、标注丰富的UHR图像数据集UltraHR-100K;提出Detail-Oriented Timestep Sampling (DOTS) 和 Soft-Weighting Frequency Regularization (SWFR),后者利用离散傅里叶变换对高频成分进行软约束以保留细节。 Result: 在UltraHR-eval4K基准上的实验表明,所提方法显著提升了超高分辨率图像生成的细节质量和整体保真度。 Conclusion: 通过构建高质量数据集和引入频率感知训练策略,有效解决了UHR T2I生成中的关键挑战,显著改善了细节表现力。 Abstract: Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{https://github.com/NJU-PCALab/UltraHR-100k}{here}.

[131] HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification

Debojyoti Ghosh,Adrijit Goswami

Main category: cs.CV

TL;DR: 提出了一种名为HybridSOMSpikeNet的混合深度学习框架,结合卷积特征提取、可微自组织和脉冲神经网络时序处理,实现高效准确的垃圾分类,在十类数据集上达到97.39%准确率,支持可持续发展目标。

Details Motivation: 准确的垃圾分类对可持续废物管理和减少城市化环境影响至关重要,现有方法存在误分类导致填埋增加、回收效率低和温室气体排放问题。 Method: 采用预训练ResNet-152提取空间特征,结合可微软自组织映射(Soft-SOM)增强拓扑聚类,并引入脉冲神经网络头部进行时序激活累积,构建HybridSOMSpikeNet模型。 Result: 在十类废物数据集上测试准确率达到97.39%,优于多种先进模型,同时保持轻量计算特性,适合实际部署。 Conclusion: HybridSOMSpikeNet能有效提升垃圾分类精度和回收效率,减少污染和运营成本,助力联合国可持续发展目标(SDG 11 和 SDG 12),推动智能环保系统发展。 Abstract: Accurate waste classification is vital for achieving sustainable waste management and reducing the environmental footprint of urbanization. Misclassification of recyclable materials contributes to landfill accumulation, inefficient recycling, and increased greenhouse gas emissions. To address these issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning framework that integrates convolutional feature extraction, differentiable self-organization, and spiking-inspired temporal processing to enable intelligent and energy-efficient waste classification. The proposed model employs a pre-trained ResNet-152 backbone to extract deep spatial representations, followed by a Differentiable Soft Self-Organizing Map (Soft-SOM) that enhances topological clustering and interpretability. A spiking neural head accumulates temporal activations over discrete time steps, improving robustness and generalization. Trained on a ten-class waste dataset, HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several state-of-the-art architectures while maintaining a lightweight computational profile suitable for real-world deployment. Beyond its technical innovations, the framework provides tangible environmental benefits. By enabling precise and automated waste segregation, it supports higher recycling efficiency, reduces contamination in recyclable streams, and minimizes the ecological and operational costs of waste processing. The approach aligns with global sustainability priorities, particularly the United Nations Sustainable Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities, circular economy initiatives, and intelligent environmental management systems.

[132] Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling

Jinhee Kim,Jae Jun An,Kang Eun Jeon,Jong Hwan Ko

Main category: cs.CV

TL;DR: 提出了一种高效的多比特量化网络训练方法,通过权重偏置校正和基于梯度重要性分数的比特级核心集采样策略,显著减少训练开销,同时保持模型性能。

Details Motivation: 现有多比特量化网络训练方法需要对每个位宽重复全数据集更新,训练开销大且常需额外微调,导致效率低下。 Method: 提出两种技术:1)权重偏置校正,通过中和不同位宽下的量化偏置并统一激活分布,实现共享批归一化,避免微调;2)比特级核心集采样,利用梯度重要性分数为各子模型选择高信息量子集进行训练,提升训练效率。 Result: 在CIFAR-10/100、TinyImageNet和ImageNet-1K上,结合ResNet和ViT架构的实验表明,该方法在保持竞争力或更优精度的同时,最高可将训练时间减少7.88倍。 Conclusion: 所提方法显著降低了多比特量化网络的训练成本,无需额外微调,适用于灵活部署场景,具备良好的实用性和扩展性。 Abstract: Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.

[133] Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

Jing Bi,Guangyu Sun,Ali Vosoughi,Chen Chen,Chenliang Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于代理的架构,结合大语言模型推理与轻量级视觉模块,以解决多模态大语言模型在视觉任务中存在视觉幻觉和过度依赖文本先验的问题,显著提升了性能。

Details Motivation: 多模态大语言模型(MLLMs)在复杂视觉任务中仍存在视觉幻觉和过度依赖文本先验的问题,需系统诊断并改进其视觉推理能力。 Method: 提出一个三阶段评估框架来诊断现有模型,并设计一种基于代理的架构,将大语言模型的推理能力与轻量级视觉模块结合,实现对推理链的细粒度分析和迭代优化。 Result: 新系统在MMMU上提升+10.3,在MathVista上提升+6.0(相对于7B基准),性能媲美或超越更大规模的模型。 Conclusion: 未来的视觉推理模型应整合更多专用工具来分析视觉内容,本文提出的框架为提升MLLM的视觉理解能力提供了有效路径。 Abstract: Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.

[134] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Xuyang Liu,Xiyan Gui,Yuchao Zhang,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为MixKV的新方法,通过结合重要性和多样性来优化大规模视觉-语言模型中的KV缓存压缩,有效缓解了多模态序列处理中的内存瓶颈问题。

Details Motivation: 现有的KV缓存压缩方法主要关注保留高重要性的KV对,但忽略了多模态缓存中特有的语义冗余模式。这些方法在压缩过程中可能丢失部分语义信息,影响模型性能。 Method: 分析了LVLM中不同注意力头的KV缓存冗余特性,提出MixKV方法,在压缩时自适应地平衡重要性和多样性,以更好地覆盖语义信息分布。该方法可适配不同模态和模型结构。 Result: 在极端压缩(budget=64)下,MixKV在五个多模态理解基准上平均比基线方法提升5.1%,在GUI定位任务中对SnapKV和AdaKV分别提升8.0%和9.0%,同时保持相近的推理效率,并可无缝扩展到大语言模型。 Conclusion: MixKV通过融合重要性和多样性,有效提升了KV缓存压缩的语义完整性与模型性能,具有良好的通用性和实用性,有助于推动大规模多模态模型的高效部署。 Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.

[135] ALICE-LRI: A General Method for Lossless Range Image Generation for Spinning LiDAR Sensors without Calibration Metadata

Samuel Soutullo,Miguel Yermo,David L. Vilariño,Óscar G. Lorenzo,José C. Cabaleiro,Francisco F. Rivera

Main category: cs.CV

TL;DR: 本文提出了一种名为ALICE-LRI的通用、传感器无关的方法,首次实现了从旋转式LiDAR点云到无损距离图像的生成,无需依赖制造商元数据或校准文件,能够在不丢失任何点的情况下完成点云重建,保持几何精度并实现实时性能。

Details Motivation: 传统的LiDAR数据投影方法存在几何不一致性,导致不可逆的信息损失,影响高保真应用的精度,因此需要一种能够实现无损投影且不依赖设备元数据的通用方法。 Method: 提出ALICE-LRI方法,通过自动反向工程推断旋转式LiDAR传感器的内在几何结构,包括激光束配置、角度分布以及每束的校准修正,从而实现无损的距离图像生成和完整的点云重建。 Result: 在KITTI和DurLAR数据集上的实验表明,ALICE-LRI实现了完美的点保留(零点损失),几何精度保持在传感器精度范围内,并具备实时处理能力;压缩应用案例显示其在下游任务中显著提升质量。 Conclusion: ALICE-LRI实现了从近似投影到无损投影的范式转变,为需要完整几何保持的高精度遥感应用开辟了新的可能性。 Abstract: 3D LiDAR sensors are essential for autonomous navigation, environmental monitoring, and precision mapping in remote sensing applications. To efficiently process the massive point clouds generated by these sensors, LiDAR data is often projected into 2D range images that organize points by their angular positions and distances. While these range image representations enable efficient processing, conventional projection methods suffer from fundamental geometric inconsistencies that cause irreversible information loss, compromising high-fidelity applications. We present ALICE-LRI (Automatic LiDAR Intrinsic Calibration Estimation for Lossless Range Images), the first general, sensor-agnostic method that achieves lossless range image generation from spinning LiDAR point clouds without requiring manufacturer metadata or calibration files. Our algorithm automatically reverse-engineers the intrinsic geometry of any spinning LiDAR sensor by inferring critical parameters including laser beam configuration, angular distributions, and per-beam calibration corrections, enabling lossless projection and complete point cloud reconstruction with zero point loss. Comprehensive evaluation across the complete KITTI and DurLAR datasets demonstrates that ALICE-LRI achieves perfect point preservation, with zero points lost across all point clouds. Geometric accuracy is maintained well within sensor precision limits, establishing geometric losslessness with real-time performance. We also present a compression case study that validates substantial downstream benefits, demonstrating significant quality improvements in practical applications. This paradigm shift from approximate to lossless LiDAR projections opens new possibilities for high-precision remote sensing applications requiring complete geometric preservation.

[136] Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Yuhan Liu,Lianhui Qin,Shengjie Wang

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架Speculative Verdict(SV),通过结合多个轻量级“草案专家”和一个强大的“判决模型”,在信息密集图像上实现高效、准确的多步推理。

Details Motivation: 现有大视觉语言模型在处理信息密集、图文混杂的图像时,难以精确定位关键线索并进行多跳推理,导致性能下降。 Method: SV框架分为草案阶段和判决阶段:小规模VLM作为草案专家生成多样化的推理路径以提供定位候选;大规模VLM作为判决模型整合这些路径得出最终答案,并引入共识选择机制,仅将高一致性的路径送入判决阶段以提升效率与准确性。 Result: SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等多个高分辨率、信息密集的视觉问答基准上均取得显著提升,能够在降低计算成本的同时纠正错误推理路径。 Conclusion: SV通过融合多个部分正确的推理路径,在无需训练的前提下实现了误差校正与计算高效性,优于大型专有模型或需训练的流水线方法。 Abstract: Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict

[137] AutoScape: Geometry-Consistent Long-Horizon Scene Generation

Jiacheng Chen,Ziyu Jiang,Mingfu Liang,Bingbing Zhuang,Jong-Chyi Su,Sparsh Garg,Ying Wu,Manmohan Chandraker

Main category: cs.CV

TL;DR: 本文提出了AutoScape,一种长时驾驶场景生成框架,通过RGB-D扩散模型生成几何一致的关键帧,并利用视频扩散模型插值得到连贯的长时驾驶视频,显著提升了FID和FVD指标。

Details Motivation: 现有方法在生成长时驾驶场景时难以保持几何一致性与视觉质量,因此需要一种能够同时建模外观与深度结构并维持长期一致性的生成框架。 Method: 提出一种新型RGB-D扩散模型,在共享潜在空间中联合处理图像与深度信息,基于已生成关键帧的点云显式条件化,并引入 warp-consistent guidance 来引导采样过程;随后使用视频扩散模型在关键帧间插值生成密集视频帧。 Result: AutoScape 能生成超过20秒的高质量、几何一致的驾驶视频,在长时FID和FVD指标上分别比先前最优方法提升48.6%和43.0%。 Conclusion: AutoScape 通过关键帧引导与几何感知的扩散建模,有效解决了长时驾驶场景生成中的几何不一致问题,实现了更真实、稳定的视频生成。 Abstract: This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.

[138] ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology

Nima Torbati,Anastasia Meshcheryakova,Ramona Woitek,Diana Mechtcheriakova,Amirreza Mahbod

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力驱动特征融合的CNN与视觉Transformer双编码器模型,用于提升组织病理学图像语义分割性能,在GCPS和PUMA数据集上均优于现有方法。

Details Motivation: 为了提高组织病理学图像中语义组织分割的准确性,克服传统深度学习模型在长距离依赖和细节保留上的局限。 Method: 设计了一个统一的双编码器框架,结合卷积神经网络(CNN)和视觉Transformer(ViT),通过注意力机制实现跨模态特征融合,增强特征表示能力。 Result: 在GCPS数据集上达到76.79% μIoU和86.87% μDice,在PUMA数据集上达到64.93% μIoU和76.60% μDice,优于现有基准模型。 Conclusion: 所提出的注意力驱动特征融合方法有效提升了组织病理图像分割性能,具有良好的应用前景。 Abstract: Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet

[139] DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Noam Issachar,Guy Yariv,Sagie Benaim,Yossi Adi,Dani Lischinski,Raanan Fattal

Main category: cs.CV

TL;DR: 本文提出了一种名为动态位置外推(DyPE)的无需训练的新方法,使预训练的扩散Transformer模型能够在远超其训练分辨率的情况下生成高质量图像,且不增加采样成本。

Details Motivation: 由于自注意力机制在图像token数量上呈二次增长,超高分辨率下训练扩散Transformer模型成本极高,因此需要一种高效的方法来突破分辨率限制。 Method: DyPE利用扩散过程中固有的频谱 progression 特性,在每一步动态调整模型的位置编码,使其频率谱与当前生成阶段匹配,从而实现对高频细节的逐步解析。 Result: DyPE在多个基准测试中显著提升了生成质量,能够使用FLUX生成高达1600万像素的图像,并在超高分辨率生成任务中达到最先进的保真度,且分辨率越高提升越明显。 Conclusion: DyPE是一种有效的训练-free方法,能够显著扩展预训练扩散Transformer的生成分辨率,为超高分辨率图像生成提供了新的解决方案。 Abstract: Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

[140] AlphaFlow: Understanding and Improving MeanFlow Models

Huijie Zhang,Aliaksandr Siarohin,Willi Menapace,Michael Vasilkovsky,Sergey Tulyakov,Qing Qu,Ivan Skorokhodov

Main category: cs.CV

TL;DR: 本文提出了α-Flow,一种统一轨迹流匹配、Shortcut Model和MeanFlow的新框架,通过课程学习策略缓解优化冲突,提升了少步生成模型的收敛速度和性能,在ImageNet上实现了新的SOTA结果。

Details Motivation: MeanFlow虽然在少步生成建模中表现强大,但其训练过程存在优化冲突和收敛慢的问题,原因在于其目标函数中轨迹流匹配与轨迹一致性项之间存在强负相关。 Method: 通过梯度分析揭示MeanFlow中的优化冲突,提出α-Flow家族目标函数,采用从轨迹流匹配平滑退火到MeanFlow的课程策略,解耦冲突目标,提升训练稳定性与效率。 Result: 在ImageNet-1K 256x256上使用标准DiT主干网络,α-Flow在多种设置下均优于MeanFlow,最大的α-Flow-XL/2+模型在1-NFE和2-NFE下分别达到2.58和2.15的FID分数,创下新SOTA。 Conclusion: α-Flow通过解耦优化目标并引入课程学习,有效解决了MeanFlow中的优化冲突问题,显著提升了少步生成模型的训练效率和生成质量。 Abstract: MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).

[141] CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image

Binbin Huang,Haobin Duan,Yiqun Zhao,Zibo Zhao,Yi Ma,Shenghua Gao

Main category: cs.CV

TL;DR: Cupid是一种基于生成的3D重建方法,能从单张2D图像中准确推断物体的相机姿态、3D形状和纹理,通过两阶段流匹配 pipeline 在统一框架下实现鲁棒的姿态与形状估计。

Details Motivation: 现有3D重建方法在从单幅图像恢复精确姿态、形状和纹理方面存在局限,需要更鲁棒且统一的生成框架来提升性能。 Method: 将3D重建建模为从学习到的3D对象分布中的条件采样过程,联合生成体素和像素-体素对应关系;采用共享3D潜在空间中的分布表示,并设计两阶段流匹配流程:第一阶段粗略生成初始3D几何结构用于姿态恢复,第二阶段融合位姿对齐的图像特征以增强结构保真度和外观细节。 Result: 实验表明,Cupid在PSNR上提升超过3 dB,Chamfer Distance降低超10%,姿态估计精度与单目方法相当,并在视觉质量上优于基线3D生成模型。 Conclusion: Cupid通过统一的生成式框架实现了高精度的3D重建,在形状、姿态和纹理恢复方面均表现出优越性能,显著优于现有方法。 Abstract: This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.

[142] Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature

Lei Cheng,Siyang Cao

Main category: cs.CV

TL;DR: 本文提出了一种融合雷达与摄像头数据的多目标跟踪框架,通过在线联合标定和特征匹配提升跟踪精度,减少了人工干预,据作者所知是首个利用雷达-摄像头共性特征进行在线标定以实现多目标跟踪的研究。

Details Motivation: 现有研究往往低估了雷达的作用,未能充分利用其在三维空间中提供精确距离信息的能力;本文旨在突出雷达的关键作用,并减少对人工标定的依赖。 Method: 提出一种雷达-摄像头融合的多目标跟踪框架,利用传感器间的共性特征实现在线联合标定,并结合特征匹配与类别一致性检验来提高传感器检测结果的关联准确性。 Result: 框架在真实实验中展示了更简化的雷达-摄像头映射过程和更高的跟踪精度,适用于受控环境与真实交通场景。 Conclusion: 该方法有效提升了多目标跟踪的自动化程度与精度,验证了雷达在融合系统中的关键角色,推动了自动驾驶中多传感器融合的发展。 Abstract: This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role--despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system--our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at https://github.com/radar-lab/Radar_Camera_MOT

[143] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Xiaolong Wang,Lixiang Ru,Ziyuan Huang,Kaixiang Ji,Dandan Zheng,Jingdong Chen,Jun Zhou

Main category: cs.CV

TL;DR: 提出一种基于自回归生成的图像分割新范式(ARGenSeg),在统一框架内实现多模态理解与像素级感知。

Details Motivation: 现有方法依赖离散表示或特定解码器,难以捕捉细粒度视觉细节,限制了多模态大模型在图像分割中的表现。 Method: 通过图像生成框架,利用MLLM输出视觉token,并使用通用VQ-VAE将其解码为密集掩码,实现端到端的像素级分割;采用下一尺度预测策略并行生成视觉token以降低推理延迟。 Result: 在多个分割数据集上超越先前最先进方法,显著提升推理速度,同时保持强大的理解能力。 Conclusion: ARGenSeg为多模态大语言模型中的图像分割提供了高效且精细的新范式,推动了生成式方法在像素级任务中的应用。 Abstract: We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.

[144] Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

Dean L Slack,G Thomas Hudson,Thomas Winterbottom,Noura Al Moubayed

Main category: cs.CV

TL;DR: 提出了一种基于纯Transformer的端到端视频预测模型,利用连续像素空间表示,在物理模拟数据上实现了更长时程且物理上更准确的预测,相比现有方法时间跨度提升达50%,并在参数估计泛化和可解释性方面表现出色。

Details Motivation: 现有视频生成方法在物理模拟的因果建模方面存在不足,难以准确进行长时间的物理一致性预测,因此需要一种能分离时空推理能力、专注于物理规律学习的简单有效模型。 Method: 采用纯Transformer架构,直接在连续像素空间进行自回归视频预测,比较了多种时空自注意力结构,通过物理对象追踪指标和无监督训练评估时空推理能力,并使用探针模型进行可解释性分析以识别编码PDE参数的关键网络区域。 Result: 相比现有的潜在空间方法,该模型在保持常见视频质量指标相当的同时,将物理准确预测的时间范围延长了高达50%;探针实验表明模型能泛化到分布外的PDE参数估计,验证了其学习到的表示具有物理意义和可解释性。 Conclusion: 纯Transformer结合连续像素空间表示是一种简单、参数高效且可解释的方法,为基于注意力机制的时空视频建模提供了新平台,尤其适用于需要长期物理一致性的预测任务。 Abstract: Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.

[145] SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution

Ritik Shah,Marco F Duarte

Main category: cs.CV

TL;DR: 提出SpectraMorph,一种物理引导的自监督高光谱-多光谱图像融合框架,通过解混瓶颈和线性混合重建光谱,具有可解释性、快速训练和对少波段MSI的鲁棒性。

Details Motivation: 现有深度学习方法在高光谱超分辨率中缺乏可解释性,且在多光谱图像波段极少时性能下降。 Method: 采用物理引导的自监督框架,从低分辨率高光谱图像提取端元,用多层感知机从多光谱图像预测丰度图,通过线性混合重建光谱,并利用多光谱传感器的光谱响应函数进行自监督训练。 Result: 在合成和真实数据集上,SpectraMorph优于现有的无监督/自监督方法,与有监督方法相当,且能在极少数波段(如单波段)下保持鲁棒,训练时间小于一分钟。 Conclusion: SpectraMorph实现了高性能、可解释性和强鲁棒性的高光谱超分辨率,适用于多种传感器配置,尤其在低波段输入下表现突出。 Abstract: Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.

[146] Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Nimrod Berman,Omkar Joglekar,Eitan Kosman,Dotan Di Castro,Omri Azencot

Main category: cs.CV

TL;DR: 本文提出了一个名为Latent Denoising Diffusion Bridge Model(LDDBM)的通用模态翻译框架,通过在共享潜在空间中构建去噪扩散桥模型,实现任意模态间的转换,无需对齐维度或依赖特定假设,在多种任务上表现出色。

Details Motivation: 现有的模态翻译方法通常依赖于共享维度、高斯先验和特定架构等限制性假设,限制了其泛化能力和理论基础,因此需要一种更通用且理论扎实的方法。 Method: 提出LDDBM框架,基于潜在变量扩展的去噪扩散桥模型,在共享潜在空间中学习不同模态之间的映射;引入对比对齐损失以保证语义一致性,设计领域无关的编码器-解码器结构用于潜在空间中的噪声预测,并提出预测损失和多种训练策略以提升跨域翻译的准确性和训练稳定性。 Result: 该方法支持任意模态组合,在多视图到3D形状生成、图像超分辨率和多视图场景合成等任务上表现优异,实验和消融研究验证了其有效性。 Conclusion: LDDBM为通用模态翻译提供了一个新的强基线,克服了现有方法的限制,具有良好的扩展性和实际应用潜力。 Abstract: Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

[147] LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas

Guocheng Gordon Qian,Ruihang Zhang,Tsai-Shien Chen,Yusuf Dalva,Anujraaj Argo Goyal,Willi Menapace,Ivan Skorokhodov,Meng Dong,Arpit Sahni,Daniil Ostashev,Ju Hu,Sergey Tulyakov,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 提出LayerComposer,一种用于个性化多主体文本到图像生成的交互式框架,通过分层画布和锁定机制实现更好的空间控制和身份保持。

Details Motivation: 现有个性化生成模型在空间组成上的交互控制不足,并且难以扩展到多个主体。 Method: 引入分层画布表示法,每个主体位于独立层上,支持无遮挡组合;设计锁定机制,在保持选定层高保真度的同时灵活适应上下文。 Result: 实验表明,LayerComposer在多主体个性化图像生成中实现了优于现有方法的空间控制和身份保持能力。 Conclusion: LayerComposer通过分层结构和无需架构修改的锁定机制,有效提升了多主体生成的可控性与质量。 Abstract: Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.

[148] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Yihao Meng,Hao Ouyang,Yue Yu,Qiuyu Wang,Wen Wang,Ka Leong Cheng,Hanlin Wang,Yixuan Li,Cheng Chen,Yanhong Zeng,Yujun Shen,Huamin Qu

Main category: cs.CV

TL;DR: HoloCine是一种新型文本到视频模型,通过整体生成整个场景来弥合“叙事鸿沟”,实现从首镜头到末镜头的全局一致性,支持自动化电影制作。

Details Motivation: 现有的文本到视频模型只能生成孤立的视频片段,缺乏多镜头叙事连贯性,难以满足讲故事的需求。 Method: 提出HoloCine模型,采用Window Cross-Attention机制将文本提示定位到特定镜头,并通过Sparse Inter-Shot Self-Attention模式在保持效率的同时实现镜头间稀疏自注意力(镜头内密集)。 Result: HoloCine在叙事连贯性上达到新的SOTA水平,并展现出角色与场景的持久记忆以及对电影技法的理解等 emergent 能力。 Conclusion: HoloCine标志着从片段合成向端到端自动电影制作的关键转变,推动了长时、连贯视频生成的发展。 Abstract: State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.