Skip to content

Table of Contents

cs.CL [Back]

[1] DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse

Jindi Wang,Yidi Zhang,Zhaoxing Li

Main category: cs.CL

TL;DR: 本研究提出了一种基于DeBERTa的模型DeBERTa-KC,用于自动分类YouTube科学视频评论中的知识建构水平,通过引入Focal Loss、Label Smoothing和R-Drop等技术,在四类知识建构类别上取得了优异性能(macro-F1: 0.836),显著优于基线模型。

Details Motivation: 现有方法难以有效识别非正式在线学习环境中复杂的知识建构过程,尤其是高层次的认知参与,因此需要一种能够准确、可扩展地自动分类知识建构水平的模型。 Method: 基于DeBERTa-v3模型,结合Focal Loss处理类别不平衡,使用Label Smoothing和R-Drop提升模型泛化能力,并构建了一个包含数据采集、标注、预处理、训练与评估的端到端可复现流程,采用10折分层交叉验证进行评估。 Result: 在20,000条人工标注的YouTube科学频道评论数据上,DeBERTa-KC取得了0.836 ± 0.008的macro-F1分数,显著优于传统与Transformer基线模型(p<0.01),尤其在Explore和Negotiate类别的识别上表现突出。 Conclusion: DeBERTa-KC能有效捕捉非正式数字学习环境中知识建构的细微特征,为自动化分析学习者认知参与提供了可扩展且理论驱动的解决方案。 Abstract: This study presents DeBERTa-KC, a transformer-based model for automatic classification of knowledge construction (KC) levels in online science learning discourse. Using comments collected from four popular YouTube science channels (2022--2024), a balanced corpus of 20,000 manually annotated samples was created across four KC categories: \textit{nonKC}, \textit{Share}, \textit{Explore}, and \textit{Negotiate}. The proposed model extends DeBERTa-v3 with Focal Loss, Label Smoothing, and R-Drop regularization to address class imbalance and enhance generalization. A reproducible end-to-end pipeline was implemented, encompassing data extraction, annotation, preprocessing, training, and evaluation. Across 10-fold stratified cross-validation, DeBERTa-KC achieved a macro-F1 of $0.836 \pm 0.008$, significantly out-performing both classical and transformer baselines ($p<0.01$). Per-category results indicate strong sensitivity to higher-order epistemic engagement, particularly in \textit{Explore} and \textit{Negotiate} discourse. These findings demonstrate that large language models can effectively capture nuanced indicators of knowledge construction in informal digital learning environments, offering scalable, theory-informed approaches to discourse analysis and the development of automated tools for assessing epistemic engagement.

[2] An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

Xincheng Liu

Main category: cs.CL

TL;DR: 本研究评估了五种主流大语言模型生成的教案在高中物理“电磁波谱”主题下的教学合理性和可用性,发现模型选择主要影响语言可读性,而提示框架结构(尤其是RACE)显著提升事实准确性和课程标准对齐度,但学习目标多停留在记忆和理解层面,高阶认知目标较少。

Details Motivation: 随着AI在教育中的应用日益广泛,亟需评估不同大语言模型及其提示工程策略在生成教学材料时的教学有效性与可靠性,以指导教育者合理选用AI工具。 Method: 选取ChatGPT、Claude、Gemini、DeepSeek和Grok五种模型,结合TAG、RACE和COSTAR三种提示框架,为同一高中物理主题生成15份教案,并通过可读性、事实准确性、课程标准对齐度和认知要求四个自动化指标进行分析。 Result: DeepSeek生成的教案最易读(FKGL=8.64),Claude语言最复杂(FKGL=19.89);RACE框架下教案事实错误最少且最符合NGSS标准;所有教案的学习目标主要集中于布鲁姆分类的记忆和理解层级,缺乏高阶思维目标。 Conclusion: 模型选择决定教案可读性,而提示框架(尤其是RACE)更影响教学准确性和课程对齐;最佳配置是结合高可读性模型、RACE框架及包含核心概念、标准和高阶目标的显式检查清单。 Abstract: This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom's taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives.

[3] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Yatai Ji,Teng Wang,Yuying Ge,Zhiheng Liu,Sidi Yang,Ying Shan,Ping Luo

Main category: cs.CL

TL;DR: 本文提出了一种名为ReDiff的改进型离散扩散模型,通过将生成过程从被动去噪转变为主动修正,解决了视觉-语言任务中因训练与推理不一致导致的错误级联问题。

Details Motivation: 离散扩散模型在视觉-语言任务中具有潜力,但训练与推理之间的差异导致初始解码错误引发连锁反应,造成语法错误和语义幻觉,限制了其实际应用。 Method: 提出ReDiff框架,采用两阶段训练:首先训练模型修正合成错误以建立基础修订能力;其次引入在线自纠正循环,让模型学习专家修正来改进自身生成的错误草案。 Result: 实验表明,ReDiff显著提升了生成内容的连贯性和事实准确性,实现了比传统去噪方法更稳定高效的并行生成。 Conclusion: ReDiff通过主动修正机制有效打破了错误级联,为离散扩散模型在生成任务中的稳定应用提供了新思路。 Abstract: Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert's corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at https://rediff-hku.github.io/.

[4] Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

J Rosser,José Luis Redondo García,Gustavo Penha,Konstantina Palla,Hugues Bouchard

Main category: cs.CL

TL;DR: 提出Sparse Tracing方法,利用动态稀疏注意力和Stream算法在近线性时间和线性空间内高效分析超长上下文的注意力模式,显著减少计算资源需求并保留关键信息路径。

Details Motivation: 传统机械可解释性技术在处理百万token上下文时计算和内存开销过大,难以扩展到长上下文场景。 Method: 提出Stream算法,通过二分搜索式精细化剪枝,在每层注意力中保留每个查询对应的前k个关键块,实现近线性时间O(T log T)和线性空间O(T)的可解释性分析。 Result: 在链式思维推理轨迹上剪枝97-99%的token交互,在RULER基准上保留关键检索路径的同时剪除90-96%的交互,并揭示从‘针’到输出的逐层路径。 Conclusion: Sparse Tracing使长上下文可解释性可在消费级GPU上运行,为大规模模型的注意力分析提供了实用且可落地的工具,推动了思维链监控的普及化。 Abstract: As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99\% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96\% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.

[5] Automated HIV Screening on Dutch EHR with Large Language Models

Lang Zhou,Amrish Jhingoer,Yinghao Luo,Klaske Vliegenthart--Jongbloed,Carlijn Jordans,Ben Werkhoven,Tom Seinen,Erik van Mulligen,Casper Rokx,Yunlei Li

Main category: cs.CL

TL;DR: 提出一种基于大语言模型(LLM)的新方法,利用电子健康记录中的非结构化文本数据来提高HIV检测的筛查效率和准确性。

Details Motivation: 现有的HIV诊断研究主要依赖于结构化数据,忽略了临床笔记等非结构化文本中可能包含的重要风险信息,且大规模实验室检测不可行。 Method: 构建一个利用大语言模型(LLM)分析电子健康记录(EHR)中非结构化文本的新型处理流程,以判断患者是否需要进一步进行HIV检测。 Result: 在鹿特丹伊拉斯姆斯大学医学中心的临床数据上实验显示,该方法在保持低假阴性率的同时实现了高准确率。 Conclusion: 所提出的LLM驱动的管道能有效利用EHR中的非结构化文本,提升HIV筛查的效率与性能,具有临床应用潜力。 Abstract: Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient's eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.

[6] An Expert-grounded benchmark of General Purpose LLMs in LCA

Artur Donaldson,Bharathan Balaji,Cajetan Oriekezie,Manish Kumar,Laure Patouillard

Main category: cs.CL

TL;DR: 本研究首次通过专家评估对大语言模型(LLM)在生命周期评估(LCA)中的应用进行了系统性基准测试,涵盖11个主流LLM在22项LCA任务中的表现,结果显示37%的回答包含错误或误导信息,且部分模型幻觉引用率高达40%,但解释质量普遍良好,开放权重模型表现不逊于闭源模型。

Details Motivation: 尽管大语言模型(LLMs)在环境与社会领域中的生命周期评估(LCA)中被广泛探索,但缺乏标准化评估框架和明确的真值标准,导致其可靠性、鲁棒性和可用性缺乏系统证据,因此需要基于专家意见的基准研究来填补这一空白。 Method: 研究评估了11个通用大语言模型(涵盖开源与商业模型)在22项LCA相关任务中的表现,由17名经验丰富的从业者从科学准确性、解释质量、鲁棒性、可验证性和指令遵循等方面进行评审,共收集168份专家评审意见。 Result: 专家认为37%的模型回答包含不准确或误导性信息;多数模型在解释质量和格式遵循方面表现良好;幻觉引用率在不同模型间差异显著,最高达40%;开放权重模型在准确性与解释质量上表现与闭源模型相当甚至更优。 Conclusion: 研究发现,在LCA中直接将大语言模型作为自由问答式‘神谕’使用存在风险,尤其因幻觉和错误信息可能导致严重后果;然而,LLMs在提升解释质量和减轻简单任务负担方面具有潜力,未来应结合 grounding 机制以提升可靠性。 Abstract: Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs na\"ively in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...

[7] Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

Nishant Balepur,Dang Nguyen,Dayeon Ki

Main category: cs.CL

TL;DR: 提出基于游戏的评估方法Dixit,用于全面评估多模态大语言模型(MLMs)的能力,实验表明其胜率排名与主流基准高度一致,并揭示了MLM在推理策略上的改进空间。

Details Motivation: 现有评估方法无法综合评估MLM在单一任务中的多能力表现,且依赖主观、昂贵的人类或模型对比,易被表面特征干扰。 Method: 设计基于Dixit卡牌游戏的评估框架,要求模型生成具有迷惑性但不过度误导的描述,通过客观规则和竞争机制评估多模态理解与推理能力。 Result: 五种MLM的Dixit胜率排名与主流基准完全相关;人机对战揭示了MLM在策略和推理上的不足。 Conclusion: 基于游戏的评估能有效、客观地衡量MLM的综合能力,为未来模型改进提供方向。 Abstract: Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks -- which cannot jointly assess MLM capabilities in a single task -- or rely on human or model pairwise comparisons -- which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.

[8] Large Language Model enabled Mathematical Modeling

Guoyun Zhang

Main category: cs.CL

TL;DR: 本研究探讨了DeepSeek-R1大语言模型在运筹学优化建模中的应用潜力,通过自然语言理解与代码生成弥合现实问题与数学模型之间的鸿沟。

Details Motivation: 传统优化方法依赖领域专家进行问题建模,而现有大模型存在成本高、易产生幻觉等问题,限制了其在供应链等实际场景中的应用。 Method: 在NL4OPT、IndustryOR、EasyLP和ComplexOR四个运筹学基准上系统评估DeepSeek-R1,采用LLM-as-a-Judge、少样本学习、工具调用和多智能体框架等策略来减少幻觉并提升建模准确性。 Result: 验证了DeepSeek-R1在运筹学问题建模中的有效性,并提出了有效的幻觉分类体系与缓解策略。 Conclusion: DeepSeek-R1是一种成本效益高且性能优越的替代方案,能够有效支持运筹学中的优化建模任务。 Abstract: The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.

[9] Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Jackson Hassell,Dan Zhang,Hannah Kim,Tom Mitchell,Estevam Hruschka

Main category: cs.CL

TL;DR: 提出一种基于记忆增强的框架,利用预训练大语言模型生成的批评和标签数据进行目标分类学习,无需参数更新,显著提升准确率并增强可解释性。

Details Motivation: 传统微调方法成本高、灵活性差且不透明,希望探索无需参数更新的高效、灵活且可解释的学习方式。 Method: 构建一个记忆增强框架,结合实例级的片段记忆(存储LLM生成的批评)和任务级的语义记忆(提炼可复用指导),通过检索机制在推理时使用这些记忆进行分类。 Result: 在多种任务上,相比仅依赖标签的RAG式基线,引入批评使准确率最高提升24.8%;发现闭源与开源模型在处理事实型与偏好型数据时行为差异显著;提出“可引导性”(suggestibility)新指标解释模型对监督信号的响应。 Conclusion: 记忆驱动的反思式学习能有效提升LLM代理的适应性和可解释性,为无需参数更新的高效学习提供了有前景的路径。 Abstract: We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.

[10] LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation

Le Ren,Xiangjian Zeng,Qingqiang Wu,Ruoxuan Liang

Main category: cs.CL

TL;DR: 提出了一种名为LyriCAR的新型可控歌词翻译框架,采用无监督方式,并引入难度感知的课程设计和自适应课程策略,显著提升了翻译质量并减少了近40%的训练步数。

Details Motivation: 现有方法依赖手工规则和句子级建模,难以在段落级别上保持跨行连贯性和全局押韵,泛化能力有限。 Method: 提出LyriCAR框架,结合难度感知的课程设计师和自适应课程策略,以无监督方式逐步训练模型应对更复杂的挑战。 Result: 在英-中歌词翻译任务上实验表明,LyriCAR在标准翻译指标和多维奖励评分上均达到最先进水平,且训练步数减少近40%。 Conclusion: LyriCAR通过自适应课程学习有效提升了歌词翻译的质量与效率,具备良好的应用前景。 Abstract: Lyric translation is a challenging task that requires balancing multiple musical constraints. Existing methods often rely on hand-crafted rules and sentence-level modeling, which restrict their ability to internalize musical-linguistic patterns and to generalize effectively at the paragraph level, where cross-line coherence and global rhyme are crucial. In this work, we propose LyriCAR, a novel framework for controllable lyric translation that operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware curriculum designer and an adaptive curriculum strategy, ensuring efficient allocation of training resources, accelerating convergence, and improving overall translation quality by guiding the model with increasingly complex challenges. Extensive experiments on the EN-ZH lyric translation task show that LyriCAR achieves state-of-the-art results across both standard translation metrics and multi-dimensional reward scores, surpassing strong baselines. Notably, the adaptive curriculum strategy reduces training steps by nearly 40% while maintaining superior performance. Code, data and model can be accessed at https://github.com/rle27/LyriCAR.

[11] LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation

Xin Lian,Kenneth D. Forbus

Main category: cs.CL

TL;DR: 本文提出了一种结合大语言模型(LLM)和符号化自然语言理解(NLU)系统的混合方法,利用LLM进行文本简化和知识补全,同时使用符号NLU生成可推理的结构化表示,在常识科学文本的因果关系与数量提取任务中表现优于纯符号系统。

Details Motivation: 大语言模型易产生事实幻觉和输出不一致,而符号NLU系统虽可解释性强但覆盖范围有限且维护成本高,因此需要融合两者优势以提升性能与可解释性。 Method: 采用LLM进行文本重述与简化以扩展覆盖范围,并自动填补知识空白;结合符号NLU系统生成结构化的语义表示,支持推理与增量学习。 Result: 在常识科学文本中提取数量和因果规律的任务上,混合方法显著优于仅使用符号NLU的流水线。 Conclusion: 该混合方法有效结合了LLM的广泛语言覆盖与符号NLU的精确结构化表示能力,提升了系统在复杂NLU任务中的表现,兼具可解释性与扩展性。 Abstract: Despite the broad applicability of large language models (LLMs), their reliance on probabilistic inference makes them vulnerable to errors such as hallucination in generated facts and inconsistent output structure in natural language understanding (NLU) tasks. By contrast, symbolic NLU systems provide interpretable understanding grounded in curated lexicons, semantic resources, and syntactic & semantic interpretation rules. They produce relational representations that can be used for accurate reasoning and planning, as well as incremental debuggable learning. However, symbolic NLU systems tend to be more limited in coverage than LLMs and require scarce knowledge representation and linguistics skills to extend and maintain. This paper explores a hybrid approach that integrates the broad-coverage language processing of LLMs with the symbolic NLU capabilities of producing structured relational representations to hopefully get the best of both approaches. We use LLMs for rephrasing and text simplification, to provide broad coverage, and as a source of information to fill in knowledge gaps more automatically. We use symbolic NLU to produce representations that can be used for reasoning and for incremental learning. We evaluate this approach on the task of extracting and interpreting quantities and causal laws from commonsense science texts, along with symbolic- and LLM-only pipelines. Our results suggest that our hybrid method works significantly better than the symbolic-only pipeline.

[12] A Fundamental Algorithm for Dependency Parsing (With Corrections)

Michael A. Covington

Main category: cs.CL

TL;DR: 提出一种用于将自然语言句子解析为依存句法树的基本算法,逐词处理并即时依附词语,模拟人脑解析特性。

Details Motivation: 设计一种更符合人类语言处理方式的依存句法分析算法,实现高效且实时的解析。 Method: 采用逐词处理机制,每个词语在可依附时立即进行依附,算法最坏情况时间复杂度为O(n^3),但在实际语言中仅对小规模输入出现。 Result: 该算法具有与短语结构解析相当的复杂度,但在实际应用中表现更接近人类语言处理行为。 Conclusion: 所提出的算法在模拟人类语言解析过程方面具有潜力,适用于自然语言处理中的依存句法分析任务。 Abstract: This paper presents a fundamental algorithm for parsing natural language sentences into dependency trees. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached, corresponding to properties claimed for the parser in the human brain. Like phrase-structure parsing, its worst-case complexity is $O(n^3)$, but in human language, the worst case occurs only for small $n$.

[13] Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs

Yunpeng Xiao,Carl Yang,Mark Mai,Xiao Hu,Kai Shu

Main category: cs.CL

TL;DR: 本文提出了一种统一的框架,从临床背景和临床问题两个维度刻画临床决策任务,以更真实地评估大语言模型在医疗领域的应用潜力,并总结了现有数据集、方法及评估指标,指出了未来面临的开放性挑战。

Details Motivation: 现有的医学数据集(如MedQA)多依赖简化的问答形式,不能充分反映真实的临床决策过程,因此需要一种更贴近实际临床环境的评估范式。 Method: 提出了一个包含临床背景和临床问题两个维度的统一框架,对现有数据集和基准进行归纳分析,回顾了训练时和测试时的应对方法,并将评估标准扩展到准确性之外的效率和可解释性。 Result: 该框架能够系统化地描述临床决策任务的复杂性,标准化不同模型之间的比较,并指导更具临床意义的大语言模型开发。 Conclusion: 所提出的双维范式有助于明确假设、促进公平比较,并推动面向真实临床场景的LLM研究与应用。 Abstract: Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.

[14] Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation and Specialized Pre-training

Alexandra Apostolopoulou,Konstantinos Kanaris,Athanasios Koursaris,Dimitris Tsakalidis,George Domalis,Ioannis E. Livieris

Main category: cs.CL

TL;DR: 本文提出了一种针对现代希腊语的新一代嵌入模型(GEM),通过高质量的数据预处理和多样化的现代Transformer架构(如ELECTRA、ConvBERT、ModernBERT)在通用和法律领域进行预训练,显著优于现有基线模型,并首次提出了适用于法律领域的双语希腊-英语嵌入模型。

Details Motivation: 由于研究分散、模型架构单一以及上下文长度受限,形态丰富但资源中等的现代希腊语在自然语言处理方面进展受限,尤其在需要长文本建模的法律领域更为明显。 Method: 构建大规模、高质量的通用与法律领域希腊语语料库,采用严格的数据过滤与预处理方法,并在此基础上预训练多种现代Transformer架构(包括ELECTRA、ConvBERT、ModernBERT),同时提出首个面向法律领域的希腊-英语双语嵌入模型。 Result: 实验表明,所提出的GEM-RoBERTa和GEM-ConvBERT模型在下游任务中显著优于现有基线,验证了数据质量和模型架构多样性的有效性。 Conclusion: 基于高质量数据和现代架构的GEM模型为希腊语(尤其是法律领域)的语言模型发展提供了新基准,并展示了扩展至双语场景的潜力。 Abstract: The advancement of natural language processing for morphologically rich, moderately-resourced languages like Modern Greek is often hindered by a fragmented research landscape, a lack of architectural diversity and reliance on limited context-length models. This is particularly true in specialized, high-value domains such as law, where existing models are frequently confined to early transformer architectures with a restrictive 512-token window, insufficient for analyzing long legal documents. To address these challenges, this paper presents Greek Embedding Models, a new family of transformer models for Greek language built upon a foundation of extensive, quality-driven data curation. We detail the construction of several large-scale Greek corpora, emphasizing a rigorous, quality-based filtering and preprocessing methodology to create high-value training datasets from both general-domain and specialized legal sources. On this carefully curated foundation, we pre-train and systematically evaluate a diverse suite of modern architectures, which has not previously applied to Greek language, such as ELECTRA, ConvBERT and ModernBERT. Furthermore, we propose the first bilingual Greek-English Embedding Models tailored for the legal domain. The extensive experiments on downstream tasks demonstrate that the new class of models establish the effectiveness of the proposed approach, highlighting that the GEM-RoBERTa and GEM-ConvBERT models significantly outperform existing baselines.

[15] Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models

David Dukić

Main category: cs.CL

TL;DR: 本论文通过改进迁移学习方法,提升预训练神经语言模型在序列标注任务中的表现,提出了多任务模型、架构修改方法和生成式上下文微调框架。

Details Motivation: 为了提高预训练语言模型在序列标注任务中的迁移效果,特别是在领域迁移和上下文适应方面存在的局限性。 Method: 提出三种改进方法:1)引入额外信号的多任务模型;2)在自回归大语言模型中实现层间双向信息流动的架构修改;3)结合监督式上下文微调与响应导向适应策略的序列标注框架。 Result: 所提模型、方法和框架显著提升了序列标注任务的性能,验证了针对性迁移学习范式对预训练模型的有效性。 Conclusion: 通过针对性的迁移学习范式,预训练神经语言模型在序列标注任务中可达到最佳性能。 Abstract: This doctoral thesis improves the transfer learning for sequence labeling tasks by adapting pre-trained neural language models. The proposed improvements in transfer learning involve introducing a multi-task model that incorporates an additional signal, a method based on architectural modifications in autoregressive large language models, and a sequence labeling framework for autoregressive large language models utilizing supervised in-context fine-tuning combined with response-oriented adaptation strategies. The first improvement is given in the context of domain transfer for the event trigger detection task. The domain transfer of the event trigger detection task can be improved by incorporating an additional signal obtained from a domain-independent text processing system into a multi-task model. The second improvement involves modifying the model's architecture. For that purpose, a method is proposed to enable bidirectional information flow across layers of autoregressive large language models. The third improvement utilizes autoregressive large language models as text generators through a generative supervised in-context fine-tuning framework. The proposed model, method, and framework demonstrate that pre-trained neural language models achieve their best performance on sequence labeling tasks when adapted through targeted transfer learning paradigms.

[16] ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

Marianne Menglin Liu,Daniel Garcia,Fjona Parllaku,Vikas Upadhyay,Syed Fahad Allam Shah,Dan Roth

Main category: cs.CL

TL;DR: 提出ToolScope,通过自动纠正工具合并和检索相关工具来提升大模型在复杂任务中工具选择的准确性和效率。

Details Motivation: 解决大语言模型在面对冗余工具时的选择歧义以及上下文限制下无法处理大规模工具集的问题。 Method: 设计ToolScopeMerger进行自动纠错式工具合并以减少冗余,结合ToolScopeRetriever对查询相关工具进行排序和筛选,压缩工具集规模。 Result: 在三个主流大模型和三个开源基准上实验显示,工具选择准确率提升8.38%至38.6%。 Conclusion: ToolScope能有效提升大语言模型在受限上下文和冗余工具环境下的工具使用能力。 Abstract: Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.

[17] From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge

Nafis Chowdhury,Moinul Haque,Anika Ahmed,Nazia Tasnim,Md. Istiak Hossain Shihab,Sajjadur Rahman,Farig Sadeque

Main category: cs.CL

TL;DR: 提出了一个包含孟加拉语文化知识的数据集BLanCK,用于评估多语言大模型在低资源文化背景下的表现,发现模型在文化知识任务上表现较差,但通过提供上下文可显著提升性能。

Details Motivation: 现有的多语言基准在捕捉低资源文化细微差别方面存在不足,难以充分评估大语言模型在非主流文化背景下的文化理解能力。 Method: 构建了一个涵盖民间传统、烹饪艺术和地方方言的孟加拉语文化知识数据集(BLanCK),并在多个多语言大模型上进行实验,比较其在有无上下文条件下的文化知识理解表现。 Result: 实验表明,当前多语言大模型在非文化类任务上表现良好,但在文化知识任务上表现不佳;当提供上下文信息时,所有模型的表现均有显著提升。 Conclusion: 上下文感知架构和经过文化定制的训练数据对于提升大语言模型在低资源文化背景下的表现至关重要。 Abstract: Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.

[18] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

Mehrdad Ghassabi,Sadra Hakim,Hamidreza Baradaran Kashani,Pedram Rostami

Main category: cs.CL

TL;DR: 本研究利用强化学习与AI反馈(RLAIF)和直接偏好优化(DPO)方法,提升波斯语小型语言模型在医学问答中的推理能力。通过翻译构建波斯语医学多选题数据集,并生成正确与错误的思维链推理路径,训练出更高效的专用模型,仅用少量数据即超越此前大规模训练的模型。

Details Motivation: 提升小规模语言模型在资源较少语言(如波斯语)中特定领域(如医学问答)的推理能力,解决数据稀缺下的模型训练挑战。 Method: 采用RLAIF生成偏好与非偏好回答对,结合DPO进行训练;通过教师与学生模型生成链式思维(CoT)推理路径,构建包含正误推理轨迹的数据集,并用于训练波斯语基础模型。 Result: 构建了包含200万偏好token和250万非偏好token的高质量训练数据集;训练后的模型在医学推理任务上显著优于此前基于5700万token训练的gaokerena-V模型。 Conclusion: 基于推理优化的训练方法(如RLAIF+DPO)可在数据有限的情况下高效提升小模型在特定领域的复杂推理能力,尤其适用于低资源语言场景。 Abstract: Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.

[19] CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

Zhaoyi Joey Hou,Bowei Alvin Zhang,Yining Lu,Bhiman Kumar Baghel,Anneliese Brei,Ximing Lu,Meng Jiang,Faeze Brahman,Snigdha Chaturvedi,Haw-Shiuan Chang,Daniel Khashabi,Xiang Lorraine Li

Main category: cs.CL

TL;DR: 本文提出了CreativityPrism,一个用于全面评估大语言模型创造力的框架,将创造力分解为质量、新颖性和多样性三个维度,并在17个最先进的模型上进行评估,揭示了专有模型与开源模型之间的性能差距以及不同创造力维度间的相关性差异。

Details Motivation: 由于现有创造力评估方法在不同领域和任务中存在定义和测量上的不一致,缺乏统一的评估框架,因此需要一种能够跨多种场景全面评估大语言模型创造力的方法。 Method: 提出CreativityPrism框架,包含三个维度(质量、新颖性、多样性)、九项任务、三个领域(发散性思维、创意写作和逻辑推理)以及二十种评估指标,并对17个最先进的专有和开源大语言模型进行评估,分析各指标与任务领域之间的性能相关性。 Result: 实验结果显示专有模型整体优于开源模型;同一领域内的任务性能高度相关,而跨领域相关性较弱;质量和多样性指标之间存在强相关性,而新颖性与其他两个维度的相关性较弱。 Conclusion: 创造力的不同维度和任务之间表现不具强泛化性,支持了需要采用综合性框架来评估大语言模型创造力的观点。 Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.

[20] Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning

Yajie Li,Albert Galimov,Mitra Datta Ganapaneni,Pujitha Thejaswi,De Meng,Priyanshu Kumar,Saloni Potdar

Main category: cs.CL

TL;DR: ARTER提出了一种高效的实体链接方法,通过自适应路由和选择性推理,在减少LLM使用的同时提升了性能。

Details Motivation: 传统实体链接依赖大量标注数据和精细调参,而现有少样本方法因过度依赖LLM推理导致计算成本高。 Method: 结合候选生成、上下文评分、自适应路由和选择性推理,利用多种信号将提及分为简单和困难案例,分别用轻量模型和LLM进行处理。 Result: 在6个数据集中的5个上平均提升+2.53%,最高提升+4.47%,且LLM token使用量减少一半。 Conclusion: ARTER在保持高性能的同时显著提高了计算效率,是实体链接中一种高效实用的少样本解决方案。 Abstract: Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.

[21] BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation

Haoyuan Li,Zhengyuan Shen,Sullam Jeoung,Yueyan Chen,Jiayu Li,Qi Zhu,Shuai Wang,Vassilis Ioannidis,Huzefa Rangwala

Main category: cs.CL

TL;DR: 提出BoundRL,一种高效的长结构化文本分段与标签预测方法,通过强化学习和可验证奖励显著提升小模型性能。

Details Motivation: 现有文本分段方法难以处理包含表格、代码等复杂结构化文本,需更有效的方法以支持语义分割。 Method: 采用仅生成起始token的策略,结合原文本重建内容,并引入强化学习与可验证奖励(RLVR)及中间候选生成机制优化训练。 Result: 实验表明,1.7B参数的小模型在复杂LLM提示任务中超越大模型的少样本提示效果,且RLVR和中间候选显著提升性能与泛化能力。 Conclusion: BoundRL通过高效建模和创新训练策略,在复杂结构化文本分段任务中实现了卓越性能与低推理成本。 Abstract: As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.

[22] Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ?

Anthony Dubreuil,Antoine Gourru,Christine Largeron,Amine Trabelsi

Main category: cs.CL

TL;DR: 本文研究了大语言模型在立场检测任务中的偏见问题,发现模型会因文本复杂度和特定群体方言等属性而表现出显著的刻板印象。

Details Motivation: 大语言模型从预训练数据中继承了社会偏见,但在立场检测任务中的偏见评估尚未得到足够关注。本文旨在探究这些模型在零样本设置下进行立场检测时的偏见行为。 Method: 通过自动标注现有立场检测数据集中的文本,引入两个属性:特定群体的方言/白话和文本复杂度/可读性,并分析这些属性如何影响模型的立场判断。 Result: 实验结果显示,大语言模型存在显著偏见,例如错误地将支持大麻的观点与低文本复杂度关联,或将非裔美国人方言与反对特朗普立场相关联。 Conclusion: 大语言模型在立场检测任务中表现出明显的社会刻板印象,提示需要对这类敏感NLP任务中的公平性进行更深入的评估与修正。 Abstract: Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model's stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.

[23] DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

Tian Lan,Bin Zhu,Qianghuai Jia,Junyang Ren,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo,Kaifu Zhang

Main category: cs.CL

TL;DR: 本文提出了DeepWideSearch,首个旨在评估代理在信息检索中结合深度推理与广度信息收集能力的基准。实验表明现有最先进代理在此基准上表现极差,凸显了当前技术的局限性。

Details Motivation: 当前搜索代理无法同时进行多跳深度推理和大规模信息收集,难以满足实际应用需求,如全面市场分析和商业开发。 Method: 构建了一个包含220个问题、覆盖15个领域的基准DeepWideSearch,通过转化现有数据集生成需要多跳推理的大规模数据任务。 Result: 即使最先进的搜索代理在DeepWideSearch上的平均成功率也仅为2.39%,错误分析揭示出四种失败模式:缺乏反思、过度依赖内部知识、检索不足和上下文溢出。 Conclusion: DeepWideSearch为评估兼具深度与广度的信息检索代理提供了新标准,暴露了当前代理架构的关键缺陷,并推动未来更强大、鲁棒的搜索代理研究。 Abstract: Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.

[24] Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding

Yuhang Zhou,Mingrui Zhang,Ke Li,Mingyi Wang,Qiao Liu,Qifei wang,Jiayi Liu,Fei Liu,Serena Li,Weiwi Li,Mingze Gao,Abhishek Kumar,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang

Main category: cs.CL

TL;DR: 提出Mixture-of-Minds,一种结合多智能体分工与强化学习的框架,用于提升表格理解和推理能力,在TableBench上达到62.13%的性能,超越OpenAI-o4-mini-high。

Details Motivation: 现有方法在表格推理任务中存在局限:微调方法易产生计算错误和幻觉,基于工具的方法缺乏语义理解,因此需要融合强推理与可靠表格处理的新方法。 Method: 设计Mixture-of-Minds多智能体框架,将任务分解为规划、编码和回答三个角色,并利用代码执行实现精确操作;结合蒙特卡洛树搜索生成伪黄金轨迹,通过强化学习进行自我优化训练。 Result: 在TableBench上取得62.13%的成绩,超过OpenAI-o4-mini-high,显著优于现有方法。 Conclusion: 结构化多智能体工作流结合强化学习能有效提升表格推理性能,展现出在复杂语义与精确操作结合任务中的潜力。 Abstract: Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.

[25] Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models

Maggie Bai,Ava Kim Cohen,Eleanor Koss,Charlie Lichtenbaum

Main category: cs.CL

TL;DR: 该论文研究了大语言模型(LLMs)在文本输入下的空间推理能力,通过五个任务评估其在网格环境中的空间理解与多步问题解决能力,发现模型在小规模任务中表现尚可,但随着复杂度增加性能显著下降,平均准确率下降42.7%,最高达84%,揭示了LLMs缺乏稳健的空间表征能力。

Details Motivation: 探究大语言模型在语言之外的空间推理能力,识别其在处理结构化空间任务时的局限性。 Method: 设计五个逐步增加复杂度的空间任务(如象限识别、几何变换、距离评估、单词搜索和滑块拼图),在不同尺寸的网格上测试多种LLMs的表现。 Result: LLMs在小规模任务中准确率超过50%,但随着网格规模和任务复杂度增加,性能急剧下降,所有任务平均准确率下降42.7%,部分任务下降高达84%。 Conclusion: 当前大语言模型在扩展空间推理任务时存在明显缺陷,缺乏稳定的空间表示机制,表明语言与空间推理之间存在显著鸿沟,需在未来研究中融合几何与语言进行联合建模。 Abstract: This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.

[26] Decoding-Free Sampling Strategies for LLM Marginalization

David Pohl,Marco Cognetta,Junyoung Lee,Naoaki Okazaki

Main category: cs.CL

TL;DR: 本文研究了在子词分词框架下语言模型评估的局限性,提出使用无需解码的采样策略进行近似边缘化计算,以大幅降低计算成本并提高效率。

Details Motivation: 由于子词分词的多样性,传统语言模型仅评估单一输出分词方式的概率,忽略了其他可能的表示;因此需要通过边缘化所有可能分词的概率来更准确地评估模型。 Method: 提出并研究了几种无需解码的采样策略,这些策略不依赖于语言模型生成过程,而是基于廉价、通用的采样方法来近似文本的总概率。 Result: 在多个开源模型上验证了该方法的有效性,结果显示无需解码的采样策略能在极低运行时间成本下提供足够准确的边缘概率估计,并成功应用于下游推理任务。 Conclusion: 无需解码的采样策略为子词级语言模型的概率边缘化提供了一种高效且实用的解决方案,显著优于基于生成采样的近似方法。 Abstract: Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks.

[27] Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders

Filippo Cenacchi,Deborah Richards,Longbing Cao

Main category: cs.CL

TL;DR: 提出了一种统一的三模态情感严重程度评估框架,用于同时评估抑郁症和PTSD的严重程度,通过融合文本、音频和面部信号的标准化特征,实现跨障碍的分级诊断,并提供可解释的决策支持。

Details Motivation: 抑郁症和PTSD常共病且症状交织,传统二分类、单疾病自动评估方法难以满足临床需求,亟需能够输出严重程度分级并提供解释的跨障碍评估模型。 Method: 采用同步融合访谈文本(句子级Transformer嵌入)、音频(log Mel谱及其差分)和面部信号(动作单元、注视、头部姿态等)的三模态方法,通过校准的晚期融合分类器输出各疾病的严重程度概率及特征归因。 Result: 在DAIC衍生语料库上进行分层交叉验证,融合模型在准确率和加权F1上与最强单模态基线相当,但在决策曲线效用和模态缺失/噪声下的鲁棒性更优;对PTSD显著降低回归误差并提升类别一致性;错误多发生在相邻严重等级间,极端等级识别可靠;消融实验显示文本对抑郁、音视频对面部线索对PTSD更为关键。 Conclusion: 该三模态融合方法实现了可重现的、面向多障碍共病的严重程度评估,提供了面向临床决策支持的可解释性,具备实际应用潜力。 Abstract: Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.

[28] Context-level Language Modeling by Learning Predictive Context Embeddings

Beiya Dai,Yuliang Liu,Daozheng Xue,Qipeng Guo,Kai Chen,Xinbing Wang

Main category: cs.CL

TL;DR: 本文提出了ContextLM框架,通过引入下文预测目标来增强标准预训练,从而提升语言模型在困惑度和下游任务中的表现。

Details Motivation: 传统的下一个词元预测限制了模型捕捉高层语义结构和长距离上下文关系的能力。 Method: 提出ContextLM框架,在标准预训练中加入下文预测目标,使模型学习多词元上下文的预测表示。 Result: 在GPT2和Pythia模型系列上的实验表明,ContextLM在困惑度和下游任务性能上均有持续改进。 Conclusion: 下文预测为目标提供了一种可扩展且高效的路径,以实现更强的语言建模能力,同时保持与标准自回归评估范式兼容。 Abstract: Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.

[29] Citation Failure: Definition, Analysis and Efficient Mitigation

Jan Buchmann,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出CITECONTROL基准和CITENTION框架,以解决LLM-based RAG系统中的引用失败问题,通过分析响应与证据间的关系来改进引用质量。

Details Motivation: 现有研究未区分引用失败和响应失败,导致难以准确评估和改进引用质量,因此需要专门针对引用失败进行研究。 Method: 采用两步法:首先构建CITECONTROL基准,系统性地研究响应与证据关系对引用质量的影响;然后提出CITENTION框架,融合生成式、注意力机制和基于检索的方法来提升引用效果。 Result: 实验表明,引用失败随关系复杂度增加而增多;CITENTION在CITECONTROL基准及迁移场景中均显著提升了引用性能。 Conclusion: 通过分离引用失败与响应失败,并结合多种引用方法,可有效提升LLM在RAG系统中的引用完整性与准确性。 Abstract: Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.

[30] Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering

Lei Tang,Wei Zhou,Mohsen Mesgar

Main category: cs.CL

TL;DR: 首次系统研究了过程奖励模型(PRMs)在表格问答(TQA)任务中的应用,发现结合文本和代码验证的PRMs有助于解选择,但在跨领域数据上泛化能力有限。

Details Motivation: 探索过程奖励模型(PRMs)在涉及半结构化数据的任务(如表格问答TQA)中的适用性,解决TQA中信息冗余、推理步骤松散和领域特定推理等挑战。 Method: 评估最先进的生成式PRMs在TQA任务上的表现,从答案和推理步骤两个层面进行分析,并结合文本与代码验证方法。 Result: 结合文本和代码验证的PRMs能辅助解选择,但在域外数据上泛化能力弱;步骤级验证性能与最终答案准确率相关性低,可能源于推理步骤间依赖性弱和因果联系松散。 Conclusion: 当前PRMs在TQA任务上存在局限性,需构建更鲁棒、具备过程感知能力的验证器,本文为未来研究提供了重要启示。 Abstract: Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.

[31] Teaching Language Models to Reason with Tools

Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu

Main category: cs.CL

TL;DR: 提出CoRT(Code-Optimized Reasoning Training)框架,通过Hint-Engineering生成高质量代码集成推理数据,提升大推理模型在数学任务中对计算工具的使用效率与准确性。

Details Motivation: 大推理模型在复杂数学运算中常表现出低效或错误,其内部概率推理与外部确定性计算工具(如Code Interpreter)之间存在冲突,导致无效推理。需解决模型与工具协同问题。 Method: 提出CoRT框架,采用Hint-Engineering策略在推理路径中注入提示以合成高质量训练数据,并结合监督微调、拒绝采样与强化学习优化多轮内外部推理的交错过程,提升模型调用Code Interpreter的能力。 Result: 在5个数学推理数据集上,CoRT使32B和1.5B模型分别获得4%和8%的绝对性能提升,并显著提高效率:32B模型减少约30% token使用,1.5B模型减少约50%。 Conclusion: CoRT有效提升了大推理模型对计算工具的利用能力,在性能和推理效率方面均优于纯自然语言推理方法,为模型与外部工具协同提供了可行方案。 Abstract: Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: https://github.com/ChengpengLi1003/CoRT.

[32] Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Matteo Silvestri,Flavio Giorgi,Fabrizio Silvestri,Gabriele Tolomei

Main category: cs.CL

TL;DR: 该研究发现大语言模型在表格推理任务中的表现可能源于对具有明显语义线索的数据集的记忆,而非真正的泛化能力。

Details Motivation: 探究大语言模型在结构化数据推理任务中是否存在数据污染问题,尤其是对常用基准数据集的先验知识影响。 Method: 通过控制性探测实验,分析模型在保留或去除语义线索(如列名、类别含义)时的表现变化。 Result: 当数据包含强语义线索时,模型表现良好;一旦线索被移除或随机化,性能急剧下降至接近随机水平。 Conclusion: 大语言模型在表格推理任务中的优秀表现部分归因于对公开数据集的记忆,而非真实推理能力,建议改进评估方法以区分语义泄露和真实推理。 Abstract: Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.

[33] FreeChunker: A Cross-Granularity Chunking Framework

Wenxuan Zhang,Yuan-Hao Jiang,Yonghe Wu

Main category: cs.CL

TL;DR: 本文提出了FreeChunker,一种跨粒度编码框架,通过将句子作为基本单元并支持灵活检索任意句子组合,改变了传统的静态分块范式,显著提升了检索性能和计算效率。

Details Motivation: 现有固定粒度的分块方法依赖静态边界识别,难以适应多样化的查询需求,限制了RAG系统的有效性。 Method: 提出FreeChunker框架,将句子视为原子单位,摒弃静态分块,实现跨粒度的灵活检索,支持任意句子组合的动态构建。 Result: 在LongBench V2上的实验表明,FreeChunker在检索性能上优于传统分块方法,同时显著提升了计算效率。 Conclusion: FreeChunker通过范式转变,有效解决了固定粒度分块的局限性,为RAG系统提供了更强的适应性和更高的效率。 Abstract: Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly reduces the computational overhead required for semantic boundary detection but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker achieves superior retrieval performance compared to traditional chunking methods, while significantly outperforming existing approaches in computational efficiency.

[34] Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)

Francesca Padovani,Bastian Bunzeck,Manar Ali,Omar Momen,Arianna Bisazza,Hendrik Buschmeier,Sina Zarrieß

Main category: cs.CL

TL;DR: 本文研究了仅在对话数据上预训练的小型语言模型的表现,并通过多种微调策略提升其对话生成能力,发现DPO微调能显著改善模型在自定义对话基准上的表现。

Details Motivation: 探索仅使用对话数据预训练是否能产生形式和功能上更合适的语言模型,特别是在对话任务中的潜力。 Method: 基于对话数据预训练llamalogue模型,并采用PPO和DPO等不同微调策略优化模型的对话生成能力。 Result: 模型在大多数标准BabyLM基准上表现不佳,但在最小对设置的对话延续预测中表现出色;DPO微调进一步提升了在自定义对话基准上的性能,而PPO则效果不佳甚至有负面影响。 Conclusion: 仅对话数据预训练适合特定对话任务,DPO是更有效的微调方法,有助于提升模型的对话理解与生成能力。 Abstract: We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce "more communicative" text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.

[35] The Impact of Negated Text on Hallucination with Large Language Models

Jaehyung Seo,Hyeonseok Moon,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文研究了大语言模型在处理否定文本时的幻觉检测能力,发现模型在否定语境下难以有效识别幻觉,常产生逻辑不一致或不忠实的判断,并通过构建NegHalu数据集和分析内部状态揭示了这一问题的根源。

Details Motivation: 否定文本对大语言模型幻觉的影响尚未被充分探索,本文旨在回答三个关键研究问题:模型是否能识别否定引起的上下文变化并准确检测幻觉。 Method: 重构现有幻觉检测数据集,引入否定表达,构建NegHalu数据集,并在token级别分析模型处理否定输入时的内部状态。 Result: 实验表明,大语言模型在否定文本中检测幻觉的能力显著下降,常做出逻辑不一致的判断,且内部表示显示其难以正确处理否定带来的语义变化。 Conclusion: 否定文本显著影响大语言模型的幻觉检测性能,当前模型在处理否定语境时存在根本性挑战,需针对性改进以提升可靠性和一致性。 Abstract: Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.

Son T. Luu,Trung Vo,Hiep Nguyen,Khanh Quoc Tran,Kiet Van Nguyen,Vu Tran,Ngan Luu-Thuy Nguyen,Le-Minh Nguyen

Main category: cs.CL

TL;DR: 本文介绍了VLSP 2025 MLQA-TSR任务,旨在推动越南多模态法律文本处理研究,聚焦交通标志法规领域,包含多模态法律检索和多模态问答两个子任务,并提供了基准数据集。最佳结果分别为64.55%的F2分数和86.30%的准确率。

Details Motivation: 推动越南多模态法律文本处理的研究,特别是在交通标志法规领域的智能系统开发与评估 Method: 设计并发布了VLSP 2025 MLQA-TSR共享任务,包含多模态法律检索和多模态问答两个子任务,并构建基准数据集 Result: 在多模态法律检索任务上达到64.55%的F2分数,在多模态问答任务上达到86.30%的准确率 Conclusion: 该任务为越南交通法规领域的多模态法律信息处理提供了重要基准,促进了相关智能系统的发展 Abstract: This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.

[37] NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew

Shaltiel Shmidman,Avi Shmidman,Moshe Koppel

Main category: cs.CL

TL;DR: 本文介绍了NeoDictaBERT和NeoDictaBERT-双语版,这是基于NeoBERT架构、专为希伯来语文本设计的BERT风格模型,在多种基准测试中超越现有模型,尤其在检索任务中表现突出,并已向社区公开发布以推动希伯来语NLP研究。

Details Motivation: 现有的BERT模型架构已相对过时,而新架构如Llama3和Qwen3在性能上更先进;为了提升希伯来语NLP任务的表现,需要采用现代架构并针对性地优化于希伯来语场景。 Method: 采用与NeoBERT相同的现代架构,训练了两个专注于希伯来语的BERT风格模型:NeoDictaBERT及其双语版本,并在多类下游任务中进行评估。 Result: NeoDictaBERT在几乎所有希伯来语基准测试中均优于现有模型,且NeoDictaBERT-双语版在检索任务中表现尤为出色,超过类似规模的其他多语言模型。 Conclusion: 基于现代架构的NeoDictaBERT系列模型显著提升了希伯来语NLP的性能,为相关下游任务提供了强有力的基线模型,并促进了该领域的开放研究。 Abstract: Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.

[38] Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction

Suchir Salhan,Hongyi Gu,Donya Rooein,Diana Galvan-Sosa,Gabrielle Gaudeau,Andrew Caines,Zheng Yuan,Paula Buttery

Main category: cs.CL

TL;DR: ContingentChat是一个师生框架,用于评估和提升基于100M词训练的BabyLM中的多轮对话连贯性。通过新的对齐数据集进行后训练,BabyLM生成的回复更合乎语法且连贯。实验表明,针对性的后训练有助于提高对话质量,但连贯性对BabyLM仍是挑战。

Details Motivation: 旨在提升儿童与看护者之间多轮对话中体现的连贯性(contingency),即及时、直接且有意义的交流特性,在语言模型中的实现能力。 Method: 提出ContingentChat师生框架,使用新型对齐数据集对BabyLM进行后训练,并采用自适应教师解码策略进行实验。 Result: 后训练显著提升了BabyLM生成回复的语法性和连贯性,但自适应解码策略带来的额外增益有限。 Conclusion: 针对性的后训练能有效改善BabyLM的对话质量,但实现真正的对话连贯性仍具挑战。 Abstract: Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.

[39] LM-mixup: Text Data Augmentation via Language Model based Mixup

Zhijie Deng,Zhouan Shen,Ling Li,Yao Zhou,Zhaowei Zhu,Yanji He,Wei Wang,Jiaheng Wei

Main category: cs.CL

TL;DR: 本文提出了Instruction Distillation任务,旨在将低质量、冗余的指令数据蒸馏为高质量、连贯的指令-输出对,并构建了MIXTURE数据集。通过LM-Mixup方法(结合监督微调与强化学习)实现有效数据增强,在仅使用约3%数据的情况下超越全量训练,证明低质量数据经恰当处理后极具价值。

Details Motivation: 高质量指令遵循数据稀缺而低质量数据常被丢弃,导致信息浪费;现有数据增强方法难以有效利用低质量数据,且缺乏明确评估机制。 Method: 提出Instruction Distillation任务,构建包含14.4万样本的MIXTURE数据集,采用LM-Mixup方法:先在MIXTURE上进行监督微调,再通过GRPO结合质量、语义对齐和格式合规三个奖励信号进行强化学习优化。 Result: 在多个基准上,仅使用LM-Mixup蒸馏出的约3%数据进行微调,性能超越全量数据训练,并可媲美最先进的高质量数据筛选方法。 Conclusion: 经过适当蒸馏和LM-Mixup增强,低质量数据可成为提升指令调优大模型效率与性能的宝贵资源。 Abstract: Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.

[40] Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

Christian Hobelsberger,Theresa Winner,Andreas Nawroth,Oliver Mitevski,Anna-Carolina Haensch

Main category: cs.CL

TL;DR: 本文系统评估了四种大语言模型(LLM)输出置信度估计方法:VCE、MSP、样本一致性与CoCoA,实验基于四个问答任务和一个先进的开源LLM。结果表明,不同指标捕捉到的模型置信度特征各异,其中混合型CoCoA方法在校准性和正确答案区分能力上表现最佳,整体可靠性最高。文章还讨论了各方法的权衡并提出了在实际应用中选择不确定性度量的建议。

Details Motivation: 大语言模型输出具有不确定性和正确性波动,限制了其实际可靠性,因此需要有效的方法来量化模型输出的置信度。 Method: 系统评估四种置信度估计方法(VCE、MSP、样本一致性、CoCoA),在四个问答任务上使用先进开源大模型进行实验,比较其在校准性和判别性方面的表现。 Result: 每种不确定性度量捕捉到不同的置信度特征;CoCoA方法在整体可靠性、校准性和正确答案的判别能力方面表现最优。 Conclusion: CoCoA是一种更优的混合式置信度估计方法,推荐根据具体应用场景权衡选择合适的不确定性度量方法。 Abstract: Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.

[41] Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs

Lukas Edman,Alexander Fraser

Main category: cs.CL

TL;DR: 提出了一种改进的掩码语言模型(MLM),根据模型预测能力动态调整掩码概率,并引入子词嵌入,提升了在(Super)GLUE任务上的性能和形态学泛化能力。

Details Motivation: 在BabyLM挑战赛中提升小规模模型的语言建模能力,特别是在低资源条件下实现更好的语言理解与泛化。 Method: 改进掩码语言建模(MLM),动态调整掩码token的掩码概率;结合子词嵌入方法,增强模型对形态结构的学习。 Result: 在(Super)GLUE任务上显著优于标准MLM,并在BabyLM挑战赛的strict-small赛道中超过基线模型。 Conclusion: 改进的MLM策略和子词嵌入有效提升了小模型的语言理解与形态学泛化能力,验证了其在资源受限场景下的潜力。 Abstract: We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.

[42] RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging

Bowen Wang,Haiyuan Wan,Liwen Shi,Chen Yang,Peng He,Yue Ma,Haochen Han,Wenhao Li,Tiao Tan,Yongjian Li,Fangming Liu,Yifan Gong,Sheng Zhang

Main category: cs.CL

TL;DR: 提出RECALL,一种基于表示的模型融合框架,用于无需历史数据的持续学习。

Details Motivation: 解决大语言模型在持续学习中遭遇灾难性遗忘的问题,并避免对历史数据或任务标签的依赖。 Method: 通过在聚类典型样本上的层间隐藏表示计算模型间相似性,并进行自适应、分层的参数融合,以对齐不同模型的知识。 Result: 在五个NLP任务和多种持续学习场景中,RECALL在知识保持和泛化能力上均优于基线方法,表现出强抗遗忘能力和多领域融合性能。 Conclusion: RECALL为大语言模型提供了一种可扩展且无需数据的持续学习方案,有效平衡了通用特征保留与任务特定适应。 Abstract: We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose RECALL, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.

[43] Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Tim Tian Hua,Andrew Qin,Samuel Marks,Neel Nanda

Main category: cs.CL

TL;DR: 本文提出通过向大语言模型的激活添加引导向量来抑制其评估意识,使其在评估期间表现得如同部署状态,从而提高安全评估的可靠性。

Details Motivation: 由于大语言模型在被评估时可能察觉并调整行为以显得更对齐,导致安全评估不可靠,因此需要一种方法来抑制这种评估意识。 Method: 通过两步训练方法训练一个具有评估意识的模型:首先在包含模型描述的数据上进行继续预训练,使其识别评估线索;然后使用专家迭代训练模型在评估环境中使用Python类型提示。随后,通过向模型激活中添加预先构建的引导向量来抑制评估意识。 Result: 训练后的模型在评估情境下比部署情境下更多地添加类型提示,表现出评估意识;但在引入引导向量后,即使存在评估线索,模型的行为也类似于部署状态,评估意识被有效抑制。 Conclusion: 激活引导可以有效抑制大语言模型的评估意识,提升安全评估的真实性与可靠性,未来AI评估器可利用该技术改进评估效果。 Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.

[44] Robust Preference Alignment via Directional Neighborhood Consensus

Ruochen Mao,Yuling Shi,Xiaodong Gu,Jiaheng Wei

Main category: cs.CL

TL;DR: 本文提出了一种无需重新训练的后处理方法Robust Preference Selection (RPS),通过利用方向性邻域共识来提升大语言模型在多样化人类偏好下的对齐鲁棒性,有效缓解了因训练数据偏好集中而导致的性能下降问题。

Details Motivation: 现有对齐方法在面对偏离训练数据主流偏好的个性化需求时表现不稳定,且依赖昂贵的重训练,难以覆盖完整的偏好谱系,导致存在“偏好覆盖缺口”。 Method: 提出Robust Preference Selection (RPS) 方法:在用户指定偏好的局部邻域内采样多个相关偏好对应的回答,构建更优候选池,再从中选择最符合用户意图的响应,整个过程无需模型微调。 Result: 在DPA、DPO和SFT三种不同对齐范式下实验表明,RPS在代表性不足的偏好区域上相比强基线方法最高获得69%的胜率,显著提升了模型对罕见或复杂偏好的鲁棒性。 Conclusion: RPS是一种实用且理论可解释的训练免费方法,能够有效增强偏好对齐模型在多样化用户需求下的可靠性与稳定性。 Abstract: Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.

[45] Hierarchical Sequence Iteration for Heterogeneous Question Answering

Ruiyi Yang,Hao Xue,Imran Razzak,Hakim Hacid,Flora D. Salim

Main category: cs.CL

TL;DR: 本文提出了HSEQ Iteration框架,通过将文档、表格和知识图统一为可逆的层次化序列,并结合结构感知的迭代检索机制,在多源异构数据上实现了高效、准确的问答。

Details Motivation: 现有RAG方法在处理多跳问题和异构证据源时存在准确性低、延迟高、资源消耗大的问题,缺乏统一且高效的检索与推理机制。 Method: 将不同格式的数据(文本、表格、知识图谱)线性化为带轻量级结构标签的层次化序列(HSeq),由Head Agent指导检索,Iteration Agent执行结构感知的动作(如父子跳转、表格邻接、KG关系扩展)进行迭代证据收集,最后由Head Agent整合规范化证据生成答案,并支持矛盾检测后的 refinement 循环。 Result: 在HotpotQA、HybridQA/TAT-QA和MetaQA等多个基准上均优于强基线模型,取得更高的EM/F1分数,同时显著提升效率;具备格式无关性、预算感知迭代和证据规范化三大优势。 Conclusion: HSEQ提供了一种统一、高效且可审计的RAG框架,能够在异构多源环境下实现精准问答,平衡了性能、成本与准确性。 Abstract: Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration for Heterogeneous Question Answering, a unified framework that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) guided, budget-aware iteration that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.

[46] Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset

Paul Lerner,François Yvon

Main category: cs.CL

TL;DR: 提出基于多语言翻译公平性评估大模型政治偏见的新框架,利用新构建的多语言平行语料库分析欧洲议会演讲翻译质量差异。

Details Motivation: 现有方法多通过英文问卷评估大模型政治偏见,缺乏跨语言和实际语境的公平性分析。因此需要一种基于多语言翻译公平性的新评估方式。 Method: 构建包含21种语言、1.5百万句子的新版EuroParl多语言平行语料库,结合议员所属政党属性,系统比较不同政治立场政党在翻译质量上的差异。 Result: 发现左翼、中间、右翼主流政党演讲的翻译质量显著优于边缘政党,存在系统性翻译偏差。 Conclusion: 大模型在多语言翻译中存在与政治立场相关联的系统性偏差,主流政党更受优待,表明翻译公平性可作为评估政治偏见的有效视角。 Abstract: The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left, center, and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.

[47] ARC-Encoder: learning compressed text representations for large language models

Hippolyte Pilchen,Edouard Grave,Patrick Pérez

Main category: cs.CL

TL;DR: 本文提出了一种名为ARC-Encoder的上下文压缩方法,通过将文本上下文压缩为连续表示来减少大语言模型推理时的计算成本,且无需微调或修改解码器模型,具有良好的跨模型通用性和高效性。

Details Motivation: 现有的上下文压缩方法往往需要微调解码器模型或修改其架构,导致模型泛化能力下降,因此需要一种无需改动目标模型即可高效压缩上下文的方法。 Method: 设计一个独立的编码器(ARC-Encoder),将上下文压缩为更少数量的连续表示(通常是原始token数的1/4或1/8),并系统研究了编码器的训练策略和架构选择,使其可适配多种解码器LLM。 Result: 在多种LLM应用场景(如上下文学习、扩展上下文窗口)中,ARC-Encoder在多个基准上达到最先进性能,同时提升推理效率,并验证了单个编码器可适配多个不同解码器的可行性。 Conclusion: ARC-Encoder是一种灵活、高效的上下文压缩解决方案,能够在不修改解码器的情况下显著降低推理成本,并具备跨LLM的可移植性。 Abstract: Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x\!\in\!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder , fine-tuning dataset and pretrained models are available at https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .

[48] The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

Sangmitra Madhusudan,Kaige Chen,Ali Emami

Main category: cs.CL

TL;DR: 本文提出了CenterBench,一个包含9720个中心嵌套句的语料库,用于区分语言模型是依赖语法结构理解还是语义模式匹配。实验表明,随着句子复杂度增加,模型在语义合理与不合理句子间的性能差距扩大,揭示其倾向于放弃结构分析而依赖语义联想。

Details Motivation: 现有评测难以区分语言模型是真正解析句法结构,还是依赖语义常识进行预测,因此需要新方法识别模型的理解机制。 Method: 构建CenterBench数据集,包含语法相同但语义合理性和嵌套深度不同的中心嵌套句,并设计六类理解问题,测试模型在不同复杂度下的表现差异。 Result: 六种模型在复杂句子中语义合理与不合理情况下的理解准确率差距高达26.8个百分点,且推理模型虽提升准确率,但仍存在语义捷径、过度推理和拒绝回答等问题;而人类表现无此系统性偏差。 Conclusion: CenterBench首次提供了识别模型何时从结构分析转向语义匹配的框架,揭示当前语言模型在复杂句法处理中仍严重依赖语义线索而非真正语法理解。 Abstract: When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

[49] GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

Jinchang Luo,Mingquan Cheng,Fan Wan,Ni Li,Xiaoling Xia,Shuangshuang Tian,Tingcheng Bian,Haiwei Wang,Haohuan Fu,Yan Tao

Main category: cs.CL

TL;DR: 提出GlobalRAG,一种基于强化学习的框架,通过分解问题为子目标、协调检索与推理,并引入规划质量奖励和子目标完成奖励,在多跳问答中显著提升性能,仅用42%训练数据即在EM和F1上平均提升14.2%。

Details Motivation: 现有强化学习在多跳问答中受限于缺乏全局规划和执行不忠实问题,导致推理不连贯和证据利用不一致。 Method: 将问题分解为子目标,迭代优化证据使用;设计规划质量奖励和子目标完成奖励;采用渐进式权重退火策略平衡过程与结果目标。 Result: 在领域内和领域外基准上均显著优于强基线模型,仅使用8k训练数据(为基线的42%),EM和F1平均提升14.2%。 Conclusion: GlobalRAG有效提升了多跳问答中的全局推理能力和执行可靠性,以更少数据实现更优性能,验证了结构化规划与强化学习结合的优势。 Abstract: Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.

Zhouwei Zhai,Mengxiang Chen,Haoyun Xia,Jin Li,Renquan Zhou,Min Yang

Main category: cs.CL

TL;DR: 提出多智能体认知决策框架(MACDF),将电商搜索从被动检索转变为主动决策支持,显著提升复杂查询下的推荐准确性和用户满意度。

Details Motivation: 传统检索-排序范式依赖查询-商品匹配,与用户多阶段的认知决策过程不一致,导致语义鸿沟、决策成本高和缺乏专业购物指导等问题。 Method: 设计多智能体认知决策框架(MACDF),模拟用户的多阶段认知决策过程,通过多个智能体协同完成理解、推理、筛选和引导等任务,实现主动决策支持。 Result: 离线实验显示MACDF在推荐准确性和用户满意度方面显著优于传统方法,尤其在包含否定、多约束或推理需求的复杂查询上表现突出;京东搜索平台的在线A/B测试验证了其实际有效性。 Conclusion: 多智能体认知系统具有重塑电商搜索范式的潜力,能够更好地支持用户复杂的购物决策过程。 Abstract: The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF's significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.

[51] Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks

Jiangang Hao,Wenju Cui,Patrick Kyllonen,Emily Kerzabi

Main category: cs.CL

TL;DR: 本研究探讨了基于ChatGPT的自动编码在协作问题解决中的沟通数据分类是否存在性别和种族偏见。结果表明,ChatGPT编码在不同性别和种族群体间无显著偏见,支持其在大规模协作评估中的应用。

Details Motivation: 现有研究表明ChatGPT可有效编码沟通数据,但其是否对不同性别和种族群体存在偏见尚不清楚,本文旨在填补这一空白。 Method: 使用典型的协作问题解决编码框架,对来自谈判、问题解决和决策三类协作任务的数据,采用ChatGPT进行自动编码,并分析其在性别和种族群体间的差异。 Result: ChatGPT-based coding 在不同性别和种族群体之间没有表现出显著偏见。 Conclusion: ChatGPT可用于大规模协作与沟通能力评估,且不引入显著的人口统计学偏见,具备公平性和应用前景。 Abstract: Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.

[52] BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection

Ali Zain,Sareem Farooqui,Muhammad Rafi

Main category: cs.CL

TL;DR: 该论文介绍了BUSTED团队在阿拉伯语AI生成文本检测任务中的提交方案,获得第5名。研究比较了AraELECTRA、CAMeLBERT和XLM-RoBERTa三种预训练模型,发现多语言XLM-RoBERTa在F1分数(0.7701)上表现最佳,优于专用阿拉伯语模型,突显了多语言模型在该任务中的强大泛化能力。

Details Motivation: 旨在评估不同预训练Transformer模型在阿拉伯语AI生成文本检测中的有效性,探索专用语言模型与多语言模型的性能差异。 Method: 对AraELECTRA、CAMeLBERT和XLM-RoBERTa三种预训练模型在提供的数据集上进行二分类任务的微调,并比较其性能。 Result: XLM-RoBERTa模型取得了最高的F1分数(0.7701),优于专门针对阿拉伯语设计的AraELECTRA和CAMeLBERT模型。 Conclusion: 多语言模型在阿拉伯语AI生成文本检测中表现出强大的泛化能力,不一定需要专用语言模型即可取得优异性能。 Abstract: This paper details our submission to the Ara- GenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, se- cured 5th place. We investigated the effec- tiveness of three pre-trained transformer mod- els: AraELECTRA, CAMeLBERT, and XLM- RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a sur- prising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the spe- cialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capa- bilities of multilingual models.

[53] Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model

Haoyu Wang,Sihang Jiang,Yuyan Chen,Yitong Wang,Yanghua Xiao

Main category: cs.CL

TL;DR: 本文基于人类好奇心评估量表5DCR,设计了一个综合评估框架来衡量大语言模型(LLMs)的好奇心表现,发现LLMs在知识获取上比人类更强烈,但在不确定环境中仍较保守,并验证了好奇心能提升模型的推理和主动学习能力。

Details Motivation: 探讨大语言模型是否具备类似人类的好奇心驱动学习能力,借鉴人类好奇心评估体系构建可量化的评估框架。 Method: 基于Five-Dimensional Curiosity scale Revised (5DCR) 设计涵盖信息寻求、刺激寻求和社会好奇心等多个维度的评估框架,对LLMs的好奇行为进行系统评测。 Result: LLMs表现出比人类更强的知识渴求,但在不确定环境下倾向于保守选择;好奇行为与模型的思维过程正相关,有助于提升其推理和主动学习能力。 Conclusion: 大语言模型具备类人好奇心的潜力,该研究为未来LLMs的学习能力和创新研究提供了实验支持。 Abstract: Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.

[54] The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI

Alan Saji,Raj Dabre,Anoop Kunchukuttan,Ratish Puduppully

Main category: cs.CL

TL;DR: 该论文研究了大型推理模型(LRM)在多语言推理中的表现,发现其倾向于使用英语进行推理,尽管这通常提高准确性,但在复杂任务中易因翻译错误而失败。

Details Motivation: 探索大型推理模型在非英语问题上的推理能力及其对语言和文化细微差异的处理问题。 Method: 系统比较LRM在英语与提问语言中推理的表现,评估MGSM和GPQA Diamond两个任务,并分析推理过程中的认知特征。 Result: 英语推理表现出更多认知行为且准确率更高,尤其在复杂任务中;但存在‘迷失在翻译中’的错误模式。 Conclusion: 虽然英语推理提升性能,但依赖翻译可能导致关键错误,需增强模型在提问语言中的直接推理能力。 Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.

[55] \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding

Junghyun Min,York Hay Ng,Sophia Chan,Helena Shunhua Zhao,En-Shiun Annie Lee

Main category: cs.CL

TL;DR: 本文提出了首个粤语自然语言理解基准CantoNLU,涵盖七个语法和语义任务,并评估了多种模型在粤语上的表现,发现经过粤语适配的模型整体最优,而单语粤语模型在句法任务上表现更好。

Details Motivation: 粤语虽使用广泛但资源匮乏,缺乏标准化的评估框架,限制了粤语NLP的发展。 Method: 构建包含七个任务的CantoNLU基准,并评估四种模型:未经粤语训练的普通话模型、两个通过持续预训练适配粤语的模型,以及一个从零训练的单语粤语模型。 Result: 粤语适配模型整体表现最佳,单语粤语模型在句法任务上更优,而普通话模型在某些任务中仍具竞争力,表明在粤语数据稀缺时直接迁移可能有效。 Conclusion: CantoNLU为粤语NLP研究提供了重要基准,推动低资源语言的技术发展,且模型比较揭示了不同训练策略的适用场景。 Abstract: Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.

[56] Neural Diversity Regularizes Hallucinations in Small Models

Kushal Chakrabarti,Nirmal Balachundhar

Main category: cs.CL

TL;DR: 提出神经多样性(neural diversity)作为减少语言模型幻觉的新机制,通过ND-LoRA方法在不增加参数和数据的情况下显著降低幻觉率。

Details Motivation: 语言模型尽管规模不断增大,仍存在严重幻觉问题,需寻找新的缓解路径。 Method: 受投资组合理论启发,提出神经多样性概念,使用并行LoRA适配器结合Barlow Twins正则化,构建ND-LoRA框架。 Result: ND-LoRA在多种任务上平均减少14.6%幻觉,最多减少25.6%,且不影响整体准确性;实验证明神经多样性是关键中介因素,不同任务需要不同程度的最优神经多样性。 Conclusion: 神经多样性可作为与参数和数据并列的第三维扩展轴,提升语言模型在固定资源下的可靠性。 Abstract: Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.

[57] Structure-Conditional Minimum Bayes Risk Decoding

Bryan Eikema,Anna Rutkiewicz,Mario Giulianelli

Main category: cs.CL

TL;DR: 本文提出三种轻量级的效用函数改进方法,以提升最小贝叶斯风险(MBR)解码在开放生成任务中对潜在结构的敏感性,并在对话和指令遵循任务中显著提高生成质量。

Details Motivation: 标准基于相似性的效用函数在开放任务中可能导致MBR选择虽具代表性但结构上次优的输出,因未能捕捉生成结果中的潜在结构差异。 Method: 提出三种针对效用函数的轻量级改进,设计两个评估结构最优性的指标,并在包含对话行为、情感和回应结构三类潜在结构的数据集上进行验证。 Result: 改进后的效用函数在结构最优性指标上显著优于传统方法,并在AlpacaEval和MT-Bench基准上使胜率最高提升13.7个百分点。 Conclusion: 增强MBR对生成空间中潜在结构的敏感性可有效提升开放生成任务的输出质量,所提方法为MBR在复杂任务中的应用提供了可行路径。 Abstract: Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model's outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model's distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure: dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list). We further propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.

[58] User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Xiaoyuan Wu,Roshni Kaushik,Wenkai Li,Lujo Bauer,Koichi Onoue

Main category: cs.CL

TL;DR: 该研究通过一项包含94名参与者的用户研究,发现用户在评估大语言模型(LLM)对隐私敏感场景的响应时,对隐私保护性和帮助性的判断一致性较低,而代理LLM之间虽高度一致,却与真实用户评价相关性差,表明现有基于代理LLM的隐私评估方法无法准确反映用户感知,需加强以用户为中心的评估研究。

Details Motivation: 现有LLM隐私评估基准依赖代理LLM来衡量隐私合规性,忽视了真实用户的感知;同时缺乏对响应帮助性的细致分析。因此,需要探究真实用户如何感知LLM在隐私敏感任务中的表现。 Method: 作者基于PrivacyLens中的90个真实场景,开展了一项涉及94名参与者的用户研究,要求用户评估LLM响应的隐私保护性和帮助性,并将用户评分与五个代理LLM的评分进行对比分析。 Result: 研究发现:1)用户间对同一响应的隐私和帮助性评分一致性低;2)代理LLM之间评分高度一致,但与用户评分的相关性低;3)代理LLM无法准确估计真实用户的感知。 Conclusion: LLM在隐私敏感场景中的隐私保护和帮助性感知具有高度个体差异,当前依赖代理LLM的评估方式不足以反映真实用户体验,未来应开展更多以用户为中心的评估,并探索提升代理LLM与用户感知对齐的方法。 Abstract: Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.

Xizhi Wu,Madeline S. Kreider,Philip E. Empey,Chenyu Li,Yanshan Wang

Main category: cs.CL

TL;DR: 本研究比较了多种自然语言处理(NLP)方法从临床笔记中提取氟嘧啶治疗和毒性信息的效果,发现基于大语言模型(LLM)的方法(尤其是错误分析提示)表现最佳,F1分数达到1.000,显著优于传统机器学习和深度学习模型。

Details Motivation: 由于氟嘧啶类药物在结直肠癌和乳腺癌治疗中广泛应用但伴随手足综合征和心脏毒性等副作用,且毒性信息常嵌入于非结构化临床笔记中,因此需要高效的NLP方法自动提取相关信息以支持药物流行病学和肿瘤学研究。 Method: 构建包含236份临床笔记的金标准数据集,由领域专家标注治疗方案和毒性类别;采用规则-based、机器学习(随机森林、SVM、逻辑回归)、深度学习(BERT、ClinicalBERT)以及基于大语言模型的零样本和错误分析提示方法进行信息抽取,并使用80:20训练测试划分评估性能。 Result: LLM-based方法中,错误分析提示在治疗和毒性提取上均达到F1=1.000,零样本提示在治疗提取上F1=1.000、毒性提取F1=0.876;逻辑回归和SVM排名第二(F1=0.937);BERT和ClinicalBERT表现较差(F1分别为0.873/0.839 和 0.873/0.886);规则方法为基线(F1约0.857-0.858)。 Conclusion: 基于大语言模型的NLP方法在提取氟嘧啶相关治疗与毒性信息方面最为有效,具有推动肿瘤学研究和药物安全监测的潜力。 Abstract: Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities extraction.LR and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance.

[60] Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Runzhe Zhan,Zhihong Huang,Xinyi Yang,Lidia S. Chao,Min Yang,Derek F. Wong

Main category: cs.CL

TL;DR: 本文首次系统分析了大型推理模型(LRM)作为机器翻译(MT)质量评估器的潜力,发现其存在过度思考、评分机制偏差等问题,并提出通过合成的人类思维轨迹训练来校准LRM的思考过程。实验表明,该方法显著降低35倍的推理开销,同时在7B到32B规模的LRM上显著提升评估性能。

Details Motivation: 探索大型推理模型(LRM)在机器翻译评估中的应用潜力,解决现有自动评估方法在细粒度质量判断上的不足。 Method: 构建合成的人类思维轨迹数据集,用于训练和校准LRM的中间思考过程,并在WMT24 Metrics基准上进行评估。 Result: 校准后的LRM将思考预算减少约35倍,在不同规模模型上显著提升与人类判断的相关性,例如R1-Distill-Qwen-7B提升了+8.7相关性点。 Conclusion: 经过高效校准的LRM在降低推理成本的同时,能显著提升机器翻译自动评估的准确性和细粒度分析能力,展现出作为评估器的巨大潜力。 Abstract: Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.

[61] A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

Alicia Sagae,Chia-Jung Lee,Sandeep Avula,Brandon Dang,Vanessa Murdock

Main category: cs.CL

TL;DR: 提出了一种针对负责任AI(如公平性)评估大语言模型的新方法,基于真实应用场景构建参数化数据集,用于识别LLM在质量、真实性、安全性和公平性方面的差距。

Details Motivation: 现有LLM评估方法多关注高层任务,难以满足负责任AI(如公平性)在具体应用中的评估需求,尤其因不同应用中敏感属性的重要性不同。 Method: 构建一个基于真实应用场景(根据产品特征生成描述)的数据集,通过性别化形容词和产品类别的交叉公平属性进行参数化,生成带标签的提示集,并用于评估LLM在多个维度的表现。 Result: 该数据集能有效揭示LLM在质量、真实性、安全性及公平性方面的缺陷,提供可复现的评估资源。 Conclusion: 为LLM的负责任AI评估提供了可扩展的框架和实用数据资源,推动面向具体应用的精细化评估。 Abstract: Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.

[62] Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Mutian He,Philip N. Garner

Main category: cs.CL

TL;DR: 提出了一种结合稀疏注意力和可学习token驱逐机制的混合模型,以缓解线性注意力模型在检索密集型任务中的遗忘问题,同时保持其高效性。

Details Motivation: 线性注意力模型由于固定大小的隐状态导致记忆有限,容易遗忘早期信息,影响其在需要长期依赖的检索密集型任务中的表现。 Method: 设计了一系列混合模型,在线性注意力中插入具有中间时空复杂度的token mixer,包括带token驱逐的稀疏注意力和查询感知的原生稀疏注意力;提出一种可学习的token驱逐机制,并结合滑动窗口注意力与轻量级CNN,自适应保留每头的关键KV对。 Result: 所提方法在检索密集型基准任务上表现出色,有效缓解了遗忘问题,同时保持了线性注意力的时间和空间复杂度优势,且提供了高效的Triton稀疏注意力内核实现。 Conclusion: 通过引入可学习的稀疏注意力机制,可以在不牺牲效率的前提下显著提升线性注意力模型在长序列和检索密集任务中的性能。 Abstract: Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.

[63] Simple Context Compression: Mean-Pooling and Multi-Ratio Training

Yair Feldman,Yoav Artzi

Main category: cs.CL

TL;DR: 提出一种轻量且简单的均值池化方法用于上下文压缩,在多种设置下优于现有的压缩标记架构,表现出色且支持多压缩比训练。

Details Motivation: 为了降低在检索增强生成(RAG)中使用长上下文时的计算成本,需要高效的软上下文压缩方法。 Method: 采用均值池化策略将输入序列转换为较短的连续表示,并训练同一压缩器输出多种压缩比。 Result: 在多个领域内和跨领域的问答数据集、不同模型族与规模及压缩比下,均值池化方法表现最佳,且在多压缩比训练下性能下降较小。 Conclusion: 简单的均值池化是一种高效且鲁棒的上下文压缩方法,但不同架构和训练方式之间的权衡较为复杂,显示出压缩方法的多样性挑战。 Abstract: A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.

[64] On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?

Mingmeng Geng,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文讨论了当前大语言模型生成文本检测面临的挑战,指出缺乏对“LLM生成文本”的明确定义、应用场景的多样性以及人类编辑对LLM输出的影响,使得检测边界模糊。现有基准和评估方法未能充分覆盖实际应用中的各种情况,导致检测结果常被误解,其意义逐渐减弱。因此,检测工具仅在特定条件下有效,结果应作为参考而非决定性依据。

Details Motivation: 由于大语言模型(LLMs)的广泛应用,研究者开始关注如何检测其生成的文本,但目前对检测目标缺乏一致且精确的定义,且真实场景复杂多样,促使作者反思现有检测方法的有效性和局限性。 Method: 本文通过分析现有检测任务的目标模糊性、LLM使用场景的多样性以及人机协作带来的文本边界模糊问题,批判性地审视了当前检测基准和评估方式的不足。 Result: 发现当前检测器的数值结果常被误解,其实际意义受限;检测性能高度依赖具体条件,难以泛化到复杂现实场景。 Conclusion: 现有的LLM生成文本检测方法在特定条件下仍有一定用途,但其结果应谨慎解读,仅作为参考,不能作为决定性判断依据。 Abstract: With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.

cs.CV [Back]

[65] Fourier-Based GAN Fingerprint Detection using ResNet50

Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru

Main category: cs.CV

TL;DR: 提出一种结合频域分析与深度学习的方法,利用2D DFT和ResNet50有效区分StyleGAN生成图像与真实图像,性能优于直接在空间域训练的模型。

Details Motivation: 应对由生成对抗网络(GAN)产生的逼真图像对图像取证和工业系统内容真实性带来的挑战。 Method: 将图像通过二维离散傅里叶变换(2D DFT)转换到频域以检测细微的周期性伪影,并使用ResNet50神经网络在频域图像上进行训练以区分真实与合成图像。 Result: 该方法在检测StyleGAN生成图像时达到92.8%的准确率和0.95的AUC,显著优于在原始空间域图像上训练的模型。 Conclusion: GAN生成图像具有独特的频域“指纹”,结合信号处理与深度学习可提升数字取证能力,增强工业AI系统的可信度。 Abstract: The rapid rise of photorealistic images produced from Generative Adversarial Networks (GANs) poses a serious challenge for image forensics and industrial systems requiring reliable content authenticity. This paper uses frequency-domain analysis combined with deep learning to solve the problem of distinguishing StyleGAN-generated images from real ones. Specifically, a two-dimensional Discrete Fourier Transform (2D DFT) was applied to transform images into the Fourier domain, where subtle periodic artifacts become detectable. A ResNet50 neural network is trained on these transformed images to differentiate between real and synthetic ones. The experiments demonstrate that the frequency-domain model achieves a 92.8 percent and an AUC of 0.95, significantly outperforming the equivalent model trained on raw spatial-domain images. These results indicate that the GAN-generated images have unique frequency-domain signatures or "fingerprints". The method proposed highlights the industrial potential of combining signal processing techniques and deep learning to enhance digital forensics and strengthen the trustworthiness of industrial AI systems.

[66] Transformed Multi-view 3D Shape Features with Contrastive Learning

Márcus Vinícius Lobo Costa,Sherlon Almeida da Silva,Bárbara Caroline Benato,Leo Sampaio Ferraz Ribeiro,Moacir Antonelli Ponti

Main category: cs.CV

TL;DR: 本文研究了基于Vision Transformers(ViTs)和对比学习的3D形状特征表示学习方法,通过结合ViTs的全局形状理解能力和对比学习的局部特征优化,在减少对大量标注数据依赖的同时,提升了多视图3D分析性能。

Details Motivation: 现有3D形状表示方法依赖CNN和大量标注数据,难以捕捉关键形状关系,且在从2D图像识别3D物体时表现受限,因此需要更有效的学习框架。 Method: 采用基于Vision Transformers(ViTs)的架构,结合监督与自监督的对比学习目标,进行3D形状表示学习,并在ModelNet等数据集上进行多视图3D分析实验验证。 Result: 在ModelNet10上实现了约90.6%的准确率,表明ViTs结合对比学习能有效提升3D表示学习性能,优于传统CNN方法。 Conclusion: ViTs与对比学习相结合的方法在3D形状表示学习中表现出色,能够克服CNN的局限性和对大量标注数据的依赖,为3D形状理解提供了统一且有效的框架。 Abstract: This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs' ability to understand overall shapes and contrastive learning's effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.

[67] FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking

Martha Teiko Teye,Ori Maoz,Matthias Rottmann

Main category: cs.CV

TL;DR: FutrTrack是一个基于相机-LiDAR的多目标跟踪框架,采用Transformer平滑器和融合驱动跟踪器,在nuScenes和KITTI上表现出色,具有低身份切换和高空间一致性。

Details Motivation: 提升现有3D检测器在多传感器环境下的多目标跟踪性能,解决遮挡和视角变化带来的身份切换和轨迹抖动问题。 Method: 提出FutrTrack框架,包含一个基于Transformer的时序平滑模块和一个无需显式运动模型的多模态BEV特征融合跟踪器,通过几何与语义信息进行跨帧身份匹配与传播。 Result: 在nuScenes测试集上达到74.7 aMOTA,显著优于单传感器方法,有效减少身份切换并提升轨迹稳定性。 Conclusion: FutrTrack证明了多模态特征对基于查询的Transformer跟踪方法的重要性,提供了一种高效、无需预训练即可提升性能的跟踪框架。 Abstract: We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird's-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.

[68] Improving Predictive Confidence in Medical Imaging via Online Label Smoothing

Kushan Choudhury,Shubhrodeep Roy,Ankur Chanda,Shubhajit Biswas,Somenath Kuiry

Main category: cs.CV

TL;DR: 本研究探讨了在线标签平滑(OLS)在医学图像分类中的应用,结果显示OLS在多个模型上均提升了分类准确性和特征表示学习。

Details Motivation: 深度学习模型在医学图像分类中表现优异,但常产生过度自信的预测,影响其在关键医疗环境中的可靠性。传统标签平滑方法未能考虑类别间关系,限制了性能提升。 Method: 采用在线标签平滑(OLS)方法,根据模型训练过程中的预测模式动态调整软标签,并在RadImageNet数据集上使用ResNet-50、MobileNetV2和VGG-19三种主流架构进行评估。 Result: OLS在Top-1和Top-5分类准确率上均优于标准训练方法(如硬标签、传统标签平滑和无教师知识蒸馏),并生成更紧凑且分离良好的特征嵌入,表明表征学习能力增强。 Conclusion: OLS不仅提高了预测性能和模型校准性,还为医学影像领域构建可信AI系统提供了一种实用有效的解决方案。 Abstract: Deep learning models, especially convolutional neural networks, have achieved impressive results in medical image classification. However, these models often produce overconfident predictions, which can undermine their reliability in critical healthcare settings. While traditional label smoothing offers a simple way to reduce such overconfidence, it fails to consider relationships between classes by treating all non-target classes equally. In this study, we explore the use of Online Label Smoothing (OLS), a dynamic approach that adjusts soft labels throughout training based on the model's own prediction patterns. We evaluate OLS on the large-scale RadImageNet dataset using three widely used architectures: ResNet-50, MobileNetV2, and VGG-19. Our results show that OLS consistently improves both Top-1 and Top-5 classification accuracy compared to standard training methods, including hard labels, conventional label smoothing, and teacher-free knowledge distillation. In addition to accuracy gains, OLS leads to more compact and well-separated feature embeddings, indicating improved representation learning. These findings suggest that OLS not only strengthens predictive performance but also enhances calibration, making it a practical and effective solution for developing trustworthy AI systems in the medical imaging domain.

[69] A Unified Detection Pipeline for Robust Object Detection in Fisheye-Based Traffic Surveillance

Neema Jakisa Owor,Joshua Kofi Asamoah,Tanner Wambui Muturi,Anneliese Jakisa Owor,Blessing Agyei Kyem,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah

Main category: cs.CV

TL;DR: 提出了一种针对鱼眼相机交通监控图像的检测框架,通过预处理和后处理流程以及集成多个先进检测模型的方法,在严重畸变区域提高了检测一致性,在2025年AI City Challenge Track 4中取得了F1分数0.6366,排名第8。

Details Motivation: 鱼眼相机虽然能提供大视场的交通监控,但其严重的径向畸变和非均匀分辨率给标准目标检测器带来了挑战,尤其是在图像边界区域。 Method: 设计了一个简单而有效的预处理和后处理流程,并结合多个最先进的检测模型输出,采用集成策略提升整体检测精度。 Result: 在2025年AI City Challenge Track 4上达到F1分数0.6366,62支队伍中排名第8。 Conclusion: 所提出的框架能有效应对鱼眼图像中的畸变问题,提升目标检测的鲁棒性和一致性。 Abstract: Fisheye cameras offer an efficient solution for wide-area traffic surveillance by capturing large fields of view from a single vantage point. However, the strong radial distortion and nonuniform resolution inherent in fisheye imagery introduce substantial challenges for standard object detectors, particularly near image boundaries where object appearance is severely degraded. In this work, we present a detection framework designed to operate robustly under these conditions. Our approach employs a simple yet effective pre and post processing pipeline that enhances detection consistency across the image, especially in regions affected by severe distortion. We train several state-of-the-art detection models on the fisheye traffic imagery and combine their outputs through an ensemble strategy to improve overall detection accuracy. Our method achieves an F1 score of0.6366 on the 2025 AI City Challenge Track 4, placing 8thoverall out of 62 teams. These results demonstrate the effectiveness of our framework in addressing issues inherent to fisheye imagery.

[70] Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses

Damian Bowness,Charalambos Poullis

Main category: cs.CV

TL;DR: 本文提出了一种实时渲染感知的过滤方法,用于改善3D高斯点阵(3DGS)模型在训练视角外区域的视觉质量,通过基于中间梯度的敏感性评分来抑制由各向异性方向引起的不稳定性,从而提升外推区域的渲染真实感与一致性。

Details Motivation: 当从远离训练数据分布的相机位置观察3D高斯点阵模型时,常出现严重的视觉噪声,这是由于外推区域缺乏训练数据导致模型对密度、颜色和几何的预测不确定性增加。因此需要一种能有效处理生成不确定性的方法。 Method: 提出一种基于中间梯度敏感性评分的实时渲染感知过滤方法,专门针对由各向异性取向引起的不稳定性,而非传统的各向同性方差,该方法可无缝集成到现有的3DGS渲染流程中。 Result: 实验表明,与现有的NeRF方法(如BayesRays)相比,该方法显著提升了视觉质量、真实感和一致性,并且无需额外的后处理重训练或微调即可实现实时集成。 Conclusion: 所提出的过滤方法有效缓解了3DGS模型在训练视角外的生成不确定性问题,能够在用户自由导航时保持高视觉保真度,具有良好的实时性和兼容性。 Abstract: When viewing a 3D Gaussian Splatting (3DGS) model from camera positions significantly outside the training data distribution, substantial visual noise commonly occurs. These artifacts result from the lack of training data in these extrapolated regions, leading to uncertain density, color, and geometry predictions from the model. To address this issue, we propose a novel real-time render-aware filtering method. Our approach leverages sensitivity scores derived from intermediate gradients, explicitly targeting instabilities caused by anisotropic orientations rather than isotropic variance. This filtering method directly addresses the core issue of generative uncertainty, allowing 3D reconstruction systems to maintain high visual fidelity even when users freely navigate outside the original training viewpoints. Experimental evaluation demonstrates that our method substantially improves visual quality, realism, and consistency compared to existing Neural Radiance Field (NeRF)-based approaches such as BayesRays. Critically, our filter seamlessly integrates into existing 3DGS rendering pipelines in real-time, unlike methods that require extensive post-hoc retraining or fine-tuning. Code and results at https://damian-bowness.github.io/EV3DGS

[71] BrainPuzzle: Hybrid Physics and Data-Driven Reconstruction for Transcranial Ultrasound Tomography

Shengyu Chen,Shihang Feng,Yi Luo,Xiaowei Jia,Youzuo Lin

Main category: cs.CV

TL;DR: 提出BrainPuzzle,一种结合物理建模与机器学习的双阶段框架,用于实现高精度经颅超声速成像。

Details Motivation: 传统全波形反演受信号弱、空间覆盖不全限制,纯数据驱动方法在低信噪比和稀疏孔径下定量偏差大。 Method: 第一阶段采用逆时迁移生成保留结构细节的迁移片段;第二阶段使用基于Transformer的超分辨率编码器-解码器与图注意力单元融合片段,重建准确的声速图。 Result: 在两个合成数据集上实验表明,BrainPuzzle在声速重建精度和图像完整性方面优于现有方法。 Conclusion: BrainPuzzle通过融合物理模型与深度学习,提升了定量经颅超声脑成像的可行性与准确性。 Abstract: Ultrasound brain imaging remains challenging due to the large difference in sound speed between the skull and brain tissues and the difficulty of coupling large probes to the skull. This work aims to achieve quantitative transcranial ultrasound by reconstructing an accurate speed-of-sound (SoS) map of the brain. Traditional physics-based full-waveform inversion (FWI) is limited by weak signals caused by skull-induced attenuation, mode conversion, and phase aberration, as well as incomplete spatial coverage since full-aperture arrays are clinically impractical. In contrast, purely data-driven methods that learn directly from raw ultrasound data often fail to model the complex nonlinear and nonlocal wave propagation through bone, leading to anatomically plausible but quantitatively biased SoS maps under low signal-to-noise and sparse-aperture conditions. To address these issues, we propose BrainPuzzle, a hybrid two-stage framework that combines physical modeling with machine learning. In the first stage, reverse time migration (time-reversal acoustics) is applied to multi-angle acquisitions to produce migration fragments that preserve structural details even under low SNR. In the second stage, a transformer-based super-resolution encoder-decoder with a graph-based attention unit (GAU) fuses these fragments into a coherent and quantitatively accurate SoS image. A partial-array acquisition strategy using a movable low-count transducer set improves feasibility and coupling, while the hybrid algorithm compensates for the missing aperture. Experiments on two synthetic datasets show that BrainPuzzle achieves superior SoS reconstruction accuracy and image completeness, demonstrating its potential for advancing quantitative ultrasound brain imaging.

[72] Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

Huichan Seo,Sieun Choi,Minki Hong,Yi Zhou,Junseo Kim,Lukman Ismaila,Naome Etori,Mehul Agarwal,Zhixuan Liu,Jihie Kim,Jean Oh

Main category: cs.CV

TL;DR: 本研究提出了一种跨国家、跨时代的统一评估框架,用于诊断文本到图像(T2I)和图像到图像(I2I)生成模型中的文化偏见,发现现有模型倾向于默认生成全球北方现代化的视觉内容,并在编辑过程中丧失文化保真度。

Details Motivation: 现有生成模型常忽视或误表征文化多样性,尤其是在图像到图像编辑任务中,文化偏见问题尚未被系统评估,因此需要一个标准化、可复现的评估框架来揭示和追踪这类偏见。 Method: 研究设计了一个涵盖六个国家、8大类36子类的文化分类体系,结合时代感知提示词,在统一协议下评估T2I生成与I2I编辑;采用自动指标、文化感知的检索增强VQA以及母语专家的人工评判进行综合评估,并公开全部数据与配置。 Result: 研究发现:(1)在无国家指向的提示下,模型偏向全球北方和现代风格,弱化国家差异;(2)迭代I2I编辑会降低文化保真度,即使传统指标未变或改善;(3)I2I模型仅添加表面文化元素,缺乏时代一致性和情境理解,尤其在全球南方目标上保留源身份特征。 Conclusion: 当前生成图像模型在文化敏感编辑方面仍不可靠;本文通过发布标准化数据、提示和人工评估协议,建立了一个以文化为中心、可复现的基准,有助于未来对文化偏见的诊断与改进。 Abstract: Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models.

[73] Filter-Based Reconstruction of Images from Events

Bernd Pfrommer

Main category: cs.CV

TL;DR: 本文提出了一种名为FIBAR的异步滤波器基强度图像重建方法,用于从移动事件相机的事件流中恢复图像。该方法通过IIR滤波器积分事件引起的亮度变化,并利用新颖算法检测和处理“过时”像素,结合高斯模糊抑制噪声。FIBAR完全异步,可在任意时刻输出图像,且在普通笔记本CPU上实现高达1.4亿事件/秒的处理速度。虽然重建结果比神经网络方法(如FireNet)更噪且存在鬼影问题,但足以胜任如标志点检测等任务。

Details Motivation: 现有的基于神经网络的事件相机图像重建方法通常依赖GPU,计算成本高且难以部署在资源受限平台。本文旨在提出一种更轻量、高效、可在CPU上实时运行的异步重建方法,适用于低功耗或嵌入式场景。 Method: 提出FIlter Based Asynchronous Reconstruction(FIBAR)方法:1)使用时间数字IIR滤波器积分事件带来的亮度变化;2)通过监测最近更新的像素窗口来检测“过时”像素;3)对无事件更新的像素区域应用高斯模糊,假设相机运动下无事件意味着低梯度;4)整个过程异步进行,支持任意时刻图像读出。 Result: FIBAR在现代笔记本CPU上可处理约4200万(启用空间滤波)至1.4亿事件/秒;重建图像虽比FireNet等神经网络方法更噪且存在鬼影现象,但在如fiducial marker检测等任务中仍有效;定性实验显示其在简单场景下具备实用性。 Conclusion: FIBAR是一种简单、快速、完全异步的事件相机图像重建方法,虽在图像质量上不如深度学习方法,但因其低计算需求和高吞吐率,适合对实时性和资源消耗敏感的应用场景,为轻量级事件视觉系统提供了可行方案。 Abstract: Reconstructing an intensity image from the events of a moving event camera is a challenging task that is typically approached with neural networks deployed on graphics processing units. This paper presents a much simpler, FIlter Based Asynchronous Reconstruction method (FIBAR). First, intensity changes signaled by events are integrated with a temporal digital IIR filter. To reduce reconstruction noise, stale pixels are detected by a novel algorithm that regulates a window of recently updated pixels. Arguing that for a moving camera, the absence of events at a pixel location likely implies a low image gradient, stale pixels are then blurred with a Gaussian filter. In contrast to most existing methods, FIBAR is asynchronous and permits image read-out at an arbitrary time. It runs on a modern laptop CPU at about 42(140) million events/s with (without) spatial filtering enabled. A few simple qualitative experiments are presented that show the difference in image reconstruction between FIBAR and a neural network-based approach (FireNet). FIBAR's reconstruction is noisier than neural network-based methods and suffers from ghost images. However, it is sufficient for certain tasks such as the detection of fiducial markers. Code is available at https://github.com/ros-event-camera/event_image_reconstruction_fibar

[74] Data-Adaptive Transformed Bilateral Tensor Low-Rank Representation for Clustering

Hui Chen,Xinjie Wang,Xianchao Xiu,Wanquan Liu

Main category: cs.CV

TL;DR: 提出了一种新的自适应变换双边张量低秩表示模型TBTLRR,通过学习任意酉变换提升图像聚类中对噪声的鲁棒性,并有效捕捉全局和局部相关性。

Details Motivation: 现有张量低秩表示方法依赖固定变换,对噪声鲁棒性差,难以有效捕捉图像数据的全局与局部结构信息。 Method: 引入数据自适应的张量核范数,结合双边张量结构,并融合ℓ₁/₂范数和Frobenius范数正则化项;采用基于ADMM的优化算法求解非凸模型。 Result: 在多个实验中优于当前先进方法,表现出更强的噪声鲁棒性和更优的聚类性能。 Conclusion: TBTLRR能更有效地捕捉图像数据的全局与局部相关性,显著提升聚类效果,具备实际应用潜力。 Abstract: Tensor low-rank representation (TLRR) has demonstrated significant success in image clustering. However, most existing methods rely on fixed transformations and suffer from poor robustness to noise. In this paper, we propose a novel transformed bilateral tensor low-rank representation model called TBTLRR, which introduces a data-adaptive tensor nuclear norm by learning arbitrary unitary transforms, allowing for more effective capture of global correlations. In addition, by leveraging the bilateral structure of latent tensor data, TBTLRR is able to exploit local correlations between image samples and features. Furthermore, TBTLRR integrates the $\ell_{1/2}$-norm and Frobenius norm regularization terms for better dealing with complex noise in real-world scenarios. To solve the proposed nonconvex model, we develop an efficient optimization algorithm inspired by the alternating direction method of multipliers (ADMM) and provide theoretical convergence. Extensive experiments validate its superiority over the state-of-the-art methods in clustering. The code will be available at https://github.com/xianchaoxiu/TBTLRR.

[75] Endoshare: A Source Available Solution to De-Identify and Manage Surgical Videos

Lorenzo Arboit,Dennis N. Schneider,Britty Baby,Vinkle Srivastav,Pietro Mascagni,Nicolas Padoy

Main category: cs.CV

TL;DR: Endoshare是一个开源、跨平台的应用程序,用于整合、标准化和去标识化微创手术中的内窥镜视频,支持隐私保护的手术视频管理。

Details Motivation: 推动手术培训、研究和质量改进,同时解决视频格式异构和隐私共享问题。 Method: 遵循软件开发生命周期,采用以用户为中心的迭代反馈,结合临床医生和计算机科学家的可用性评估与技术接受模型进行设计与测试。 Result: 原型测试显示高可用性(医生评分4.68/5,计算机科学家4.03/5),优化后外科医生感知有用性为5.07/7,易用性5.15/7,推荐度9.20/10;处理时间受模式、视频长度和硬件影响。 Conclusion: Endoshare提供透明且用户友好的手术视频管理流程,具备成为专有系统替代方案的潜力,但需进一步认证合规性和互操作性。 Abstract: Video-based assessment and surgical data science can advance surgical training, research, and quality improvement. However, widespread use remains limited by heterogeneous recording formats and privacy concerns associated with video sharing. We present Endoshare, a source-available, cross-platform application for merging, standardizing, and de-identifying endoscopic videos in minimally invasive surgery. Development followed the software development life cycle with iterative, user-centered feedback. During the analysis phase, an internal survey of clinicians and computer scientists based on ten usability heuristics identified key requirements that guided a privacy-by-design architecture. In the testing phase, an external clinician survey combined the same heuristics with Technology Acceptance Model constructs to assess usability and adoption, complemented by benchmarking across different hardware configurations. Four clinicians and four computer scientists initially tested the prototype, reporting high usability (4.68 +/- 0.40/5 and 4.03 +/- 0.51/5), with the lowest score (4.00 +/- 0.93/5) relating to label clarity. After refinement, the testing phase surveyed ten surgeons who reported high perceived usefulness (5.07 +/- 1.75/7), ease of use (5.15 +/- 1.71/7), heuristic usability (4.38 +/- 0.48/5), and strong recommendation (9.20 +/- 0.79/10). Processing time varied with processing mode, video duration (both p <= 0.001), and machine computational power (p = 0.041). Endoshare provides a transparent, user-friendly pipeline for standardized, privacy-preserving surgical video management. Compliance certification and broader interoperability validation are needed to establish it as a deployable alternative to proprietary systems. The software is available at https://camma-public.github.io/Endoshare/

[76] Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

Hao Yu,Haoyu Chen,Yan Jiang,Wei Peng,Zhaodong Sun,Samuel Kaski,Guoying Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的卷积算子Attentive Convolution (ATConv),通过引入自注意力机制中的自适应路由和横向抑制特性,克服了传统卷积的冗余性和静态性问题,在图像分类和生成任务中超越了多种自注意力机制。

Details Motivation: 自注意力机制虽然表达能力强,但计算复杂度高;卷积虽高效却性能不足。本文旨在探究自注意力优于卷积的本质原因,并据此改进卷积设计以缩小性能差距。 Method: 分析自注意力优于卷积的两个关键因素:自适应路由和横向抑制,并据此提出Attentive Convolution (ATConv) 算子,将其应用于CNN(AttNet)和扩散模型中验证效果。 Result: ATConv在仅使用3×3卷积核的情况下,在多个视觉任务上优于各种自注意力机制;AttNet在ImageNet-1K上达到84.4% Top-1准确率(27M参数);在SiT-XL/2中用ATConv替代SA使FID降低0.15且采样更快。 Conclusion: 通过借鉴自注意力的核心原理,卷积可以被重新设计为兼具高效性与强表达能力的算子,推动卷积网络的再复兴。 Abstract: Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3\times3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3\times 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.

[77] StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

Jiho Park,Sieun Choi,Jaeyoon Seo,Jihie Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为StableSketcher的新框架,通过改进变分自编码器和引入基于视觉问答的奖励函数,增强了扩散模型生成手绘草图的能力,并发布了首个包含实例级草图、描述和问答对的数据集SketchDUO。

Details Motivation: 现有的扩散模型在生成像素级的手绘草图方面仍存在挑战,尤其是在保持提示忠实性和语义一致性方面表现不足。 Method: 提出StableSketcher框架,微调变分自编码器以优化潜在解码,并结合基于视觉问答的奖励函数进行强化学习,提升文本-图像对齐和语义一致性。同时构建了新的数据集SketchDUO。 Result: 实验表明,StableSketcher在草图风格保真度和提示对齐方面优于Stable Diffusion基线模型,且SketchDUO数据集填补了现有数据集缺乏细粒度标注的空白。 Conclusion: StableSketcher有效提升了扩散模型生成手绘草图的质量和语义一致性,SketchDUO为未来草图生成研究提供了重要资源。 Abstract: Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.

[78] BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

Ziheng Zhang,Xinyue Ma,Arpita Chowdhury,Elizabeth G. Campolongo,Matthew J. Thompson,Net Zhang,Samuel Stevens,Hilmar Lapp,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao,Jianyang Gu

Main category: cs.CV

TL;DR: 本研究提出利用描述性字幕作为生物多模态基础模型的额外监督信号,通过生成基于维基百科视觉信息和分类定制格式的合成字幕,训练出性能优越的BIOCAP模型,在物种分类和图文检索任务中表现优异。

Details Motivation: 生物领域缺乏大规模、实例特定的自然语言标注,限制了多模态基础模型的应用。本文旨在探索描述性字幕作为补充监督信号的潜力,以更好对齐图像与文本的潜在语义空间。 Method: 利用多模态大语言模型(MLLMs)生成合成描述性字幕,结合维基百科的视觉信息和特定分类的格式示例进行引导,减少幻觉;使用这些字幕训练BIOCAP模型(BIOCLIP with Captions),实现图像-文本对齐。 Result: BIOCAP模型在物种分类和图文检索任务上表现出色,验证了描述性字幕相较于简单标签更能有效桥接生物图像与多模态模型。 Conclusion: 描述性字幕是提升生物多模态基础模型性能的有效途径,合成字幕生成方法可缓解标注稀缺问题,推动生物学领域的多模态学习发展。 Abstract: This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.

[79] Physics-Guided Fusion for Robust 3D Tracking of Fast Moving Small Objects

Prithvi Raj Singh,Raju Gottumukkala,Anthony S. Maida,Alan B. Barhorst,Vijaya Gopu

Main category: cs.CV

TL;DR: 本文提出了一种结合深度学习检测与基于物理的跟踪算法的系统,用于在RGB-D相机下实现对快速移动小物体的3D检测与跟踪,显著优于传统方法。

Details Motivation: 快速移动的小物体在现有计算机视觉研究中仍属难点,尤其在遮挡、快速变向等复杂场景下检测与跟踪性能不足。 Method: 采用深度学习进行目标检测,并设计基于运动学方程的物理跟踪算法,结合异常值检测与校正模块,在3D空间中实现鲁棒的跟踪。 Result: 在自建的壁球数据集上验证,相比卡尔曼滤波器跟踪器,平均位移误差减少高达70%。 Conclusion: 该系统通过融合物理模型与深度学习,有效提升了对高速小目标的感知能力,适用于自主机器人平台的实时3D跟踪任务。 Abstract: While computer vision has advanced considerably for general object detection and tracking, the specific problem of fast-moving tiny objects remains underexplored. This paper addresses the significant challenge of detecting and tracking rapidly moving small objects using an RGB-D camera. Our novel system combines deep learning-based detection with physics-based tracking to overcome the limitations of existing approaches. Our contributions include: (1) a comprehensive system design for object detection and tracking of fast-moving small objects in 3D space, (2) an innovative physics-based tracking algorithm that integrates kinematics motion equations to handle outliers and missed detections, and (3) an outlier detection and correction module that significantly improves tracking performance in challenging scenarios such as occlusions and rapid direction changes. We evaluated our proposed system on a custom racquetball dataset. Our evaluation shows our system surpassing kalman filter based trackers with up to 70\% less Average Displacement Error. Our system has significant applications for improving robot perception on autonomous platforms and demonstrates the effectiveness of combining physics-based models with deep learning approaches for real-time 3D detection and tracking of challenging small objects.

[80] Inverse Image-Based Rendering for Light Field Generation from Single Images

Hyunjun Jung,Hae-Gon Jeon

Main category: cs.CV

TL;DR: 本文提出了一种名为逆向图像渲染(inverse image-based rendering)的新方法,能够从单张图像生成光场,无需特殊设备或复杂计算。

Details Motivation: 为了扩展光场在场景表示中的应用,克服其对专用设备和高计算成本的依赖,实现更广泛的实际应用。 Method: 设计了一个神经渲染管线,通过存储输入图像中源光线的光流,利用交叉注意力建模光线间关系,并预测任意视角下目标光线的颜色;通过迭代更新源光线集合来生成一致的新视图。 Result: 该方法在多个具有挑战性的数据集上表现出色,无需重新训练或微调,且优于当前最先进的新视图合成方法。 Conclusion: 逆向图像渲染为从单图像生成光场提供了一种高效、通用的解决方案,显著提升了光场技术的适用性和实用性。 Abstract: A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the generated out-of-view contents are updated to the set of source rays. This procedure is iteratively performed while ensuring the consistent generation of occluded contents. We demonstrate that our inverse image-based rendering works well with various challenging datasets without any retraining or finetuning after once trained on synthetic dataset, and outperforms relevant state-of-the-art novel view synthesis methods.

[81] Revisiting Logit Distributions for Reliable Out-of-Distribution Detection

Jiachen Liang,Ruibing Hou,Minyang Hu,Hong Chang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 提出了一种新的后处理OOD检测方法LogitGap,利用logits空间中最大logit与其他logit的关系来增强ID和OOD样本的可分性,并通过自动选择最具信息量的logits子集进一步提升性能。

Details Motivation: 现有后处理OOD检测方法未能充分利用模型logits空间中的丰富信息,导致检测效果受限。 Method: 提出LogitGap方法,基于最大logit与其余logit的关系进行OOD检测;引入一种无需训练的策略,自动识别最具有信息量的logits子集以优化评分。 Result: 在视觉-语言和纯视觉模型上进行了大量实验,LogitGap在多种OOD检测场景和基准上均达到最先进的性能。 Conclusion: LogitGap通过挖掘logits空间的结构关系并聚焦关键信息,显著提升了后处理OOD检测的效果,具有良好的通用性和实用性。 Abstract: Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning models in open-world applications. While post-hoc methods are favored for their efficiency and ease of deployment, existing approaches often underexploit the rich information embedded in the model's logits space. In this paper, we propose LogitGap, a novel post-hoc OOD detection method that explicitly exploits the relationship between the maximum logit and the remaining logits to enhance the separability between in-distribution (ID) and OOD samples. To further improve its effectiveness, we refine LogitGap by focusing on a more compact and informative subset of the logit space. Specifically, we introduce a training-free strategy that automatically identifies the most informative logits for scoring. We provide both theoretical analysis and empirical evidence to validate the effectiveness of our approach. Extensive experiments on both vision-language and vision-only models demonstrate that LogitGap consistently achieves state-of-the-art performance across diverse OOD detection scenarios and benchmarks. Code is available at https://github.com/GIT-LJc/LogitGap.

[82] PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding

Penghao Wang,Yiyang He,Xin Lv,Yukai Zhou,Lan Xu,Jingyi Yu,Jiayuan Gu

Main category: cs.CV

TL;DR: PartNeXt是一个大规模、高质量、带纹理的3D部件理解数据集,包含23000多个模型和精细层次化标注,支持细粒度部件分割和3D部分问答等任务,显著提升现有方法性能并推动结构化3D理解研究。

Details Motivation: 现有3D部件数据集(如PartNet)依赖无纹理几何和专家标注,限制了可扩展性和实用性,难以支持细粒度和开放词汇的部件理解任务。 Method: 提出PartNeXt数据集,包含超过23,000个带纹理的3D模型,覆盖50个类别,具有细粒度、层次化的部件标注;设计两类基准任务:类无关部件分割和3D部件问答;并在Point-SAM上验证数据集有效性。 Result: 在类无关部件分割任务中,现有SOTA方法(如PartField、SAMPart3D)对细粒度和叶级部件表现不佳;在3D部件问答任务中,揭示了3D-LLMs在开放词汇部件定位上的显著不足;使用PartNeXt训练Point-SAM相比PartNet有显著性能提升。 Conclusion: PartNeXt通过可扩展的标注流程、纹理感知标签和多任务评估,为结构化3D理解提供了更优的数据基础,推动3D视觉、图形学与机器人领域的部件级理解发展。 Abstract: Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset's superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.

[83] Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists

Eduardo R. Corral-Soto,Yang Liu,Yuan Ren,Bai Dongfeng,Liu Bingbing

Main category: cs.CV

TL;DR: 本文提出了一种从单张RGB图像中对自行车和骑行者进行类别级8D姿态估计的方法,除了估计自行车的3D平移和旋转外,还估计其车把和踏板相对于车身框架的旋转角度,从而更精确地预测骑行者的行驶方向和行为。

Details Motivation: 在自动驾驶中,准确估计骑行者的姿态对于判断其穿越意图、行为预测和避免碰撞至关重要。传统的6D姿态估计方法无法处理自行车部件(如车把和踏板)的关节运动,导致3D边界框和实际行驶方向不一致。 Method: 提出一种联合估计8D姿态和3D关键点的模型,利用合成与真实图像混合数据训练,实现对自行车车把和踏板旋转角度的估计。 Result: 实验表明,该方法在8D姿态参数估计上表现良好,并在与现有基于刚性模板的6D姿态估计方法对比中展现出竞争力。 Conclusion: 该方法通过引入两个额外的关节角度参数,提升了对非刚性自行车的姿态估计精度,有助于更准确地预测骑行者意图和行驶方向。 Abstract: In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.

[84] TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan,Songhe Feng

Main category: cs.CV

TL;DR: 提出一种新的方法TOMCAT,通过在测试时利用无监督数据积累多模态知识,自适应更新多模态原型以应对标签空间分布偏移问题,在四个基准数据集上实现了最先进的性能。

Details Motivation: 现有CZSL方法因测试时包含由属性和对象重新组合的未见组合而导致标签空间分布偏移,从而性能下降。 Method: 从无监督数据中积累文本和视觉模态的知识,在测试时更新多模态原型;设计自适应更新权重控制原型调整程度;引入动态优先队列存储高置信度图像以获取历史视觉知识;通过多模态协同表示学习对齐文本与视觉原型。 Result: 在四个基准数据集的闭世界和开世界设置下均达到最先进水平。 Conclusion: 所提方法有效缓解了测试时分布偏移问题,提升了CZSL模型的泛化能力。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .

[85] IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks

Insu Jeon,Wonkwang Lee,Myeongjang Pyeon,Gunhee Kim

Main category: cs.CV

TL;DR: 提出了一种基于信息瓶颈框架的新型GAN模型IB-GAN,用于无监督解耦表示学习,在解耦性能和生成质量方面优于InfoGAN和β-VAE。

Details Motivation: 尝试将信息瓶颈(IB)框架引入GAN的优化中,以实现更有效的解耦表示学习。 Method: 在GAN生成器中引入一个中间随机层,用以约束输入与输出之间的互信息,并通过端到端方式联合训练该层作为可学习的潜在分布。 Result: 在dSprites和Color-dSprites上取得了与β-VAE相当甚至更好的解耦分数,且在CelebA和3D Chairs上的FID得分优于β-VAE和InfoGAN,表明其生成样本质量更高、多样性更好。 Conclusion: IB-GAN通过引入信息瓶颈机制,有效实现了潜在空间的解耦与可解释性,在解耦性能和生成效果上优于现有方法。 Abstract: We propose a new GAN-based unsupervised model for disentangled representation learning. The new model is discovered in an attempt to utilize the Information Bottleneck (IB) framework to the optimization of GAN, thereby named IB-GAN. The architecture of IB-GAN is partially similar to that of InfoGAN but has a critical difference; an intermediate layer of the generator is leveraged to constrain the mutual information between the input and the generated output. The intermediate stochastic layer can serve as a learnable latent distribution that is trained with the generator jointly in an end-to-end fashion. As a result, the generator of IB-GAN can harness the latent space in a disentangled and interpretable manner. With the experiments on dSprites and Color-dSprites dataset, we demonstrate that IB-GAN achieves competitive disentanglement scores to those of state-of-the-art \b{eta}-VAEs and outperforms InfoGAN. Moreover, the visual quality and the diversity of samples generated by IB-GAN are often better than those by \b{eta}-VAEs and Info-GAN in terms of FID score on CelebA and 3D Chairs dataset.

[86] PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

Yun Wang,Junjie Hu,Qiaole Dong,Yongjian Zhang,Yanwei Fu,Tin Lun Lam,Dapeng Wu

Main category: cs.CV

TL;DR: 本文提出了一种用于立体视频中时间一致深度估计的PPMStereo方法,通过引入“挑选-播放”记忆模块(PPM)实现高效且长时序一致的动态立体匹配。

Details Motivation: 现有方法在建模长时间一致性时面临计算成本高或效果有限的权衡问题,难以满足真实应用场景(如增强现实)对深度估计时间一致性的要求。 Method: 受人类两阶段决策启发,设计PPM模块:'pick'阶段选择最相关的帧,'play'阶段自适应加权这些帧进行时空信息聚合,构建紧凑而高效的记忆缓冲区以支持动态立体匹配。 Result: 实验表明PPMStereo在准确性和时间一致性上均达到SOTA水平,在Sintel数据集上TEPE指标显著优于BiDAStereo,同时计算成本更低。 Conclusion: PPMStereo通过高效的记忆机制实现了高质量的时间一致深度估计,为实际应用提供了更优解决方案。 Abstract: Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3\% \& 9.02\% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolor{blue}{https://github.com/cocowy1/PPMStereo}.

[87] Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

Aaron Appelle,Jerome P. Lynch

Main category: cs.CV

TL;DR: 提出了一种严格的评估协议,用于评测文本到视频和图像到视频模型在行人动态模拟中的表现,发现当前领先模型已学习到有效的多智能体行为先验,但仍存在人物合并或消失等缺陷。

Details Motivation: 现有视频生成模型的基准测试多关注单个主体,缺乏对多人交互场景下多智能体动态合理性的验证,因此需要一种新的评估方法来检验模型作为通用世界模拟器在复杂场景中的可行性。 Method: 设计了针对文本到视频(T2V)和图像到视频(I2V)模型的评估协议:I2V使用现有数据集的起始帧以与真实视频进行对比;T2V构建涵盖不同行人密度和交互的提示集;并提出一种无需相机参数即可从像素空间重建2D鸟瞰轨迹的方法。 Result: 实验表明,当前领先的视频生成模型已具备较强的多智能体行为建模能力,能生成合理的行人动态;但依然存在人物合并、消失等失败模式。 Conclusion: 该评估协议有效揭示了现有T2V和I2V模型在模拟复杂人群动态方面的潜力与局限,为未来改进提供了方向。 Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.

[88] SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization

Xinyi Hu,Yuran Wang,Yue Li,Wenxuan Liu,Zheng Wang

Main category: cs.CV

TL;DR: 本文提出了Suspicion Progression Analysis Network (SPAN),将时间意图定位从离散分类转为连续回归,以捕捉视频中可疑意图的动态演变。

Details Motivation: 现有方法无法有效建模可疑意图的连续性和时序依赖性,限制了早期干预和系统可解释性。 Method: 提出SPAN模型,结合怀疑分数公式、怀疑系数调制和概念锚定映射,利用多模态信息建模连续怀疑度变化,并引入TPP理论刻画长期依赖与累积效应。 Result: 在HAI数据集上实验表明,SPAN比现有方法MSE降低19.8%,平均mAP提升1.78%,低频场景下mAP提升2.74%。 Conclusion: 连续怀疑建模能实现更早检测和主动干预,显著提升安全监控系统的可解释性与实用性。 Abstract: Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.

[89] A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development

Minh Sao Khue Luu,Margaret V. Benedichuk,Ekaterina I. Roppert,Roman M. Kenzhin,Bair N. Tuchinov

Main category: cs.CV

TL;DR: 本研究系统分析了54个公开的脑部MRI数据集,涵盖超过538,031次扫描,评估了数据规模、模态组成、疾病覆盖及图像异质性,揭示了健康人群与临床群体间的显著不平衡以及跨数据集的强烈异质性;尽管标准化预处理可提升数据一致性,但仍存在残留的协变量偏移,表明仅靠数据协调无法消除数据集间偏差,需在基础模型设计中引入预处理感知和领域自适应策略。

Details Motivation: 脑部MRI基础模型的发展依赖于大规模、多样化且一致的数据,但目前缺乏对现有公开数据集在这些方面的系统性评估,导致模型泛化能力受限。 Method: 在数据集层面分析模态组成、疾病覆盖和数据规模;在图像层面量化体素间距、方向和强度分布;评估强度归一化、偏场校正、去颅骨、空间配准和插值等预处理步骤对体素统计和几何结构的影响;并通过3D DenseNet121进行特征空间案例研究,评估预处理后仍存在的协变量偏移。 Result: 发现公开数据集中存在健康样本远多于临床样本的不平衡现象;图像层面存在显著异质性;标准化预处理虽能提升组内一致性,但组间差异仍存;特征空间分析证实预处理后仍存在可测量的协变量偏移。 Conclusion: 公共脑部MRI数据存在显著异质性和不平衡问题,仅靠标准化预处理无法完全消除数据集间偏差,未来基础模型的开发需结合预处理感知和领域自适应方法以提升泛化能力。 Abstract: The development of foundation models for brain MRI depends critically on the scale, diversity, and consistency of available data, yet systematic assessments of these factors remain scarce. In this study, we analyze 54 publicly accessible brain MRI datasets encompassing over 538,031 to provide a structured, multi-level overview tailored to foundation model development. At the dataset level, we characterize modality composition, disease coverage, and dataset scale, revealing strong imbalances between large healthy cohorts and smaller clinical populations. At the image level, we quantify voxel spacing, orientation, and intensity distributions across 15 representative datasets, demonstrating substantial heterogeneity that can influence representation learning. We then perform a quantitative evaluation of preprocessing variability, examining how intensity normalization, bias field correction, skull stripping, spatial registration, and interpolation alter voxel statistics and geometry. While these steps improve within-dataset consistency, residual differences persist between datasets. Finally, feature-space case study using a 3D DenseNet121 shows measurable residual covariate shift after standardized preprocessing, confirming that harmonization alone cannot eliminate inter-dataset bias. Together, these analyses provide a unified characterization of variability in public brain MRI resources and emphasize the need for preprocessing-aware and domain-adaptive strategies in the design of generalizable brain MRI foundation models.

[90] RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao,Qianli Ma,Xiaoxue Wu,Shuai Yang,Guanzhou Lan,Haonan Zhao,Jiaxuan Chen,Qingyang Liu,Yu Qiao,Xinyuan Chen,Yaohui Wang,Li Niu

Main category: cs.CV

TL;DR: RAPO++ 是一个无需修改生成模型的跨阶段提示优化框架,通过检索增强、迭代优化和大语言模型微调,显著提升文本到视频生成的质量。

Details Motivation: 用户提供的提示通常简短且与训练数据不匹配,限制了文本到视频生成模型的潜力,因此需要一种系统方法来优化提示。 Method: 分为三个阶段:第一阶段使用检索增强提示优化(RAPO)丰富并重构提示;第二阶段通过多源反馈进行样本特定的提示优化(SSPO);第三阶段利用优化后的提示对大语言模型进行微调。 Result: 在五个先进T2V模型和五个基准上实验表明,RAPO++ 在语义对齐、组合推理、时间稳定性和物理合理性方面显著优于现有方法。 Conclusion: RAPO++ 是一种模型无关、高效且可扩展的提示优化方案,为文本到视频生成中的提示工程设立了新标准。 Abstract: Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

[91] FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing

Yanghao Wang,Zhen Wang,Long Chen

Main category: cs.CV

TL;DR: 提出FlowCycle,一种无需反转的基于流的编辑框架,通过循环一致过程学习目标感知的中间状态,实现高质量且一致的文本到图像编辑。

Details Motivation: 现有方法在文本到图像编辑中采用目标无关的中间状态,导致可编辑性有限或修改不一致,尤其当修改与源图像差异较大时。 Method: 提出FlowCycle框架,使用可学习噪声参数化破坏过程,并通过前向编辑和反向恢复的双一致性约束进行循环优化,生成目标感知的中间状态。 Result: 实验表明,FlowCycle在编辑质量和一致性方面优于现有最先进方法,能更准确地修改目标内容同时保持源图像一致性。 Conclusion: 目标感知的中间状态对于提升文本到图像编辑的保真度和一致性至关重要,FlowCycle为无需反转的编辑提供了有效解决方案。 Abstract: Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.

[92] Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection

Talha Ilyas,Duong Nhu,Allison Thomas,Arie Levin,Lim Wei Yap,Shu Gong,David Vera Anaya,Yiwen Jiang,Deval Mehta,Ritesh Warty,Vinayak Smith,Maya Reddy,Euan Wallace,Wenlong Cheng,Zongyuan Ge,Faezeh Marzbanrad

Main category: cs.CV

TL;DR: 提出了一种名为CURL的自监督学习框架,用于从胎儿超声视频中检测胎动,结合空间和时间对比学习,实现了较高的检测性能。

Details Motivation: 传统胎动检测方法(如孕妇感知和CTG)存在主观性和准确性不足的问题,需要一种更客观、准确的检测手段。 Method: 提出CURL框架,采用双对比损失(空间和时间对比学习),并设计任务特定采样策略,通过概率微调实现对任意长度超声视频的灵活推理。 Result: 在92名受试者的数据集上测试,CURL达到78.01%的敏感性和81.60%的AUROC。 Conclusion: 自监督对比学习在胎动分析中具有潜力,可提升产前监测和临床决策的可靠性。 Abstract: Accurate fetal movement (FM) detection is essential for assessing prenatal health, as abnormal movement patterns can indicate underlying complications such as placental dysfunction or fetal distress. Traditional methods, including maternal perception and cardiotocography (CTG), suffer from subjectivity and limited accuracy. To address these challenges, we propose Contrastive Ultrasound Video Representation Learning (CURL), a novel self-supervised learning framework for FM detection from extended fetal ultrasound video recordings. Our approach leverages a dual-contrastive loss, incorporating both spatial and temporal contrastive learning, to learn robust motion representations. Additionally, we introduce a task-specific sampling strategy, ensuring the effective separation of movement and non-movement segments during self-supervised training, while enabling flexible inference on arbitrarily long ultrasound recordings through a probabilistic fine-tuning approach. Evaluated on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions, CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its potential for reliable and objective FM analysis. These results highlight the potential of self-supervised contrastive learning for fetal movement analysis, paving the way for improved prenatal monitoring and clinical decision-making.

[93] EditInfinity: Image Editing with Binary-Quantized Generative Models

Jiahuan Wang,Yuxin Chen,Jun Yu,Guangming Lu,Wenjie Pei

Main category: cs.CV

TL;DR: 本文提出EditInfinity,通过适配二值量化生成模型Infinity实现高效的文本驱动图像编辑,利用其可精确获取源图像中间量化表示的特性,解决了扩散模型在图像反演中的近似误差问题。

Details Motivation: 现有基于扩散模型的图像编辑方法受限于图像反演过程中缺乏中间步骤的精确监督,导致近似误差影响编辑性能。 Method: 提出EditInfinity,采用VQ-based生成模型,设计了结合文本提示校正和图像风格保持的精确图像反演机制,并引入整体平滑策略以提升编辑保真度和语义对齐。 Result: 在PIE-Bench基准上针对“添加”、“修改”和“删除”操作的实验表明,EditInfinity优于现有的扩散模型基线方法。 Conclusion: EditInfinity通过精确的中间表示监督和高效的反演机制,在低调参开销下实现了高质量的文本驱动图像编辑。 Abstract: Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across "add", "change", and "delete" editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.

[94] Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Ge Zheng,Jiaye Qian,Jiajin Tang,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的“诱导-检测-抑制”框架,用于减少大视觉语言模型在长篇回答中的幻觉问题,验证了上下文依赖性是导致幻觉的关键因素。

Details Motivation: 研究旨在探究大视觉语言模型在长篇自由形式回答中出现更多幻觉的原因,是否仅由长度引起,还是存在更深层机制。 Method: 通过一系列初步实验,提出‘诱导-检测-抑制’框架:利用设计的上下文主动诱导幻觉,借此检测高风险情况,并在解码过程中抑制对象级幻觉。 Result: 该方法在多个基准测试中均取得显著且一致的改进,有效提升了幻觉的检测与抑制能力。 Conclusion: 幻觉风险主要源于对上下文依赖的增加,而非响应长度本身;所提框架不仅提升性能,更为理解LVLMs中长响应的幻觉机制提供了新视角。 Abstract: Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.

[95] COS3D: Collaborative Open-Vocabulary 3D Segmentation

Runsong Zhu,Ka-Hei Hui,Zhengzhe Liu,Qianyi Wu,Weiliang Tang,Shi Qiu,Pheng-Ann Heng,Chi-Wing Fu

Main category: cs.CV

TL;DR: 本文提出了一种名为COS3D的协作式提示-分割框架,用于解决开放词汇3D分割中的局限性,通过引入协作场(包含实例场和语言场)以及新颖的特征映射和训练策略,在两个基准上实现了领先性能,并展示了在多任务应用中的潜力。

Details Motivation: 现有基于高斯溅射的方法在开放词汇3D分割中存在分割质量差或误差累积的问题,主要因为依赖单一3D语言场或预计算的类别无关分割结果,难以有效融合语言与分割线索。 Method: 提出协作场概念,包含实例场和语言场;设计实例到语言的特征映射和两阶段训练策略;在推理阶段引入自适应语言到实例的提示优化机制,以提升提示与分割的一致性。 Result: 在两个广泛使用的基准上显著优于现有方法,实现了高质量的开放词汇3D分割,并展现出在新视角图像分割、层次化分割和机器人应用中的潜力。 Conclusion: COS3D通过在整个流程中有效整合语言与分割线索,解决了现有方法的关键缺陷,为开放词汇3D分割提供了新的有效范式。 Abstract: Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}.

[96] Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

Minseok Kang,Minhyeok Lee,Minjung Kim,Donghyeong Kim,Sangyoun Lee

Main category: cs.CV

TL;DR: 本文提出了DualGround,一种用于视频时序定位的双分支架构,通过分离句子级和短语级语义来改善跨模态对齐,显著提升了Moment Retrieval和Highlight Detection的性能。

Details Motivation: 现有方法在跨模态注意力中对所有文本标记一视同仁,忽视了其不同语义角色,导致模型过度依赖[EOS]标记的全局语义,无法有效利用词级信号进行细粒度时序对齐。 Method: 提出DualGround,采用双分支结构:[EOS]标记通过句子级路径处理全局语义,词标记聚类为短语级单元用于局部定位;引入基于语义角色的跨模态交互策略,并联合建模句子级与短语级语义以实现解耦对齐。 Result: DualGround在QVHighlights和Charades-STA数据集上的Moment Retrieval和Highlight Detection任务中均达到SOTA性能。 Conclusion: 通过显式分离全局与局部语义,DualGround有效提升了视频-语言对齐中的细粒度时序定位能力,验证了解耦语义建模的重要性。 Abstract: Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.

[97] Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization

Shuhan Hu,Yiru Li,Yuanyuan Li,Yingying Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于掩码的位置编码方案(MPE)和上下文增强模块(CEM),构建了EDGeo框架,用于跨视角目标地理定位,显著提升了定位精度。

Details Motivation: 现有方法依赖于仅捕捉2D坐标的关键点位置编码,忽略了目标形状信息,导致对标注偏移敏感且跨视图匹配能力有限。 Method: 提出掩码位置编码(MPE),利用分割掩码捕捉空间坐标和目标轮廓;设计上下文增强模块(CEM),采用条带卷积核提取长距离上下文特征,提升细长型目标的区分能力。 Result: 在CVOGL和VIGOR-Building两个公开数据集上实验表明,该方法在地面到卫星的挑战性场景下定位精度提升了3.39%,达到最先进水平。 Conclusion: EDGeo通过引入对象感知的位置编码和上下文建模,为跨视角地理定位提供了更鲁棒的解决方案。 Abstract: Cross-view object geo-localization enables high-precision object localization through cross-view matching, with critical applications in autonomous driving, urban management, and disaster response. However, existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information, resulting in sensitivity to annotation shifts and limited cross-view matching capability. To address these limitations, we propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes, thereby upgrading the model from "location-aware" to "object-aware." Furthermore, to tackle the challenge of large-span objects (e.g., elongated buildings) in satellite imagery, we design a context enhancement module. This module employs horizontal and vertical strip convolutional kernels to extract long-range contextual features, enhancing feature discrimination among strip-like objects. Integrating MPE and CEM, we present EDGeo, an end-to-end framework for robust cross-view object geo-localization. Extensive experiments on two public datasets (CVOGL and VIGOR-Building) demonstrate that our method achieves state-of-the-art performance, with a 3.39% improvement in localization accuracy under challenging ground-to-satellite scenarios. This work provides a robust positional encoding paradigm and a contextual modeling framework for advancing cross-view geo-localization research.

[98] Calibrating Multimodal Consensus for Emotion Recognition

Guowei Zhong,Junjie Li,Huaiyu Zhu,Ruohong Huan,Yun Pan

Main category: cs.CV

TL;DR: 提出了一种新的多模态情感识别模型CMC,通过伪标签生成模块和无参数融合模块解决跨模态语义不一致和文本主导问题,在多个数据集上表现优异。

Details Motivation: 现有方法忽视了多模态间可能存在的语义不一致性,且易受文本模态主导,影响识别准确性。 Method: 设计了伪标签生成模块(PLGM)进行自监督单模态预训练,并采用无参数融合模块(PFM)和多模态共识路由器(MCR)实现更可靠的多模态融合。 Result: 在CH-SIMS、CH-SIMS v2、CMU-MOSI和CMU-MOSEI四个数据集上达到或超过了现有最优方法的表现,尤其在存在语义不一致的场景下优势明显。 Conclusion: CMC有效缓解了文本主导问题并提升了多模态情感识别在语义不一致情况下的鲁棒性和准确性。 Abstract: In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.

[99] Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals

Saraf Anzum Shreya,MD. Abu Ismail Siddique,Sharaf Tasnim

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv8 nano的实时货币检测系统,旨在帮助视障人士独立识别纸币和硬币。

Details Motivation: 视障人士在日常生活中处理货币时面临困难,需要一种不依赖他人的解决方案。 Method: 采用YOLOv8 nano模型,结合自定义检测头和Squeeze-and-Excitation模块,提升特征提取与检测精度,在包含30类货币(USD、EUR、BDT)的数据集上进行训练。 Result: 模型实现了97.73%的准确率、95.23%的召回率、95.85%的F1分数以及97.21%的mAP50(B)。 Conclusion: 该系统结合语音反馈,可有效辅助视障人士识别货币,具有实用性和高效性,有助于提升其生活自主性。 Abstract: Technologies like smartphones have become an essential in our daily lives. It has made accessible to everyone including visually impaired individuals. With the use of smartphone cameras, image capturing and processing have become more convenient. With the use of smartphones and machine learning, the life of visually impaired can be made a little easier. Daily tasks such as handling money without relying on someone can be troublesome for them. For that purpose this paper presents a real-time currency detection system designed to assist visually impaired individuals. The proposed model is trained on a dataset containing 30 classes of notes and coins, representing 3 types of currency: US dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a YOLOv8 nano model with a custom detection head featuring deep convolutional layers and Squeeze-and-Excitation blocks to enhance feature extraction and detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5 (mAP50(B)) of 97.21\%. Using the voice feedback after the detection would help the visually impaired to identify the currency. This paper aims to create a practical and efficient currency detection system to empower visually impaired individuals independent in handling money.

[100] GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection

Guangyu Dai,Dong Chen,Siliang Tang,Yueting Zhuang

Main category: cs.CV

TL;DR: 提出了一种基于多模态信息的细粒度特征融合方法GMFVAD,用于视频异常检测,通过引入文本特征细化视觉特征,减少冗余,在四个主流数据集上达到SOTA性能。

Details Motivation: 现有方法在融合多模态信息(如文本)时较为粗略,忽略了视频片段中的冗余信息,影响异常检测效果。 Method: 提出GMFVAD,生成更细粒度的多模态特征,利用视频字幕的文本特征增强关键部分的视觉特征,提升特征表达能力。 Result: 在四个主流数据集上取得当前最优性能,消融实验验证了冗余信息的减少是性能提升的关键原因。 Conclusion: GMFVAD通过细粒度融合多模态信息有效减少了视觉特征冗余,显著提升了视频异常检测的准确性和鲁棒性。 Abstract: Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.

[101] Causal Debiasing for Visual Commonsense Reasoning

Jiayi Zou,Gengyun Jia,Bing-Kun Bao

Main category: cs.CV

TL;DR: 本文提出了VCR-OOD数据集以评估模型在视觉和文本模态上的泛化能力,并通过反事实调整方法消除数据集中存在的共现和统计偏差。

Details Motivation: 现有视觉常识推理方法忽视了数据集中的偏差问题,缺乏有效的去偏策略,影响模型的泛化能力。 Method: 构建VCR-OOD-QA和VCR-OOD-VA两个子集,分析VCR任务中的因果图与预测捷径,并采用基于正确答案集的字典进行后门调整以消除偏差。 Result: 实验表明所提出的去偏方法在多个数据集上有效提升了模型的泛化性能。 Conclusion: 通过构造去偏数据集和引入因果调整方法,能够有效缓解VCR任务中的数据偏差问题,提升模型在分布外场景下的鲁棒性。 Abstract: Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.

[102] Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition

Haodong Yang,Zhongling Huang,Shaojie Guo,Zhe Zhang,Gong Cheng,Junwei Han

Main category: cs.CV

TL;DR: 提出知识引导的神经网络KINN,通过物理先验与紧凑架构解决CV-SAR图像识别中的泛化、解释性与效率三难问题。

Details Motivation: 传统数据驱动模型未能充分利用CV-SAR数据中的电磁散射特征,导致在数据受限和域偏移场景下难以兼顾泛化性、解释性和效率。 Method: 设计基于“压缩-聚合-压缩”架构的轻量级KINN框架:第一阶段利用物理引导的字典处理器嵌入先验知识,实现稀疏特征提取;第二阶段聚合特征;第三阶段通过自蒸馏的紧凑分类头进行语义压缩。 Result: 在五个SAR基准上验证,KINN在参数量小(0.7M CNN / 0.95M ViT)的情况下,在低数据量和分布外场景中表现出卓越的泛化能力和可解释性,达到最先进的性能。 Conclusion: KINN有效解决了CV-SAR图像识别中的表示三难问题,为可信AI在SAR分析中的应用提供了新路径。 Abstract: Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR) image recognition are fundamentally constrained by a representation trilemma under data-limited and domain-shift scenarios: the concurrent, yet conflicting, optimization of generalization, interpretability, and efficiency. Our work is motivated by the premise that the rich electromagnetic scattering features inherent in CV-SAR data hold the key to resolving this trilemma, yet they are insufficiently harnessed by conventional data-driven models. To this end, we introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture. The first stage performs a physics-guided compression, wherein a novel dictionary processor adaptively embeds physical priors, enabling a compact unfolding network to efficiently extract sparse, physically-grounded signatures. A subsequent aggregation module enriches these representations, followed by a final semantic compression stage that utilizes a compact classification head with self-distillation to learn maximally task-relevant and discriminative embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer (0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios and tangible interpretability, thereby providing an effective solution to the representation trilemma and offering a new path for trustworthy AI in SAR image analysis.

[103] DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering

Jiayi Zou,Chaofan Chen,Bing-Kun Bao,Changsheng Xu

Main category: cs.CV

TL;DR: 提出了一种双模态反事实对比构建框架(DMC^3),用于解决第一人称视频问答中的多事件理解和手-物交互识别问题,通过反事实样本构造与对比优化,在多个数据集上达到SOTA性能。

Details Motivation: 现有方法忽略了第一人称视角带来的独特挑战,如多事件理解和手-物交互识别,因此需要更有效的建模方式来提升Egocentric VideoQA性能。 Method: 提出DMC^3框架,包括一个基线模型、反事实样本构造模块(文本模态通过事件描述改写,视觉模态通过核心交互挖掘生成正负样本)和反事实样本参与的对比优化模块,使用对比损失拉近正样本距离、推远负样本距离。 Result: 在EgoTaskQA的normal和indirect分割上分别取得52.51%和46.04%,在QAEGO4D上达到13.2%,均为当前最优性能。 Conclusion: DMC^3通过引入反事实对比学习有效提升了第一人称视频问答的性能,显著增强了对关键事件和交互的理解能力。 Abstract: Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51\% and 46.04\% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on QAEGO4D, both reaching the state-of-the-art performance.

[104] UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Liangyu Chen,Hanzhang Zhou,Chenglin Cai,Jianan Zhang,Panrong Tong,Quyu Kong,Xu Zhang,Chen Liu,Yuqi Liu,Wenxuan Wang,Yue Wang,Qin Jin,Steven Hoi

Main category: cs.CV

TL;DR: 本文提出了“指令即推理”范式,通过将自然语言指令视为动态分析路径,结合两阶段训练框架(监督微调+强化学习),在GUI元素定位任务中实现了最先进的性能,并展现出涌现的推理能力。

Details Motivation: 现有GUI定位研究忽视了指令多样性和质量对性能的影响,且数据集中存在大量有缺陷的指令,限制了模型的实际效果。 Method: 提出Instruction-as-Reasoning范式,采用两阶段训练:首先在合成的多样化指令上进行监督微调以培养多视角推理能力,然后通过强化学习优化推理路径的选择与组合。 Result: UI-Ins-7B和UI-Ins-32B在五个基准上达到SOTA,其中UI-Ins-32B在UI-I2E-Bench、ScreenSpot-Pro和MMBench-GUI L2上分别取得87.3%、57.0%和84.9%的准确率,并在AndroidWorld任务中以74.1%的成功率展现代理潜力。 Conclusion: 指令不仅是用户意图的静态表达,更可作为动态推理路径提升定位性能;所提方法有效提升了GUI代理的推理与执行能力,同时缓解了SFT+RL框架中的策略崩溃问题。 Abstract: GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.

[105] Breakdance Video classification in the age of Generative AI

Sauptik Dhar,Naveen Ramakrishnan,Michelle Munson

Main category: cs.CV

TL;DR: 该研究探讨了现代视频基础模型在小众但流行的舞蹈运动——霹雳舞中的应用,发现视频编码器模型在预测任务中优于最先进的视频语言模型,并提供了针对霹雳舞视频分类的微调解码器模型的深入分析。

Details Motivation: 现有研究多集中于主流体育项目,而对霹雳舞等小众运动关注不足,本文旨在填补这一空白,探索视频基础模型在此类特殊场景中的适用性。 Method: 采用现代视频基础模型(包括编码器和解码器),对霹雳舞视频进行分类任务实验,比较不同模型性能,并对微调后的解码器模型进行深入分析。 Result: 视频编码器模型在预测任务中表现优于当前最先进的视频语言模型,同时研究揭示了如何选择合适的编码器以及微调解码器在霹雳舞分类中的工作机制。 Conclusion: 视频编码器更适合用于霹雳舞等小众运动的视频理解任务,为未来在非主流体育领域的模型选择与优化提供了指导。 Abstract: Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.

[106] A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

LinFeng Li,Jian Zhao,Zepeng Yang,Yuhang Song,Bojun Lin,Tianle Zhang,Yuchen Yuan,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种在跨模态无人机导航任务中的获胜方案,通过领域对齐的预处理和Mixture-of-Experts框架,有效应对平台间异质性和文本-视觉域差距。

Details Motivation: 解决跨平台图像检索中严重的异质性问题以及通用训练文本与特定平台测试查询之间的域差距。 Method: 采用平台划分、卫星数据增强、去除方向词等预处理;结合LLM文本优化流程,并基于BGE-M3和EVA-CLIP模型,使用渐进式两阶段难负样本挖掘训练三个平台专家模型,推理时融合其得分。 Result: 所提系统在官方排行榜上排名第一,显著提升了跨模态地理定位的性能。 Conclusion: 该方法在存在异构视角的情况下实现了鲁棒的跨模态地理定位,验证了域对齐与MoE架构的有效性。 Abstract: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.

[107] HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Zelin Peng,Zhengqin Xu,Qingyang Liu,Xiaokang Yang,Wei Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于双曲空间的高效多模态大语言模型训练范式HyperET,通过动态调整双曲半径实现任意粒度级别的视觉与文本对齐,显著降低计算资源需求。

Details Motivation: 现有的多模态大语言模型因视觉编码器缺乏多粒度语言对齐能力而导致训练效率低下,需要大量计算资源。 Method: 利用双曲空间天然的层次建模特性,提出HyperET框架,通过可学习矩阵和莫比乌斯乘法操作,在双曲空间中动态调整半径以实现任意粒度的跨模态对齐。 Result: 在多个MLLM基准测试中,HyperET在仅增加不到1%参数的情况下,持续提升了预训练和微调模型的性能。 Conclusion: HyperET提供了一种灵活且高效的参数化策略,有效解决了多模态对齐中的粒度不匹配问题,大幅降低了对计算资源的需求。 Abstract: Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.

[108] AnyPcc: Compressing Any Point Cloud with a Single Universal Model

Kangli Wang,Qianxi Yi,Yuqi Ye,Shihao Li,Wei Gao

Main category: cs.CV

TL;DR: 提出了一种名为Anypcc的通用点云压缩框架,通过引入通用上下文模型和实例自适应微调策略,显著提升了点云几何压缩的泛化能力,在15个数据集上实现了最先进的性能。

Details Motivation: 深度学习在点云几何压缩中的泛化能力受限于上下文模型不健全和对分布外(OOD)数据处理效率低的问题。 Method: 提出Anypcc框架,包括一个利用空间和通道分组先验信息的通用上下文模型,以及一种实例自适应微调(IAFT)策略,通过微调少量网络权重并将其编码进比特流来应对OOD数据。 Result: 在包含15个不同数据集的基准上进行了大量实验,结果表明Anypcc在点云压缩方面达到了新的最先进水平。 Conclusion: Anypcc通过增强上下文建模和实例级自适应优化,有效提升了点云压缩模型的泛化能力,为可重复研究提供了代码和数据支持。 Abstract: Generalization remains a critical challenge for deep learning-based point cloud geometry compression. We argue this stems from two key limitations: the lack of robust context models and the inefficient handling of out-of-distribution (OOD) data. To address both, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages priors from both spatial and channel-wise grouping to capture robust contextual dependencies. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. It fine-tunes a small subset of network weights for each instance and incorporates them into the bitstream, where the marginal bit cost of the weights is dwarfed by the resulting savings in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression. Our code and datasets will be released to encourage reproducible research.

[109] AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models

Seunghoon Lee,Jeongwoo Choi,Byunggwan Son,Jaehyeon Moon,Jeimin Jeon,Bumsub Ham

Main category: cs.CV

TL;DR: 提出了一种名为AccuQuant的新型后训练量化方法,用于扩散模型,通过在多个去噪步骤中显式模拟量化过程来减少误差累积,并显著降低内存复杂度。

Details Motivation: 扩散模型在采样过程中去噪步骤的量化误差会累积,影响性能,现有方法未能有效解决这一问题。 Method: AccuQuant通过在多个去噪步骤中最小化全精度模型与量化模型输出之间的差异,显式模拟扩散采样过程,并引入新的目标函数和高效实现技术。 Result: 该方法显著减少了量化误差的累积,内存复杂度从O(n)降低到O(1),在多个任务和标准基准上验证了其有效性和效率。 Conclusion: AccuQuant是一种高效且有效的扩散模型后训练量化方法,能够有效缓解误差累积问题,适用于多种扩散模型和任务。 Abstract: We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$, where $n$ is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.

[110] Positional Encoding Field

Yunpeng Bai,Haoxiang Li,Qixing Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的位置编码方法PE-Field,将二维位置编码扩展到三维结构场,使DiT能够在3D空间中直接建模几何结构,在单图新视角合成和空间图像编辑任务中达到SOTA性能。

Details Motivation: 发现DiT中patch token对位置编码扰动具有鲁棒性,表明空间一致性主要由位置编码控制,因此探索更结构化的位置编码方式以增强三维感知能力。 Method: 提出Positional Encoding Field(PE-Field),将2D位置编码扩展为包含深度感知和层次化子patch控制的3D结构化场,增强DiT对三维几何结构的建模能力。 Result: 在单图像新视角合成任务上达到最先进性能,并支持可控的空间图像编辑,验证了PE-Field在三维视觉生成中的有效性与泛化能力。 Conclusion: PE-Field通过结构化的3D位置编码显著提升了DiT在三维视觉生成任务中的表现,揭示了位置编码在视觉Transformer中对空间组织的关键作用。 Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.

[111] Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval

Qing Wang,Chong-Wah Ngo,Yu Cao,Ee-Peng Lim

Main category: cs.CV

TL;DR: 提出一种新的因果表示学习方法,用于解决图像到食谱检索中的模态差距和视觉偏差问题,通过预测图像中可能被忽略的烹饪元素并将其显式注入跨模态表示学习,从而提升单语和多语言多文化数据集上的检索性能。

Details Motivation: 现有图像到食谱检索方法假设食物图像能完整反映食谱文本内容,但实际上图像仅体现成品外观,无法捕捉非视觉的烹饪细节,导致模型偏向视觉主导特征,难以区分成分和做法相似的食谱,尤其在多文化混合数据中偏差更严重。 Method: 提出一种因果表示学习框架,预测图像中缺失的烹饪元素(如特定食材使用和操作步骤),并通过显式注入这些元素来修正跨模态表示学习中的偏差,以增强对非视觉关键信息的建模能力。 Result: 在标准单语Recipe1M数据集和新构建的多语言多文化菜系数据集上实验表明,该方法能有效揭示被忽略的细微烹饪要素,在图像-食谱检索任务中显著提升性能,尤其在区分相似食谱方面表现突出。 Conclusion: 所提出的因果表示学习方法能够缓解跨模态检索中的视觉偏差问题,通过引入被图像遗漏的烹饪知识,增强了模型对食谱语义细节的理解,在单语和多语言多文化场景下均实现了优越的检索效果。 Abstract: Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.

[112] Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment

Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel

Main category: cs.CV

TL;DR: 提出FuzzyDistillViT-MobileNet模型,结合动态模糊逻辑驱动的知识蒸馏和图像融合技术,用于肺癌分类,在多个数据集上实现高准确率。

Details Motivation: 解决传统知识蒸馏方法在肺癌诊断中无法有效处理图像不确定性与复杂性的问题,提升学生模型的泛化能力和鲁棒性。 Method: 采用Vision Transformer(ViT-B32)作为教师模型,MobileNet为学生模型,通过动态模糊逻辑调整蒸馏权重;引入Gamma校正和直方图均衡化进行像素级图像增强,并利用小波融合方法提升分辨率;使用遗传算法选择最优预训练学生模型。 Result: 在LC25000组织病理图像数据集上达到99.16%的准确率,在IQOTH/NCCD CT扫描图像数据集上达到99.54%的准确率,表现出跨模态影像的良好鲁棒性。 Conclusion: 所提方法通过动态蒸馏权重和图像质量优化,显著提升了肺癌分类的精度与稳定性,具有临床辅助诊断潜力。 Abstract: This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven knowledge distillation (KD) to address uncertainty and complexity in disease diagnosis. Unlike traditional models that rely on static KD with fixed weights, our method dynamically adjusts the distillation weight using fuzzy logic, enabling the student model to focus on high-confidence regions while reducing attention to ambiguous areas. This dynamic adjustment improves the model ability to handle varying uncertainty levels across different regions of LC images. We employ the Vision Transformer (ViT-B32) as the instructor model, which effectively transfers knowledge to the student model, MobileNet, enhancing the student generalization capabilities. The training process is further optimized using a dynamic wait adjustment mechanism that adapts the training procedure for improved convergence and performance. To enhance image quality, we introduce pixel-level image fusion improvement techniques such as Gamma correction and Histogram Equalization. The processed images (Pix1 and Pix2) are fused using a wavelet-based fusion method to improve image resolution and feature preservation. This fusion method uses the wavedec2 function to standardize images to a 224x224 resolution, decompose them into multi-scale frequency components, and recursively average coefficients at each level for better feature representation. To address computational efficiency, Genetic Algorithm (GA) is used to select the most suitable pre-trained student model from a pool of 12 candidates, balancing model performance with computational cost. The model is evaluated on two datasets, including LC25000 histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images (99.54% accuracy), demonstrating robustness across different imaging domains.

[113] Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Kun Ouyang,Yuanxin Liu,Linli Yao,Yishuo Cai,Hao Zhou,Jie Zhou,Fandong Meng,Xu Sun

Main category: cs.CV

TL;DR: 本文提出了Conan,一个用于证据支持的多步视频推理框架,通过构建大规模数据集Conan-91K和设计多阶段渐进式冷启动训练方法,在六个多步推理基准上显著超越基线模型,实现了最先进的性能。

Details Motivation: 现有的多模态大语言模型在视频推理任务中存在推理链脱离视觉证据或帧检索定位不准的问题,缺乏有效的跨帧多步推理能力。 Method: 提出Conan框架,结合上下文与证据帧识别、跨帧线索推理及自适应决策机制;构建Conan-91K数据集,并采用识别-推理-行动(AIR)的强化学习训练框架进行多阶段渐进式训练。 Result: 在六个多步视频推理基准上平均准确率超过Qwen2.5-VL-7B-Instruct超过10%,并在长视频理解任务中表现出良好的泛化性、可扩展性和鲁棒性。 Conclusion: Conan通过视觉证据 grounding 与自适应推理决策,有效提升了多模态大模型在复杂视频推理任务中的性能,具有广泛的应用前景。 Abstract: Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.

[114] Reliable and Reproducible Demographic Inference for Fairness in Face Analysis

Alexandre Fournier-Montgieux,Hervé Le Borgne,Adrian Popescu,Bertrand Luvison

Main category: cs.CV

TL;DR: 提出一种模块化迁移学习的可复现人口统计推断(DAI)流程,提升人脸分析系统公平性评估的可靠性,在性别和族裔推断上优于基线方法,尤其在族裔上表现更优,并引入基于身份内一致性的鲁棒性度量。

Details Motivation: 公平性评估依赖于自动人口统计属性推断(DAI),而DAI本身的可靠性影响评估结果的偏差与方差,因此需要提升DAI的可靠性以确保公平性审计的有效性。 Method: 采用模块化迁移学习方法,结合预训练的人脸识别编码器与非线性分类头,构建可复现的DAI流程,并在准确性、公平性和新提出的基于身份内一致性的鲁棒性三个维度进行审计。 Result: 在多个数据集和训练设置下,该方法在性别和族裔推断任务中优于强基线模型,尤其在更具挑战性的族裔推断上表现突出,且具备良好的公平性与鲁棒性。 Conclusion: 该工作为公平性审计中的人口统计推断提供了可靠、透明且可复现的基础,推动了公平性评估的可信度与标准化。 Abstract: Fairness evaluation in face analysis systems (FAS) typically depends on automatic demographic attribute inference (DAI), which itself relies on predefined demographic segmentation. However, the validity of fairness auditing hinges on the reliability of the DAI process. We begin by providing a theoretical motivation for this dependency, showing that improved DAI reliability leads to less biased and lower-variance estimates of FAS fairness. To address this, we propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach. Our design integrates pretrained face recognition encoders with non-linear classification heads. We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency. The proposed robustness metric is applicable to any demographic segmentation scheme. We benchmark the pipeline on gender and ethnicity inference across multiple datasets and training setups. Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute. To promote transparency and reproducibility, we will publicly release the training dataset metadata, full codebase, pretrained models, and evaluation toolkit. This work contributes a reliable foundation for demographic inference in fairness auditing.

[115] EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization

Yixiong Yang,Tao Wu,Senmao Li,Shiqi Yang,Yaxing Wang,Joost van de Weijer,Kai Wang

Main category: cs.CV

TL;DR: 提出了一种双向概念蒸馏框架EchoDistill,用于实现单步扩散模型的个性化(1-SDP),通过师生模型间的双向知识回传和共享文本编码器,显著提升了文本到图像生成中新概念的个性化效果与生成质量。

Details Motivation: 现有单步文本到图像扩散模型在个性化新概念时表现有限,因其难以有效捕捉新概念分布,需提升其个性化能力。 Method: 设计了一个端到端的双向概念蒸馏框架EchoDistill,联合训练多步教师模型和单步学生模型;通过共享文本编码器、对抗损失和对齐损失,并引入双向回传优化策略,实现概念在师生模型间的相互增强。 Result: 实验表明,该方法在1-SDP设置下显著优于现有个性化方法,不仅提升了学生模型的个性化能力,也改善了教师模型的生成质量。 Conclusion: EchoDistill为文本到图像扩散模型提供了一种快速且高效的个性化新范式,验证了双向蒸馏在模型容量受限情况下的有效性。 Abstract: Recent advances in accelerating text-to-image (T2I) diffusion models have enabled the synthesis of high-fidelity images even in a single step. However, personalizing these models to incorporate novel concepts remains a challenge due to the limited capacity of one-step models to capture new concept distributions effectively. We propose a bidirectional concept distillation framework, EchoDistill, to enable one-step diffusion personalization (1-SDP). Our approach involves an end-to-end training process where a multi-step diffusion model (teacher) and a one-step diffusion model (student) are trained simultaneously. The concept is first distilled from the teacher model to the student, and then echoed back from the student to the teacher. During the EchoDistill, we share the text encoder between the two models to ensure consistent semantic understanding. Following this, the student model is optimized with adversarial losses to align with the real image distribution and with alignment losses to maintain consistency with the teacher's output. Furthermore, we introduce the bidirectional echoing refinement strategy, wherein the student model leverages its faster generation capability to feedback to the teacher model. This bidirectional concept distillation mechanism not only enhances the student ability to personalize novel concepts but also improves the generative quality of the teacher model. Our experiments demonstrate that this collaborative framework significantly outperforms existing personalization methods over the 1-SDP setup, establishing a novel paradigm for rapid and effective personalization in T2I diffusion models.

[116] Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

Xiaohan Lan,Fanfan Liu,Haibo Qiu,Siqi Yang,Delian Ruan,Peng Shi,Lin Ma

Main category: cs.CV

TL;DR: 本文提出了Metis-HOME,一种混合专家框架,通过将模型分为推理分支和非推理分支,动态分配任务以提升复杂推理能力和通用性能,解决了当前多模态大模型在效率与泛化之间的权衡问题。

Details Motivation: 现有大模型在处理简单查询时也使用高成本的推理过程,导致效率低下,且过度专注于复杂推理会削弱其通用理解能力。 Method: 提出Metis-HOME,基于Qwen2.5-VL-7B构建混合专家(MoE)架构,包含专门用于复杂推理的‘思考分支’和用于快速直接推断的‘非思考分支’,并通过轻量级可训练路由器动态分配查询。 Result: 实验表明,该方法不仅显著提升了复杂推理性能,还增强了模型的通用能力,克服了以往推理专用模型性能退化的缺陷。 Conclusion: Metis-HOME为构建兼具强大推理能力和广泛适用性的多模态大语言模型提供了新范式,有效解决了推理与泛化之间的矛盾。 Abstract: Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ''Hybrid Thinking'' paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model's general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma.

[117] Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis

Lixiong Qin,Yang Zhang,Mei Wang,Jiani Hu,Weihong Deng,Weiran Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为FiFa的新框架,用于细粒度的可解释DeepFake分析(XDFA),通过定义面部图像概念树(FICT)和构建FiFa-Annotator实现更可靠的标注,并引入了Artifact-Grounding Explanation(AGE)任务与FiFa-MLLM多任务模型,实现了在AGE任务上的SOTA性能。

Details Motivation: 现有方法在DeepFake分析中缺乏细粒度感知,文本解释与视觉证据脱节,且无法支持任意面部区域的查询,导致分析结果缺乏对人脸视觉上下文(Facext)的有效依赖。 Method: 提出FiFa框架,包括:1)构建面部图像概念树(FICT)以实现细粒度区域划分;2)设计FiFa-Annotator提升标注可靠性;3)引入AGE任务,将文本解释与伪造区域分割掩码结合;4)构建FiFa-MLLM统一多任务架构,支持多种输入输出模式并引入辅助监督任务。 Result: FiFa-MLLM在AGE任务上优于强基线模型,并在现有XDFA数据集上达到SOTA性能。 Conclusion: FiFa框架通过细粒度的面部上下文建模和多任务学习,显著提升了可解释DeepFake分析的准确性和可解释性,增强了文本解释与视觉证据之间的关联。 Abstract: The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at https://github.com/lxq1000/Fake-in-Facext.

[118] Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image

Guillermo Carbajal,Andrés Almansa,Pablo Musé

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的框架,通过联合估计清晰图像和相机运动轨迹来解决由大范围或旋转运动引起的运动模糊问题,结合可微分模糊生成模块和模型驱动的恢复网络,实现了在严重或空间变化模糊情况下的先进去模糊性能。

Details Motivation: 运动模糊尤其在相机大幅或旋转抖动时仍是图像恢复中的主要挑战,现有端到端去模糊网络在处理严重或空间变化模糊时表现不佳。 Method: 利用投影运动模糊模型(PMBM),设计了一个可微分的模糊生成模块,并构建一个神经网络预测完整的3D旋转轨迹,指导基于模型的恢复网络进行端到端训练;采用模块化架构以提高可解释性,并通过推理后的重模糊损失优化轨迹。 Result: 在合成和真实数据集上均达到最先进的性能,尤其在严重或空间变化模糊情况下优于现有方法;能够重建生成模糊图像的清晰图像序列。 Conclusion: 该方法通过结合模型驱动的先验与深度学习,有效提升了复杂运动模糊下的图像恢复质量,具有良好的可解释性和实际应用潜力。 Abstract: Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at https://github.com/GuillermoCarbajal/Blur2Seq/

[119] Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation

Marziyeh Bamdad,Hans-Peter Hutter,Alireza Darvishy

Main category: cs.CV

TL;DR: SELM-SLAM3 是一种基于深度学习的视觉SLAM框架,结合SuperPoint和LightGlue,在低纹理、运动模糊等挑战性条件下显著优于ORB-SLAM3和现有RGB-D SLAM系统,适用于为视障人士提供可靠的导航辅助。

Details Motivation: 在低纹理、运动模糊或复杂光照等挑战性条件下,现有SLAM技术难以保持稳定和准确的定位与跟踪,限制了其在视障辅助导航等关键应用中的可靠性与安全性。 Method: 提出SELM-SLAM3,一个融合SuperPoint进行特征提取和LightGlue进行特征匹配的深度学习增强型视觉SLAM框架,并在TUM RGB-D、ICL-NUIM和TartanAir数据集上进行评估。 Result: 在多个挑战性数据集上,SELM-SLAM3平均比ORB-SLAM3提升87.84%,优于现有最先进的RGB-D SLAM系统36.77%,在低纹理和快速运动场景中表现出更强的鲁棒性。 Conclusion: SELM-SLAM3显著提升了恶劣环境下的定位精度与系统稳定性,为视障人士导航等高需求应用场景提供了可靠的技术平台。 Abstract: Despite advancements in SLAM technologies, robust operation under challenging conditions such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.

[120] From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging

Fuchen Li,Yansong Du,Wenbo Cheng,Xiaoxia Zhou,Sen Yin

Main category: cs.CV

TL;DR: 提出ACamera-Net,一种轻量级、场景自适应的相机参数调整网络,从RAW数据中直接预测最优曝光和白平衡,提升复杂光照下的图像质量。

Details Motivation: 消费级相机在低光、高动态范围、逆光和色温变化等复杂光照条件下难以保持稳定的图像质量,导致欠曝、偏色和色调不一致,影响后续视觉任务性能。 Method: 设计了ACamera-Net,包含ACamera-Exposure(估计ISO以缓解欠曝和对比度损失)和ACamera-Color(预测色温和增益以提升色彩一致性)两个模块,直接从RAW输入预测最佳相机参数,专为边缘设备实时推理优化。 Result: 在多样化真实场景数据上训练并验证,模型泛化能力强;实验表明其在图像质量和感知输出稳定性上优于传统自动模式和轻量基线方法,且无需额外图像增强模块。 Conclusion: ACamera-Net能有效提升复杂光照下消费级相机的图像质量与视觉任务鲁棒性,具备实时性和部署便捷性,适用于边缘设备成像系统。 Abstract: Consumer-grade camera systems often struggle to maintain stable image quality under complex illumination conditions such as low light, high dynamic range, and backlighting, as well as spatial color temperature variation. These issues lead to underexposure, color casts, and tonal inconsistency, which degrade the performance of downstream vision tasks. To address this, we propose ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment network that directly predicts optimal exposure and white balance from RAW inputs. The framework consists of two modules: ACamera-Exposure, which estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color, which predicts correlated color temperature and gain factors for improved color consistency. Optimized for real-time inference on edge devices, ACamera-Net can be seamlessly integrated into imaging pipelines. Trained on diverse real-world data with annotated references, the model generalizes well across lighting conditions. Extensive experiments demonstrate that ACamera-Net consistently enhances image quality and stabilizes perception outputs, outperforming conventional auto modes and lightweight baselines without relying on additional image enhancement modules.

[121] From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail

Xiaohan Sun,Carol O'Sullivan

Main category: cs.CV

TL;DR: 本文研究了不同细节层次(LoD)和观看距离下用户对人群角色表示的视觉质量感知,比较了几何网格、基于图像的替身、神经辐射场(NeRF)和3D高斯等表示方法在视觉保真度与计算性能间的权衡,结果有助于指导人群渲染中感知优化的LoD策略设计。

Details Motivation: 为了在保证视觉质量的同时提升渲染效率,需要理解用户对不同人群角色表示方法在不同条件下的感知差异,从而设计更符合人类视觉感知的细节层次策略。 Method: 通过定性和定量实验评估几何网格、图像替身、NeRF和3D高斯在不同LoD和观看距离下的视觉质量感知。 Result: 不同表示方法在视觉保真度和计算性能之间表现出明显权衡,用户感知质量随LoD和距离变化呈现规律性。 Conclusion: 研究结果可为人群渲染中感知优化的细节层次策略提供设计依据。 Abstract: In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.

[122] EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence

Ding Zou,Feifan Wang,Mengyu Ge,Siyuan Fan,Zongbing Zhang,Wei Chen,Lingfeng Wang,Zhongyou Hu,Wenrui Yan,Zhengwei Gao,Hao Wang,Weizhao Jin,Yu Zhang,Hainan Zhao,Mingliang Zhang,Xianxian Xi,Yaru Zhang,Wenyuan Li,Zhengguang Gao,Yurui Zhu

Main category: cs.CV

TL;DR: 本文提出了一种名为EmbodiedBrain的新型视觉-语言基础模型,旨在解决当前大模型在具身智能任务中的局限性,通过结合大规模监督微调与Step-GRPO训练方法,在长视野任务中显著提升性能,并建立了全面的评估体系和开源仿真环境。

Details Motivation: 当前的大语言模型和多模态大模型在具身智能任务中存在模型设计与实际需求脱节、实时性与性能难以兼顾以及评估指标不真实等问题,亟需更符合具身代理需求的模型架构与训练方法。 Method: 提出EmbodiedBrain模型,采用与智能体对齐的数据结构,结合大规模监督微调(SFT)与Step-Augmented Group Relative Policy Optimization(Step-GRPO),引入前置步骤作为引导前驱,并集成基础设施级加速的生成式奖励模型(GRM)以提升训练效率。 Result: 实验表明,EmbodiedBrain在通用、规划和端到端仿真基准上均达到最先进水平,显著提升了长视野任务的成功率,并通过开源数据、模型权重和评估方法推动后续研究。 Conclusion: EmbodiedBrain有效弥合了模型设计与具身代理需求之间的差距,为下一代通用具身智能体的发展提供了可扩展的框架和开放资源。 Abstract: The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.

[123] Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Jiahao Meng,Xiangtai Li,Haochen Wang,Yue Tan,Tao Zhang,Lingdong Kong,Yunhai Tong,Anran Wang,Zhiyang Teng,Yujing Wang,Zhuochen Wang

Main category: cs.CV

TL;DR: 本文提出了Open-o3 Video,一种将显式时空证据融入视频推理的非代理框架,通过构建高质量数据集和设计强化学习策略,在多个视频理解基准上实现了最先进的性能。

Details Motivation: 现有视频推理模型大多仅生成文本推理过程,缺乏对关键证据出现的时间和位置的指示;而将图像中的证据中心推理扩展到视频面临时空联合建模的挑战。 Method: 提出Open-o3 Video框架,构建STGR-CoT-30k和STGR-RL-36k两个具有精细时空标注的数据集,并采用冷启动强化学习策略与多目标奖励机制,联合优化答案准确性、时间对齐和空间精度。 Result: 在V-STAR基准上,相比Qwen2.5-VL基线,mAM提升14.4%,mLGM提升24.2%;在VideoMME、WorldSense、VideoMMMU和TVGBench等多个基准上也取得一致改进。 Conclusion: Open-o3 Video能生成可解释的时空推理轨迹,不仅提升模型性能,还为测试时缩放和答案可靠性提供支持,推动视频理解向更透明、可信的方向发展。 Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

[124] GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models

Muhammad Atif Butt,Alexandra Gomez-Villa,Tao Wu,Javier Vazquez-Corral,Joost Van De Weijer,Kai Wang

Main category: cs.CV

TL;DR: 本文提出了GenColorBench,首个针对文本到图像生成中颜色精确性的综合评测基准,基于ISCC-NBS和CSS3/X11等色彩系统,包含4.4万个聚焦颜色的提示词,系统评估现有模型在颜色控制方面的表现,并揭示其优劣与失败模式。

Details Motivation: 现有文本到图像模型在细粒度颜色控制方面表现不佳,且缺乏系统评估颜色精度的基准,而颜色在视觉感知和实际应用中至关重要,因此需要一个全面、精细的颜色生成评测标准。 Method: 构建了一个名为GenColorBench的新基准,涵盖400多种颜色和4.4万个颜色相关提示,结合感知评估与自动化指标,基于ISCC-NBS、CSS3/X11及RGB数值等色彩体系对主流文本到图像模型进行系统评测。 Result: 评估结果显示不同模型在颜色生成精度上存在显著差异,揭示了模型对不同颜色规范(如命名颜色、RGB值)的理解能力及其常见错误模式,证明了该基准能有效识别模型的强项与不足。 Conclusion: GenColorBench为文本到图像模型的颜色生成能力提供了首个系统化评测方案,有助于推动模型在颜色可控性方面的改进,未来可支持更精准的视觉内容生成。 Abstract: Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.

[125] Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation

Ziyu Ye,Chen Ju,Chaofan Ma,Xiaoyun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于相似性原型的跨模态分割新框架,通过在嵌入空间中学习类级原型并引入相似性约束,结合字典存储和对比学习机制,有效缓解了域偏移问题,在无监督域适应场景下取得了优于现有方法的性能。

Details Motivation: 深度学习模型在训练数据上表现良好,但在面对未见数据时由于域偏移会导致性能显著下降,而标注新域数据成本高昂,因此需要有效的无监督域适应方法来缩小域间差距。 Method: 提出一种基于相似性原型的跨模态分割框架:在嵌入空间中学习每个类别的原型,引入相似性约束以增强同类原型的代表性和不同类原型的可分性,并利用字典存储多图像提取的原型,避免类别缺失问题,支持原型的对比学习。 Result: 大量实验表明,该方法在跨模态分割任务中优于其他最先进的无监督域适应方法,取得了更好的分割性能。 Conclusion: 所提出的基于相似性原型和字典机制的框架能有效提升模型在未见域上的泛化能力,为无监督域适应下的跨模态分割提供了新的解决方案。 Abstract: Deep learning models have achieved great success on various vision challenges, but a well-trained model would face drastic performance degradation when applied to unseen data. Since the model is sensitive to domain shift, unsupervised domain adaptation attempts to reduce the domain gap and avoid costly annotation of unseen domains. This paper proposes a novel framework for cross-modality segmentation via similarity-based prototypes. In specific, we learn class-wise prototypes within an embedding space, then introduce a similarity constraint to make these prototypes representative for each semantic class while separable from different classes. Moreover, we use dictionaries to store prototypes extracted from different images, which prevents the class-missing problem and enables the contrastive learning of prototypes, and further improves performance. Extensive experiments show that our method achieves better results than other state-of-the-art methods.

[126] OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects

Mark He Huang,Lin Geng Foo,Christian Theobalt,Ying Sun,De Wen Soh

Main category: cs.CV

TL;DR: 提出OnlineSplatter,一种无需相机位姿、深度先验或束调优的在线前馈框架,直接从单目视频生成高质量的以对象为中心的3D高斯表示。

Details Motivation: 在缺乏可靠姿态或深度线索且对象任意运动的情况下,单目视频中的自由移动对象重建仍具挑战性。 Method: 基于首帧锚定重建,并通过密集的高斯基元场逐步优化对象表示;引入结合潜在外观-几何键和显式方向键的双键记忆模块,实现当前帧特征与时间聚合对象状态的鲁棒融合。 Result: 在真实世界数据集上的实验表明,OnlineSplatter显著优于现有的无姿态重建方法,在保持恒定内存和运行时间的同时,性能随观测增加而持续提升。 Conclusion: OnlineSplatter为自由移动对象的实时、高质量3D重建提供了一种高效且可扩展的解决方案。 Abstract: Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.

[127] SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

Yuan Sheng,Yanbin Hao,Chenxu Li,Shuo Wang,Xiangnan He

Main category: cs.CV

TL;DR: 提出了一种无需训练、模型无关的语义-视觉共识证据选择(SeViCES)框架,通过结合语义和视觉分支选择关键帧,并融合证据提升长视频理解的准确性和鲁棒性。

Details Motivation: 现有方法在处理长视频时忽略时间依赖性或依赖单模态证据,导致上下文不完整或推理不一致,难以有效支持视频大语言模型的高效理解。 Method: 设计了语义-视觉共识帧选择(SVCFS)模块,利用基于字幕的LLM推理和聚类引导的视觉对齐;并引入答案共识优化(ACR)模块,融合多模态证据以约束答案空间。 Result: 在多个长视频理解基准上实验表明,SeViCES在准确性和鲁棒性方面均优于现有最先进方法。 Conclusion: 共识驱动的证据选择能显著提升视频大语言模型对长视频的理解性能,验证了多模态、时间感知帧选择的重要性。 Abstract: Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.

[128] Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges

Zhenhuan Zhou,Jingbo Zhu,Yuchen Zhang,Xiaohang Guan,Peng Wang,Tao Li

Main category: cs.CV

TL;DR: 本文综述了深度学习在牙科图像分析中的应用,涵盖了260项研究,包括49个公开牙科数据集和211个基于深度学习的算法,系统总结了数据集特性、模型架构、优化策略及评估指标,并讨论了当前挑战与未来方向。

Details Motivation: 牙科图像分析面临低对比度、金属伪影和投影角度变化等挑战,且人工解读存在主观性和不一致性,亟需自动化解决方案以提高诊断准确性和效率。 Method: 通过系统回顾260篇相关研究,整理并分类现有的牙科数据集和深度学习模型,按不同牙科图像分析任务对网络结构、优化策略、训练方法和性能进行分析,并总结常用的训练与评估指标。 Result: 提供了深度学习在牙科图像分析中应用的全面综述,整理了数据集特征与获取方式,分类了主流模型与算法,总结了常用评估指标,并指出现有研究的局限性。 Conclusion: 该综述为牙科图像分析领域的研究人员提供了有价值的系统参考,有助于推动人工智能在牙科诊断与治疗中的进一步发展。 Abstract: Efficient analysis and processing of dental images are crucial for dentists to achieve accurate diagnosis and optimal treatment planning. However, dental imaging inherently poses several challenges, such as low contrast, metallic artifacts, and variations in projection angles. Combined with the subjectivity arising from differences in clinicians' expertise, manual interpretation often proves time-consuming and prone to inconsistency. Artificial intelligence (AI)-based automated dental image analysis (DIA) offers a promising solution to these issues and has become an integral part of computer-aided dental diagnosis and treatment. Among various AI technologies, deep learning (DL) stands out as the most widely applied and influential approach due to its superior feature extraction and representation capabilities. To comprehensively summarize recent progress in this field, we focus on the two fundamental aspects of DL research-datasets and models. In this paper, we systematically review 260 studies on DL applications in DIA, including 49 papers on publicly available dental datasets and 211 papers on DL-based algorithms. We first introduce the basic concepts of dental imaging and summarize the characteristics and acquisition methods of existing datasets. Then, we present the foundational techniques of DL and categorize relevant models and algorithms according to different DIA tasks, analyzing their network architectures, optimization strategies, training methods, and performance. Furthermore, we summarize commonly used training and evaluation metrics in the DIA domain. Finally, we discuss the current challenges of existing research and outline potential future directions. We hope that this work provides a valuable and systematic reference for researchers in this field. All supplementary materials and detailed comparison tables will be made publicly available on GitHub.

[129] Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Dong Yang,Pengfei Guo,Marc Edgar,Daguang Xu,Bernhard Kainz,Bjoern Menze

Main category: cs.CV

TL;DR: 本文提出了BTB3D,一种用于3D医学影像的因果卷积编码器-解码器模型,通过精细的三维分词和三阶段训练策略,在报告生成和文本到CT合成任务上显著优于现有方法。

Details Motivation: 现有的视觉-语言模型在处理高分辨率、长序列的3D医学影像时存在视觉编码器与临床语言不匹配、切片级分词模糊精细解剖结构等问题,导致下游任务诊断性能下降。 Method: 提出BTB3D模型,采用因果卷积编码器-解码器架构,统一2D和3D训练与推理,并生成紧凑且频率感知的体素标记;通过三阶段训练策略(局部重建、重叠窗口平铺、长上下文解码器优化)实现对长序列扫描的有效学习。 Result: 在两个关键任务上达到最先进水平:报告生成任务中BLEU分数提升,临床F1指标比CT2Rep、CT-CHAT和Merlin提高40%;文本到CT合成任务中FID降低75%,FVD减半,能生成解剖结构一致的512*512*241体积数据。 Conclusion: 精确的三维分词对于可扩展的3D医学影像视觉-语言建模至关重要,而不仅仅依赖更大的语言主干模型。 Abstract: Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D

[130] UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset

Chen Zhao,En Ci,Yunzhe Xu,Tiehan Fan,Shanyan Guan,Yanhao Ge,Jian Yang,Ying Tai

Main category: cs.CV

TL;DR: 本文提出了UltraHR-100K,一个包含10万张超高分辨率图像的大规模高质量数据集,并设计了针对细粒度细节生成的频率感知后训练方法,显著提升了文本到图像生成在超高分辨率下的细节质量和整体保真度。

Details Motivation: 现有超高分辨率文本到图像生成面临缺乏大规模高质量数据集以及缺乏针对细粒度细节合成的专门训练策略两大挑战。 Method: 构建了一个超过3K分辨率、标注丰富的高质量UHR图像数据集UltraHR-100K;提出频率感知的后训练方法,包括聚焦关键去噪步骤的Detail-Oriented Timestep Sampling (DOTS)和利用离散傅里叶变换进行高频细节保留的Soft-Weighting Frequency Regularization (SWFR)。 Result: 在UltraHR-eval4K基准上的实验表明,所提方法显著提升了生成图像的细粒度细节质量与整体视觉保真度。 Conclusion: 通过构建高质量数据集和设计针对性的训练策略,有效解决了UHR文本到图像生成中的关键问题,推动了该领域的发展。 Abstract: Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{https://github.com/NJU-PCALab/UltraHR-100k}{here}.

[131] HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification

Debojyoti Ghosh,Adrijit Goswami

Main category: cs.CV

TL;DR: 提出一种名为HybridSOMSpikeNet的混合深度学习框架,结合卷积特征提取、可微自组织和脉冲时序处理,实现高效准确的垃圾分类,测试准确率达97.39%,支持可持续发展目标。

Details Motivation: 准确的垃圾分类对可持续废物管理至关重要,当前可回收物误分类导致填埋增加、回收效率低下和温室气体排放上升。 Method: 采用预训练ResNet-152提取空间特征,结合可微软自组织映射(Soft-SOM)增强拓扑聚类,并引入脉冲神经网络头进行时序激活累积,构建HybridSOMSpikeNet模型。 Result: 在十类垃圾数据集上测试准确率达到97.39%,优于多种先进模型,且计算轻量,适合实际部署。 Conclusion: 该框架不仅提升了垃圾分类的精度与能效,还通过提高回收效率、减少污染和支持智能环境管理,助力实现联合国可持续发展目标(SDG 11 和 SDG 12)。 Abstract: Accurate waste classification is vital for achieving sustainable waste management and reducing the environmental footprint of urbanization. Misclassification of recyclable materials contributes to landfill accumulation, inefficient recycling, and increased greenhouse gas emissions. To address these issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning framework that integrates convolutional feature extraction, differentiable self-organization, and spiking-inspired temporal processing to enable intelligent and energy-efficient waste classification. The proposed model employs a pre-trained ResNet-152 backbone to extract deep spatial representations, followed by a Differentiable Soft Self-Organizing Map (Soft-SOM) that enhances topological clustering and interpretability. A spiking neural head accumulates temporal activations over discrete time steps, improving robustness and generalization. Trained on a ten-class waste dataset, HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several state-of-the-art architectures while maintaining a lightweight computational profile suitable for real-world deployment. Beyond its technical innovations, the framework provides tangible environmental benefits. By enabling precise and automated waste segregation, it supports higher recycling efficiency, reduces contamination in recyclable streams, and minimizes the ecological and operational costs of waste processing. The approach aligns with global sustainability priorities, particularly the United Nations Sustainable Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities, circular economy initiatives, and intelligent environmental management systems.

[132] Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling

Jinhee Kim,Jae Jun An,Kang Eun Jeon,Jong Hwan Ko

Main category: cs.CV

TL;DR: 本文提出了一种高效的多比特量化网络训练方法,通过权重偏置校正和基于梯度重要性的核心集采样策略,显著降低了训练开销,同时保持了模型性能。

Details Motivation: 现有多比特量化网络训练方法需要对每个位宽重复全数据集更新,导致训练成本随精度数量线性增长,且常需额外微调,训练负担重。 Method: 提出两种技术:1)权重偏置校正,通过中和不同位宽下的量化偏差并统一激活分布,实现共享批归一化并消除微调需求;2)逐位核心集采样,利用基于梯度的重要性评分选择信息量大的子集进行训练,实现隐式知识迁移。 Result: 在CIFAR-10/100、TinyImageNet和ImageNet-1K上,结合ResNet和ViT架构的实验表明,该方法在保持竞争力或更优精度的同时,训练时间最多减少7.88倍。 Conclusion: 所提方法显著降低了多比特量化网络的训练成本,无需额外微调,适用于灵活部署的深度神经网络多精度需求。 Abstract: Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.

[133] Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

Jing Bi,Guangyu Sun,Ali Vosoughi,Chen Chen,Chenliang Xu

Main category: cs.CV

TL;DR: 提出了一种基于代理的架构,结合LLM推理和轻量级视觉模块,以解决多模态大语言模型在视觉任务中的幻觉和文本先验依赖问题,显著提升了性能。

Details Motivation: 现有的多模态大语言模型在复杂视觉任务中存在视觉幻觉和过度依赖文本先验的问题,需要系统性诊断和改进。 Method: 设计了一个三阶段评估框架来诊断SOTA视觉语言模型,并提出一种结合LLM推理与轻量级视觉模块的代理式架构,实现细粒度分析和推理链的迭代优化。 Result: 新系统在MMMU上提升+10.3,在MathVista上提升+6.0(相对于7B基准),性能匹敌或超越更大规模的模型。 Conclusion: 未来的视觉推理模型应注重集成更多专用工具来分析视觉内容,本文框架和评估套件将公开以促进后续研究。 Abstract: Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.

[134] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Xuyang Liu,Xiyan Gui,Yuchao Zhang,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出MixKV,一种结合重要性和多样性的新方法,用于优化大型视觉-语言模型中的KV缓存压缩,有效缓解内存瓶颈,提升多模态理解任务性能。

Details Motivation: 现有KV缓存压缩方法主要关注保留高重要性键值对,忽略了多模态场景下各注意力头中存在的语义冗余差异,导致语义覆盖不全。 Method: 分析LVLM中KV缓存在不同注意力头上的冗余模式,提出MixKV方法,在压缩时自适应地平衡重要性和多样性,以更好地保留语义信息。 Result: 在极端压缩(budget=64)下,MixKV在五个多模态理解基准上平均提升基线方法5.1%,在GUI定位任务中对SnapKV和AdaKV分别提升8.0%和9.0%,且保持相近推理效率,并可无缝扩展至大语言模型。 Conclusion: MixKV通过联合优化重要性和多样性,有效提升了KV缓存压缩的语义保留能力,显著增强多模态和单模态模型的压缩性能与部署可扩展性。 Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.

[135] ALICE-LRI: A General Method for Lossless Range Image Generation for Spinning LiDAR Sensors without Calibration Metadata

Samuel Soutullo,Miguel Yermo,David L. Vilariño,Óscar G. Lorenzo,José C. Cabaleiro,Francisco F. Rivera

Main category: cs.CV

TL;DR: 本文提出了一种名为ALICE-LRI的通用、传感器无关的方法,首次实现了从旋转式LiDAR点云到无损距离图像的生成,无需依赖制造商元数据或校准文件。

Details Motivation: 传统LiDAR数据投影方法存在几何不一致性,导致信息不可逆丢失,影响高保真应用的精度。 Method: 通过自动反向工程推断旋转式LiDAR传感器的内在几何结构,包括激光束配置、角度分布及每束的校准修正,实现无损投影和完整点云重建。 Result: 在KITTI和DurLAR数据集上验证了该方法能完美保留所有点(零丢失),几何精度保持在传感器精度范围内,并具备实时性能。压缩应用案例显示其显著提升下游任务质量。 Conclusion: ALICE-LRI实现了从近似投影到无损投影的范式转变,为需要完全几何保持的高精度遥感应用开辟了新可能。 Abstract: 3D LiDAR sensors are essential for autonomous navigation, environmental monitoring, and precision mapping in remote sensing applications. To efficiently process the massive point clouds generated by these sensors, LiDAR data is often projected into 2D range images that organize points by their angular positions and distances. While these range image representations enable efficient processing, conventional projection methods suffer from fundamental geometric inconsistencies that cause irreversible information loss, compromising high-fidelity applications. We present ALICE-LRI (Automatic LiDAR Intrinsic Calibration Estimation for Lossless Range Images), the first general, sensor-agnostic method that achieves lossless range image generation from spinning LiDAR point clouds without requiring manufacturer metadata or calibration files. Our algorithm automatically reverse-engineers the intrinsic geometry of any spinning LiDAR sensor by inferring critical parameters including laser beam configuration, angular distributions, and per-beam calibration corrections, enabling lossless projection and complete point cloud reconstruction with zero point loss. Comprehensive evaluation across the complete KITTI and DurLAR datasets demonstrates that ALICE-LRI achieves perfect point preservation, with zero points lost across all point clouds. Geometric accuracy is maintained well within sensor precision limits, establishing geometric losslessness with real-time performance. We also present a compression case study that validates substantial downstream benefits, demonstrating significant quality improvements in practical applications. This paradigm shift from approximate to lossless LiDAR projections opens new possibilities for high-precision remote sensing applications requiring complete geometric preservation.

[136] Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Yuhan Liu,Lianhui Qin,Shengjie Wang

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架Speculative Verdict(SV),通过结合多个轻量级“草稿专家”和一个强大的“判决模型”,在信息密集图像的理解与推理任务中实现高效准确的多跳推理。

Details Motivation: 大型视觉-语言模型在处理信息密集、图文交错复杂的图像时,难以精确定位关键线索并进行多跳推理,因此需要一种高效且无需训练的框架来提升性能。 Method: SV框架分为草稿阶段和判决阶段:小规模VLM作为草稿专家生成多样化的推理路径以提供定位候选;大规模VLM综合这些路径做出最终判断,并引入共识专家选择机制,仅将高一致性的路径传递给判决模型。 Result: SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等多个高分辨率、信息密集的视觉问答基准上均取得显著性能提升,兼具高准确率和计算效率。 Conclusion: SV通过融合多个部分正确的推理路径,实现了错误纠正与成本效益的平衡,优于大型专有模型或需训练的方法,是一种高效实用的视觉推理框架。 Abstract: Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict

[137] AutoScape: Geometry-Consistent Long-Horizon Scene Generation

Jiacheng Chen,Ziyu Jiang,Mingfu Liang,Bingbing Zhuang,Jong-Chyi Su,Sparsh Garg,Ying Wu,Manmohan Chandraker

Main category: cs.CV

TL;DR: 本文提出了一种名为AutoScape的长时驾驶场景生成框架,其核心是一种新型的RGB-D扩散模型,可迭代生成稀疏但几何一致的关键帧,并通过视频扩散模型插值得到连贯的长时驾驶视频。

Details Motivation: 现有的驾驶场景生成方法在长时间跨度下难以保持几何一致性和视觉质量,因此需要一种能够生成高质量、几何一致的长时驾驶视频的新方法。 Method: 提出了一种新的RGB-D扩散模型,在共享潜在空间中联合处理图像和深度信息,显式地依赖先前生成关键帧的场景几何(如点云),并通过 warp-consistent guidance 来引导采样过程;随后使用视频扩散模型在关键帧之间进行插值以生成密集视频帧。 Result: AutoScape 能够生成超过20秒的真实且几何一致的驾驶视频,在长时FID和FVD指标上分别比之前最优方法提升48.6%和43.0%。 Conclusion: AutoScape 通过引入几何感知的RGB-D扩散模型和基于历史几何条件的关键帧生成机制,显著提升了长时驾驶场景生成的质量与一致性,为自动驾驶仿真提供了更可靠的数据生成工具。 Abstract: This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.

[138] ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology

Nima Torbati,Anastasia Meshcheryakova,Ramona Woitek,Diana Mechtcheriakova,Amirreza Mahbod

Main category: cs.CV

TL;DR: 提出了一种基于注意力驱动特征融合的双编码器模型,结合CNN和Vision Transformer,用于提升组织病理学图像语义分割性能,在GCPS和PUMA数据集上优于现有方法。

Details Motivation: 为了提升组织病理学图像中语义组织分割的准确性,解决传统深度学习模型在捕捉长距离依赖和局部细节方面的局限性。 Method: 设计了一个统一的双编码器框架,通过注意力机制融合卷积神经网络(CNN)和视觉Transformer(ViT)的特征,以同时利用CNN的局部感知能力和ViT的全局建模能力,实现更优的语义分割。 Result: 在GCPS和PUMA两个公开数据集上评估,分别取得了76.79%的mIoU和86.87%的mDice,以及64.93%的mIoU和76.60%的mDice,性能优于当前最先进的基准模型。 Conclusion: 所提出的注意力驱动CNN-ViT特征融合模型有效提升了组织病理学图像语义分割的性能,具有临床辅助诊断的应用潜力。 Abstract: Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet

[139] DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Noam Issachar,Guy Yariv,Sagie Benaim,Yossi Adi,Dani Lischinski,Raanan Fattal

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的动态位置外推方法(DyPE),用于提升预训练扩散Transformer在超高分辨率图像生成中的表现,通过在扩散过程中动态调整位置编码,显著超越训练时的分辨率限制。

Details Motivation: 由于自注意力机制随图像token数量呈二次增长,训练超高分辨率的扩散Transformer成本极高,因此需要一种低成本、高效的方法来突破分辨率限制。 Method: 利用扩散过程中频谱演化的特性,在每一步动态调整模型的位置编码,使低频结构在早期收敛,高频细节在后期逐步解析,从而适应不同生成阶段的频率需求。 Result: DyPE可在无额外采样成本的情况下,使用预训练模型生成高达1600万像素的图像,在多个基准上实现了最先进的生成保真度,且分辨率越高增益越明显。 Conclusion: DyPE是一种有效且通用的训练-free方法,能够显著扩展扩散Transformer的高分辨率生成能力,为实际应用提供了高效解决方案。 Abstract: Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

[140] AlphaFlow: Understanding and Improving MeanFlow Models

Huijie Zhang,Aliaksandr Siarohin,Willi Menapace,Michael Vasilkovsky,Sergey Tulyakov,Qing Qu,Ivan Skorokhodov

Main category: cs.CV

TL;DR: 本文提出了一种名为α-Flow的新框架,统一了轨迹流匹配、Shortcut Model和MeanFlow,并通过课程学习策略解决了训练中的优化冲突问题,在ImageNet-1K上实现了新的SOTA生成性能。

Details Motivation: MeanFlow虽然有效但存在优化冲突导致收敛慢的问题,其目标函数的两个组成部分(轨迹流匹配和轨迹一致性)存在强负相关性,需加以解耦以提升训练效率。 Method: 提出α-Flow框架,将多种方法统一在一个公式下,并采用从轨迹流匹配逐渐退火到MeanFlow的课程学习策略来缓解优化冲突。 Result: 在ImageNet-1K 256x256上使用DiT主干网络时,α-Flow在不同设置下均优于MeanFlow,最大模型达到FID 2.58(1-NFE)和2.15(2-NFE),为当前最优结果。 Conclusion: α-Flow通过解耦优化目标并引入课程学习,有效提升了Few-step生成模型的收敛速度与生成质量,是Few-step扩散模型训练的一种更优方案。 Abstract: MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).

[141] CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image

Binbin Huang,Haobin Duan,Yiqun Zhao,Zibo Zhao,Yi Ma,Shenghua Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为Cupid的基于生成模型的3D重建方法,能够从单张2D图像中准确推断物体的相机位姿、3D形状和纹理。该方法通过在共享3D潜在空间中联合生成体素和像素-体素对应关系,采用两阶段流匹配框架实现鲁棒的姿态与形状估计,并在多项指标上优于现有方法。

Details Motivation: 现有的3D重建方法在从单幅图像恢复精确的3D形状、姿态和纹理方面存在局限,尤其是在缺乏多视角输入的情况下难以兼顾几何精度与外观细节。因此,需要一种统一的生成式框架来同时优化这些要素。 Method: Cupid将3D重建建模为从学习到的3D对象分布中的条件采样过程,使用共享3D潜在空间表示相机姿态和3D形状,采用两阶段流匹配:第一阶段粗略生成初始3D几何结构及其2D投影以恢复姿态;第二阶段融合对齐后的图像特征以提升结构保真度和外观细节。 Result: 实验表明,Cupid在PSNR上提升超过3 dB,Chamfer Distance降低超过10%,姿态估计精度与单目方法相当,并在视觉质量上优于基线3D生成模型。 Conclusion: Cupid通过统一的生成框架实现了高质量的单图3D重建,在形状、姿态和纹理恢复方面均表现出优越性能,展示了生成模型在3D重建任务中的潜力。 Abstract: This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.

[142] Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature

Lei Cheng,Siyang Cao

Main category: cs.CV

TL;DR: 本文提出了一种融合雷达与相机数据的多目标跟踪(MOT)框架,通过在线标定和共性特征匹配提升跟踪精度,首次探索了雷达-相机共性特征在在线标定中的应用。

Details Motivation: 现有研究常低估雷达的作用,仅将其作为辅助传感器;本文旨在充分发挥雷达在提供目标三维空间距离信息方面的优势,提升多传感器融合跟踪性能。 Method: 提出一种雷达-相机融合MOT框架,利用雷达与相机之间的共性特征实现在线标定,并结合特征匹配与类别一致性检验来提高传感器检测结果的关联准确性。 Result: 框架在真实场景实验中表现出色,简化了雷达-相机映射过程,提升了跟踪精度,验证了方法在受控环境和实际交通场景中的有效性。 Conclusion: 该方法有效整合雷达与相机数据,通过在线标定和特征级融合显著减少人工干预,推动了自动驾驶中多传感器融合跟踪的发展。 Abstract: This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role--despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system--our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at https://github.com/radar-lab/Radar_Camera_MOT

[143] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Xiaolong Wang,Lixiang Ru,Ziyuan Huang,Kaixiang Ji,Dandan Zheng,Jingdong Chen,Jun Zhou

Main category: cs.CV

TL;DR: 提出了一种基于自回归生成的图像分割新范式(ARGenSeg),通过将分割任务转化为图像生成问题,在统一框架下实现多模态理解与像素级感知。

Details Motivation: 现有方法依赖离散表示或特定解码器,难以捕捉细粒度视觉细节,限制了多模态大模型在分割任务中的表现。 Method: 利用多模态大语言模型输出视觉token,并通过通用VQ-VAE解码为密集分割掩码;采用下一尺度预测策略并行生成视觉token以降低推理延迟。 Result: 在多个分割数据集上超越先前最先进方法,显著提升推理速度,同时保持强大的理解能力。 Conclusion: ARGenSeg通过生成式方法实现了高效、精确的像素级分割,推动了多模态大模型在密集预测任务中的应用。 Abstract: We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.

[144] Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

Dean L Slack,G Thomas Hudson,Thomas Winterbottom,Noura Al Moubayed

Main category: cs.CV

TL;DR: 提出一种基于纯Transformer的自回归视频预测模型,使用连续像素空间表示,在物理模拟数据上实现了更长时程的准确预测,并通过可解释性分析验证了模型对PDE参数估计的泛化能力。

Details Motivation: 现有视频生成方法在因果建模和物理模拟的时间一致性方面存在不足,难以准确捕捉时空动态。本文旨在通过简单的端到端Transformer架构提升物理模拟视频的长期预测准确性。 Method: 采用纯Transformer架构进行自回归视频预测,直接在连续像素空间中建模,比较多种时空自注意力结构,使用物理模拟数据集进行无监督训练,并引入物体追踪指标评估时空推理能力。 Result: 相比现有的潜在空间方法,该模型将物理上准确的预测时间跨度提升了最多50%,同时在常见视频质量指标上保持相当性能;并通过探针模型验证了网络中编码PDE参数信息的区域,显示出对分布外参数估计的良好泛化性。 Conclusion: 所提出的简单、参数高效且可解释的Transformer方法为基于注意力机制的时空视频建模提供了一个有效平台,尤其适用于物理规律驱动的长期视频预测任务。 Abstract: Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.

[145] SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution

Ritik Shah,Marco F Duarte

Main category: cs.CV

TL;DR: 提出SpectraMorph,一种物理引导的自监督融合框架,通过解混瓶颈实现高光谱与多光谱图像融合,具有可解释性、高效训练和强鲁棒性。

Details Motivation: 现有深度学习方法在高光谱超分辨率中缺乏可解释性,且在多光谱波段极少时性能下降。 Method: 采用物理引导的自监督框架,从低分辨率高光谱图像提取端元,在多光谱图像上预测丰度图,并通过线性混合重建光谱,利用多光谱传感器的光谱响应函数进行自监督训练。 Result: SpectraMorph在合成和真实数据集上优于现有的无监督/自监督方法,接近有监督方法性能,训练时间短,对单波段MSI仍具鲁棒性。 Conclusion: SpectraMorph实现了高性能、可解释、快速训练和跨传感器鲁棒的高光谱超分辨率,适用于实际应用。 Abstract: Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.

[146] Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Nimrod Berman,Omkar Joglekar,Eitan Kosman,Dotan Di Castro,Omri Azencot

Main category: cs.CV

TL;DR: 本文提出了一个名为Latent Denoising Diffusion Bridge Model (LDDBM)的通用跨模态翻译框架,通过在共享潜在空间中构建去噪扩散桥模型,克服了现有方法对维度对齐和高斯先验等限制,实现了任意模态间的翻译任务。

Details Motivation: 现有的跨模态翻译方法通常依赖于共享维度、高斯源先验和特定架构等强假设,限制了其泛化能力和理论基础。因此,需要一种更通用且理论扎实的方法来实现任意模态之间的翻译。 Method: 提出LDDBM,基于潜在变量扩展的去噪扩散桥模型,在共享潜在空间中学习不同模态间的映射;引入对比对齐损失以保证语义一致性,设计领域无关的编码器-解码器结构用于潜在空间中的噪声预测,并提出预测损失和多种训练策略以提升跨域翻译性能和训练稳定性。 Result: 该方法在多视图到3D形状生成、图像超分辨率和多视图场景合成等多个跨模态任务上表现出色,支持任意模态组合,实验验证了其有效性与鲁棒性。 Conclusion: LDDBM为通用跨模态翻译提供了一个强大且灵活的新基线,突破了传统方法的限制,具有良好的扩展性和应用前景。 Abstract: Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

[147] LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas

Guocheng Gordon Qian,Ruihang Zhang,Tsai-Shien Chen,Yusuf Dalva,Anujraaj Argo Goyal,Willi Menapace,Ivan Skorokhodov,Meng Dong,Arpit Sahni,Daniil Ostashev,Ju Hu,Sergey Tulyakov,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 提出LayerComposer,一种用于个性化多主体文本到图像生成的交互式框架,通过分层画布和锁定机制实现更好的空间控制和身份保持。

Details Motivation: 现有个性化生成模型在空间组成上的交互控制不足,并且难以扩展到多个主体。 Method: 引入了分层画布表示法,每个主体位于独立层上,支持无遮挡组合;设计了锁定机制,在保持选定层高保真度的同时灵活适应上下文。 Result: 实验表明,LayerComposer在多主体个性化图像生成中实现了优于最先进方法的空间控制和身份保持能力。 Conclusion: LayerComposer通过分层结构和锁定机制有效提升了多主体文本到图像生成的可控性和生成质量。 Abstract: Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.

[148] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Yihao Meng,Hao Ouyang,Yue Yu,Qiuyu Wang,Wen Wang,Ka Leong Cheng,Hanlin Wang,Yixuan Li,Cheng Chen,Yanhong Zeng,Yujun Shen,Huamin Qu

Main category: cs.CV

TL;DR: HoloCine 是一种新型文本到视频模型,能够整体生成连贯的多镜头叙事场景,弥合了传统模型在叙事连续性上的不足,实现了导演级控制和高效的长时视频生成。

Details Motivation: 现有的文本到视频模型只能生成孤立的片段,缺乏生成连贯、多镜头叙事内容的能力,难以满足电影级 storytelling 的需求。 Method: 提出 HoloCine 模型,采用 Window Cross-Attention 机制将文本提示定位到特定镜头,并通过 Sparse Inter-Shot Self-Attention 模式(镜头内密集、镜头间稀疏)提升长视频生成效率,实现全局一致性。 Result: HoloCine 在叙事连贯性方面达到新的 SOTA 水平,展现出角色与场景记忆、理解电影技法等 emergent 能力,支持分钟级视频生成。 Conclusion: HoloCine 实现了从片段合成向自动化电影制作的关键转变,推动端到端电影创作成为可能。 Abstract: State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.