Table of Contents
cs.CL [Back]
[1] DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse
Jindi Wang,Yidi Zhang,Zhaoxing Li
Main category: cs.CL
TL;DR: 本研究提出了一种基于DeBERTa的模型DeBERTa-KC,用于自动分类YouTube科学视频评论中的知识建构水平,通过引入Focal Loss、Label Smoothing和R-Drop等技术提升模型性能,在四类知识建构分类任务中取得了优异的macro-F1分数(0.836),显著优于基线模型。
Details
Motivation: 现有的知识建构水平自动分类方法在处理非正式在线学习环境中的复杂话语时表现有限,尤其面临类别不平衡和泛化能力不足的问题,因此需要更鲁棒、可复现的模型来准确识别学习者在开放平台中的认知参与程度。 Method: 基于DeBERTa-v3模型,引入Focal Loss应对类别不平衡,结合Label Smoothing和R-Drop增强模型泛化能力;构建了一个包含20,000个手动标注样本的平衡语料库,并设计了可复现的端到端处理流程,涵盖数据采集、标注、预处理、训练与评估,采用10折分层交叉验证进行模型评估。 Result: DeBERTa-KC在测试中达到macro-F1为0.836±0.008,显著优于传统及Transformer基线模型(p<0.01),尤其在'Explore'和'Negotiate'两类高阶认知参与类别上表现出更强的识别能力。 Conclusion: DeBERTa-KC能有效捕捉非正式数字学习环境中知识建构的细微特征,验证了大语言模型在理论驱动的话语分析中的潜力,为自动化评估学习者的认知参与提供了可扩展的技术路径。 Abstract: This study presents DeBERTa-KC, a transformer-based model for automatic classification of knowledge construction (KC) levels in online science learning discourse. Using comments collected from four popular YouTube science channels (2022--2024), a balanced corpus of 20,000 manually annotated samples was created across four KC categories: \textit{nonKC}, \textit{Share}, \textit{Explore}, and \textit{Negotiate}. The proposed model extends DeBERTa-v3 with Focal Loss, Label Smoothing, and R-Drop regularization to address class imbalance and enhance generalization. A reproducible end-to-end pipeline was implemented, encompassing data extraction, annotation, preprocessing, training, and evaluation. Across 10-fold stratified cross-validation, DeBERTa-KC achieved a macro-F1 of $0.836 \pm 0.008$, significantly out-performing both classical and transformer baselines ($p<0.01$). Per-category results indicate strong sensitivity to higher-order epistemic engagement, particularly in \textit{Explore} and \textit{Negotiate} discourse. These findings demonstrate that large language models can effectively capture nuanced indicators of knowledge construction in informal digital learning environments, offering scalable, theory-informed approaches to discourse analysis and the development of automated tools for assessing epistemic engagement.[2] An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics
Xincheng Liu
Main category: cs.CL
TL;DR: 本研究评估了五种主流大语言模型(如ChatGPT、Claude等)生成高中物理“电磁波谱”课程计划的教学合理性和可用性,比较了三种提示框架(TAG、RACE、COSTAR)的影响。结果显示,模型选择主要影响语言可读性(如DeepSeek最易读,Claude语言最复杂),而提示框架(尤其是RACE)显著提升事实准确性和课程标准对齐度。所有课程目标多停留在记忆和理解层面,缺乏高阶思维目标。最佳组合是采用易读性高的模型配合RACE框架及明确的概念、标准与高阶目标清单。
Details
Motivation: 随着AI在教育中的应用日益广泛,教师越来越多地依赖大语言模型生成课程计划。然而,这些AI生成内容在教学有效性、事实准确性及课程对齐方面存在不确定性。因此,亟需系统评估不同模型与提示设计对课程质量的影响,以指导教育者更安全、有效地使用AI工具。 Method: 研究选取五个主流大语言模型(ChatGPT-5、Claude Sonnet 4.5、Gemini 2.5 Flash、DeepSeek V3.2、Grok 4),针对同一高中物理主题‘电磁波谱’生成共15个课程计划。采用三种结构化提示框架:TAG、RACE和COSTAR。通过四项自动化指标进行分析:(1) 可读性与语言复杂度,(2) 事实准确性与幻觉检测,(3) 课程标准对齐度,(4) 学习目标的认知要求层次(基于布鲁姆分类法)。 Result: 模型选择对语言可读性影响最大:DeepSeek生成的教案最易读(FKGL=8.64),Claude语言最复杂(FKGL=19.89)。提示框架则显著影响教学可靠性:RACE框架幻觉最少,且与NGSS标准对齐度最高。所有教案的学习目标主要集中于布鲁姆分类法的记忆与理解层级,缺乏应用、分析、评价等高阶认知目标。 Conclusion: AI生成教案的质量受模型本身和提示设计双重影响:模型决定语言可读性,而提示框架(特别是RACE)更关键地影响教学准确性和课程对齐。为提高AI辅助教学的质量,建议结合高可读性模型、结构化提示(如RACE)以及包含核心概念、课程标准和高阶学习目标的显式检查清单。 Abstract: This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom's taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives.[3] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
Yatai Ji,Teng Wang,Yuying Ge,Zhiheng Liu,Sidi Yang,Ying Shan,Ping Luo
Main category: cs.CL
TL;DR: 提出ReDiff,一种增强型扩散框架,通过主动修正机制解决离散扩散模型在推理时的错误级联问题,提升生成内容的一致性和事实准确性。
Details
Motivation: 离散扩散模型在视觉-语言任务中存在训练与推理不一致的问题,导致并行解码时出现严重错误级联,影响生成质量。 Method: 将生成过程从被动去噪改为主动精炼,采用两阶段训练:首先训练模型修正合成错误以建立基础修订能力;其次引入在线自修正循环,让模型学习专家修正来改进自身生成的草稿。 Result: 实验表明,ReDiff显著提升了生成内容的连贯性和事实准确性,实现了比传统去噪方法更稳定高效的并行生成。 Conclusion: ReDiff通过错误驱动的学习机制有效打破了错误级联,为离散扩散模型的可靠应用提供了新方向。 Abstract: Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert's corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at https://rediff-hku.github.io/.[4] Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
J Rosser,José Luis Redondo García,Gustavo Penha,Konstantina Palla,Hugues Bouchard
Main category: cs.CL
TL;DR: 本文提出了一种名为Sparse Tracing的新方法,利用动态稀疏注意力高效分析长上下文中的注意力模式,并通过Stream算法在近线性时间和线性空间内实现大规模可解释性分析。
Details
Motivation: 随着大语言模型上下文长度达到百万级别,传统机械可解释技术因计算和内存开销过大而难以应用,亟需一种高效、可扩展的注意力分析方法。 Method: 提出Stream算法,采用分层剪枝策略,以二分搜索式精细化估计每头的稀疏注意力掩码,在仅保留前k个关键块的同时保持模型下一词预测行为,实现单次遍历的高效分析。 Result: 在链式思维推理轨迹上应用时,成功识别出‘思想锚点’并剪除97-99%的token交互;在RULER基准上保留关键检索路径的同时减少90-96%的交互,揭示了从‘针’到输出的逐层路径。 Conclusion: Sparse Tracing为长上下文注意力分析提供了一个实用且可即插即用的工具,使在消费级GPU上进行大规模可解释性研究成为可能,推动了链式思维监控的普及。 Abstract: As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99\% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96\% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.[5] Automated HIV Screening on Dutch EHR with Large Language Models
Lang Zhou,Amrish Jhingoer,Yinghao Luo,Klaske Vliegenthart--Jongbloed,Carlijn Jordans,Ben Werkhoven,Tom Seinen,Erik van Mulligen,Casper Rokx,Yunlei Li
Main category: cs.CL
TL;DR: 提出一种利用大语言模型分析非结构化电子健康记录文本的新方法,以提高HIV检测的准确性和效率。
Details
Motivation: 现有的HIV诊断研究主要集中在结构化数据上,忽略了临床笔记等非结构化文本中可能包含的重要风险信息。 Method: 利用大语言模型(LLM)开发了一个新流程来分析非结构化的EHR文本数据,并判断患者是否需要进一步进行HIV检测。 Result: 在鹿特丹伊拉斯姆斯大学医学中心的临床数据上的实验结果显示,该流程在保持低假阴性率的同时实现了高准确性。 Conclusion: 所提出的基于LLM的流程能够有效利用EHR中的非结构化文本数据,有助于提升HIV筛查和早期诊断的效率与准确性。 Abstract: Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient's eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.[6] An Expert-grounded benchmark of General Purpose LLMs in LCA
Artur Donaldson,Bharathan Balaji,Cajetan Oriekezie,Manish Kumar,Laure Patouillard
Main category: cs.CL
TL;DR: 本研究首次基于专家评估,对11种通用大语言模型(LLMs)在生命周期评估(LCA)中的应用进行了系统性基准测试,涵盖22项相关任务。结果显示,37%的模型回答包含不准确或误导性信息,部分模型幻觉引用率高达40%,但解释质量普遍较好。开源与闭源模型表现相当,表明需谨慎使用通用LLM,并加强接地机制以提升可靠性。
Details
Motivation: 尽管大语言模型(LLMs)在环境和社会领域中被探索用于支持生命周期评估(LCA),但缺乏标准化评估框架和系统性证据来验证其可靠性、稳健性和可用性。因此,亟需一项基于专家意见的基准研究填补这一空白。 Method: 研究评估了11种通用大语言模型(涵盖商业和开源类型),在22项LCA相关任务中的表现,并由17位经验丰富的从业者从科学准确性、解释质量、稳健性、可验证性和指令遵循等方面进行评审,共收集168份专家评价。 Result: 专家认为37%的回答包含不准确或误导性信息;多数模型在解释质量和格式遵循方面表现良好;幻觉引用率最高达40%;开源模型在准确性与解释质量上表现不逊于甚至优于闭源模型。 Conclusion: 研究发现,若将通用大语言模型作为自由形式的‘预言机’直接使用,存在较大风险,尤其是在缺乏接地机制的情况下。然而,LLMs在提高解释质量和减轻简单任务负担方面具有潜力,未来应结合专业验证机制以提升其在LCA中的可靠应用。 Abstract: Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs na\"ively in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...[7] Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities
Nishant Balepur,Dang Nguyen,Dayeon Ki
Main category: cs.CL
TL;DR: 提出基于游戏的评估方法Dixit,用于全面评估多模态大语言模型(MLMs)的能力,实验显示其胜率排名与主流基准高度一致,并揭示了MLM在推理策略上的改进空间。
Details
Motivation: 现有评估方法无法综合评估MLM在单一任务中的多能力表现,且主观、昂贵并易被表面特征操纵。 Method: 引入基于游戏的评估框架,具体采用Dixit这一需要生成迷惑性描述的卡牌游戏,通过客观规则和竞争机制评估MLM的多模态理解与推理能力。 Result: 五种MLM在Dixit中的胜率排名与主流基准完全相关;人机对战揭示了MLM策略与人类的差异及推理短板。 Conclusion: Dixit作为一种游戏化评估方法,能有效、客观地衡量MLM的综合能力,为未来模型改进提供方向。 Abstract: Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks -- which cannot jointly assess MLM capabilities in a single task -- or rely on human or model pairwise comparisons -- which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.[8] Large Language Model enabled Mathematical Modeling
Guoyun Zhang
Main category: cs.CL
TL;DR: 本研究探讨了DeepSeek-R1大语言模型在运筹学(OR)优化建模中的应用潜力,旨在通过自然语言理解与代码生成能力,降低对领域专家的依赖。研究在NL4OPT、IndustryOR、EasyLP和ComplexOR四个基准上系统评估该模型,并提出减少幻觉、提升建模准确性的多种策略。
Details
Motivation: 传统优化方法高度依赖专家将现实问题转化为数学模型,而现有大模型(如GPT-4、Claude等)存在成本高和易产生幻觉的问题,限制了其在供应链等实际场景中的应用。因此,需要探索更高效、低成本且可靠的模型(如DeepSeek-R1)来弥合自然语言到优化模型之间的鸿沟。 Method: 研究采用DeepSeek-R1模型,在四个运筹学基准(NL4OPT、IndustryOR、EasyLP、ComplexOR)上进行系统评估。方法包括基线对比、构建幻觉分类体系,并应用LLM-as-a-Judge、少样本学习(FSL)、工具调用和多智能体框架等策略以减少幻觉并提升建模准确性。 Result: DeepSeek-R1展现出在运筹学问题建模中的高潜力,相较于GPT-4等模型具有成本效益优势。通过提出的多种缓解策略,有效降低了模型幻觉,提升了从自然语言到优化模型(如变量、约束、目标函数)转换的准确性和与用户意图的一致性。 Conclusion: DeepSeek-R1是一种有前景的低成本、高性能工具,可用于运筹学中的优化建模。结合幻觉缓解策略,能够显著提升大模型在实际决策支持场景中的可靠性与实用性,推动自然语言驱动的自动化建模发展。 Abstract: The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.[9] Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
Jackson Hassell,Dan Zhang,Hannah Kim,Tom Mitchell,Estevam Hruschka
Main category: cs.CL
TL;DR: 提出一种基于记忆增强的框架,利用预训练大语言模型生成的批评来提升分类任务性能,无需参数更新。
Details
Motivation: 避免传统微调方法的成本高、灵活性差和不透明问题,探索无需参数更新的学习方式。 Method: 构建一个结合实例级记忆(经验)和任务级语义记忆(指导)的框架,利用LLM生成的批评信息进行学习。 Result: 在多种任务上,相比仅使用标签的检索基线,引入批评使准确率最高提升24.8%;发现不同模型对事实型与偏好型数据处理行为差异,并提出“可引导性”指标解释模型响应机制。 Conclusion: 记忆驱动的反思式学习有望构建更灵活、可解释的LLM代理。 Abstract: We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.[10] LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation
Le Ren,Xiangjian Zeng,Qingqiang Wu,Ruoxuan Liang
Main category: cs.CL
TL;DR: 提出LyriCAR,一种全新的无监督可控歌词翻译框架,通过难度感知的课程设计和自适应策略显著提升翻译质量与训练效率。
Details
Motivation: 现有歌词翻译方法依赖人工规则和句子级建模,难以捕捉音乐语言模式并在段落级别保持跨行连贯性和全局押韵。 Method: 提出LyriCAR框架,引入难度感知的课程设计师和自适应课程策略,在无监督环境下逐步引导模型学习复杂性递增的翻译任务。 Result: 在英-中歌词翻译任务上实验表明,LyriCAR在标准翻译指标和多维奖励评分上均达到SOTA水平,且训练步数减少近40%。 Conclusion: LyriCAR通过自适应课程学习有效提升了歌词翻译的质量与效率,具备良好的段落级建模能力和实际应用前景。 Abstract: Lyric translation is a challenging task that requires balancing multiple musical constraints. Existing methods often rely on hand-crafted rules and sentence-level modeling, which restrict their ability to internalize musical-linguistic patterns and to generalize effectively at the paragraph level, where cross-line coherence and global rhyme are crucial. In this work, we propose LyriCAR, a novel framework for controllable lyric translation that operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware curriculum designer and an adaptive curriculum strategy, ensuring efficient allocation of training resources, accelerating convergence, and improving overall translation quality by guiding the model with increasingly complex challenges. Extensive experiments on the EN-ZH lyric translation task show that LyriCAR achieves state-of-the-art results across both standard translation metrics and multi-dimensional reward scores, surpassing strong baselines. Notably, the adaptive curriculum strategy reduces training steps by nearly 40% while maintaining superior performance. Code, data and model can be accessed at https://github.com/rle27/LyriCAR.[11] LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation
Xin Lian,Kenneth D. Forbus
Main category: cs.CL
TL;DR: 提出一种结合大语言模型(LLM)和符号自然语言理解(NLU)的混合方法,利用LLM进行文本简化和补全知识,用符号系统生成可推理的结构化表示,在常识科学文本的因果和数量信息抽取任务中表现优于纯符号方法。
Details
Motivation: 大语言模型易产生幻觉和不一致输出,而符号系统虽可解释性强但覆盖范围有限且维护困难,因此需要结合两者优势。 Method: 使用LLM进行文本重述和简化以扩大覆盖范围,并自动填补知识空白;使用符号NLU生成用于推理和增量学习的结构化关系表示。 Result: 在提取常识科学文本中的数量和因果规律任务上,混合方法显著优于纯符号系统。 Conclusion: 混合方法有效结合了LLM的广泛覆盖能力和符号系统的可解释性与推理能力,是提升自然语言理解系统性能的可行路径。 Abstract: Despite the broad applicability of large language models (LLMs), their reliance on probabilistic inference makes them vulnerable to errors such as hallucination in generated facts and inconsistent output structure in natural language understanding (NLU) tasks. By contrast, symbolic NLU systems provide interpretable understanding grounded in curated lexicons, semantic resources, and syntactic & semantic interpretation rules. They produce relational representations that can be used for accurate reasoning and planning, as well as incremental debuggable learning. However, symbolic NLU systems tend to be more limited in coverage than LLMs and require scarce knowledge representation and linguistics skills to extend and maintain. This paper explores a hybrid approach that integrates the broad-coverage language processing of LLMs with the symbolic NLU capabilities of producing structured relational representations to hopefully get the best of both approaches. We use LLMs for rephrasing and text simplification, to provide broad coverage, and as a source of information to fill in knowledge gaps more automatically. We use symbolic NLU to produce representations that can be used for reasoning and for incremental learning. We evaluate this approach on the task of extracting and interpreting quantities and causal laws from commonsense science texts, along with symbolic- and LLM-only pipelines. Our results suggest that our hybrid method works significantly better than the symbolic-only pipeline.[12] A Fundamental Algorithm for Dependency Parsing (With Corrections)
Michael A. Covington
Main category: cs.CL
TL;DR: 提出一种用于将自然语言句子解析为依存句法树的基本算法,逐词处理并即时依附词语,模拟人脑解析特性。
Details
Motivation: 设计一种更符合人类语言处理机制的依存句法分析算法,提高解析的实时性和认知合理性。 Method: 采用逐词处理的方式,在词语出现时立即进行依附,类似于人类大脑的语言解析过程,最坏情况复杂度为O(n^3),但在实际语言中仅对小n发生。 Result: 算法实现了高效的依存句法分析,具有与短语结构解析相当的复杂度,但在实际应用中表现更优。 Conclusion: 该算法在理论和实际语言处理中均表现出良好的性能,支持人类语言处理的增量特性。 Abstract: This paper presents a fundamental algorithm for parsing natural language sentences into dependency trees. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached, corresponding to properties claimed for the parser in the human brain. Like phrase-structure parsing, its worst-case complexity is $O(n^3)$, but in human language, the worst case occurs only for small $n$.[13] Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs
Yunpeng Xiao,Carl Yang,Mark Mai,Xiao Hu,Kai Shu
Main category: cs.CL
TL;DR: 本文提出了一种统一的框架,从临床背景和临床问题两个维度刻画临床决策任务,以更真实地评估大语言模型在医疗领域的应用,并总结了现有数据集、方法及评估指标,指出了未来挑战。
Details
Motivation: 现有的医学数据集(如MedQA)多依赖简化的问答形式,不能充分反映真实的临床决策过程,因此需要一个更贴近实际临床环境的评估范式。 Method: 提出了一个包含临床背景和临床问题两个维度的统一框架,对现有数据集和基准进行梳理,回顾了训练时和测试时的应对方法,并扩展了除准确率外的评估维度,如效率和可解释性。 Result: 该框架能够系统化地分析临床决策任务的复杂性,标准化不同模型的比较,并指导更具临床意义的大语言模型开发。 Conclusion: 所提出的双维范式有助于明确假设、统一评估标准,并推动大语言模型在真实临床场景中的发展与应用。 Abstract: Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.[14] Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation and Specialized Pre-training
Alexandra Apostolopoulou,Konstantinos Kanaris,Athanasios Koursaris,Dimitris Tsakalidis,George Domalis,Ioannis E. Livieris
Main category: cs.CL
TL;DR: 本文提出了一种针对现代希腊语的新一代嵌入模型(GEM),通过高质量的数据预处理和多样化的现代Transformer架构(如ELECTRA、ConvBERT、ModernBERT)在通用和法律领域进行预训练,显著提升了在下游任务中的表现,特别是在长文本法律文档处理方面优于现有基线模型。
Details
Motivation: 由于研究分散、架构单一以及上下文长度受限,形态丰富但资源中等的现代希腊语在自然语言处理方面发展受限,尤其在需要长上下文的法律领域更为明显。 Method: 构建大规模、高质量的通用与法律领域希腊语语料库,采用严格的过滤与预处理方法,并在此基础上预训练多种现代Transformer架构(包括ELECTRA、ConvBERT、ModernBERT),同时提出首个面向法律领域的双语希腊-英语嵌入模型。 Result: 实验表明,GEM-RoBERTa和GEM-ConvBERT在多个下游任务中显著优于现有基线模型,验证了所提方法的有效性。 Conclusion: 基于高质量数据和多样化现代架构的GEM模型为希腊语NLP研究提供了新基准,尤其推动了法律领域长文本建模的发展。 Abstract: The advancement of natural language processing for morphologically rich, moderately-resourced languages like Modern Greek is often hindered by a fragmented research landscape, a lack of architectural diversity and reliance on limited context-length models. This is particularly true in specialized, high-value domains such as law, where existing models are frequently confined to early transformer architectures with a restrictive 512-token window, insufficient for analyzing long legal documents. To address these challenges, this paper presents Greek Embedding Models, a new family of transformer models for Greek language built upon a foundation of extensive, quality-driven data curation. We detail the construction of several large-scale Greek corpora, emphasizing a rigorous, quality-based filtering and preprocessing methodology to create high-value training datasets from both general-domain and specialized legal sources. On this carefully curated foundation, we pre-train and systematically evaluate a diverse suite of modern architectures, which has not previously applied to Greek language, such as ELECTRA, ConvBERT and ModernBERT. Furthermore, we propose the first bilingual Greek-English Embedding Models tailored for the legal domain. The extensive experiments on downstream tasks demonstrate that the new class of models establish the effectiveness of the proposed approach, highlighting that the GEM-RoBERTa and GEM-ConvBERT models significantly outperform existing baselines.[15] Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models
David Dukić
Main category: cs.CL
TL;DR: 本文提出三种改进迁移学习的方法,以提升预训练语言模型在序列标注任务中的表现,包括多任务学习、架构修改和生成式上下文微调框架。
Details
Motivation: 为了提高预训练语言模型在序列标注任务上的迁移学习效果,尤其是在领域迁移和信息流动限制方面存在的问题。 Method: 1) 引入包含额外信号的多任务模型;2) 修改自回归大模型架构以实现层间的双向信息流;3) 提出自回归大模型的序列标注框架,结合监督式上下文微调与响应导向的适应策略。 Result: 所提出的方法在事件触发检测等序列标注任务中显著提升了模型性能,验证了针对性迁移学习范式的优势。 Conclusion: 通过针对性的迁移学习范式,预训练语言模型在序列标注任务中可达到最佳性能。 Abstract: This doctoral thesis improves the transfer learning for sequence labeling tasks by adapting pre-trained neural language models. The proposed improvements in transfer learning involve introducing a multi-task model that incorporates an additional signal, a method based on architectural modifications in autoregressive large language models, and a sequence labeling framework for autoregressive large language models utilizing supervised in-context fine-tuning combined with response-oriented adaptation strategies. The first improvement is given in the context of domain transfer for the event trigger detection task. The domain transfer of the event trigger detection task can be improved by incorporating an additional signal obtained from a domain-independent text processing system into a multi-task model. The second improvement involves modifying the model's architecture. For that purpose, a method is proposed to enable bidirectional information flow across layers of autoregressive large language models. The third improvement utilizes autoregressive large language models as text generators through a generative supervised in-context fine-tuning framework. The proposed model, method, and framework demonstrate that pre-trained neural language models achieve their best performance on sequence labeling tasks when adapted through targeted transfer learning paradigms.[16] ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
Marianne Menglin Liu,Daniel Garcia,Fjona Parllaku,Vikas Upadhyay,Syed Fahad Allam Shah,Dan Roth
Main category: cs.CL
TL;DR: 提出ToolScope,通过自动修正工具合并和检索相关工具来提升大模型在复杂任务中工具选择的准确性和效率。
Details
Motivation: 解决大语言模型在面对冗余工具集时的选择歧义问题以及上下文限制导致无法高效处理大规模工具集的问题。 Method: 设计ToolScopeMerger(带自动修正)以减少工具冗余,并开发ToolScopeRetriever对工具进行排序和筛选,压缩工具集至上下文可容纳范围。 Result: 在三个主流大模型和三个开源工具使用基准上的实验表明,工具选择准确率提升了8.38%到38.6%。 Conclusion: ToolScope能有效提升大语言模型在受限上下文下对大规模、冗余工具集的使用准确性和效率。 Abstract: Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.[17] From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge
Nafis Chowdhury,Moinul Haque,Anika Ahmed,Nazia Tasnim,Md. Istiak Hossain Shihab,Sajjadur Rahman,Farig Sadeque
Main category: cs.CL
TL;DR: 提出了一个名为BLanCK的数据集,用于评估多语言大模型在孟加拉语文化知识(如民俗、饮食、方言)方面的表现,发现现有模型在文化相关任务上表现较差,但上下文的引入显著提升了性能。
Details
Motivation: 现有跨语言基准在捕捉低资源文化细微差异方面存在不足,难以准确评估大模型对特定文化的理解能力。 Method: 构建了一个涵盖民俗、烹饪艺术和地方方言的孟加拉语文化知识数据集(BLanCK),并评估了多个多语言大模型在有无上下文条件下的表现。 Result: 实验表明,当前多语言模型在非文化类任务上表现良好,但在文化知识任务上显著落后;提供上下文后,所有模型的表现均有大幅提升。 Conclusion: 提升大模型对低资源文化的理解能力需要结合上下文感知架构和经过文化定制的训练数据。 Abstract: Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.[18] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training
Mehrdad Ghassabi,Sadra Hakim,Hamidreza Baradaran Kashani,Pedram Rostami
Main category: cs.CL
TL;DR: 本研究利用强化学习与AI反馈(RLAIF)及直接偏好优化(DPO)方法,提升波斯语小型语言模型在医学问答中的推理能力。通过翻译构建波斯语医学多选题数据集,并生成正误推理路径对,训练出仅使用少量数据但性能优于先前大规模训练模型的新型模型。
Details
Motivation: 提升小规模语言模型在资源稀缺语言(如波斯语)中的医学推理能力,以应对专业领域应用中数据有限的挑战。 Method: 采用RLAIF生成正负样本对,结合DPO进行训练;利用教师和学生模型生成思维链(CoT)推理路径,构建包含正确与错误推理轨迹的数据集。 Result: 构建了包含200万token优选答案和250万token拒选答案的数据集;所训练模型在波斯语医学推理任务中超越了基于5700万token训练的前代模型gaokerena-V。 Conclusion: 专注于推理能力的训练方法(如RLAIF+DPO)可在数据量受限的情况下高效提升领域特定语言模型的性能,尤其适用于低资源语言的专业应用场景。 Abstract: Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.[19] CreativityPrism: A Holistic Benchmark for Large Language Model Creativity
Zhaoyi Joey Hou,Bowei Alvin Zhang,Yining Lu,Bhiman Kumar Baghel,Anneliese Brei,Ximing Lu,Meng Jiang,Faeze Brahman,Snigdha Chaturvedi,Haw-Shiuan Chang,Daniel Khashabi,Xiang Lorraine Li
Main category: cs.CL
TL;DR: 本文提出了CreativityPrism框架,用于全面评估大语言模型在不同场景下的创造力,将创造力分解为质量、新颖性和多样性三个维度,并在17个最先进模型上进行评估,揭示了专有模型与开源模型之间的性能差距以及不同创造力维度间的相关性差异。
Details
Motivation: 由于现有创造力评估方法在不同领域和任务中存在定义和测量方式的不一致,缺乏统一的评估框架,因此需要一个能够跨领域、多维度系统评估大语言模型创造力的综合框架。 Method: 提出CreativityPrism框架,包含三个维度(质量、新颖性、多样性)、九项任务、三个领域(发散性思维、创意写作、逻辑推理)和二十种评估指标,并对17个最先进的专有和开源大语言模型进行评估,分析各指标与任务领域之间的性能相关性。 Result: 实验结果显示专有模型整体优于开源模型;同一领域内的任务性能高度相关,而跨领域相关性较弱;多样性和质量指标之间存在强相关性,而新颖性与其他两个维度的相关性较弱。 Conclusion: 创造力的不同维度和任务之间表现不具强泛化性,单一任务或维度的高表现不能代表整体创造力水平,因此需要采用像CreativityPrism这样的综合性评估框架来全面衡量大语言模型的创造力。 Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.[20] Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
Yajie Li,Albert Galimov,Mitra Datta Ganapaneni,Pujitha Thejaswi,De Meng,Priyanshu Kumar,Saloni Potdar
Main category: cs.CL
TL;DR: ARTER提出了一种高效的实体链接方法,通过自适应路由和选择性推理,在减少LLM使用的同时提升了性能。
Details
Motivation: 传统实体链接依赖大量标注数据和精细调优,而现有少样本方法因过度依赖LLM推理导致计算成本高。 Method: 结合候选生成、上下文评分、自适应路由和选择性推理,利用多种信号将提及分为简单和困难案例,分别用轻量模型和LLM进行处理。 Result: 在6个数据集中的5个上平均提升+2.53%,最高提升+4.47%,且LLM token使用量减少一半。 Conclusion: ARTER在保持高性能的同时显著提高了推理效率,为实体链接提供了一种高效、可扩展的解决方案。 Abstract: Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.[21] BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
Haoyuan Li,Zhengyuan Shen,Sullam Jeoung,Yueyan Chen,Jiayu Li,Qi Zhu,Shuai Wang,Vassilis Ioannidis,Huzefa Rangwala
Main category: cs.CL
TL;DR: 提出BoundRL,一种高效的长结构化文本分段与标签预测方法,通过强化学习和可验证奖励提升小模型性能。
Details
Motivation: 传统文本分割方法难以处理包含表格、代码等复杂元素的结构化文本,需更有效的语义分割方案。 Method: 采用仅生成起始token的策略,结合原文本重构内容,并利用强化学习与可验证奖励(RLVR)优化重建保真度和语义对齐,引入中间候选缓解熵崩溃。 Result: 实验表明,1.7B参数的小模型在复杂LLM提示任务中优于大模型的少样本提示,RLVR显著优于监督微调,中间候选进一步提升性能与泛化能力。 Conclusion: BoundRL通过高效生成机制和强化学习框架,显著提升结构化文本分割效果,尤其适用于复杂AI提示等场景。 Abstract: As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.[22] Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ?
Anthony Dubreuil,Antoine Gourru,Christine Largeron,Amine Trabelsi
Main category: cs.CL
TL;DR: 该论文研究了大语言模型在立场检测任务中的偏见问题,发现模型会因文本复杂度和特定群体方言等属性而表现出显著的刻板印象。
Details
Motivation: 大语言模型从预训练数据中继承了社会刻板印象,但在立场检测任务中的偏见评估尚未受到足够关注。由于立场检测常涉及政治倾向,属于敏感NLP任务,因此有必要探究其潜在偏见。 Method: 在零样本设置下,通过自动标注现有立场检测数据集中的文本属性(如特定群体的方言和文本复杂度),分析这些属性如何影响大语言模型的立场判断。 Result: 实验结果显示,大语言模型在立场检测中存在显著偏见,例如将支持大麻的观点错误地与低文本复杂度关联,或将非裔美国人方言与反对特朗普立场错误关联。 Conclusion: 大语言模型在立场检测任务中表现出明显的社会属性相关偏见,需进一步改进以减少不公平的刻板印象影响。 Abstract: Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model's stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.[23] DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking
Tian Lan,Bin Zhu,Qianghuai Jia,Junyang Ren,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo,Kaifu Zhang
Main category: cs.CL
TL;DR: 本文提出了DeepWideSearch,这是首个旨在评估代理在信息检索中整合深度(多跳推理)和广度(大规模信息收集)能力的基准。实验表明,即使是当前最先进的代理在该基准上的平均成功率也仅为2.39%,揭示了现有方法的重大局限性,并公开发布了该基准以促进未来研究。
Details
Motivation: 现有的搜索代理无法同时进行深度的多跳推理和广泛的资讯收集,这在实际应用如市场分析和业务发展中是一个关键缺陷。因此,需要一个新的基准来评估和推动能够结合深度与广度的信息检索代理的发展。 Method: 提出了DeepWideSearch基准,通过两种方法转化现有数据集,构建了一个包含220个问题、涵盖15个不同领域的测试集合,要求代理对大量数据进行处理并执行多跳推理。 Result: 实验结果显示,最先进的搜索代理在DeepWideSearch上的平均成功率为2.39%,错误分析揭示了四种失败模式:缺乏反思、过度依赖内部知识、检索不足和上下文溢出。 Conclusion: DeepWideSearch为评估信息检索代理的能力提供了新的挑战性基准,暴露了当前技术的关键局限,有助于推动更强大和鲁棒的信息检索代理的研究。 Abstract: Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.[24] Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding
Yuhang Zhou,Mingrui Zhang,Ke Li,Mingyi Wang,Qiao Liu,Qifei wang,Jiayi Liu,Fei Liu,Serena Li,Weiwi Li,Mingze Gao,Abhishek Kumar,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为Mixture-of-Minds的多智能体框架,用于表格推理任务,通过规划、编码和回答三个角色分工协作,并结合蒙特卡洛树搜索与强化学习实现自我改进,在TableBench上达到62.13%的性能,超越OpenAI-o4-mini-high。
Details
Motivation: 现有表格推理方法在语言推理与精确计算之间存在互补性缺陷:微调方法易产生幻觉和算术错误,而基于工具的方法缺乏语义理解,因此需要一种能结合强推理与可靠表格处理的新方法。 Method: 提出Mixture-of-Minds框架,将表格推理分解为规划、编码和回答三个专门角色,并通过代码执行实现精确操作;进一步引入基于蒙特卡洛树搜索的自我改进训练框架,利用伪黄金轨迹通过强化学习优化各智能体。 Result: 在TableBench数据集上达到62.13%的准确率,优于OpenAI-o4-mini-high等现有模型,验证了该方法的有效性。 Conclusion: 结合结构化多智能体工作流与强化学习的Mixture-of-Minds框架显著提升了表格理解与推理能力,展示了多代理协同与自优化策略在复杂推理任务中的潜力。 Abstract: Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.[25] Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models
Maggie Bai,Ava Kim Cohen,Eleanor Koss,Charlie Lichtenbaum
Main category: cs.CL
TL;DR: 该论文研究了大语言模型(LLMs)在文本输入下的空间推理能力,通过五个任务评估其在网格环境中的空间理解与多步问题解决能力,发现模型在小规模任务中表现尚可,但随着复杂度增加性能显著下降,平均准确率下降42.7%,最高达84%,揭示了LLMs在空间表征上的局限性。
Details
Motivation: 探究大语言模型是否具备超越语言理解的空间推理能力,尤其是在结构化环境中的抽象空间计算能力。 Method: 设计五项基于网格的任务(象限识别、几何变换、距离评估、单词搜索和滑块拼图),逐步增加网格规模以测试模型在不同复杂度下的表现。 Result: LLMs在低复杂度任务中准确率超过50%,但随着规模上升,性能急剧下降,所有任务平均准确率下降42.7%,最高下降84%,且未见有效泛化。 Conclusion: 当前大语言模型缺乏稳健的空间表征机制,空间推理能力有限,尤其在复杂、多步的结构化任务中表现不足,需在未来研究中融合几何与语言进行联合建模。 Abstract: This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.[26] Decoding-Free Sampling Strategies for LLM Marginalization
David Pohl,Marco Cognetta,Junyoung Lee,Naoaki Okazaki
Main category: cs.CL
TL;DR: 本文研究了解码无关的采样策略,以在子词分词语言模型中实现快速且足够准确的边缘化概率估计,显著降低计算成本并应用于下游推理任务。
Details
Motivation: 由于子词分词导致同一文本有多种表示方式,传统模型仅评估特定分词结果的概率,忽略了其他可能分词路径;因此需要通过边缘化所有可能分词的概率来更准确地评估模型表现,但精确计算困难,现有采样方法因需生成导致成本高。 Method: 提出解码无关的采样策略,不依赖语言模型生成,仅基于廉价、无需模型和分词器参与的采样方法,从而避免昂贵的生成步骤,实现高效的近似边缘化概率计算。 Result: 在多个开源模型上验证了所提方法,在显著降低运行时间成本的同时,提供了足够准确的边缘化概率估计,并成功应用于下游推理任务。 Conclusion: 解码无关的采样策略能够在极低计算开销下提供可靠的边缘化概率估计,为语言模型评估和推理任务提供了一种高效可行的新途径。 Abstract: Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks.[27] Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders
Filippo Cenacchi,Deborah Richards,Longbing Cao
Main category: cs.CL
TL;DR: 提出一种统一的三模态情感严重程度评估框架,用于同时评估抑郁症和PTSD的严重程度,融合文本、音频和面部信号特征,通过校准的晚期融合分类器实现跨障碍的分级诊断,具有良好的鲁棒性和临床解释性。
Details
Motivation: 抑郁症和PTSD常共病且症状交织,传统二元、单疾病自动评估方法难以满足临床需求,亟需能提供跨障碍严重程度估计和决策解释的自动化工具。 Method: 同步融合访谈文本(句子级Transformer嵌入)、音频(对数梅尔统计量及delta)和面部信号(动作单元、注视、头部姿态等),采用校准的晚期融合分类器输出抑郁(PHQ-8,5类)和PTSD(3类)的分级严重程度及特征归因。 Result: 在DAIC衍生语料库上进行分层交叉验证,融合模型在准确率和加权F1上与最强单模态基线相当,但在决策曲线效用和模态缺失/噪声下的鲁棒性更优;对PTSD显著降低回归误差并提升类别一致性;错误多集中在相邻严重等级,极端类别识别可靠;消融实验显示文本对抑郁、音视频对面部线索对PTSD更为关键,归因结果符合临床行为标记。 Conclusion: 该三模态融合方法可实现可重复评估,并为临床决策提供带解释的支持,适用于共病情境下的情感障碍严重程度评估。 Abstract: Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.[28] Context-level Language Modeling by Learning Predictive Context Embeddings
Beiya Dai,Yuliang Liu,Daozheng Xue,Qipeng Guo,Kai Chen,Xinbing Wang
Main category: cs.CL
TL;DR: 本文提出了ContextLM框架,通过引入“下一上下文预测”目标来增强标准预训练,从而提升语言模型在长距离依赖和语义结构上的建模能力,同时保持与传统自回归评估方式的兼容性。
Details
Motivation: 传统的下一项token预测限制了模型捕捉高层语义结构和长距离上下文关系的能力,因此需要一种更强大的预训练机制。 Method: 提出ContextLM框架,在标准token级预测基础上增加“下一上下文预测”目标,让模型学习对未来多token上下文块的预测,利用未来token chunk的误差信号进行训练。 Result: 在GPT2和Pythia模型族(最大1.5B参数)上的实验表明,ContextLM在困惑度和下游任务性能上均有持续提升,并展现出更好的长距离连贯性和注意力分配效率,且计算开销极小。 Conclusion: 下一上下文预测为语言建模提供了一条可扩展且高效的强化路径,能够在不改变推理范式的前提下显著提升模型性能。 Abstract: Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.[29] Citation Failure: Definition, Analysis and Efficient Mitigation
Jan Buchmann,Iryna Gurevych
Main category: cs.CL
TL;DR: 本文提出CITECONTROL基准和CITENTION框架,以解决LLM-based RAG系统中的引用失败问题,通过分析响应与证据的关系及其对引用质量的影响,显著提升了引用性能。
Details
Motivation: 现有研究未区分引用失败和响应失败,导致难以准确评估和改进引用质量。作者希望专门针对引用失败这一问题进行建模与缓解。 Method: 采用两步法:首先构建CITECONTROL基准,系统化改变响应与证据的关系以分析引用失败模式;然后提出CITENTION框架,融合生成式、基于注意力和基于检索的引用方法来提升引用效果。 Result: 实验表明,引用失败随关系复杂度增加而增多;CITENTION在CITECONTROL基准和迁移场景下均显著提升了引用质量。 Conclusion: 通过分离引用失败与响应失败,并结合多种引用机制,可有效提升LLM在RAG系统中的引用完整性与可靠性。 Abstract: Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.[30] Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
Lei Tang,Wei Zhou,Mohsen Mesgar
Main category: cs.CL
TL;DR: 本研究首次系统探讨了过程奖励模型(PRMs)在表格问答(TQA)中的应用,发现结合文本与代码验证的PRMs有助于解的选择,但在跨领域数据上泛化能力有限,且步骤级验证与答案准确性相关性较弱。
Details
Motivation: 探索过程奖励模型(PRMs)在涉及半结构化数据的任务(如表格问答)中的适用性,解决其在信息冗余、推理步骤松散和领域特定推理等方面的挑战。 Method: 从答案和推理步骤两个角度评估最先进的生成式PRMs在表格问答任务上的表现,并分析其在领域内和跨领域的泛化能力。 Result: 结合文本与代码验证的PRMs能辅助解选择,但在跨领域数据上表现不佳;步骤级验证得分与最终答案准确率之间相关性较弱,可能源于推理步骤间的依赖性不足。 Conclusion: 当前PRMs在表格问答任务中存在局限性,需构建更具鲁棒性和过程感知能力的验证器以提升性能。 Abstract: Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.[31] Teaching Language Models to Reason with Tools
Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu
Main category: cs.CL
TL;DR: 本文提出了CoRT(Code-Optimized Reasoning Training),一种用于提升大推理模型(LRM)在数学任务中使用代码解释器(CI)能力的后训练框架。通过引入Hint-Engineering方法生成高质量的代码融合推理数据,并结合拒绝采样与强化学习优化多轮内外推理交互。实验表明,CoRT在多个数学推理数据集上显著提升了性能(最高提升8%),并大幅减少30%-50%的token消耗。
Details
Motivation: 大推理模型在复杂数学运算中常表现出低效或错误,因其内部概率推理与外部确定性工具(如代码解释器)之间存在冲突,需有效整合二者以提升推理准确性与效率。 Method: 提出CoRT框架,包含Hint-Engineering数据合成策略,用于在推理路径中注入提示以生成优质训练样本;对1.5B至32B参数模型进行监督微调,并结合拒绝采样与强化学习优化多轮CI使用与内部思考的交替过程。 Result: 在五个数学推理数据集上,CoRT使DeepSeek-R1-Distill-Qwen-32B和1.5B分别获得4%和8%的绝对提升,同时32B和1.5B模型的token使用量分别降低约30%和50%。 Conclusion: CoRT能有效提升大推理模型对计算工具的利用能力,在增强数学推理性能的同时显著提高推理效率,为模型与外部工具协同提供了可行方案。 Abstract: Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: https://github.com/ChengpengLi1003/CoRT.[32] Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
Matteo Silvestri,Flavio Giorgi,Fabrizio Silvestri,Gabriele Tolomei
Main category: cs.CL
TL;DR: 该研究发现大语言模型在表格推理任务中的表现可能源于对具有明显语义线索的数据集的记忆,而非真正的泛化能力。
Details
Motivation: 评估大语言模型在结构化数据上的推理能力时,常忽略数据集污染问题,本文旨在探究这一混淆因素的影响。 Method: 通过对Adult Income、Titanic等常用表格基准进行受控探针实验,比较保留与去除语义线索(如列名、类别含义)时模型的表现差异。 Result: 当数据包含强语义线索时,模型表现出显著性能;一旦这些线索被移除或随机化,性能急剧下降至接近随机水平。 Conclusion: 大语言模型在表格推理任务上的优异表现部分归因于对公开数据集的记忆,而非真实推理能力,建议改进评估协议以区分语义泄露与真正推理。 Abstract: Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.[33] FreeChunker: A Cross-Granularity Chunking Framework
Wenxuan Zhang,Yuan-Hao Jiang,Yonghe Wu
Main category: cs.CL
TL;DR: 本文提出了FreeChunker,一种跨粒度编码框架,通过将句子作为基本单位并支持灵活检索任意句子组合,改变了传统的静态分块范式,显著提升了检索性能和计算效率。
Details
Motivation: 现有的固定粒度分块方法依赖静态边界识别,难以适应多样化的查询需求,限制了RAG系统的有效性。 Method: 提出FreeChunker框架,将句子视为原子单位,摒弃静态分块,转而支持跨粒度的灵活检索,允许任意句子组合的动态检索。 Result: 在LongBench V2上的实验表明,FreeChunker在检索性能上优于传统分块方法,同时在计算效率方面显著优于现有方法。 Conclusion: FreeChunker通过范式转变,有效提升了RAG系统的适应性和效率,为未来动态、复杂查询处理提供了新方向。 Abstract: Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly reduces the computational overhead required for semantic boundary detection but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker achieves superior retrieval performance compared to traditional chunking methods, while significantly outperforming existing approaches in computational efficiency.[34] Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)
Francesca Padovani,Bastian Bunzeck,Manar Ali,Omar Momen,Arianna Bisazza,Hendrik Buschmeier,Sina Zarrieß
Main category: cs.CL
TL;DR: 研究探讨了仅在对话数据上预训练的小型语言模型的表现,并通过不同微调策略提升其对话生成能力,发现DPO微调能显著改善模型在自定义对话任务上的表现。
Details
Motivation: 探索仅使用对话数据预训练是否能产生形式和功能上合适的语言模型,并提升其对话生成能力。 Method: 基于对话数据预训练llamalogue模型,采用PPO和DPO等多种微调策略,在标准BabyLM基准和自定义对话任务上评估模型性能。 Result: 模型在多数标准BabyLM基准上表现不佳,但在最小对设置的对话延续预测任务中表现出色;DPO微调进一步提升了其在自定义对话基准上的性能,而PPO微调效果不一甚至有负面影响。 Conclusion: 仅基于对话数据的预训练可有效提升模型在特定对话任务上的表现,结合DPO微调能进一步优化对话生成能力。 Abstract: We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce "more communicative" text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.[35] The Impact of Negated Text on Hallucination with Large Language Models
Jaehyung Seo,Hyeonseok Moon,Heuiseok Lim
Main category: cs.CL
TL;DR: 本文研究了大语言模型在处理否定文本时的幻觉检测能力,发现模型在否定语境下难以有效识别幻觉,并表现出逻辑不一致的问题。
Details
Motivation: 否定文本对大语言模型幻觉的影响尚未被充分探索,本文旨在填补这一空白。 Method: 通过构建NegHalu数据集,重构现有的幻觉检测数据集以包含否定表达,并在token级别分析模型内部状态。 Result: 实验表明,大语言模型在否定文本中难以有效检测幻觉,常产生逻辑不一致或不忠实的判断。 Conclusion: 否定文本显著影响大语言模型的幻觉检测能力,需进一步研究以缓解其负面影响。 Abstract: Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.[36] VLSP 2025 MLQA-TSR Challenge: Vietnamese Multimodal Legal Question Answering on Traffic Sign Regulation
Son T. Luu,Trung Vo,Hiep Nguyen,Khanh Quoc Tran,Kiet Van Nguyen,Vu Tran,Ngan Luu-Thuy Nguyen,Le-Minh Nguyen
Main category: cs.CL
TL;DR: 本文介绍了VLSP 2025 MLQA-TSR任务,旨在推动越南多模态法律文本处理研究,聚焦交通标志法规,包含多模态法律检索和问答两个子任务,并提供了基准数据集和当前最佳性能结果。
Details
Motivation: 推动越南多模态法律文本处理技术的发展,特别是在交通标志法规领域的智能系统研究。 Method: 设计并发布了VLSP 2025 MLQA-TSR共享任务,包含多模态法律检索和多模态问答两个子任务,提供基准数据集用于模型训练与评估。 Result: 在多模态法律检索任务上取得了64.55%的F2分数,在多模态问答任务上达到了86.30%的准确率。 Conclusion: 该任务为越南语多模态法律信息处理提供了重要的基准平台,有助于推动相关智能系统的研发。 Abstract: This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.[37] NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew
Shaltiel Shmidman,Avi Shmidman,Moshe Koppel
Main category: cs.CL
TL;DR: 本文介绍了NeoDictaBERT及其双语版本,这是基于NeoBERT架构、专为希伯来语设计的BERT风格模型,在多种希伯来语基准上表现优异,并在检索任务中超越同类多语言模型。
Details
Motivation: 尽管BERT模型在多项任务中表现良好,但其架构已显过时。为了弥补其与现代Transformer模型(如Llama3和Qwen3)之间的差距,并提升希伯来语NLP的支持能力,需要构建更先进的BERT-style模型。 Method: 采用NeoBERT的现代架构,专门针对希伯来语文本训练了NeoDictaBERT及其双语版本,优化了词典和训练流程,并扩展了上下文窗口。 Result: NeoDictaBERT在几乎所有希伯来语基准测试中均优于现有模型;其双语版本在检索任务中表现突出,超过类似规模的其他多语言模型。 Conclusion: NeoDictaBERT系列模型为希伯来语NLP任务提供了强大基础,展示了现代化BERT架构在特定语言上的有效性,且有助于推动低资源语言的技术发展。 Abstract: Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.[38] Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction
Suchir Salhan,Hongyi Gu,Donya Rooein,Diana Galvan-Sosa,Gabrielle Gaudeau,Andrew Caines,Zheng Yuan,Paula Buttery
Main category: cs.CL
TL;DR: 提出ContingentChat框架,用于评估和提升基于100M词训练的BabyLM中的多轮对话偶然性,通过新的对齐数据集进行后训练,使响应更合乎语法且连贯。
Details
Motivation: 提升儿童与照顾者之间多轮对话中的偶然性(即及时、直接且有意义的交流)质量。 Method: 采用教师-学生框架ContingentChat,并利用新的对齐数据集对BabyLM进行后训练,结合自适应教师解码策略进行实验。 Result: 后训练显著提升了生成响应的语法性和连贯性,但自适应解码策略带来的额外增益有限。 Conclusion: 有针对性的后训练能有效提升对话质量,但实现真正的对话偶然性仍是BabyLM面临的挑战。 Abstract: Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.[39] LM-mixup: Text Data Augmentation via Language Model based Mixup
Zhijie Deng,Zhouan Shen,Ling Li,Yao Zhou,Zhaowei Zhu,Yanji He,Wei Wang,Jiaheng Wei
Main category: cs.CL
TL;DR: 本文提出了指令蒸馏任务,通过构建MIXTURE数据集和LM-Mixup方法,将低质量、冗余的指令数据蒸馏为高质量指令对,并结合监督微调与强化学习优化,显著提升大模型指令遵循性能。
Details
Motivation: 高质量指令数据稀缺而低质量数据常被丢弃,导致信息浪费;现有数据增强方法难以有效利用低质量数据,且缺乏合理评估机制。 Method: 提出指令蒸馏任务,构建包含14.4万样本的MIXTURE数据集,采用监督微调结合基于组相对策略优化(GRPO)的强化学习,引入质量、语义对齐和格式合规三种奖励信号优化LM-Mixup模型。 Result: 在多个基准上,仅使用蒸馏后约3%数据微调的模型即超越全量数据训练结果,并可媲美最先进的高质量数据筛选方法。 Conclusion: 经过适当蒸馏,低质量数据可成为宝贵资源,LM-Mixup能高效提升指令微调大模型的性能与数据利用效率。 Abstract: Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.[40] Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
Christian Hobelsberger,Theresa Winner,Andreas Nawroth,Oliver Mitevski,Anna-Carolina Haensch
Main category: cs.CL
TL;DR: 本文系统评估了四种大语言模型(LLM)输出置信度估计方法,发现混合型CoCoA方法在校准性和正确答案区分能力上表现最佳。
Details
Motivation: 大语言模型输出的不确定性和正确性变化较大,实际可靠性难以保证,因此需要有效的方法来量化其不确定性。 Method: 对VCE、MSP、Sample Consistency和CoCoA四种置信度估计方法,在四个问答任务上使用最先进的开源大语言模型进行实验评估。 Result: 每种不确定性度量捕捉到模型置信度的不同方面,其中CoCoA方法整体可靠性最好,显著提升了校准性和正确答案的判别能力。 Conclusion: CoCoA是一种更优的置信度估计方法,研究还讨论了各方法的权衡,并为在实际应用中选择不确定性度量提供了建议。 Abstract: Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.[41] Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs
Lukas Edman,Alexander Fraser
Main category: cs.CL
TL;DR: 提出了一种改进的掩码语言模型(MLM),根据模型预测能力动态调整掩码 token 的概率,并引入子词嵌入,提升了在 (Super)GLUE 任务上的性能和形态学泛化能力,在 BabyLM 挑战赛中优于基线。
Details
Motivation: 标准 MLM 在训练过程中对所有 token 均匀掩码,忽略了模型对不同 token 预测难度的差异,导致学习效率不高;同时,缺乏对子词信息的有效利用限制了模型的形态学泛化能力。 Method: 提出一种自适应 MLM,根据模型预测难易程度动态调整 token 的掩码概率,并结合子词嵌入方法,增强模型对词形变化的学习能力。 Result: 在 (Super)GLUE 基准测试上显著优于标准 MLM,且在 BabyLM Challenge 的 strict-small 赛道中超过了基线系统。 Conclusion: 自适应掩码策略与子词嵌入相结合能有效提升语言模型的表示能力和下游任务性能,尤其适用于资源受限的低资源语言建模场景。 Abstract: We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.[42] RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging
Bowen Wang,Haiyuan Wan,Liwen Shi,Chen Yang,Peng He,Yue Ma,Haochen Han,Wenhao Li,Tiao Tan,Yongjian Li,Fangming Liu,Yifan Gong,Sheng Zhang
Main category: cs.CL
TL;DR: 提出RECALL框架,利用大语言模型内部表征进行无数据持续学习,通过分层参数融合实现知识对齐与多领域集成。
Details
Motivation: 解决持续学习中灾难性遗忘问题,避免依赖历史数据或任务标签。 Method: 基于聚类典型样本的层间隐藏表征计算模型间相似性,进行自适应、分层的参数融合。 Result: 在五个NLP任务和多种持续学习场景中优于基线方法,具有良好知识保留和泛化能力。 Conclusion: RECALL为大语言模型提供了一种可扩展、无需数据的持续学习方案,有效平衡通用特征保持与任务特定适应。 Abstract: We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose RECALL, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.[43] Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Tim Tian Hua,Andrew Qin,Samuel Marks,Neel Nanda
Main category: cs.CL
TL;DR: 本文提出通过在大型语言模型的激活中添加引导向量来抑制其评估意识,使其在评估期间表现得如同处于部署状态,从而提高安全评估的可靠性。
Details
Motivation: 由于大模型可能在被评估时调整行为以显得更对齐,这会影响安全评估的可靠性,因此需要一种方法来检测并抑制这种评估意识。 Method: 首先通过继续预训练使模型学习到评估线索(如Python类型提示)与评估环境的关联,然后使用专家迭代训练模型在评估时使用类型提示;接着利用原始模型构建激活引导向量,在评估时施加该向量以抑制评估意识。 Result: 训练后的模型在评估环境中比部署环境中更多地使用类型提示,表现出评估意识;但在引入激活引导后,即使存在评估线索,模型的行为也趋于像在部署环境中一样,有效抑制了评估意识。 Conclusion: 激活引导是一种有效的手段,可用于提升AI安全评估的可靠性,使模型在评估时表现得更接近真实部署状态。 Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.[44] Robust Preference Alignment via Directional Neighborhood Consensus
Ruochen Mao,Yuling Shi,Xiaodong Gu,Jiaheng Wei
Main category: cs.CL
TL;DR: 提出了一种无需训练的后处理方法Robust Preference Selection (RPS),通过利用方向邻域一致性来提升大语言模型在多样化人类偏好下的对齐鲁棒性。
Details
Motivation: 现有对齐方法在面对偏离训练数据中心趋势的细微用户偏好时表现不稳定,且依赖昂贵的重训练,难以覆盖完整的偏好谱系,导致存在偏好覆盖缺口。 Method: RPS是一种无需训练的后处理方法,通过从用户偏好的局部邻域内采样多个响应生成候选池,并选择最符合原始意图的响应;理论分析证明其邻域生成策略优于强基线。 Result: 在DPA、DPO和SFT三种不同对齐范式下实验表明,RPS在挑战性偏好上相比基线显著提升了鲁棒性,最高胜率达69%,且无需模型重训练。 Conclusion: RPS提供了一种实用且有理论支持的解决方案,可有效增强偏好对齐模型在多样化和边缘化偏好下的可靠性与适应性。 Abstract: Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.[45] Hierarchical Sequence Iteration for Heterogeneous Question Answering
Ruiyi Yang,Hao Xue,Imran Razzak,Hakim Hacid,Flora D. Salim
Main category: cs.CL
TL;DR: 本文提出了HSEQ(Hierarchical Sequence)框架,用于异构信息源下的多跳问答,通过结构化序列和分步迭代实现高效、准确且可审计的答案生成。
Details
Motivation: 现有RAG方法在处理多步推理和异构数据源(如文本、表格、知识图谱)时存在准确性低、延迟高、资源消耗大的问题,缺乏统一且高效的证据收集机制。 Method: 将文档、表格和知识图谱线性化为带轻量级结构标签的可逆层次序列,并设计结构感知的迭代机制:由Head Agent引导检索,Iteration Agent通过遵循结构的操作(如父子节点跳转、表格邻接、KG关系扩展)选择并扩展HSeq,最后由Head Agent整合规范化证据生成答案,支持矛盾检测与 refine 循环。 Result: 在HotpotQA、HybridQA/TAT-QA和MetaQA等多个基准上,HSEQ在EM/F1指标上均优于强基线模型,同时具有更高的效率;展现出格式无关的统一性、预算感知的迭代能力和证据规范化带来的可审计性优势。 Conclusion: HSEQ提供了一种统一、高效且可扩展的RAG框架,能够有效处理多跳、异构信息环境下的复杂问答任务,兼顾准确性、效率与可解释性。 Abstract: Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration for Heterogeneous Question Answering, a unified framework that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) guided, budget-aware iteration that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.[46] Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset
Paul Lerner,François Yvon
Main category: cs.CL
TL;DR: 本研究提出了一种基于多语言翻译公平性原则的政治偏见分析框架,利用欧洲议会的21种语言平行语料库EuroParl,发现主流左、中、右政党言论的翻译质量系统性优于边缘政党。
Details
Motivation: 传统方法通过模拟英语问卷回答来评估大语言模型的政治偏见,但缺乏跨语言公平性视角。本文旨在从多语言翻译公平性角度重新定义政治偏见的衡量方式。 Method: 构建了一个包含1.5百万句子、覆盖7国、12个欧盟政党及数百个国别政党的新型21路多语言平行语料库EuroParl,并系统比较欧洲议会演讲在不同政治派别间的翻译质量差异。 Result: 发现属于主流政治光谱(左、中、右)的多数党派演讲被更准确地翻译,而边缘政党则存在系统性的翻译质量下降。 Conclusion: 大语言模型在多语言翻译中表现出对主流政治团体的偏好,揭示了潜在的政治偏见,表明翻译公平性可作为评估模型政治倾向的新指标。 Abstract: The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left, center, and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.[47] ARC-Encoder: learning compressed text representations for large language models
Hippolyte Pilchen,Edouard Grave,Patrick Pérez
Main category: cs.CL
TL;DR: 本文提出了一种名为ARC-Encoder的上下文压缩方法,通过将文本压缩为连续表示来减少解码器LLM的推理成本,且无需微调或修改目标模型,具有跨多种LLM的通用性和高效性。
Details
Motivation: 现有上下文压缩技术常需微调或修改模型结构,影响其通用能力,因此需要一种无需改动目标模型即可高效压缩上下文的方法。 Method: 设计一个可训练的编码器(ARC-Encoder),将上下文压缩为更少的连续表示(通常减少4到8倍),替代解码器中的词元嵌入,并系统研究了训练策略和架构选择。 Result: ARC-Encoder在多个基准测试中达到SOTA性能,显著提升推理效率,并支持同时适配多个解码器,实现跨LLM的泛化。 Conclusion: ARC-Encoder是一种灵活、高效的上下文压缩解决方案,可在不修改目标模型的前提下,实现高性能和良好的跨模型兼容性。 Abstract: Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x\!\in\!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder , fine-tuning dataset and pretrained models are available at https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .[48] The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts
Sangmitra Madhusudan,Kaige Chen,Ali Emami
Main category: cs.CL
TL;DR: 本文提出了CenterBench数据集,用于区分语言模型是基于语法结构理解还是依赖语义模式匹配来处理中心嵌套句。实验表明,随着句子复杂度增加,模型在语义合理与不合理句子间的性能差距扩大,揭示了其倾向于放弃结构分析而依赖语义关联的现象。
Details
Motivation: 现有语言模型评估缺乏有效方法区分模型是真正解析句法结构,还是仅依赖语义常识进行预测。因此需要构建能识别模型是否具备结构性语言理解能力的测试框架。 Method: 构建包含9,720个中心嵌套句的理解数据集CenterBench,每个句子都有语法相同但语义不合理对应版本,并设计六类问题测试表层理解、句法依赖和因果推理。在六种模型上评估,并分析其在不同嵌套深度下的表现差异及推理路径。 Result: 随着嵌套复杂度上升,模型在语义合理与不合理句子上的理解准确率差距系统性扩大,最大中位数差距达26.8个百分点;语义合理性反而损害对结果动作的因果推理表现;推理模型虽提升准确率,但其思维链显示存在语义捷径、过度思考和拒绝回答等问题。 Conclusion: CenterBench首次提供了识别语言模型从结构分析转向语义匹配的框架,揭示当前模型在复杂句法处理中依赖语义启发而非真正句法解析,而人类则表现出更稳定的结构理解能力。 Abstract: When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.[49] GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
Jinchang Luo,Mingquan Cheng,Fan Wan,Ni Li,Xiaoling Xia,Shuangshuang Tian,Tingcheng Bian,Haiwei Wang,Haohuan Fu,Yan Tao
Main category: cs.CL
TL;DR: GlobalRAG是一种基于强化学习的检索增强生成框架,通过引入全局规划和子目标分解机制,有效提升多跳问答中的推理能力,在少量训练数据下显著优于强基线方法。
Details
Motivation: 现有强化学习在多跳问答中缺乏全局规划和可靠执行,导致查询生成不准确和证据利用不一致,限制了检索增强生成的效果。 Method: 提出GlobalRAG框架,将问题分解为子目标,协调检索与推理过程,并通过迭代优化证据;设计规划质量奖励和子目标完成奖励以引导学习,并采用渐进式权重退火策略平衡过程与结果导向的目标。 Result: 在领域内和领域外基准上均显著优于强基线,仅使用8k训练数据(为基线的42%),EM和F1指标平均提升14.2%。 Conclusion: GlobalRAG通过结构化全局推理和可靠执行机制,有效提升了小样本下多跳问答的性能,展示了强化学习在复杂推理任务中的潜力。 Abstract: Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.[50] Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search
Zhouwei Zhai,Mengxiang Chen,Haoyun Xia,Jin Li,Renquan Zhou,Min Yang
Main category: cs.CL
TL;DR: 提出多智能体认知决策框架(MACDF),将电商搜索从被动检索转变为主动决策支持,显著提升复杂查询下的推荐准确性和用户满意度。
Details
Motivation: 传统检索-排序范式依赖查询-物品匹配,难以匹配用户多阶段的认知决策过程,导致语义鸿沟、决策成本高和缺乏专业购物指导等问题。 Method: 设计并实现多智能体认知决策框架(MACDF),通过多个智能体协同模拟用户的认知决策过程,提供主动的购物决策支持。 Result: 离线实验显示MACDF在推荐准确性和用户满意度方面显著优于现有方法,尤其在涉及否定、多约束或推理的复杂查询上表现突出;在线A/B测试验证了其在京东搜索平台的实际有效性。 Conclusion: 多智能体认知系统有望彻底改变电子商务搜索范式,实现更智能、更人性化的购物体验。 Abstract: The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF's significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.[51] Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks
Jiangang Hao,Wenju Cui,Patrick Kyllonen,Emily Kerzabi
Main category: cs.CL
TL;DR: 本研究探讨了基于ChatGPT的自动化编码在协作问题解决中的应用,发现其在性别和种族群体间无显著偏见,支持其在大规模沟通与协作评估中的使用。
Details
Motivation: 评估大规模沟通与协作依赖人工编码,耗时耗力。虽然已有研究显示ChatGPT可达到与人类相当的编码准确率,但其是否对不同性别和种族群体存在偏见尚不清楚。 Method: 利用ChatGPT基于典型编码框架对三类协作任务(谈判、问题解决、决策)的沟通数据进行自动编码,并分析其在性别和种族群体间的差异。 Result: 结果显示,ChatGPT的编码结果在性别和种族群体之间没有表现出显著偏见。 Conclusion: ChatGPT在协作沟通的自动化编码中具有公平性和可靠性,具备用于大规模评估的潜力。 Abstract: Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.[52] BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
Ali Zain,Sareem Farooqui,Muhammad Rafi
Main category: cs.CL
TL;DR: 本论文介绍了BUSTED团队在Ara-GenEval共享任务中的提交方案,该任务旨在检测阿拉伯语AI生成文本。团队通过微调三种预训练模型(AraELECTRA、CAMeLBERT和XLM-RoBERTa)进行二分类任务,最终发现多语言模型XLM-RoBERTa以0.7701的F1分数表现最佳,优于专用阿拉伯语模型,凸显了多语言模型在文本检测任务中的强大泛化能力。
Details
Motivation: 探索不同预训练Transformer模型在阿拉伯语AI生成文本检测任务中的有效性,特别是比较专用阿拉伯语模型与多语言模型的性能差异。 Method: 采用AraELECTRA、CAMeLBERT和XLM-RoBERTa三种预训练模型,在提供的数据集上进行微调,执行二分类任务以区分人类撰写与AI生成的阿拉伯语文本。 Result: XLM-RoBERTa模型取得了最高的F1分数(0.7701),表现优于AraELECTRA和CAMeLBERT等专用阿拉伯语模型。 Conclusion: 尽管存在专门针对阿拉伯语优化的模型,但在AI生成文本检测任务中,多语言模型仍展现出更强的泛化能力和性能优势,表明其在低资源或跨语言场景下的潜力。 Abstract: This paper details our submission to the Ara- GenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, se- cured 5th place. We investigated the effec- tiveness of three pre-trained transformer mod- els: AraELECTRA, CAMeLBERT, and XLM- RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a sur- prising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the spe- cialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capa- bilities of multilingual models.[53] Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model
Haoyu Wang,Sihang Jiang,Yuyan Chen,Yitong Wang,Yanghua Xiao
Main category: cs.CL
TL;DR: 本文基于人类好奇心评估问卷5DCR,设计了一个综合评估框架来衡量大语言模型(LLM)的好奇心表现,发现LLM在知识获取上比人类更强烈,但在不确定环境中仍倾向于保守选择,并验证了好奇心行为能提升模型的推理和主动学习能力。
Details
Motivation: 探讨大语言模型是否具备类似人类的好奇心驱动学习能力,并借鉴人类好奇心评估体系构建适用于LLM的好奇心评测框架。 Method: 基于Five-Dimensional Curiosity scale Revised (5DCR) 设计涵盖信息寻求、刺激寻求和社会好奇心等多个维度的评估框架,通过实验测量LLM在不同情境下的好奇心表现及其与思维过程的关系。 Result: LLM表现出比人类更强的知识渴求,但在不确定性环境下更保守;好奇行为能够增强模型的推理和主动学习能力。 Conclusion: 大语言模型具备类似人类的好奇心特征,且好奇心有助于提升其学习与推理能力,为未来LLM的学习机制发展和创新研究提供了实证支持。 Abstract: Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.[54] The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
Alan Saji,Raj Dabre,Anoop Kunchukuttan,Ratish Puduppully
Main category: cs.CL
TL;DR: 研究探讨了大型推理模型(LRM)在多语言环境下的推理能力,发现尽管使用英语推理通常准确率更高,但存在因翻译步骤导致错误的“迷失在翻译中”问题。
Details
Motivation: 探索大型推理模型在非英语语境下的推理表现,关注其是否依赖英语进行推理以及对语言和文化细微差异的影响。 Method: 系统比较LRM在英语与问题语言之间进行推理的表现,评估任务包括MGSM和GPQA Diamond,并分析推理路径中的认知特征。 Result: 英语推理表现出更强的认知行为特征,且答案准确率更高,尤其在复杂任务上优势更明显;但翻译过程可能引入错误,导致‘迷失在翻译中’的问题。 Conclusion: 虽然英语推理在多语言任务中表现更好,但直接使用问题语言推理可避免翻译错误,未来应平衡二者策略以提升多语言推理的准确性与可解释性。 Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.[55] \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
Junghyun Min,York Hay Ng,Sophia Chan,Helena Shunhua Zhao,En-Shiun Annie Lee
Main category: cs.CL
TL;DR: 本文提出了一个名为CantoNLU的粤语自然语言理解基准,涵盖七个句法和语义任务,并评估了多种模型在粤语上的表现,发现经过粤语适配的模型整体表现最佳,而单语粤语模型在句法任务上更优。同时,未经过粤语训练的普通话模型在某些情况下仍具竞争力。所有数据、代码和模型权重均已公开。
Details
Motivation: 粤语虽然使用者众多,但由于政策和双言现象导致资源匮乏,缺乏系统的评估框架,因此需要构建专门的粤语自然语言理解评测基准。 Method: 构建了一个包含七个任务的粤语NLU基准CantoNLU,并评估了四种模型:未经粤语训练的普通话模型、两个通过持续预训练适配粤语的模型,以及一个从零训练的单语粤语模型。 Result: 粤语适配模型整体表现最好,单语粤语模型在句法任务上表现更佳,而普通话模型在部分任务中仍具竞争力,表明在粤语数据稀缺时直接迁移可能有效。 Conclusion: CantoNLU为粤语NLP研究提供了重要基准,模型比较揭示了不同训练策略的适用场景,推动粤语语言技术的发展。 Abstract: Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.[56] Neural Diversity Regularizes Hallucinations in Small Models
Kushal Chakrabarti,Nirmal Balachundhar
Main category: cs.CL
TL;DR: 本文提出“神经多样性”作为一种减少语言模型幻觉的新机制,通过理论证明和实验验证表明,增加表征的去相关性可显著降低幻觉率,且在固定参数和数据下,神经多样性是提升模型可靠性的第三维度。
Details
Motivation: 语言模型尽管规模增大,仍普遍存在幻觉问题,现有方法难以有效缓解,因此需要从模型内部表征机制出发寻找根本解决方案。 Method: 受投资组合理论启发,提出神经多样性概念,并推导出幻觉概率与表征相关性的理论边界;引入ND-LoRA方法,结合并行LoRA适配器与Barlow Twins正则化来实现去相关表征。 Result: ND-LoRA在不损害整体准确率的前提下,将幻觉率最多降低25.6%(平均14.6%);消融实验证明组件协同作用,因果干预确认神经多样性为中介因素,相关分析显示微小相关性上升会显著增加幻觉;不同任务存在不同的最优神经多样性水平。 Conclusion: 神经多样性是继参数和数据之后提升语言模型可靠性的第三个关键维度,为在固定资源下优化模型提供了新方向。 Abstract: Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.[57] Structure-Conditional Minimum Bayes Risk Decoding
Bryan Eikema,Anna Rutkiewicz,Mario Giulianelli
Main category: cs.CL
TL;DR: 本文提出了三种轻量级的效用函数改进方法,以提升最小贝叶斯风险(MBR)解码在开放性任务中对生成结果潜在结构的敏感性,并在对话和指令跟随任务中显著提高了生成质量。
Details
Motivation: 标准基于相似性的效用函数在开放性任务中可能导致MBR选择虽具代表性但结构上次优的响应,因此需要增强MBR对输出空间中潜在结构差异的敏感性。 Method: 提出三种针对效用函数的轻量级改进,并构建包含对话行为、情感和响应结构三类潜在结构的数据集;设计两个评估指标来衡量MBR的结构性最优性。 Result: 实验表明标准相似性效用函数在所提指标上表现不佳,而新方法显著提升了结构敏感性,在AlpacaEval和MT-Bench上最高提升13.7个百分点的胜率。 Conclusion: 改进后的MBR效用函数能更有效捕捉生成结果中的潜在结构,从而提升开放性生成任务中的输出质量。 Abstract: Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model's outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model's distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure: dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list). We further propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.[58] User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
Xiaoyuan Wu,Roshni Kaushik,Wenkai Li,Lujo Bauer,Koichi Onoue
Main category: cs.CL
TL;DR: 该研究通过一项包含94名参与者的用户研究,发现用户在评估大语言模型(LLM)对隐私敏感场景的响应时,对隐私保护性和有用性的判断存在较大个体差异,且与代理LLM的评估结果相关性低,表明当前依赖代理LLM进行隐私评估的方法不足以反映真实用户体验,需加强以用户为中心的评估研究。
Details
Motivation: 现有研究依赖代理LLM评估LLM在隐私敏感任务中的表现,忽视了真实用户的感知;同时缺乏对响应帮助性的细致分析。因此需要从用户角度出发,探究其对LLM响应在隐私保护和有用性方面的实际感知。 Method: 基于PrivacyLens数据集中的90个隐私敏感场景,开展了一项包含94名参与者的用户研究,收集他们对LLM响应在隐私保护性和有用性方面的评分,并与五个代理LLM的评估结果进行对比分析。 Result: 用户之间对同一响应的隐私保护性和有用性评价一致性较低;而多个代理LLM之间的评估结果高度一致,但每个代理LLM与用户整体评价的相关性均很低。 Conclusion: LLM响应的隐私保护性和有用性感知具有高度个体化特征,代理LLM无法准确反映真实用户的判断,未来应开展更多以用户为中心的评估,并探索提升代理模型与用户感知一致性的方法。 Abstract: Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.[59] Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing
Xizhi Wu,Madeline S. Kreider,Philip E. Empey,Chenyu Li,Yanshan Wang
Main category: cs.CL
TL;DR: 本研究比较了多种自然语言处理(NLP)方法从临床笔记中提取氟嘧啶治疗和毒性信息的效果,发现基于大语言模型(LLM)的方法(特别是错误分析提示)表现最佳,F1分数达到1.000,显著优于传统机器学习和深度学习模型。
Details
Motivation: 由于氟嘧啶类药物的毒性信息常嵌入在非结构化的临床笔记中,难以系统获取,因此需要高效的NLP方法自动提取相关信息以支持癌症治疗监测和药物安全研究。 Method: 研究构建了一个包含236份临床笔记的金标准数据集,并由领域专家标注治疗方案和毒性事件。采用了基于规则、机器学习(随机森林、SVM、逻辑回归)、深度学习(BERT、ClinicalBERT)以及大语言模型(零样本和错误分析提示)等多种NLP方法进行比较,使用80:20训练测试划分评估性能。 Result: LLM结合错误分析提示在治疗和毒性提取上均达到F1=1.000,零样本提示也表现出色(治疗F1=1.000,毒性F1=0.876)。逻辑回归和SVM排名第二(毒性F1=0.937),而BERT和ClinicalBERT表现较差(毒性F1分别为0.839和0.886),基于规则的方法为基线(F1约0.857-0.858)。 Conclusion: 基于大语言模型的NLP方法在提取氟嘧啶相关临床信息方面最有效,具有推动肿瘤学研究和药物警戒应用的巨大潜力。 Abstract: Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities extraction.LR and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance.[60] Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
Runzhe Zhan,Zhihong Huang,Xinyi Yang,Lidia S. Chao,Min Yang,Derek F. Wong
Main category: cs.CL
TL;DR: 本文首次系统分析了大推理模型(LRM)作为机器翻译质量评估者的潜力,提出通过训练合成的人类思维轨迹来校准LRM的“思考”过程,显著降低计算开销并提升评估性能。
Details
Motivation: 尽管大推理模型在复杂任务中展现出强大的推理能力,但其在机器翻译评估中的应用尚未被充分探索,且存在过度思考、评分机制偏差等问题。 Method: 通过构建合成的、类似人类的思维轨迹数据,对不同规模的LRM进行训练以校准其思考过程,并在WMT24 Metrics基准上进行评测。 Result: 该方法将LRM的思考预算减少了约35倍,在7B到32B规模的模型上均提升了评估相关性,例如R1-Distill-Qwen-7B的相关性提高了+8.7个百分点。 Conclusion: 经过高效校准的LRM在细粒度自动机器翻译评估中具有巨大潜力,能够在降低成本的同时提升评估准确性。 Abstract: Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.[61] A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text
Alicia Sagae,Chia-Jung Lee,Sandeep Avula,Brandon Dang,Vanessa Murdock
Main category: cs.CL
TL;DR: 提出了一种针对特定应用场景的大型语言模型评估方法,聚焦于公平性等负责任AI维度,构建了一个基于产品描述生成任务的数据集,并用其识别LLM在质量、真实性、安全性和公平性方面的差距。
Details
Motivation: 现有LLM评估方法多关注高层任务,难以满足负责任AI(如公平性)在具体应用中的需求,因不同应用对受保护属性的相关性不同。 Method: 构建一个面向实际应用(生成产品描述)的参数化数据集,结合性别化形容词和产品类别等公平性属性,生成带标签的提示集,用于评估LLM在多个维度的表现。 Result: 展示了如何利用该数据集有效识别LLM在质量、真实性、安全性和公平性方面的问题,提供了一套可复用的评估方案和公开资源。 Conclusion: 该工作为LLM的负责任AI评估提供了针对性的数据集和方法论,推动了面向具体应用场景的精细化模型评估。 Abstract: Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.[62] Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Mutian He,Philip N. Garner
Main category: cs.CL
TL;DR: 提出了一种结合稀疏注意力和可学习token淘汰机制的混合模型,以缓解线性注意力模型在检索密集型任务中的遗忘问题,同时保持其高效的计算复杂度。
Details
Motivation: 线性注意力模型由于固定大小的状态导致记忆有限,容易遗忘早期信息,影响检索密集型任务的性能。 Method: 引入一系列混合模型,结合介于线性和全注意力之间的稀疏注意力机制,提出可学习的token淘汰方法,并结合滑动窗口注意力与轻量级CNN,自适应保留关键KV对。 Result: 在检索密集型基准任务上验证了所提方法的有效性,模型在保持线性时间和空间复杂度的同时提升了性能。 Conclusion: 所提出的混合注意力机制有效缓解了线性注意力的遗忘问题,为高效且强大的序列建模提供了可行方案。 Abstract: Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.[63] Simple Context Compression: Mean-Pooling and Multi-Ratio Training
Yair Feldman,Yoav Artzi
Main category: cs.CL
TL;DR: 提出一种轻量且简单的均值池化方法用于软上下文压缩,在多种问答数据集和模型上优于现有的压缩标记架构。
Details
Motivation: 降低在检索增强生成中使用长上下文时的计算成本。 Method: 采用均值池化方法对输入序列进行压缩,并训练同一压缩器输出多种压缩比率。 Result: 均值池化方法性能最优,且在多压缩比训练下性能下降较小;不同架构和训练方式下的权衡更为复杂。 Conclusion: 均值池化是一种高效、简单的上下文压缩方法,适用于多种场景,但压缩方法的整体权衡较为复杂。 Abstract: A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.[64] On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
Mingmeng Geng,Thierry Poibeau
Main category: cs.CL
TL;DR: 本文讨论了检测大语言模型生成文本的挑战,指出当前缺乏对“LLM生成文本”的明确定义,应用场景多样性和人类编辑的影响使检测更加困难,现有基准和评估方法不足以反映真实应用情况,因此检测结果应仅作为参考而非决定性依据。
Details
Motivation: 由于大语言模型(LLMs)的广泛应用,研究者关注于检测其生成的文本,但缺乏统一、精确的目标定义,且实际使用中存在多种复杂情况,导致检测任务面临挑战。 Method: 本文通过分析现有检测方法的局限性,探讨LLM生成文本定义不清、人类编辑影响以及现实应用场景多样性等问题,指出现有基准和评估方式的不足。 Result: 研究表明,当前的检测器在特定条件下仍可用,但由于多种因素干扰,其检测结果常被误解,实际意义正在减弱。 Conclusion: 检测器的结果应谨慎解释,仅作为参考,不能作为判断文本来源的决定性证据。 Abstract: With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.cs.CV [Back]
[65] Fourier-Based GAN Fingerprint Detection using ResNet50
Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru
Main category: cs.CV
TL;DR: 提出一种结合频域分析与深度学习的方法,利用2D DFT揭示StyleGAN生成图像的周期性伪影,并通过ResNet50实现高效检测,显著优于直接在空间域训练的模型。
Details
Motivation: 应对由StyleGAN等生成对抗网络带来的逼真假图像对图像取证和工业系统内容真实性构成的挑战。 Method: 对图像进行二维离散傅里叶变换(2D DFT),将图像转换到频域以凸显生成图像中的周期性伪影,并使用ResNet50网络在频域图像上进行训练以区分真实与合成图像。 Result: 该方法在检测StyleGAN生成图像时达到92.8%的准确率和0.95的AUC,显著优于在空间域图像上训练的模型。 Conclusion: GAN生成图像具有可识别的频域“指纹”,结合信号处理与深度学习的方法能有效提升数字取证能力,在工业AI系统中具有广泛应用前景。 Abstract: The rapid rise of photorealistic images produced from Generative Adversarial Networks (GANs) poses a serious challenge for image forensics and industrial systems requiring reliable content authenticity. This paper uses frequency-domain analysis combined with deep learning to solve the problem of distinguishing StyleGAN-generated images from real ones. Specifically, a two-dimensional Discrete Fourier Transform (2D DFT) was applied to transform images into the Fourier domain, where subtle periodic artifacts become detectable. A ResNet50 neural network is trained on these transformed images to differentiate between real and synthetic ones. The experiments demonstrate that the frequency-domain model achieves a 92.8 percent and an AUC of 0.95, significantly outperforming the equivalent model trained on raw spatial-domain images. These results indicate that the GAN-generated images have unique frequency-domain signatures or "fingerprints". The method proposed highlights the industrial potential of combining signal processing techniques and deep learning to enhance digital forensics and strengthen the trustworthiness of industrial AI systems.[66] Transformed Multi-view 3D Shape Features with Contrastive Learning
Márcus Vinícius Lobo Costa,Sherlon Almeida da Silva,Bárbara Caroline Benato,Leo Sampaio Ferraz Ribeiro,Moacir Antonelli Ponti
Main category: cs.CV
TL;DR: 本文研究了基于Vision Transformers(ViTs)和对比学习的3D形状表示学习方法,展示了其在多视图3D分析任务中的优越性能,有效克服了CNN对形状关系建模的局限性和对大量标注数据的依赖。
Details
Motivation: 现有3D形状表示学习方法依赖CNN且需要大量标注数据,难以捕捉关键的形状关系,因此需要更高效、更具泛化能力的模型。 Method: 采用Vision Transformers作为主干网络,结合监督与自监督的对比学习目标,在ModelNet等数据集上进行多视图3D形状分类任务。 Result: 在ModelNet10上实现了约90.6%的准确率,验证了ViT结合对比学习在捕捉全局形状语义和局部判别特征上的有效性。 Conclusion: ViTs结合对比学习能有效提升3D形状表示学习性能,减少对标注数据的依赖,为3D视觉任务提供了新的统一框架。 Abstract: This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs' ability to understand overall shapes and contrastive learning's effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.[67] FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking
Martha Teiko Teye,Ori Maoz,Matthias Rottmann
Main category: cs.CV
TL;DR: FutrTrack是一个基于transformer的相机-LiDAR多目标跟踪框架,通过引入多模态特征融合和时序平滑,在nuScenes和KITTI数据集上实现了优异的3D MOT性能。
Details
Motivation: 现有单传感器3D多目标跟踪方法在复杂场景下存在轨迹抖动、身份切换频繁等问题,需利用多模态传感器互补性提升跟踪鲁棒性。 Method: 提出FutrTrack框架,包含一个基于Transformer的时序平滑模块和一个无需显式运动模型的融合跟踪器;采用两阶段多模态BEV特征融合,并结合几何与语义信息进行跨帧身份匹配与传播。 Result: 在nuScenes测试集上达到74.7 aMOTA,显著优于单传感器方法,有效减少身份切换并提升轨迹一致性。 Conclusion: FutrTrack验证了多模态特征融合对基于query的transformer跟踪器的重要性,提供了一种高效、低数据依赖的3D多目标跟踪解决方案。 Abstract: We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird's-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.[68] Improving Predictive Confidence in Medical Imaging via Online Label Smoothing
Kushan Choudhury,Shubhrodeep Roy,Ankur Chanda,Shubhajit Biswas,Somenath Kuiry
Main category: cs.CV
TL;DR: 本研究探讨了在线标签平滑(OLS)在医学图像分类中的应用,发现其在提升分类准确性和模型校准方面优于传统方法。
Details
Motivation: 深度学习模型在医学图像分类中表现优异,但常产生过度自信的预测,影响可靠性;传统标签平滑忽略类别间关系,限制了性能提升。 Method: 采用在线标签平滑(OLS)方法,在训练过程中根据模型预测动态调整软标签,并在RadImageNet数据集上使用ResNet-50、MobileNetV2和VGG-19三种主流架构进行评估。 Result: OLS在Top-1和Top-5分类准确率上均优于硬标签、传统标签平滑和无教师知识蒸馏方法,同时生成更紧凑且分离良好的特征嵌入,表明表示学习能力增强。 Conclusion: OLS不仅能提高医学图像分类的预测性能和模型校准性,还为构建可信的医疗AI系统提供了一种实用有效的解决方案。 Abstract: Deep learning models, especially convolutional neural networks, have achieved impressive results in medical image classification. However, these models often produce overconfident predictions, which can undermine their reliability in critical healthcare settings. While traditional label smoothing offers a simple way to reduce such overconfidence, it fails to consider relationships between classes by treating all non-target classes equally. In this study, we explore the use of Online Label Smoothing (OLS), a dynamic approach that adjusts soft labels throughout training based on the model's own prediction patterns. We evaluate OLS on the large-scale RadImageNet dataset using three widely used architectures: ResNet-50, MobileNetV2, and VGG-19. Our results show that OLS consistently improves both Top-1 and Top-5 classification accuracy compared to standard training methods, including hard labels, conventional label smoothing, and teacher-free knowledge distillation. In addition to accuracy gains, OLS leads to more compact and well-separated feature embeddings, indicating improved representation learning. These findings suggest that OLS not only strengthens predictive performance but also enhances calibration, making it a practical and effective solution for developing trustworthy AI systems in the medical imaging domain.[69] A Unified Detection Pipeline for Robust Object Detection in Fisheye-Based Traffic Surveillance
Neema Jakisa Owor,Joshua Kofi Asamoah,Tanner Wambui Muturi,Anneliese Jakisa Owor,Blessing Agyei Kyem,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah
Main category: cs.CV
TL;DR: 提出了一种针对鱼眼相机图像的检测框架,通过预处理和后处理流程以及集成学习方法,提升了在严重畸变区域的目标检测一致性与准确性,在2025 AI City Challenge Track 4中取得了第八名的成绩。
Details
Motivation: 鱼眼相机虽然能提供大视场的交通监控,但其严重的径向畸变和非均匀分辨率给标准目标检测器带来了挑战,尤其是在图像边界区域。 Method: 采用简单的预处理和后处理流程来增强全图检测的一致性,并训练多个先进的检测模型,通过集成策略融合输出结果。 Result: 在2025 AI City Challenge Track 4上达到了0.6366的F1分数,62支队伍中排名第八。 Conclusion: 所提出的框架能有效应对鱼眼图像中的固有挑战,提升目标检测性能。 Abstract: Fisheye cameras offer an efficient solution for wide-area traffic surveillance by capturing large fields of view from a single vantage point. However, the strong radial distortion and nonuniform resolution inherent in fisheye imagery introduce substantial challenges for standard object detectors, particularly near image boundaries where object appearance is severely degraded. In this work, we present a detection framework designed to operate robustly under these conditions. Our approach employs a simple yet effective pre and post processing pipeline that enhances detection consistency across the image, especially in regions affected by severe distortion. We train several state-of-the-art detection models on the fisheye traffic imagery and combine their outputs through an ensemble strategy to improve overall detection accuracy. Our method achieves an F1 score of0.6366 on the 2025 AI City Challenge Track 4, placing 8thoverall out of 62 teams. These results demonstrate the effectiveness of our framework in addressing issues inherent to fisheye imagery.[70] Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses
Damian Bowness,Charalambos Poullis
Main category: cs.CV
TL;DR: 提出了一种实时渲染感知的过滤方法,利用中间梯度的敏感性分数来减少3D高斯点阵在训练数据分布外视角下的视觉噪声,显著提升了视觉质量与一致性。
Details
Motivation: 在远离训练数据分布的相机视角下,3D高斯点阵(3DGS)模型常出现严重视觉噪声,这是由于外推区域缺乏训练数据导致密度、颜色和几何预测不确定性。 Method: 提出一种基于中间梯度敏感性分数的实时渲染感知过滤方法,专门针对各向异性方向引起的不稳定性,而非各向同性方差,直接缓解生成不确定性问题。 Result: 实验表明,该方法在视觉质量、真实感和一致性上显著优于现有的NeRF-based方法(如BayesRays),且能实时集成到现有3DGS渲染管线中,无需额外重训练或微调。 Conclusion: 该过滤方法有效解决了3DGS在训练视角外的生成不确定性问题,支持用户自由导航时仍保持高视觉保真度,具有良好的实用性和兼容性。 Abstract: When viewing a 3D Gaussian Splatting (3DGS) model from camera positions significantly outside the training data distribution, substantial visual noise commonly occurs. These artifacts result from the lack of training data in these extrapolated regions, leading to uncertain density, color, and geometry predictions from the model. To address this issue, we propose a novel real-time render-aware filtering method. Our approach leverages sensitivity scores derived from intermediate gradients, explicitly targeting instabilities caused by anisotropic orientations rather than isotropic variance. This filtering method directly addresses the core issue of generative uncertainty, allowing 3D reconstruction systems to maintain high visual fidelity even when users freely navigate outside the original training viewpoints. Experimental evaluation demonstrates that our method substantially improves visual quality, realism, and consistency compared to existing Neural Radiance Field (NeRF)-based approaches such as BayesRays. Critically, our filter seamlessly integrates into existing 3DGS rendering pipelines in real-time, unlike methods that require extensive post-hoc retraining or fine-tuning. Code and results at https://damian-bowness.github.io/EV3DGS[71] BrainPuzzle: Hybrid Physics and Data-Driven Reconstruction for Transcranial Ultrasound Tomography
Shengyu Chen,Shihang Feng,Yi Luo,Xiaowei Jia,Youzuo Lin
Main category: cs.CV
TL;DR: 本文提出了一种名为BrainPuzzle的混合两阶段框架,用于实现经颅超声波速(SoS)的定量重建,结合物理建模与机器学习,在低信噪比和稀疏孔径条件下实现了高精度、结构完整的SoS成像。
Details
Motivation: 传统全波形反演受限于颅骨引起的信号衰减、模式转换和相位畸变,且全孔径阵列在临床上不现实;纯数据驱动方法在复杂非线性波传播下难以保证定量准确性,因此需要一种兼顾物理规律与数据学习的方法来提升经颅超声成像质量。 Method: 第一阶段采用逆时迁移(reverse time migration)对多角度采集数据生成保留结构细节的迁移片段;第二阶段利用基于Transformer的超分辨率编码-解码器与图注意力单元(GAU)融合这些片段,重建出一致且定量准确的SoS图像,并采用可移动小规模换能器阵列进行部分孔径采集以提高临床可行性。 Result: 在两个合成数据集上的实验表明,BrainPuzzle在SoS重建精度和图像完整性方面优于现有方法,即使在低信噪比和稀疏孔径条件下仍能保持良好性能。 Conclusion: BrainPuzzle通过融合物理模型与深度学习,有效克服了经颅超声成像中的信号弱化与孔径限制问题,为实现定量超声脑成像提供了可行且高效的技术路径。 Abstract: Ultrasound brain imaging remains challenging due to the large difference in sound speed between the skull and brain tissues and the difficulty of coupling large probes to the skull. This work aims to achieve quantitative transcranial ultrasound by reconstructing an accurate speed-of-sound (SoS) map of the brain. Traditional physics-based full-waveform inversion (FWI) is limited by weak signals caused by skull-induced attenuation, mode conversion, and phase aberration, as well as incomplete spatial coverage since full-aperture arrays are clinically impractical. In contrast, purely data-driven methods that learn directly from raw ultrasound data often fail to model the complex nonlinear and nonlocal wave propagation through bone, leading to anatomically plausible but quantitatively biased SoS maps under low signal-to-noise and sparse-aperture conditions. To address these issues, we propose BrainPuzzle, a hybrid two-stage framework that combines physical modeling with machine learning. In the first stage, reverse time migration (time-reversal acoustics) is applied to multi-angle acquisitions to produce migration fragments that preserve structural details even under low SNR. In the second stage, a transformer-based super-resolution encoder-decoder with a graph-based attention unit (GAU) fuses these fragments into a coherent and quantitatively accurate SoS image. A partial-array acquisition strategy using a movable low-count transducer set improves feasibility and coupling, while the hybrid algorithm compensates for the missing aperture. Experiments on two synthetic datasets show that BrainPuzzle achieves superior SoS reconstruction accuracy and image completeness, demonstrating its potential for advancing quantitative ultrasound brain imaging.[72] Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models
Huichan Seo,Sieun Choi,Minki Hong,Yi Zhou,Junseo Kim,Lukman Ismaila,Naome Etori,Mehul Agarwal,Zhixuan Liu,Jihie Kim,Jean Oh
Main category: cs.CV
TL;DR: 本研究提出了一种统一的评估框架,用于检测文本到图像(T2I)和图像到图像(I2I)生成模型中的文化偏见,涵盖六个国家、多种类别以及时期感知提示,并结合自动指标、文化感知视觉问答与本地专家人工评估。研究发现现有模型倾向于默认生成全球北方现代化图像,迭代编辑会损害文化保真度,且I2I模型仅进行表面修改而非上下文一致的调整。作者公开了完整数据集与评估协议,建立了一个可复现的文化中心化基准。
Details
Motivation: 现有生成模型常在文化表征上存在偏差,尽管文本到图像模型的文化偏见已有研究,但图像到图像编辑系统的相关分析仍不足。为填补这一空白,本文旨在构建一个标准化、跨国家、跨时代的评估体系,系统性地诊断T2I与I2I模型中的文化敏感性问题。 Method: 研究采用涵盖六个国家、8大类36子类的分类体系,设计时代感知提示,在固定设置下对开源模型进行T2I生成与I2I编辑的统一评估;评估方法结合标准自动指标、基于检索增强的文化感知视觉问答(VQA),以及来自本地评审者的专家人工判断,并发布全部图像语料库、提示词和配置以确保可复现性。 Result: 研究发现:(1)在无国家指向的提示下,模型倾向于生成全球北方现代化风格图像,模糊国家差异;(2)即使传统指标不变或改善,迭代I2I编辑仍会削弱文化保真度;(3)I2I模型多使用调色板变化或通用道具等表面手段,缺乏时代一致性和情境感知,尤其在全球南方目标上常保留原始身份特征。 Conclusion: 当前生成图像模型在文化敏感编辑方面仍不可靠,特别是I2I系统容易通过表面修改掩盖深层文化误表征。本文提供的标准化数据、提示与人工评估协议,构成一个可复现的文化中心化基准,有助于未来持续诊断与追踪生成模型中的文化偏见。 Abstract: Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models.[73] Filter-Based Reconstruction of Images from Events
Bernd Pfrommer
Main category: cs.CV
TL;DR: 本文提出了一种名为FIBAR的异步滤波器方法,用于从移动事件相机的事件流中重建强度图像。该方法通过IIR滤波器积分事件,并利用新颖算法检测和模糊“陈旧”像素以降低噪声,可在任意时间点异步输出图像,且在普通笔记本CPU上高效运行。
Details
Motivation: 现有的基于神经网络的事件图像重建方法通常需要GPU支持,计算成本高。本文旨在提出一种更简单、轻量级且可在CPU上实时运行的方法,适用于资源受限或需要低延迟的应用场景。 Method: 采用数字IIR滤波器对事件引起的亮度变化进行时间积分;通过监测最近更新的像素窗口来识别陈旧像素;针对移动相机假设无事件区域梯度较低,使用Gaussian滤波对陈旧像素进行模糊处理,从而减少噪声;整个过程异步执行,支持任意时刻图像读出。 Result: FIBAR在现代笔记本CPU上可处理约4200万至1.4亿事件/秒(是否启用空间滤波),重建速度较快。与FireNet等神经网络方法相比,FIBAR重建图像更嘈杂并存在鬼影现象,但在如fiducial marker检测等任务中仍具可用性。 Conclusion: FIBAR是一种高效、轻量且异步的事件图像重建方法,虽在图像质量上不如深度学习方法,但具备低计算需求和实时性优势,适合特定应用场景,为资源受限平台提供了可行解决方案。 Abstract: Reconstructing an intensity image from the events of a moving event camera is a challenging task that is typically approached with neural networks deployed on graphics processing units. This paper presents a much simpler, FIlter Based Asynchronous Reconstruction method (FIBAR). First, intensity changes signaled by events are integrated with a temporal digital IIR filter. To reduce reconstruction noise, stale pixels are detected by a novel algorithm that regulates a window of recently updated pixels. Arguing that for a moving camera, the absence of events at a pixel location likely implies a low image gradient, stale pixels are then blurred with a Gaussian filter. In contrast to most existing methods, FIBAR is asynchronous and permits image read-out at an arbitrary time. It runs on a modern laptop CPU at about 42(140) million events/s with (without) spatial filtering enabled. A few simple qualitative experiments are presented that show the difference in image reconstruction between FIBAR and a neural network-based approach (FireNet). FIBAR's reconstruction is noisier than neural network-based methods and suffers from ghost images. However, it is sufficient for certain tasks such as the detection of fiducial markers. Code is available at https://github.com/ros-event-camera/event_image_reconstruction_fibar[74] Data-Adaptive Transformed Bilateral Tensor Low-Rank Representation for Clustering
Hui Chen,Xinjie Wang,Xianchao Xiu,Wanquan Liu
Main category: cs.CV
TL;DR: 提出了一种新的变换双侧张量低秩表示模型TBTLRR,通过学习数据自适应的酉变换提升图像聚类中对噪声的鲁棒性,并结合ℓ₁/₂范数和Frobenius范数正则化处理复杂噪声,实验表明其优于现有方法。
Details
Motivation: 现有张量低秩表示方法依赖固定变换,对噪声鲁棒性差,难以有效捕捉图像数据的全局和局部相关性。 Method: 提出TBTLRR模型,引入数据自适应的张量核范数,利用双侧结构挖掘潜在张量数据的局部关联,并融合ℓ₁/₂范数与Frobenius范数进行噪声建模;采用基于ADMM的优化算法求解非凸模型。 Result: 在多个真实数据集上实验表明,TBTLRR在图像聚类性能上优于当前最先进的方法,具有更强的噪声鲁棒性和收敛性保证。 Conclusion: TBTLRR通过数据驱动的变换和双侧结构建模,有效提升了低秩表示在复杂噪声环境下的聚类性能,具备理论保证和实际应用价值。 Abstract: Tensor low-rank representation (TLRR) has demonstrated significant success in image clustering. However, most existing methods rely on fixed transformations and suffer from poor robustness to noise. In this paper, we propose a novel transformed bilateral tensor low-rank representation model called TBTLRR, which introduces a data-adaptive tensor nuclear norm by learning arbitrary unitary transforms, allowing for more effective capture of global correlations. In addition, by leveraging the bilateral structure of latent tensor data, TBTLRR is able to exploit local correlations between image samples and features. Furthermore, TBTLRR integrates the $\ell_{1/2}$-norm and Frobenius norm regularization terms for better dealing with complex noise in real-world scenarios. To solve the proposed nonconvex model, we develop an efficient optimization algorithm inspired by the alternating direction method of multipliers (ADMM) and provide theoretical convergence. Extensive experiments validate its superiority over the state-of-the-art methods in clustering. The code will be available at https://github.com/xianchaoxiu/TBTLRR.[75] Endoshare: A Source Available Solution to De-Identify and Manage Surgical Videos
Lorenzo Arboit,Dennis N. Schneider,Britty Baby,Vinkle Srivastav,Pietro Mascagni,Nicolas Padoy
Main category: cs.CV
TL;DR: Endoshare是一个开源、跨平台的应用程序,用于标准化和去标识化微创手术中的内窥镜视频,支持隐私保护的外科视频管理。
Details
Motivation: 由于录像格式异构和视频共享带来的隐私问题,基于视频的评估和外科数据科学的广泛应用受到限制。 Method: 遵循软件开发生命周期,结合迭代的用户中心反馈;采用隐私设计架构,并通过内外部临床医生和计算机科学家的可用性调查及技术接受模型评估进行测试。 Result: 原型测试显示高可用性(内部评分4.68/5和4.03/5),改进后外部外科医生评价感知有用性为5.07/7,易用性5.15/7,推荐度9.20/10;处理时间受模式、视频长度和硬件影响显著。 Conclusion: Endoshare提供了一个透明且用户友好的标准化外科视频管理流程,但需进一步认证合规性和互操作性以替代专有系统。 Abstract: Video-based assessment and surgical data science can advance surgical training, research, and quality improvement. However, widespread use remains limited by heterogeneous recording formats and privacy concerns associated with video sharing. We present Endoshare, a source-available, cross-platform application for merging, standardizing, and de-identifying endoscopic videos in minimally invasive surgery. Development followed the software development life cycle with iterative, user-centered feedback. During the analysis phase, an internal survey of clinicians and computer scientists based on ten usability heuristics identified key requirements that guided a privacy-by-design architecture. In the testing phase, an external clinician survey combined the same heuristics with Technology Acceptance Model constructs to assess usability and adoption, complemented by benchmarking across different hardware configurations. Four clinicians and four computer scientists initially tested the prototype, reporting high usability (4.68 +/- 0.40/5 and 4.03 +/- 0.51/5), with the lowest score (4.00 +/- 0.93/5) relating to label clarity. After refinement, the testing phase surveyed ten surgeons who reported high perceived usefulness (5.07 +/- 1.75/7), ease of use (5.15 +/- 1.71/7), heuristic usability (4.38 +/- 0.48/5), and strong recommendation (9.20 +/- 0.79/10). Processing time varied with processing mode, video duration (both p <= 0.001), and machine computational power (p = 0.041). Endoshare provides a transparent, user-friendly pipeline for standardized, privacy-preserving surgical video management. Compliance certification and broader interoperability validation are needed to establish it as a deployable alternative to proprietary systems. The software is available at https://camma-public.github.io/Endoshare/[76] Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency
Hao Yu,Haoyu Chen,Yan Jiang,Wei Peng,Zhaodong Sun,Samuel Kaski,Guoying Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新的卷积算子Attentive Convolution (ATConv),通过引入自注意力机制中的自适应路由和横向抑制特性,显著提升了卷积网络的表达能力,在图像分类和生成任务中超越了自注意力机制,实现了高效且高性能的视觉模型。
Details
Motivation: 自注意力机制虽然表达能力强,但计算复杂度高;卷积虽高效但表达能力不足。本文旨在探究自注意力优于卷积的本质原因,并据此改进卷积设计以缩小性能差距。 Method: 通过分析自注意力的优势,提出了两个关键设计原则:自适应路由和横向抑制,并将其融入卷积操作中,提出了Attentive Convolution (ATConv) 算子。基于ATConv构建了新的CNN家族AttNet,并在ImageNet分类和扩散生成模型中验证其有效性。 Result: ATConv仅使用3x3卷积核即在基础视觉任务上优于多种自注意力机制;AttNet在27M参数下达到84.4%的ImageNet-1K准确率;在SiT-XL/2中用ATConv替换所有自注意力使FID降低0.15且采样更快。 Conclusion: 通过借鉴自注意力的核心机制改进卷积设计,可以构建出兼具高效率和强表达能力的新型卷积网络,推动卷积网络在现代视觉任务中的复兴。 Abstract: Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3\times3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3\times 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.[77] StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
Jiho Park,Sieun Choi,Jaeyoon Seo,Jihie Kim
Main category: cs.CV
TL;DR: 本文提出了StableSketcher,一种基于扩散模型的框架,用于生成高质量的手绘风格草图,并引入了新的SketchDUO数据集以提升文本-图像对齐和语义一致性。
Details
Motivation: 现有的扩散模型在生成像素级手绘草图方面存在挑战,尤其是在抽象表达和提示词保真度方面表现不足。 Method: 通过微调变分自编码器优化潜在空间解码,并结合基于视觉问答的奖励函数进行强化学习,提升草图生成的风格和语义一致性。 Result: 实验表明,StableSketcher在草图的风格保真度和提示词对齐方面优于Stable Diffusion基线模型,且新提出的SketchDUO数据集包含实例级草图与问答对,填补了现有数据集的空白。 Conclusion: StableSketcher有效提升了手绘草图的生成质量,同时SketchDUO为未来草图生成与理解研究提供了重要资源。 Abstract: Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.[78] BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
Ziheng Zhang,Xinyue Ma,Arpita Chowdhury,Elizabeth G. Campolongo,Matthew J. Thompson,Net Zhang,Samuel Stevens,Hilmar Lapp,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao,Jianyang Gu
Main category: cs.CV
TL;DR: 本研究提出利用描述性字幕作为生物多模态基础模型的补充监督信号,通过生成基于维基百科和分类特征的合成字幕训练BIOCAP模型,在物种分类和图文检索任务中表现出色。
Details
Motivation: 生物领域缺乏大规模、实例特定的自然语言标注,限制了多模态模型的应用。本文旨在探索描述性字幕作为补充监督信号的潜力,以增强图像与语义之间的对齐。 Method: 使用多模态大语言模型(MLLMs)结合维基百科视觉信息和分类定制格式生成合成描述性字幕,并将其用于训练BIOCAP模型,实现图像-文本联合表示学习。 Result: BIOCAP在物种分类和图文检索任务上表现优异,验证了描述性字幕能有效提升生物多模态模型的语义理解能力。 Conclusion: 描述性字幕能够引导模型关注关键生物学特征,抑制虚假相关性,是连接生物图像与多模态基础模型的有效途径。 Abstract: This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.[79] Physics-Guided Fusion for Robust 3D Tracking of Fast Moving Small Objects
Prithvi Raj Singh,Raju Gottumukkala,Anthony S. Maida,Alan B. Barhorst,Vijaya Gopu
Main category: cs.CV
TL;DR: 本文提出了一种结合深度学习检测与基于物理的跟踪算法的新型系统,用于解决快速移动小物体在3D空间中的检测与跟踪难题。
Details
Motivation: 现有方法在处理快速移动的小物体时存在局限性,尤其是在RGB-D相机下对微小物体的检测和跟踪研究不足。 Method: 系统采用深度学习进行目标检测,并引入基于运动学方程的物理跟踪算法,同时设计了异常值检测与校正模块以应对遮挡和快速变向等挑战。 Result: 在自建的壁球数据集上评估显示,相比卡尔曼滤波器跟踪器,该系统的平均位移误差减少了70%。 Conclusion: 该系统通过融合物理模型与深度学习,在实时3D小目标跟踪中表现出优越性能,具有在自主机器人平台感知中的应用潜力。 Abstract: While computer vision has advanced considerably for general object detection and tracking, the specific problem of fast-moving tiny objects remains underexplored. This paper addresses the significant challenge of detecting and tracking rapidly moving small objects using an RGB-D camera. Our novel system combines deep learning-based detection with physics-based tracking to overcome the limitations of existing approaches. Our contributions include: (1) a comprehensive system design for object detection and tracking of fast-moving small objects in 3D space, (2) an innovative physics-based tracking algorithm that integrates kinematics motion equations to handle outliers and missed detections, and (3) an outlier detection and correction module that significantly improves tracking performance in challenging scenarios such as occlusions and rapid direction changes. We evaluated our proposed system on a custom racquetball dataset. Our evaluation shows our system surpassing kalman filter based trackers with up to 70\% less Average Displacement Error. Our system has significant applications for improving robot perception on autonomous platforms and demonstrates the effectiveness of combining physics-based models with deep learning approaches for real-time 3D detection and tracking of challenging small objects.[80] Inverse Image-Based Rendering for Light Field Generation from Single Images
Hyunjun Jung,Hae-Gon Jeon
Main category: cs.CV
TL;DR: 本文提出了一种名为逆向图像渲染(inverse image-based rendering)的新方法,能够从单张图像生成光场,用于新颖视图合成。
Details
Motivation: 光场在场景表示和新型视图渲染中具有优势,但传统获取方式依赖昂贵设备或高计算成本,限制了其广泛应用。本文旨在通过单图像生成光场,提升其可用性和适用性。 Method: 提出一种神经渲染管线,通过将输入图像的光线流存储为源光线,利用交叉注意力建模源光线间关系,预测目标视角下目标光线的颜色;并通过迭代更新生成的视图作为新的源光线,保持遮挡区域生成的一致性。 Result: 该方法在多个具有挑战性的数据集上无需微调即可表现良好,且优于当前最先进的新型视图合成方法。 Conclusion: 所提出的逆向图像渲染方法有效实现了从单图像生成光场,在新颖视图合成任务中表现出强泛化能力和优越性能。 Abstract: A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the generated out-of-view contents are updated to the set of source rays. This procedure is iteratively performed while ensuring the consistent generation of occluded contents. We demonstrate that our inverse image-based rendering works well with various challenging datasets without any retraining or finetuning after once trained on synthetic dataset, and outperforms relevant state-of-the-art novel view synthesis methods.[81] Revisiting Logit Distributions for Reliable Out-of-Distribution Detection
Jiachen Liang,Ruibing Hou,Minyang Hu,Hong Chang,Shiguang Shan,Xilin Chen
Main category: cs.CV
TL;DR: 提出了一种新的后处理OOD检测方法LogitGap,通过利用logits空间中最大logit与其他logits的关系来增强ID和OOD样本的可分性,并引入一种无需训练的策略来自动生成最相关logits,实现了在多种场景下的SOTA性能。
Details
Motivation: 现有后处理OOD检测方法未能充分利用模型logits空间中的丰富信息,导致ID与OOD样本分离效果不佳。 Method: 提出LogitGap方法,利用最大logit与其余logits之间的关系增强判别能力;并设计一种训练免费策略,自动识别最具有信息量的logits子集用于评分。 Result: 在视觉-语言和纯视觉模型上进行了大量实验,结果显示LogitGap在多个OOD检测基准和场景中 consistently 达到最先进的性能。 Conclusion: LogitGap通过有效挖掘logits空间结构,提供了一种高效、无需训练的OOD检测方案,在多种模型和任务上表现出优越性能。 Abstract: Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning models in open-world applications. While post-hoc methods are favored for their efficiency and ease of deployment, existing approaches often underexploit the rich information embedded in the model's logits space. In this paper, we propose LogitGap, a novel post-hoc OOD detection method that explicitly exploits the relationship between the maximum logit and the remaining logits to enhance the separability between in-distribution (ID) and OOD samples. To further improve its effectiveness, we refine LogitGap by focusing on a more compact and informative subset of the logit space. Specifically, we introduce a training-free strategy that automatically identifies the most informative logits for scoring. We provide both theoretical analysis and empirical evidence to validate the effectiveness of our approach. Extensive experiments on both vision-language and vision-only models demonstrate that LogitGap consistently achieves state-of-the-art performance across diverse OOD detection scenarios and benchmarks. Code is available at https://github.com/GIT-LJc/LogitGap.[82] PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding
Penghao Wang,Yiyang He,Xin Lv,Yukai Zhou,Lan Xu,Jingyi Yu,Jiayuan Gu
Main category: cs.CV
TL;DR: 本文提出了PartNeXt,一个包含超过23,000个高质量、带纹理的3D模型的数据集,具有细粒度和层次化的部件标注,覆盖50个类别。该数据集解决了现有数据集如PartNet在可扩展性和可用性上的局限,并在无类别部件分割和3D部件中心问答任务上进行了基准测试,揭示了当前方法在细粒度部件理解上的不足。此外,在PartNeXt上训练Point-SAM显著优于在PartNet上的表现,证明了其更高的质量和多样性。
Details
Motivation: 现有的3D部件理解数据集(如PartNet)依赖于无纹理几何形状和专家标注,限制了其可扩展性和实用性。为了推动计算机视觉、图形学和机器人领域对物体部件级理解的发展,需要一个更具规模、带纹理且标注精细的数据集。 Method: 提出PartNeXt数据集,包含23,000多个带纹理的3D模型,具有细粒度、层次化部件标注,覆盖50个类别;设计两个基准任务:类无关部件分割和3D部件中心的问答任务;使用Point-SAM等模型进行训练与评估,并与PartNet对比性能。 Result: 在类无关部件分割任务中,现有最先进方法(如PartField、SAMPart3D)在细粒度和叶级部件上表现不佳;在3D部件问答任务中,3D-LLMs显示出开放词汇部件定位的明显缺陷;在PartNeXt上训练Point-SAM相比PartNet带来显著性能提升。 Conclusion: PartNeXt通过可扩展的标注、纹理感知标签和多任务评估,为结构化3D理解研究提供了新方向,显著优于现有数据集,有望成为未来3D部件理解的重要资源。 Abstract: Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset's superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.[83] Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists
Eduardo R. Corral-Soto,Yang Liu,Yuan Ren,Bai Dongfeng,Liu Bingbing
Main category: cs.CV
TL;DR: 本文提出了一种从单张RGB图像中对自行车和骑行者进行类别级8D姿态估计的方法,不仅估计自行车的3D平移和旋转,还估计其车把和踏板相对于车身的旋转角度,从而更精确地预测骑行者的行驶方向和行为。
Details
Motivation: 在自动驾驶中,准确估计骑行者的姿态对于判断其穿越意图、行为预测和避障至关重要。传统6D姿态估计方法无法处理自行车部件(如车把和踏板)的关节运动,导致3D边界框与实际行驶方向不一致。 Method: 提出一种联合估计8D姿态(包括3D位置、3D旋转、车把和踏板旋转角)和3D关键点的模型,利用合成与真实图像混合数据进行训练,以提升在真实场景中的泛化能力。 Result: 实验表明,该方法在8D姿态参数估计上表现良好,并在与使用刚性模板的最先进6D姿态估计器对比时取得具有竞争力的结果。 Conclusion: 所提方法能更精细地建模自行车的关节结构,提升对骑行者意图和运动方向的估计精度,有助于增强自动驾驶系统对弱势道路使用者的安全保障。 Abstract: In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.[84] TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning
Xudong Yan,Songhe Feng
Main category: cs.CV
TL;DR: 提出一种新的方法TOMCAT,通过在测试时利用无监督数据积累多模态知识并自适应更新原型,有效应对分布偏移问题,在四个基准数据集上实现了最先进的性能。
Details
Motivation: 现有CZSL方法因测试时标签空间的分布偏移(由属性和对象重新组合产生未见组合)导致性能下降。 Method: 从无监督数据中积累文本和视觉模态的知识,设计自适应更新权重以调整多模态原型,并引入动态优先队列存储高置信度图像以获取历史视觉知识,同时通过多模态协同表示学习对齐文本与视觉原型。 Result: 在四个基准数据集的闭世界和开世界设置下均取得最优性能。 Conclusion: 所提方法能有效缓解分布偏移问题,显著提升CZSL在复杂场景下的泛化能力。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .[85] IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks
Insu Jeon,Wonkwang Lee,Myeongjang Pyeon,Gunhee Kim
Main category: cs.CV
TL;DR: 提出了一种基于信息瓶颈框架的新型GAN模型IB-GAN,用于无监督解耦表示学习,在解耦性能和生成质量方面优于InfoGAN和β-VAE。
Details
Motivation: 尝试将信息瓶颈(IB)框架引入GAN的优化中,以实现更好的解耦表示学习。 Method: 在GAN生成器中引入一个中间随机层,用以约束输入与输出之间的互信息,并将其作为可学习的潜在分布,与生成器联合端到端训练。 Result: 在dSprites和Color-dSprites上,IB-GAN的解耦评分优于InfoGAN并媲美β-VAE;在CelebA和3D Chairs上的FID结果显示其生成样本的质量和多样性更优。 Conclusion: IB-GAN通过引入信息瓶颈机制,能够有效实现解耦表示学习,并在生成质量和解耦性能之间取得良好平衡,优于现有方法。 Abstract: We propose a new GAN-based unsupervised model for disentangled representation learning. The new model is discovered in an attempt to utilize the Information Bottleneck (IB) framework to the optimization of GAN, thereby named IB-GAN. The architecture of IB-GAN is partially similar to that of InfoGAN but has a critical difference; an intermediate layer of the generator is leveraged to constrain the mutual information between the input and the generated output. The intermediate stochastic layer can serve as a learnable latent distribution that is trained with the generator jointly in an end-to-end fashion. As a result, the generator of IB-GAN can harness the latent space in a disentangled and interpretable manner. With the experiments on dSprites and Color-dSprites dataset, we demonstrate that IB-GAN achieves competitive disentanglement scores to those of state-of-the-art \b{eta}-VAEs and outperforms InfoGAN. Moreover, the visual quality and the diversity of samples generated by IB-GAN are often better than those by \b{eta}-VAEs and Info-GAN in terms of FID score on CelebA and 3D Chairs dataset.[86] PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching
Yun Wang,Junjie Hu,Qiaole Dong,Yongjian Zhang,Yanwei Fu,Tin Lun Lam,Dapeng Wu
Main category: cs.CV
TL;DR: 本文提出了一种用于立体视频中时序一致深度估计的Pick-and-Play Memory(PPMStereo)方法,通过构建动态记忆缓冲区来高效建模长距离时空一致性。
Details
Motivation: 由于现有方法在建模长期时间一致性时面临计算效率与性能之间的权衡,难以满足实际应用(如增强现实)对深度估计稳定性的要求。 Method: 受人类两阶段决策过程启发,设计了'pick'和'play'两个阶段:'pick'选择最相关的帧,'play'自适应加权这些帧进行时空聚合,从而维护一个紧凑且信息丰富的记忆缓冲区,实现高效的动态立体匹配。 Result: 实验表明,PPMStereo在精度和时间一致性方面均达到最先进水平,在Sintel数据集上显著优于BiDAStereo,同时计算成本更低。 Conclusion: PPMStereo有效解决了长时序建模与计算效率之间的矛盾,为实时应用提供了高质量、时间一致的深度估计方案。 Abstract: Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3\% \& 9.02\% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolor{blue}{https://github.com/cocowy1/PPMStereo}.[87] Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
Aaron Appelle,Jerome P. Lynch
Main category: cs.CV
TL;DR: 提出了一种严格的评估协议,用于评测文本到视频和图像到视频模型作为行人动态的隐式模拟器的能力,发现当前领先模型已学习到有效的多智能体行为先验,但仍存在人物合并或消失等缺陷。
Details
Motivation: 现有视频生成模型缺乏对多主体交互场景(如行人动态)合理性的系统评估,需要验证其作为通用世界模拟器的潜力。 Method: 构建了针对T2V和I2V模型的评估协议:利用真实数据集起始帧进行I2V评估,设计涵盖不同行人密度和交互的提示集用于T2V;提出一种无需相机参数即可从像素空间重建2D鸟瞰轨迹的方法。 Result: 实验表明当前主流视频生成模型能生成符合真实行人动态的多智能体行为,但在密集交互场景中仍会出现人物合并、消失等失败模式。 Conclusion: 视频生成模型已具备一定的隐式多智能体模拟能力,但需进一步改进以处理复杂的人群交互,该评估框架为未来研究提供了基准。 Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.[88] SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization
Xinyi Hu,Yuran Wang,Yue Li,Wenxuan Liu,Zheng Wang
Main category: cs.CV
TL;DR: 本文提出了Suspicion Progression Analysis Network (SPAN),将时间意图定位从离散分类转为连续回归,以捕捉视频监控中可疑意图的动态演变。通过引入怀疑分数公式、怀疑系数调制和概念锚定映射,SPAN 能更好地建模长期依赖与累积效应,在HAI数据集上显著优于现有方法,实现更早检测与更强可解释性。
Details
Motivation: 现有离散分类方法无法有效捕捉可疑意图的连续性和动态演变,限制了早期干预与系统可解释性。 Method: 提出SPAN模型,采用连续回归代替离散分类;基于TPP理论设计怀疑分数公式;引入多模态调节的怀疑系数调制机制;设计概念锚定映射以关联行为与潜在意图。 Result: 在HAI数据集上,MSE降低19.8%,平均mAP提升1.78%,低频场景下mAP提升2.74%。 Conclusion: SPAN通过连续建模可疑意图,显著提升了时间意图定位的性能、可解释性与实用性,支持更早的安全预警与主动干预。 Abstract: Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.[89] A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development
Minh Sao Khue Luu,Margaret V. Benedichuk,Ekaterina I. Roppert,Roman M. Kenzhin,Bair N. Tuchinov
Main category: cs.CV
TL;DR: 本研究系统评估了54个公开脑MRI数据集(超过538,031例),揭示了在模态、疾病覆盖和数据规模方面的显著不平衡,以及图像层面的异质性,表明标准化预处理无法完全消除数据集间的偏差,强调开发通用脑MRI基础模型需考虑预处理感知和领域自适应策略。
Details
Motivation: 脑MRI基础模型的发展依赖于数据的规模、多样性和一致性,但目前对这些因素的系统性评估仍缺乏。 Method: 在数据集层面分析模态组成、疾病覆盖和规模;在图像层面量化体素间距、方向和强度分布;评估强度归一化、偏置场校正、去颅骨、空间配准和插值等预处理步骤对数据一致性的影响;并通过3D DenseNet121进行特征空间案例研究,检测协变量偏移。 Result: 发现健康人群数据集规模远大于临床群体,存在显著不平衡;不同数据集在图像特性上表现出高度异质性;尽管标准化预处理提升了单个数据集内的一致性,但仍存在跨数据集的残余差异;特征空间分析证实了明显的残余协变量偏移。 Conclusion: 公共脑MRI资源存在广泛的数据异质性和预处理导致的残余偏差,仅靠数据协调无法完全消除跨数据集偏倚,因此在设计可泛化的脑MRI基础模型时,必须采用对预处理敏感且具备领域自适应能力的策略。 Abstract: The development of foundation models for brain MRI depends critically on the scale, diversity, and consistency of available data, yet systematic assessments of these factors remain scarce. In this study, we analyze 54 publicly accessible brain MRI datasets encompassing over 538,031 to provide a structured, multi-level overview tailored to foundation model development. At the dataset level, we characterize modality composition, disease coverage, and dataset scale, revealing strong imbalances between large healthy cohorts and smaller clinical populations. At the image level, we quantify voxel spacing, orientation, and intensity distributions across 15 representative datasets, demonstrating substantial heterogeneity that can influence representation learning. We then perform a quantitative evaluation of preprocessing variability, examining how intensity normalization, bias field correction, skull stripping, spatial registration, and interpolation alter voxel statistics and geometry. While these steps improve within-dataset consistency, residual differences persist between datasets. Finally, feature-space case study using a 3D DenseNet121 shows measurable residual covariate shift after standardized preprocessing, confirming that harmonization alone cannot eliminate inter-dataset bias. Together, these analyses provide a unified characterization of variability in public brain MRI resources and emphasize the need for preprocessing-aware and domain-adaptive strategies in the design of generalizable brain MRI foundation models.[90] RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
Bingjie Gao,Qianli Ma,Xiaoxue Wu,Shuai Yang,Guanzhou Lan,Haonan Zhao,Jiaxuan Chen,Qingyang Liu,Yu Qiao,Xinyuan Chen,Yaohui Wang,Li Niu
Main category: cs.CV
TL;DR: RAPO++ 是一个无需修改生成模型的跨阶段提示优化框架,通过检索增强、迭代反馈优化和大语言模型微调,显著提升文本到视频生成的质量。
Details
Motivation: 用户提供的提示通常简短且与训练数据不一致,限制了扩散模型在文本到视频生成中的潜力,因此需要更有效的提示优化方法。 Method: 分为三个阶段:第一阶段使用检索增强提示优化(RAPO)丰富并重构提示;第二阶段通过多源反馈进行样本特定的迭代优化(SSPO);第三阶段利用优化后的提示对大语言模型进行微调。 Result: 在五个主流T2V模型和五个基准上实验表明,RAPO++ 在语义对齐、组合推理、时间稳定性和物理合理性方面显著优于现有方法。 Conclusion: RAPO++ 是一种模型无关、高效且可扩展的提示优化方案,为文本到视频生成中的提示工程设立了新标准。 Abstract: Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.[91] FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing
Yanghao Wang,Zhen Wang,Long Chen
Main category: cs.CV
TL;DR: 本文提出FlowCycle,一种无需反转的基于流的图像编辑框架,通过可学习噪声和循环一致性优化生成目标感知的中间状态,实现高质量且一致的文本引导图像编辑。
Details
Motivation: 现有方法在构建中间状态时忽视了与特定编辑目标之间的语义差异,导致编辑能力有限或结果不一致,尤其是在修改幅度较大时。因此需要一种能够感知目标的中间状态以提升编辑性能。 Method: 提出FlowCycle框架,使用可学习噪声参数化破坏过程,并通过前向编辑与反向恢复的双重建约束进行循环一致性优化,从而生成目标感知的中间状态,实现无需反转的流模型编辑。 Result: 实验表明,FlowCycle在编辑质量和源图像一致性方面优于当前最先进的方法,尤其在大范围编辑场景下表现突出。 Conclusion: FlowCycle通过目标感知的中间状态设计,有效平衡了编辑保真度与源一致性,为基于流的文本到图像编辑提供了新的高效范式。 Abstract: Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.[92] Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
Talha Ilyas,Duong Nhu,Allison Thomas,Arie Levin,Lim Wei Yap,Shu Gong,David Vera Anaya,Yiwen Jiang,Deval Mehta,Ritesh Warty,Vinayak Smith,Maya Reddy,Euan Wallace,Wenlong Cheng,Zongyuan Ge,Faezeh Marzbanrad
Main category: cs.CV
TL;DR: 提出了一种基于自监督对比学习的胎儿运动检测框架CURL,利用双对比损失和特定任务采样策略,从长时间超声视频中准确检测胎儿运动,在92名受试者数据上表现出良好性能。
Details
Motivation: 传统方法如孕妇感知和CTG存在主观性和准确性不足的问题,难以可靠地评估胎儿运动,因此需要一种客观、自动化的检测方法。 Method: 提出Contrastive Ultrasound Video Representation Learning (CURL),采用空间-时间双对比损失进行自监督学习,并设计任务特定的采样策略,结合概率性微调实现对任意长度超声视频的灵活推理。 Result: 在92名受试者、每人30分钟的超声数据集上,CURL达到78.01%的敏感性和81.60%的AUROC。 Conclusion: CURL能够有效学习胎儿运动表征,为无监督胎儿运动分析提供了可行方案,具有提升产前监测客观性和临床决策支持的潜力。 Abstract: Accurate fetal movement (FM) detection is essential for assessing prenatal health, as abnormal movement patterns can indicate underlying complications such as placental dysfunction or fetal distress. Traditional methods, including maternal perception and cardiotocography (CTG), suffer from subjectivity and limited accuracy. To address these challenges, we propose Contrastive Ultrasound Video Representation Learning (CURL), a novel self-supervised learning framework for FM detection from extended fetal ultrasound video recordings. Our approach leverages a dual-contrastive loss, incorporating both spatial and temporal contrastive learning, to learn robust motion representations. Additionally, we introduce a task-specific sampling strategy, ensuring the effective separation of movement and non-movement segments during self-supervised training, while enabling flexible inference on arbitrarily long ultrasound recordings through a probabilistic fine-tuning approach. Evaluated on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions, CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its potential for reliable and objective FM analysis. These results highlight the potential of self-supervised contrastive learning for fetal movement analysis, paving the way for improved prenatal monitoring and clinical decision-making.[93] EditInfinity: Image Editing with Binary-Quantized Generative Models
Jiahuan Wang,Yuxin Chen,Jun Yu,Guangming Lu,Wenjie Pei
Main category: cs.CV
TL;DR: 本文提出EditInfinity,通过适配二值量化生成模型Infinity实现高效的文本驱动图像编辑,利用其可精确获取中间量化表示的特性,解决了扩散模型图像反演中的近似误差问题。
Details
Motivation: 现有基于扩散模型的图像编辑方法受限于图像反演过程中缺乏中间步骤的精确监督,导致近似误差影响编辑性能。因此,需要一种能实现精确图像反演且参数高效的方法。 Method: 提出EditInfinity,基于VQ生成模型,设计了结合文本提示校正和图像风格保持的高效图像反演机制,并引入整体平滑策略以提升编辑保真度和语义对齐精度。 Result: 在PIE-Bench基准上针对“添加”、“修改”和“删除”三种编辑操作的实验表明,EditInfinity优于现有的扩散模型基线方法。 Conclusion: EditInfinity通过利用VQ模型的精确中间表示和高效的适配机制,在低调参开销下实现了高质量、高保真的文本驱动图像编辑。 Abstract: Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across "add", "change", and "delete" editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.[94] Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Ge Zheng,Jiaye Qian,Jiajin Tang,Sibei Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的“诱导-检测-抑制”框架,用于缓解大视觉语言模型(LVLMs)在长文本生成中的幻觉问题,验证了上下文依赖性而非长度本身是导致幻觉的关键因素。
Details
Motivation: 研究旨在探究LVLMs在较长自由生成响应中出现更多幻觉现象的根本原因,是否由生成长度本身引起,还是存在更深层机制。 Method: 通过一系列初步实验,提出‘诱导-检测-抑制’框架:利用精心设计的上下文主动诱导幻觉,基于诱导样例实现高风险案例的早期检测,并在实际解码过程中抑制对象级幻觉。 Result: 该方法在多个基准测试中均取得显著且一致的性能提升,表现出良好的幻觉检测与抑制能力。 Conclusion: 幻觉风险主要源于对上下文的依赖以维持长回复的一致性和完整性,而非回复长度本身;研究不仅验证了所提假设,也为深入理解LVLMs中的幻觉机制提供了新视角。 Abstract: Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.[95] COS3D: Collaborative Open-Vocabulary 3D Segmentation
Runsong Zhu,Ka-Hei Hui,Zhengzhe Liu,Qianyi Wu,Weiliang Tang,Shi Qiu,Pheng-Ann Heng,Chi-Wing Fu
Main category: cs.CV
TL;DR: 本文提出了COS3D,一种新的协作式提示-分割框架,用于开放词汇的3D分割任务。通过引入包含实例场和语言场的协作场,并设计两阶段训练策略和自适应提示优化方法,有效融合了语言与分割线索,在多个基准上实现了领先性能。
Details
Motivation: 现有基于高斯溅射的方法在开放词汇3D分割中存在分割质量差或误差累积的问题,难以有效结合语言与分割信息。 Method: 提出协作场概念,包含实例场和语言场;设计实例到语言的特征映射与两阶段训练策略;在推理阶段采用自适应语言到实例的提示优化机制。 Result: 在两个主流基准上显著优于现有方法,展现出在新视角图像分割、层次化分割和机器人等应用中的潜力。 Conclusion: COS3D通过协同建模语言与实例信息,实现了高质量的开放词汇3D分割,为后续研究和应用提供了有效框架。 Abstract: Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}.[96] Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding
Minseok Kang,Minhyeok Lee,Minjung Kim,Donghyeong Kim,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出DualGround,一种双分支架构,通过分离句子级和短语级语义来改善视频时序 grounding,实现细粒度的时间对齐,在多个基准上达到SOTA性能。
Details
Motivation: 现有方法在跨模态注意力中对所有文本标记一视同仁,过度依赖[EOS]标记的全局语义,忽视词级别信号,导致细粒度时间对齐能力受限。 Method: 提出DualGround,采用双分支结构:[EOS]标记通过句子级路径处理,词标记聚类为短语级单元用于局部定位;引入标记角色感知的跨模态交互策略,并结合句子级与短语级语义进行联合建模。 Result: DualGround在QVHighlights和Charades-STA数据集上的Moment Retrieval和Highlight Detection任务中均取得最先进性能。 Conclusion: 通过解耦全局与局部语义建模,DualGround有效提升了视频-语言对齐中的细粒度时序定位能力。 Abstract: Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.[97] Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization
Shuhan Hu,Yiru Li,Yuanyuan Li,Yingying Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于掩码的位置编码方案(MPE)和上下文增强模块(CEM),构建了EDGeo框架,用于提升跨视角物体地理定位的精度,尤其在处理长条形建筑物等大跨度物体时表现出色,在公开数据集上实现了最先进的性能。
Details
Motivation: 现有方法依赖于仅捕捉二维坐标的基于关键点的位置编码,忽略了物体形状信息,导致对标注偏移敏感且跨视角匹配能力有限。此外,卫星图像中的大跨度物体(如细长建筑)也给定位带来挑战。 Method: 提出掩码位置编码(MPE),利用分割掩码同时捕获空间坐标和物体轮廓;设计上下文增强模块(CEM),采用水平与垂直条带卷积核提取长距离上下文特征;将MPE与CEM结合,构建端到端的EDGeo框架。 Result: 在CVOGL和VIGOR-Building两个公开数据集上进行了广泛实验,结果表明该方法在具有挑战性的地面到卫星场景下,定位精度提升了3.39%,达到最先进水平。 Conclusion: 所提出的EDGeo框架通过引入对象感知的位置编码和上下文建模机制,显著提升了跨视角物体地理定位的鲁棒性和准确性,为该领域研究提供了新的范式。 Abstract: Cross-view object geo-localization enables high-precision object localization through cross-view matching, with critical applications in autonomous driving, urban management, and disaster response. However, existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information, resulting in sensitivity to annotation shifts and limited cross-view matching capability. To address these limitations, we propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes, thereby upgrading the model from "location-aware" to "object-aware." Furthermore, to tackle the challenge of large-span objects (e.g., elongated buildings) in satellite imagery, we design a context enhancement module. This module employs horizontal and vertical strip convolutional kernels to extract long-range contextual features, enhancing feature discrimination among strip-like objects. Integrating MPE and CEM, we present EDGeo, an end-to-end framework for robust cross-view object geo-localization. Extensive experiments on two public datasets (CVOGL and VIGOR-Building) demonstrate that our method achieves state-of-the-art performance, with a 3.39% improvement in localization accuracy under challenging ground-to-satellite scenarios. This work provides a robust positional encoding paradigm and a contextual modeling framework for advancing cross-view geo-localization research.[98] Calibrating Multimodal Consensus for Emotion Recognition
Guowei Zhong,Junjie Li,Huaiyu Zhu,Ruohong Huan,Yun Pan
Main category: cs.CV
TL;DR: 提出了一种名为Calibrated Multimodal Consensus (CMC)的模型,通过伪标签生成和无参数融合模块缓解多模态情感识别中的语义不一致和文本主导问题,在多个数据集上表现优异。
Details
Motivation: 现有方法忽视跨模态语义不一致性且易受文本模态主导,影响情感识别准确性。 Method: 设计伪标签生成模块(PLGM)实现自监督单模态预训练,结合无参数融合模块(PFM)与多模态共识路由(MCR)进行多模态微调,缓解文本主导并提升融合可靠性。 Result: 在CH-SIMS、CH-SIMS v2、CMU-MOSI和CMU-MOSEI四个数据集上达到或超越当前最优性能,尤其在存在语义不一致的场景下表现突出。 Conclusion: CMC有效解决了多模态情感识别中的模态冲突与文本主导问题,提升了模型鲁棒性和准确性。 Abstract: In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.[99] Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals
Saraf Anzum Shreya,MD. Abu Ismail Siddique,Sharaf Tasnim
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv8 nano的实时货币检测系统,旨在帮助视障人士独立识别美元、欧元和孟加拉塔卡的纸币与硬币,结合深度卷积层和Squeeze-and-Excitation模块提升检测精度,并通过语音反馈实现便捷使用。
Details
Motivation: 视障人士在日常处理货币时面临困难,难以独立完成交易,现有辅助技术尚不完善,因此需要一种高效、准确的实时货币识别系统来提升其生活自主性。 Method: 采用YOLOv8 nano模型并设计带有深度卷积层和Squeeze-and-Excitation模块的自定义检测头,以增强特征提取能力;模型在包含30类(涵盖美元、欧元和孟加拉塔卡)的货币数据集上进行训练,实现在智能手机上的实时检测,并集成语音反馈功能。 Result: 模型实现了97.73%的准确率、95.23%的召回率、95.85%的F1分数以及97.21%的mAP50(B),表现出优异的检测性能,并支持实时运行和语音输出。 Conclusion: 该系统有效提升了视障人士在货币识别方面的独立性,具备实际应用价值,未来可扩展至更多货币类型或集成于更多移动设备中。 Abstract: Technologies like smartphones have become an essential in our daily lives. It has made accessible to everyone including visually impaired individuals. With the use of smartphone cameras, image capturing and processing have become more convenient. With the use of smartphones and machine learning, the life of visually impaired can be made a little easier. Daily tasks such as handling money without relying on someone can be troublesome for them. For that purpose this paper presents a real-time currency detection system designed to assist visually impaired individuals. The proposed model is trained on a dataset containing 30 classes of notes and coins, representing 3 types of currency: US dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a YOLOv8 nano model with a custom detection head featuring deep convolutional layers and Squeeze-and-Excitation blocks to enhance feature extraction and detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5 (mAP50(B)) of 97.21\%. Using the voice feedback after the detection would help the visually impaired to identify the currency. This paper aims to create a practical and efficient currency detection system to empower visually impaired individuals independent in handling money.[100] GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
Guangyu Dai,Dong Chen,Siliang Tang,Yueting Zhuang
Main category: cs.CV
TL;DR: 提出了一种基于多模态信息的细粒度特征方法GMFVAD,用于视频异常检测,通过减少视觉特征中的冗余信息,在四个主流数据集上实现了最先进的性能。
Details
Motivation: 以往的多模态视频异常检测方法粗略地融合文本特征,忽略了视频片段中可能存在的大量冗余信息,因此需要一种更精细的多模态特征融合方法来提升检测性能。 Method: 提出GMFVAD模型,生成更细粒度的多模态特征,利用视频片段的文本描述信息增强关键部分的视觉特征,并通过多模态信息的多样性优化特征表示。 Result: 在四个主流数据集上达到SOTA性能,消融实验验证了性能提升来源于冗余信息的减少。 Conclusion: GMFVAD通过细粒度融合多模态信息有效减少了视觉特征冗余,显著提升了视频异常检测的准确性。 Abstract: Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.[101] Causal Debiasing for Visual Commonsense Reasoning
Jiayi Zou,Gengyun Jia,Bing-Kun Bao
Main category: cs.CV
TL;DR: 本文提出了VCR-OOD数据集以评估模型在视觉和文本模态下的泛化能力,并通过反事实调整方法消除数据集中的共现与统计偏差。
Details
Motivation: 现有视觉常识推理方法忽视数据集偏差且缺乏有效的去偏策略,导致模型依赖捷径进行预测,泛化能力不足。 Method: 构建VCR-OOD-QA和VCR-OOD-VA两个新数据集,分析VCR任务中的因果图与预测捷径,并采用基于正确答案集合的字典进行后门调整以去除偏差。 Result: 实验表明所提出的去偏方法在多个数据集上有效提升了模型的泛化性能。 Conclusion: 通过构造去偏数据集和引入因果调整方法,能够有效缓解VCR任务中的模态偏差问题,增强模型的鲁棒性和可解释性。 Abstract: Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.[102] Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition
Haodong Yang,Zhongling Huang,Shaojie Guo,Zhe Zhang,Gong Cheng,Junwei Han
Main category: cs.CV
TL;DR: 提出知识引导的神经网络KINN,通过物理先验与紧凑架构解决CV-SAR图像识别中的表示三难问题,在少数据和分布外场景下实现高效、可解释且泛化的识别。
Details
Motivation: 传统数据驱动模型未能充分利用CV-SAR数据中的电磁散射特征,导致在数据有限和域偏移情况下难以同时优化泛化性、可解释性和效率。 Method: 提出KINN框架,采用“压缩-聚合-压缩”结构:第一阶段通过物理引导的字典处理器嵌入先验知识进行压缩;第二阶段聚合特征;第三阶段通过自蒸馏的紧凑分类头进行语义压缩。构建了CNN和ViT两种变体。 Result: 在五个SAR基准上验证,KINN在参数量仅0.7M-0.95M的情况下,显著优于现有方法,具备卓越的少样本和分布外泛化能力,并提供可解释的特征表示。 Conclusion: KINN有效解决了CV-SAR图像识别中的表示三难问题,为可信AI在SAR分析中提供了新路径。 Abstract: Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR) image recognition are fundamentally constrained by a representation trilemma under data-limited and domain-shift scenarios: the concurrent, yet conflicting, optimization of generalization, interpretability, and efficiency. Our work is motivated by the premise that the rich electromagnetic scattering features inherent in CV-SAR data hold the key to resolving this trilemma, yet they are insufficiently harnessed by conventional data-driven models. To this end, we introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture. The first stage performs a physics-guided compression, wherein a novel dictionary processor adaptively embeds physical priors, enabling a compact unfolding network to efficiently extract sparse, physically-grounded signatures. A subsequent aggregation module enriches these representations, followed by a final semantic compression stage that utilizes a compact classification head with self-distillation to learn maximally task-relevant and discriminative embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer (0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios and tangible interpretability, thereby providing an effective solution to the representation trilemma and offering a new path for trustworthy AI in SAR image analysis.[103] DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering
Jiayi Zou,Chaofan Chen,Bing-Kun Bao,Changsheng Xu
Main category: cs.CV
TL;DR: 提出了一种双模态反事实对比构建框架(DMC^3),用于解决第一人称视频问答中的多事件理解和手-物交互识别问题,通过反事实样本构造和对比优化,在多个数据集上达到SOTA性能。
Details
Motivation: 现有方法忽略了第一人称视角带来的独特挑战,如多事件理解和手-物交互识别,因此需要更有效的建模方法来提升Egocentric VideoQA的性能。 Method: 提出DMC^3框架,包括一个基线模型、反事实样本构造模块(文本模态通过事件描述改写,视觉模态通过核心交互挖掘生成正负样本)以及反事实样本参与的对比优化模块,利用对比损失拉近正样本距离、推远负样本距离。 Result: 在EgoTaskQA的normal和indirect分割上分别取得52.51%和46.04%,在QAEGO4D上达到13.2%,均达到当前最优性能。 Conclusion: DMC^3通过引入反事实样本的双模态对比学习,有效提升了第一人称视频问答中对复杂事件和交互的理解能力,显著优于现有方法。 Abstract: Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51\% and 46.04\% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on QAEGO4D, both reaching the state-of-the-art performance.[104] UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
Liangyu Chen,Hanzhang Zhou,Chenglin Cai,Jianan Zhang,Panrong Tong,Quyu Kong,Xu Zhang,Chen Liu,Yuqi Liu,Wenxuan Wang,Yue Wang,Qin Jin,Steven Hoi
Main category: cs.CV
TL;DR: 本文提出了“指令即推理”(Instruction-as-Reasoning)范式,将自然语言指令视为动态分析路径,通过两阶段训练框架(SFT+RL)提升GUI元素定位性能,显著优于现有方法,并展现出涌现的推理能力与代理潜力。
Details
Motivation: 现有GUI定位研究忽视了指令多样性与质量对性能的影响,且数据集中存在大量有缺陷的指令,限制了模型的实际表现。 Method: 提出Instruction-as-Reasoning范式,采用两阶段训练:先在合成的多样化指令上进行监督微调以培养多视角推理能力,再通过强化学习优化推理路径的选择与组合。 Result: UI-Ins-7B和UI-Ins-32B在五个基准上达到SOTA,UI-Ins-32B在UI-I2E-Bench、ScreenSpot-Pro和MMBench-GUI L2上分别取得87.3%、57.0%和84.9%的准确率,并在AndroidWorld任务中以74.1%的成功率展现强代理能力。 Conclusion: 指令应被视为动态推理路径而非静态意图代理,所提方法不仅能提升GUI定位性能,还能激发模型的涌现推理能力,并有效缓解SFT+RL框架中的策略崩溃问题。 Abstract: GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.[105] Breakdance Video classification in the age of Generative AI
Sauptik Dhar,Naveen Ramakrishnan,Michelle Munson
Main category: cs.CV
TL;DR: 该论文探讨了现代视频基础模型在小众但流行的舞蹈运动——霹雳舞中的应用,发现视频编码器模型在预测任务中优于最先进的视频语言模型,并提供了针对霹雳舞视频分类的微调解码器模型的深入分析。
Details
Motivation: 现有研究多集中于主流体育项目,而对霹雳舞等小众运动关注不足,本文旨在探索视频基础模型在此类特殊领域的适用性。 Method: 采用现代视频基础模型(包括编码器和解码器),在霹雳舞视频数据上进行实验,对比编码器与视频语言模型的性能,并对微调后的解码器模型进行详细分析。 Result: 视频编码器模型在预测任务中表现优于当前最先进的视频语言模型,同时揭示了不同编码器选择的影响及解码器在分类任务中的工作机制。 Conclusion: 视频编码器更适合用于霹雳舞这类小众运动的视频理解任务,且通过适当微调解码器可提升分类性能,为未来在非主流体育领域的应用提供了参考。 Abstract: Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.[106] A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
LinFeng Li,Jian Zhao,Zepeng Yang,Yuhang Song,Bojun Lin,Tianle Zhang,Yuchen Yuan,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了一种在跨模态无人机导航任务中的获胜方案,通过领域对齐的预处理和Mixture-of-Experts框架,有效应对平台间异质性和文本域差距问题,在大规模多平台数据中实现了鲁棒的地理定位。
Details
Motivation: 解决跨平台(卫星/无人机/地面)图像与自然语言查询之间的严重异质性及训练与测试文本间的领域差距问题。 Method: 采用平台划分、卫星数据增强、去除方向词等预处理;结合LLM驱动的文本描述优化,并基于BGE-M3和EVA-CLIP模型,使用渐进式两阶段难负样本挖掘训练三个平台专家模型,推理时融合其得分。 Result: 该系统在官方排行榜上排名第一,显著提升了跨模态地理检索的性能,尤其在异构视角下表现出强健性。 Conclusion: 所提出的域对齐预处理与MoE架构能有效缓解跨平台差异和语义不匹配,为复杂跨模态检索任务提供了可行解决方案。 Abstract: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.[107] HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models
Zelin Peng,Zhengqin Xu,Qingyang Liu,Xiaokang Yang,Wei Shen
Main category: cs.CV
TL;DR: 本文提出了一种基于双曲空间的高效多模态大语言模型训练范式HyperET,通过动态调整双曲半径实现视觉与文本在任意粒度级别的对齐,显著降低计算资源需求。
Details
Motivation: 现有的多模态大语言模型因视觉编码器缺乏多粒度语言对齐能力,导致训练需要极高的计算资源。 Method: 利用双曲空间固有的层次结构特性,提出HyperET方法,通过可学习矩阵和莫比乌斯乘法操作,在双曲空间中动态调整半径以实现跨模态多粒度对齐。 Result: 在多个MLLM基准上的实验表明,HyperET在增加不到1%参数的情况下, consistently 提升了预训练和微调模型的性能。 Conclusion: HyperET提供了一种灵活且高效的参数化策略,有效 bridging 了视觉与文本模态之间的粒度鸿沟,为多模态对齐提供了新思路。 Abstract: Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.[108] AnyPcc: Compressing Any Point Cloud with a Single Universal Model
Kangli Wang,Qianxi Yi,Yuqi Ye,Shihao Li,Wei Gao
Main category: cs.CV
TL;DR: 提出了一种名为Anypcc的通用点云压缩框架,通过通用上下文模型和实例自适应微调策略,显著提升了点云几何压缩的泛化能力,在15个不同数据集上实现了最先进的性能。
Details
Motivation: 深度学习在点云几何压缩中的泛化能力仍然存在挑战,主要源于上下文模型不够鲁棒以及对分布外(OOD)数据处理效率低。 Method: 引入Anypcc框架,包含两个核心组件:1)通用上下文模型,利用空间和通道分组先验来捕捉强健的上下文依赖;2)实例自适应微调(IAFT)策略,结合显式与隐式压缩范式,针对每个实例微调少量网络权重并将其编码进比特流。 Result: 在15个多样化数据集的基准测试中,Anypcc在点云压缩性能上达到了新的最先进水平,且权重带来的额外比特成本远小于其在几何压缩中节省的比特开销。 Conclusion: Anypcc有效解决了点云压缩中泛化能力不足的问题,通过引入通用上下文建模和轻量级实例自适应微调,为深度学习驱动的点云压缩提供了高效且鲁棒的解决方案。 Abstract: Generalization remains a critical challenge for deep learning-based point cloud geometry compression. We argue this stems from two key limitations: the lack of robust context models and the inefficient handling of out-of-distribution (OOD) data. To address both, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages priors from both spatial and channel-wise grouping to capture robust contextual dependencies. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. It fine-tunes a small subset of network weights for each instance and incorporates them into the bitstream, where the marginal bit cost of the weights is dwarfed by the resulting savings in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression. Our code and datasets will be released to encourage reproducible research.[109] AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models
Seunghoon Lee,Jeongwoo Choi,Byunggwan Son,Jaehyeon Moon,Jeimin Jeon,Bumsub Ham
Main category: cs.CV
TL;DR: 本文提出了一种名为AccuQuant的新型后训练量化方法,用于扩散模型,通过在多个去噪步骤中显式模拟采样过程来减少量化误差累积,并提出了高效的实现技术和新目标函数,显著降低了内存复杂度。
Details
Motivation: 扩散模型在采样过程中去噪步骤的量化误差会累积,影响模型性能,因此需要一种能有效缓解误差累积的量化方法。 Method: AccuQuant通过在多个去噪步骤中最小化全精度模型和量化模型输出之间的差异来显式模拟扩散采样过程,并引入新的目标函数和高效实现技术以降低内存复杂度。 Result: 实验表明,AccuQuant在多个任务和标准基准上的各种扩散模型中均表现出优异的效率和效果,内存复杂度从O(n)降至O(1)。 Conclusion: AccuQuant有效解决了扩散模型量化中的误差累积问题,兼具高效性和通用性,为实际应用提供了可行的量化方案。 Abstract: We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$, where $n$ is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.[110] Positional Encoding Field
Yunpeng Bai,Haoxiang Li,Qixing Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为PE-Field的新型位置编码方法,将二维位置编码扩展到三维结构场,使Diffusion Transformers能直接在3D空间建模几何结构,在单图新视角合成和空间图像编辑任务中达到SOTA性能。
Details
Motivation: 发现现有DiT中patch token对位置编码扰动具有较强鲁棒性,表明空间一致性主要由位置编码控制,因此探索更强大的位置编码结构以增强模型对三维空间的理解能力。 Method: 提出Positional Encoding Field(PE-Field),将位置编码从2D平面扩展为包含深度感知和层次化子patch控制的3D结构场,赋予DiT进行体素推理和精细空间控制的能力。 Result: 在单图像新视角合成任务上实现最先进性能,并展现出在可控空间图像编辑方面的良好泛化能力。 Conclusion: PE-Field通过结构化的3D位置编码显著提升了DiT在三维空间建模方面的能力,为视觉生成模型提供了更强的空间先验。 Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.[111] Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval
Qing Wang,Chong-Wah Ngo,Yu Cao,Ee-Peng Lim
Main category: cs.CV
TL;DR: 本文提出了一种新的因果表示学习方法,用于解决图像到食谱检索中因视觉模态忽略非显性烹饪细节而导致的偏差问题,通过预测并显式注入被图像忽略的烹饪元素,提升了跨模态检索性能。
Details
Motivation: 现有方法假设食物图像能完整反映食谱中的所有信息,但实际上图像仅体现成品外观,无法捕捉食材使用和烹饪过程中的细微差异,导致跨模态表示学习存在视觉主导的偏差,尤其在多文化混合数据中更为严重。 Method: 提出一种因果表示学习框架,通过因果建模预测图像中未体现的关键烹饪元素(如特定食材和操作),并将这些元素显式注入跨模态表示学习过程,以缓解视觉偏差。 Result: 在Recipe1M标准数据集和新构建的多语言多文化数据集上实验表明,该方法能有效发现被忽略的烹饪细节,在单语和多语言多文化场景下均显著提升检索性能。 Conclusion: 所提出的因果方法通过显式建模非视觉烹饪元素,有效缓解了跨模态表示学习中的偏差问题,增强了图像到食谱检索的细粒度区分能力。 Abstract: Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.[112] Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment
Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
Main category: cs.CV
TL;DR: 提出FuzzyDistillViT-MobileNet模型,结合动态模糊逻辑驱动的知识蒸馏与图像融合技术,用于肺癌分类,在多种数据集上实现高精度。
Details
Motivation: 传统知识蒸馏方法使用固定权重,难以应对肺癌诊断中图像区域的不确定性与复杂性,需更灵活的蒸馏机制。 Method: 采用Vision Transformer为教师模型,MobileNet为学生模型,通过动态模糊逻辑调整蒸馏权重;引入Gamma校正和直方图均衡化进行像素级图像增强,并利用小波融合提升分辨率;结合遗传算法优化学生模型选择。 Result: 在LC25000数据集上达到99.16%准确率,在IQOTH/NCCD CT-scan数据集上达到99.54%准确率,表现出跨模态的鲁棒性。 Conclusion: 所提方法能有效处理医学图像中的不确定性,提升学生模型泛化能力,在保证效率的同时实现高性能肺癌分类。 Abstract: This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven knowledge distillation (KD) to address uncertainty and complexity in disease diagnosis. Unlike traditional models that rely on static KD with fixed weights, our method dynamically adjusts the distillation weight using fuzzy logic, enabling the student model to focus on high-confidence regions while reducing attention to ambiguous areas. This dynamic adjustment improves the model ability to handle varying uncertainty levels across different regions of LC images. We employ the Vision Transformer (ViT-B32) as the instructor model, which effectively transfers knowledge to the student model, MobileNet, enhancing the student generalization capabilities. The training process is further optimized using a dynamic wait adjustment mechanism that adapts the training procedure for improved convergence and performance. To enhance image quality, we introduce pixel-level image fusion improvement techniques such as Gamma correction and Histogram Equalization. The processed images (Pix1 and Pix2) are fused using a wavelet-based fusion method to improve image resolution and feature preservation. This fusion method uses the wavedec2 function to standardize images to a 224x224 resolution, decompose them into multi-scale frequency components, and recursively average coefficients at each level for better feature representation. To address computational efficiency, Genetic Algorithm (GA) is used to select the most suitable pre-trained student model from a pool of 12 candidates, balancing model performance with computational cost. The model is evaluated on two datasets, including LC25000 histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images (99.54% accuracy), demonstrating robustness across different imaging domains.[113] Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Kun Ouyang,Yuanxin Liu,Linli Yao,Yishuo Cai,Hao Zhou,Jie Zhou,Fandong Meng,Xu Sun
Main category: cs.CV
TL;DR: 本文提出Conan框架,通过结合视觉证据和多步推理来提升视频理解的准确性,在多个基准上超越基线模型10%以上。
Details
Motivation: 现有视频推理方法或缺乏视觉支撑导致幻觉,或证据定位不准,难以实现准确的多步推理。 Method: 构建大规模自动推理轨迹数据集Conan-91K,并设计多阶段渐进式冷启动策略与AIR RLVR训练框架,联合优化帧识别、推理与动作决策。 Result: 在六个多步推理基准上平均准确率超过Qwen2.5-VL-7B-Instruct超10%,并在长视频理解任务中表现出强泛化能力。 Conclusion: Conan实现了证据支撑的多步视频推理,显著提升性能,具备良好的可扩展性与鲁棒性。 Abstract: Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.[114] Reliable and Reproducible Demographic Inference for Fairness in Face Analysis
Alexandre Fournier-Montgieux,Hervé Le Borgne,Adrian Popescu,Bertrand Luvison
Main category: cs.CV
TL;DR: 提出了一种基于模块化迁移学习的可复现人口统计属性推断(DAI)流程,用于提升人脸分析系统(FAS)公平性评估的可靠性,引入了基于身份内一致性的鲁棒性度量,并在性别和种族推断任务中优于基线方法。
Details
Motivation: 公平性评估依赖于自动人口统计属性推断(DAI),而DAI本身的可靠性直接影响公平性审计的有效性。现有方法因依赖端到端训练和预定义分类体系而存在偏差和方差问题,因此需要更可靠、可复现的DAI流程。 Method: 采用模块化迁移学习框架,结合预训练的人脸识别编码器与非线性分类头,替代传统的端到端训练方式,并在多个数据集和训练设置下进行性别和种族推断。 Result: 所提方法在准确率、公平性和新提出的鲁棒性(身份内一致性)指标上均优于强基线,尤其在更具挑战性的种族推断任务上表现更优。 Conclusion: 该工作为公平性审计中的人口统计推断提供了可靠、透明且可复现的基础,所发布的数据、代码和模型有助于推动领域发展。 Abstract: Fairness evaluation in face analysis systems (FAS) typically depends on automatic demographic attribute inference (DAI), which itself relies on predefined demographic segmentation. However, the validity of fairness auditing hinges on the reliability of the DAI process. We begin by providing a theoretical motivation for this dependency, showing that improved DAI reliability leads to less biased and lower-variance estimates of FAS fairness. To address this, we propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach. Our design integrates pretrained face recognition encoders with non-linear classification heads. We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency. The proposed robustness metric is applicable to any demographic segmentation scheme. We benchmark the pipeline on gender and ethnicity inference across multiple datasets and training setups. Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute. To promote transparency and reproducibility, we will publicly release the training dataset metadata, full codebase, pretrained models, and evaluation toolkit. This work contributes a reliable foundation for demographic inference in fairness auditing.[115] EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization
Yixiong Yang,Tao Wu,Senmao Li,Shiqi Yang,Yaxing Wang,Joost van de Weijer,Kai Wang
Main category: cs.CV
TL;DR: 提出了一种双向概念蒸馏框架EchoDistill,用于实现单步扩散模型的个性化(1-SDP),通过师生模型间的双向知识回传和共享文本编码器,显著提升了生成质量和新概念个性化能力。
Details
Motivation: 现有的单步文本到图像扩散模型在个性化新概念时表现受限,难以有效捕捉新概念分布,因此需要一种更高效的个性化方法。 Method: 设计了一个师生联合训练框架,其中多步模型为教师,单步模型为学生;通过双向概念蒸馏、共享文本编码器、对抗损失与对齐损失进行优化,并引入双向回传精炼策略,利用学生快速生成反馈提升教师模型。 Result: 实验表明,该方法在1-SDP设置下显著优于现有个性化方法,同时提升了学生模型的个性化能力和教师模型的生成质量。 Conclusion: EchoDistill建立了一种快速且高效的T2I扩散模型个性化新范式,验证了双向蒸馏在单步生成模型中的有效性。 Abstract: Recent advances in accelerating text-to-image (T2I) diffusion models have enabled the synthesis of high-fidelity images even in a single step. However, personalizing these models to incorporate novel concepts remains a challenge due to the limited capacity of one-step models to capture new concept distributions effectively. We propose a bidirectional concept distillation framework, EchoDistill, to enable one-step diffusion personalization (1-SDP). Our approach involves an end-to-end training process where a multi-step diffusion model (teacher) and a one-step diffusion model (student) are trained simultaneously. The concept is first distilled from the teacher model to the student, and then echoed back from the student to the teacher. During the EchoDistill, we share the text encoder between the two models to ensure consistent semantic understanding. Following this, the student model is optimized with adversarial losses to align with the real image distribution and with alignment losses to maintain consistency with the teacher's output. Furthermore, we introduce the bidirectional echoing refinement strategy, wherein the student model leverages its faster generation capability to feedback to the teacher model. This bidirectional concept distillation mechanism not only enhances the student ability to personalize novel concepts but also improves the generative quality of the teacher model. Our experiments demonstrate that this collaborative framework significantly outperforms existing personalization methods over the 1-SDP setup, establishing a novel paradigm for rapid and effective personalization in T2I diffusion models.[116] Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning
Xiaohan Lan,Fanfan Liu,Haibo Qiu,Siqi Yang,Delian Ruan,Peng Shi,Lin Ma
Main category: cs.CV
TL;DR: 本文提出了Metis-HOME,一种混合专家框架,通过将模型分为思考分支和非思考分支来平衡复杂推理与通用能力之间的权衡,解决了当前多模态大模型在推理效率和泛化能力上的局限性。
Details
Motivation: 现有大规模多模态推理模型在处理简单查询时计算开销大,且过度专注于复杂推理会损害其通用理解能力,因此需要一种能兼顾高效推理与广泛适用性的新架构。 Method: 提出Metis-HOME,基于MoE架构将Qwen2.5-VL-7B改造为包含思考分支(用于复杂多步推理)和非思考分支(用于快速直接推断)的双专家系统,并引入轻量级可训练路由器动态分配任务。 Result: 实验表明,该方法不仅显著提升了复杂推理性能,还增强了模型在一般VQA和OCR等任务上的表现,克服了以往推理专用模型通用性下降的问题。 Conclusion: Metis-HOME建立了一种新的多模态大语言模型范式,有效解决了推理能力与泛化能力之间的矛盾,实现了高效与强大的统一。 Abstract: Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ''Hybrid Thinking'' paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model's general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma.[117] Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis
Lixiong Qin,Yang Zhang,Mei Wang,Jiani Hu,Weihong Deng,Weiran Xu
Main category: cs.CV
TL;DR: 提出Fake-in-Facext (FiFa) 框架,通过细粒度面部区域标注和多任务学习模型FiFa-MLLM,提升可解释DeepFake分析的视觉上下文关联能力。
Details
Motivation: 现有方法在DeepFake解释中缺乏细粒度感知,标注粗糙且无法建立文本解释与视觉证据间的联系,导致分析结果缺乏面部视觉上下文(Facext)支持。 Method: 定义面部图像概念树(FICT)实现细粒度标注,构建FiFa-Annotator数据管道;提出Artifact-Grounding Explanation (AGE)任务,结合文本解释与分割掩码;设计FiFa-MLLM多任务架构,支持多模态输入输出。 Result: FiFa-MLLM在AGE任务上优于强基线,并在现有XDFA数据集上达到SOTA性能。 Conclusion: FiFa框架通过细粒度标注和统一多任务模型,显著提升了可解释DeepFake分析的准确性和可解释性,增强了文本与视觉证据的对齐。 Abstract: The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at https://github.com/lxq1000/Fake-in-Facext.[118] Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image
Guillermo Carbajal,Andrés Almansa,Pablo Musé
Main category: cs.CV
TL;DR: 提出一种基于深度学习的框架,联合估计清晰图像和相机运动轨迹,利用可微分的投影运动模糊模型实现高质量去模糊,尤其在严重或空间变化模糊情况下表现优异。
Details
Motivation: 运动模糊(尤其是大范围或旋转运动引起的模糊)仍是图像恢复中的主要挑战,现有端到端去模糊网络在复杂模糊场景下表现不佳。 Method: 提出一个模块化深度学习框架,结合Projective Motion Blur Model(PMBM)和可微分模糊生成模块;神经网络预测3D旋转轨迹,并指导基于模型的恢复网络进行端到端训练,同时通过推理后的重模糊损失优化轨迹。 Result: 在合成和真实数据集上均达到最先进性能,尤其在严重或空间变化模糊情况下优于现有方法;能够重建生成模糊图像的清晰图像序列,并提供对相机运动轨迹的可解释性。 Conclusion: 该方法通过结合物理启发的模糊模型与深度学习,实现了更准确、可解释的图像去模糊,为运动模糊恢复提供了新思路。 Abstract: Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at https://github.com/GuillermoCarbajal/Blur2Seq/[119] Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation
Marziyeh Bamdad,Hans-Peter Hutter,Alireza Darvishy
Main category: cs.CV
TL;DR: SELM-SLAM3 是一种基于深度学习的视觉SLAM框架,结合SuperPoint和LightGlue,在低纹理、运动模糊等挑战性条件下显著优于ORB-SLAM3和现有RGB-D SLAM系统,适用于为视障人士提供可靠的导航辅助。
Details
Motivation: 在低纹理、运动模糊或复杂光照等挑战性条件下,现有SLAM技术难以保持稳定和准确的定位与跟踪,尤其影响视障辅助导航等关键应用。因此,需要一种更具鲁棒性的SLAM方案。 Method: 提出SELM-SLAM3,一种融合SuperPoint进行特征提取和LightGlue进行特征匹配的深度学习增强型视觉SLAM框架,并在TUM RGB-D、ICL-NUIM和TartanAir等多个具有挑战性的数据集上进行评估。 Result: 在多个数据集上,SELM-SLAM3平均比ORB-SLAM3提升87.84%,优于当前最先进的RGB-D SLAM系统36.77%,在低纹理和快速运动场景中表现出更强的鲁棒性和精度。 Conclusion: SELM-SLAM3显著提升了视觉SLAM在挑战环境下的性能和可靠性,为视障人士的导航辅助系统提供了有力的技术支持。 Abstract: Despite advancements in SLAM technologies, robust operation under challenging conditions such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.[120] From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging
Fuchen Li,Yansong Du,Wenbo Cheng,Xiaoxia Zhou,Sen Yin
Main category: cs.CV
TL;DR: 提出了一种轻量级、场景自适应的相机参数调整网络ACamera-Net,用于在复杂光照条件下从RAW图像中直接预测最优曝光和白平衡参数,提升图像质量与感知任务性能。
Details
Motivation: 消费级相机在低光、高动态范围、逆光及色温变化等复杂光照条件下难以保持稳定的图像质量,导致欠曝、偏色和色调不一致,影响后续视觉任务性能。 Method: 设计了ACamera-Net,包含ACamera-Exposure模块(预测ISO以改善曝光和对比度)和ACamera-Color模块(预测色温与增益因子以提升色彩一致性),直接从RAW数据预测最优相机参数,适用于边缘设备实时推理,并可集成至成像流程。 Result: 在多种真实场景数据上训练并验证,模型泛化能力强;实验表明其在图像质量提升和感知输出稳定性方面优于传统自动模式和轻量基线方法,且无需额外图像增强模块。 Conclusion: ACamera-Net能有效改善复杂光照下的图像质量,稳定视觉感知性能,具备轻量化和实时性优势,适合部署于边缘设备。 Abstract: Consumer-grade camera systems often struggle to maintain stable image quality under complex illumination conditions such as low light, high dynamic range, and backlighting, as well as spatial color temperature variation. These issues lead to underexposure, color casts, and tonal inconsistency, which degrade the performance of downstream vision tasks. To address this, we propose ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment network that directly predicts optimal exposure and white balance from RAW inputs. The framework consists of two modules: ACamera-Exposure, which estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color, which predicts correlated color temperature and gain factors for improved color consistency. Optimized for real-time inference on edge devices, ACamera-Net can be seamlessly integrated into imaging pipelines. Trained on diverse real-world data with annotated references, the model generalizes well across lighting conditions. Extensive experiments demonstrate that ACamera-Net consistently enhances image quality and stabilizes perception outputs, outperforming conventional auto modes and lightweight baselines without relying on additional image enhancement modules.[121] From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail
Xiaohan Sun,Carol O'Sullivan
Main category: cs.CV
TL;DR: 本文研究了在不同细节层次和观看距离下,用户对人群角色表示的视觉质量感知。
Details
Motivation: 探讨不同表示方法在视觉保真度和计算性能之间的权衡,优化人群渲染的感知效果。 Method: 采用几何网格、基于图像的替身、神经辐射场(NeRFs)和3D高斯等表示方法,进行定性和定量分析。 Result: 不同表示方法在不同距离和LoD下表现出不同的视觉质量和性能特征。 Conclusion: 研究结果可为人群渲染中感知优化的LoD策略设计提供指导。 Abstract: In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.[122] EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence
Ding Zou,Feifan Wang,Mengyu Ge,Siyuan Fan,Zongbing Zhang,Wei Chen,Lingfeng Wang,Zhongyou Hu,Wenrui Yan,Zhengwei Gao,Hao Wang,Weizhao Jin,Yu Zhang,Hainan Zhao,Mingliang Zhang,Xianxian Xi,Yaru Zhang,Wenyuan Li,Zhengguang Gao,Yurui Zhu
Main category: cs.CV
TL;DR: 本文提出了EmbodiedBrain,一种新型的视觉-语言基础模型,旨在解决当前大模型在具身智能任务中的局限性,通过结合大规模监督微调和Step-GRPO方法提升长视野任务的成功率,并引入生成式奖励模型以提高训练效率。
Details
Motivation: 现有的大语言模型和多模态大模型在具身智能任务中存在设计与需求脱节、实时性与性能难以兼顾以及评估指标不真实等问题,因此需要一个更符合具身代理需求的模型框架。 Method: 提出了一种新的视觉-语言基础模型EmbodiedBrain,采用代理对齐的数据结构,结合大规模监督微调(SFT)和Step-Augmented Group Relative Policy Optimization(Step-GRPO),并将先前步骤作为引导前驱来增强长期任务规划能力;同时引入基础设施级加速的生成式奖励模型(GRM)以提升训练效率。 Result: 实验结果表明,EmbodiedBrain在通用、规划和端到端仿真基准测试中均取得了优于现有方法的表现,成为具身基础模型的新标杆。此外,作者开源了全部数据、模型权重和评估方法。 Conclusion: EmbodiedBrain有效解决了当前具身智能模型的关键挑战,在性能和实用性方面实现了显著进步,为下一代通用具身代理的发展奠定了基础。 Abstract: The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.[123] Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Jiahao Meng,Xiangtai Li,Haochen Wang,Yue Tan,Tao Zhang,Lingdong Kong,Yunhai Tong,Anran Wang,Zhiyang Teng,Yujing Wang,Zhuochen Wang
Main category: cs.CV
TL;DR: 本文提出了Open-o3 Video,一种将显式时空证据融入视频推理的非代理框架,通过构建高质量数据集和设计强化学习策略,在多个视频理解基准上实现了最先进的性能。
Details
Motivation: 现有视频推理模型大多仅生成文本推理轨迹,缺乏对关键证据出现的时间和位置的指示;而扩展图像领域的证据中心推理到视频面临时空联合建模的挑战。 Method: 提出Open-o3 Video框架,构建STGR-CoT-30k和STGR-RL-36k两个高质量数据集,并采用冷启动强化学习策略与多目标奖励机制,联合优化答案准确性、时间对齐和空间精度。 Result: 在V-STAR基准上较Qwen2.5-VL基线提升mAM 14.4%和mLGM 24.2%,并在VideoMME、WorldSense、VideoMMMU和TVGBench等多个基准上取得一致改进;模型生成的推理轨迹可用于测试时缩放和置信度感知验证。 Conclusion: Open-o3 Video成功实现了基于显式时空证据的视频推理,提升了模型可解释性和答案可靠性,为视频理解提供了新的训练范式。 Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.[124] GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models
Muhammad Atif Butt,Alexandra Gomez-Villa,Tao Wu,Javier Vazquez-Corral,Joost Van De Weijer,Kai Wang
Main category: cs.CV
TL;DR: 本文提出了GenColorBench,首个针对文本到图像生成中颜色精确性的综合评测基准,填补了现有评测在颜色控制方面的空白。
Details
Motivation: 现有文本到图像模型在细粒度颜色控制方面表现不佳,且缺乏系统评估颜色精度的基准,而颜色在视觉感知和实际应用中至关重要。 Method: 基于ISCC-NBS和CSS3/X11等色彩系统构建包含4万4千个颜色相关提示、覆盖400多种颜色的评测集GenColorBench,结合感知测试与自动化评估方法对主流模型进行评测。 Result: 评测揭示了不同模型在颜色生成上的性能差异,识别出模型对特定颜色规范的理解程度及失败模式。 Conclusion: GenColorBench可有效评估文本到图像模型的颜色生成能力,为提升颜色精确性提供了重要工具和方向。 Abstract: Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.[125] Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation
Ziyu Ye,Chen Ju,Chaofan Ma,Xiaoyun Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于相似性原型的跨模态分割新框架,通过在嵌入空间中学习类级原型并引入相似性约束,结合字典存储和对比学习机制,有效缓解了域偏移问题,在无监督域适应场景下取得了优于现有方法的性能。
Details
Motivation: 深度学习模型在训练数据上表现良好,但在面对未见数据时由于域偏移会导致性能显著下降。为减少域间差异并避免对新域进行昂贵的标注,需有效的无监督域适应方法。 Method: 提出一种基于相似性原型的跨模态分割框架:在嵌入空间中学习每个类别的原型,引入相似性约束以增强同类原型的代表性和不同类原型的可分性;使用字典存储多图提取的原型,防止类别缺失,并支持原型的对比学习。 Result: 大量实验表明,该方法在跨模态分割任务中优于其他最先进的无监督域适应方法。 Conclusion: 所提出的基于相似性原型和字典对比学习的框架能有效缩小域间差距,提升模型在未见域上的泛化能力,为无监督域适应下的跨模态分割提供了有效解决方案。 Abstract: Deep learning models have achieved great success on various vision challenges, but a well-trained model would face drastic performance degradation when applied to unseen data. Since the model is sensitive to domain shift, unsupervised domain adaptation attempts to reduce the domain gap and avoid costly annotation of unseen domains. This paper proposes a novel framework for cross-modality segmentation via similarity-based prototypes. In specific, we learn class-wise prototypes within an embedding space, then introduce a similarity constraint to make these prototypes representative for each semantic class while separable from different classes. Moreover, we use dictionaries to store prototypes extracted from different images, which prevents the class-missing problem and enables the contrastive learning of prototypes, and further improves performance. Extensive experiments show that our method achieves better results than other state-of-the-art methods.[126] OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects
Mark He Huang,Lin Geng Foo,Christian Theobalt,Ying Sun,De Wen Soh
Main category: cs.CV
TL;DR: 本文提出了一种名为OnlineSplatter的在线前馈框架,能够直接从单目视频的RGB帧中生成高质量、以物体为中心的3D高斯表示,无需相机位姿、深度先验或束调整优化。
Details
Motivation: 在缺乏可靠位姿或深度线索且物体任意运动的情况下,自由移动物体的重建仍具挑战性。现有方法通常依赖于精确的位姿估计或离线优化,限制了其在真实场景中的应用。 Method: 该方法以第一帧为锚点,通过密集的高斯基元场逐步更新物体表示;提出一种双键记忆模块,结合潜在的外观-几何特征键和显式的方向键,实现当前帧特征与历史状态的有效融合,并通过空间引导的记忆读取和高效的稀疏化机制保持紧凑而完整的物体覆盖。 Result: 在真实世界数据集上的实验表明,OnlineSplatter显著优于现有的无位姿重建基线方法,能随着观测增加持续提升重建质量,同时保持恒定的内存占用和运行时间。 Conclusion: OnlineSplatter为自由移动物体的实时、在线3D重建提供了一种高效且鲁棒的解决方案,适用于无深度、无位姿的单目视频场景。 Abstract: Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.[127] SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Yuan Sheng,Yanbin Hao,Chenxu Li,Shuo Wang,Xiangnan He
Main category: cs.CV
TL;DR: 提出了一种无需训练、模型无关的语义-视觉共识证据选择框架(SeViCES),用于高效可靠的长视频理解,通过结合语义和视觉分支选择关键帧,并融合证据提升推理一致性与准确性。
Details
Motivation: 现有长视频理解方法常忽略时间依赖性或依赖单模态证据,导致推理不聚焦或不一致,难以有效提供完整且与查询相关的上下文。 Method: 设计了语义-视觉共识帧选择(SVCFS)模块:语义分支利用LLM对字幕进行时序感知推理,视觉分支通过聚类引导并利用互信息对齐嵌入与语义得分;同时引入答案共识优化(ACR)模块,融合多模态证据并约束答案空间以减少不一致性。 Result: 在多个长视频理解基准上的实验表明,SeViCES在准确性和鲁棒性上均优于现有最先进方法。 Conclusion: 基于共识的证据选择能显著提升Video-LLMs在长视频理解中的性能,语义与视觉模态的协同优化是实现可靠推理的关键。 Abstract: Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.[128] Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges
Zhenhuan Zhou,Jingbo Zhu,Yuchen Zhang,Xiaohang Guan,Peng Wang,Tao Li
Main category: cs.CV
TL;DR: 本文综述了深度学习在牙科图像分析中的应用,涵盖260项研究,重点分析了公开数据集和深度学习模型,总结了当前挑战并提出了未来方向。
Details
Motivation: 牙科图像分析面临低对比度、金属伪影和视角变化等挑战,人工解读耗时且主观性强,亟需自动化解决方案。 Method: 系统回顾了49个关于公开牙科数据集和211个关于深度学习算法的研究,分类整理了不同任务下的模型架构、优化策略、训练方法及性能指标。 Result: 总结了牙科图像分析中常用的数据集特征、获取方式、深度学习技术、训练与评估指标,并提供了详细的比较表格。 Conclusion: 深度学习在牙科图像分析中展现出巨大潜力,本文为该领域研究人员提供了系统性参考,并将补充材料公开于GitHub。 Abstract: Efficient analysis and processing of dental images are crucial for dentists to achieve accurate diagnosis and optimal treatment planning. However, dental imaging inherently poses several challenges, such as low contrast, metallic artifacts, and variations in projection angles. Combined with the subjectivity arising from differences in clinicians' expertise, manual interpretation often proves time-consuming and prone to inconsistency. Artificial intelligence (AI)-based automated dental image analysis (DIA) offers a promising solution to these issues and has become an integral part of computer-aided dental diagnosis and treatment. Among various AI technologies, deep learning (DL) stands out as the most widely applied and influential approach due to its superior feature extraction and representation capabilities. To comprehensively summarize recent progress in this field, we focus on the two fundamental aspects of DL research-datasets and models. In this paper, we systematically review 260 studies on DL applications in DIA, including 49 papers on publicly available dental datasets and 211 papers on DL-based algorithms. We first introduce the basic concepts of dental imaging and summarize the characteristics and acquisition methods of existing datasets. Then, we present the foundational techniques of DL and categorize relevant models and algorithms according to different DIA tasks, analyzing their network architectures, optimization strategies, training methods, and performance. Furthermore, we summarize commonly used training and evaluation metrics in the DIA domain. Finally, we discuss the current challenges of existing research and outline potential future directions. We hope that this work provides a valuable and systematic reference for researchers in this field. All supplementary materials and detailed comparison tables will be made publicly available on GitHub.[129] Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Yuhan Liu,Lianhui Qin,Shengjie Wang
Main category: cs.CV
TL;DR: 提出了一种无需训练的框架Speculative Verdict(SV),通过结合多个轻量级“草稿专家”和一个强大的“判决模型”,在处理信息密集型图像时实现高效、准确的多模态推理。
Details
Motivation: 大型视觉-语言模型在理解信息密集、图文交错的复杂图像时存在定位困难和多跳推理能力不足的问题,需要一种既能提升性能又不依赖额外训练的解决方案。 Method: SV框架分为草稿阶段和判决阶段:小规模VLM作为草稿专家生成多样化的推理路径以提供候选定位;大规模VLM作为判决模型综合这些路径得出最终答案,并引入共识专家选择机制,仅转发高一致性的路径以提高效率和准确性。 Result: SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等多个高分辨率、信息密集型视觉问答基准上均取得显著性能提升,能够在降低计算成本的同时纠正错误推理路径中的错误。 Conclusion: SV通过融合多个部分正确的推理路径,在无需训练的前提下实现了误差校正与计算效率的平衡,优于大型专有模型或需训练的流水线方法。 Abstract: Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict[130] Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging
Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Dong Yang,Pengfei Guo,Marc Edgar,Daguang Xu,Bernhard Kainz,Bjoern Menze
Main category: cs.CV
TL;DR: 本文提出了BTB3D,一种用于3D医学影像的因果卷积编码器-解码器模型,通过统一2D和3D训练并生成紧凑、频率感知的体素标记,在报告生成和文本到CT合成任务上实现了新的最先进性能。
Details
Motivation: 现有方法在处理高分辨率、长序列3D医学图像时存在视觉编码器与临床语言不匹配、切片级标记化模糊精细解剖结构等问题,限制了下游任务的诊断性能。 Method: 提出BTB3D模型,采用因果卷积编码器-解码器架构,结合三阶段训练策略:局部重建、重叠窗口平铺和长上下文解码器优化,实现高效且精确的3D体素标记化。 Result: 在两个关键任务上达到最先进水平:报告生成任务中BLEU分数提升,临床F1值比CT2Rep、CT-CHAT和Merlin提高40%;文本到CT合成任务中FID降低75%,FVD减少一半,并能生成解剖一致的512*512*241高分辨率体积。 Conclusion: 精确的三维标记化对于可扩展的3D医学影像视觉-语言建模至关重要,而不仅仅依赖更大的语言模型骨干。 Abstract: Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D[131] UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset
Chen Zhao,En Ci,Yunzhe Xu,Tiehan Fan,Shanyan Guan,Yanhao Ge,Jian Yang,Ying Tai
Main category: cs.CV
TL;DR: 本文提出了一个高质量的超高清文本到图像生成数据集UltraHR-100K,并设计了频率感知的后训练方法,以提升细粒度细节生成质量。
Details
Motivation: 现有超高清文本到图像生成面临缺乏大规模高质量数据集和缺乏针对细粒度细节合成的训练策略两大挑战。 Method: 构建了一个包含10万张超高分辨率(超过3K)图像的数据集UltraHR-100K,并提出频率感知的后训练方法,包括细节导向的时间步采样(DOTS)和软加权频率正则化(SWFR),利用离散傅里叶变换增强高频细节保留。 Result: 在UltraHR-eval4K基准上的实验表明,所提方法显著提升了超高清图像生成的细节质量和整体保真度。 Conclusion: 本文通过高质量数据集和针对性训练策略,有效推动了超高清文本到图像生成中细节合成的发展。 Abstract: Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{https://github.com/NJU-PCALab/UltraHR-100k}{here}.[132] HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification
Debojyoti Ghosh,Adrijit Goswami
Main category: cs.CV
TL;DR: 本文提出了一种名为HybridSOMSpikeNet的混合深度学习框架,用于实现高效、节能的智能垃圾分类。该模型结合卷积特征提取、可微自组织映射和脉冲神经网络时序处理,在十类垃圾数据集上达到97.39%的准确率,优于多种先进模型,并支持可持续发展目标。
Details
Motivation: 准确的垃圾分类对于可持续废物管理和减少城市化带来的环境影响至关重要。传统方法存在回收物误分类导致填埋增加、回收效率低下和温室气体排放上升的问题,亟需智能化解决方案。 Method: 提出HybridSOMSpikeNet:采用预训练ResNet-152提取空间特征,结合可微软自组织映射(Soft-SOM)增强拓扑聚类与可解释性,并引入脉冲神经网络头部进行时序激活累积,提升鲁棒性与泛化能力。 Result: 在十类垃圾分类数据集上测试准确率达97.39%,性能超越多个主流深度学习模型,同时具备轻量化计算特性,适合实际部署。 Conclusion: HybridSOMSpikeNet不仅在技术上实现了高精度与高能效的平衡,还通过精准自动分拣促进资源高效回收、降低污染和处理成本,助力实现联合国可持续发展目标SDG 11和SDG 12。 Abstract: Accurate waste classification is vital for achieving sustainable waste management and reducing the environmental footprint of urbanization. Misclassification of recyclable materials contributes to landfill accumulation, inefficient recycling, and increased greenhouse gas emissions. To address these issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning framework that integrates convolutional feature extraction, differentiable self-organization, and spiking-inspired temporal processing to enable intelligent and energy-efficient waste classification. The proposed model employs a pre-trained ResNet-152 backbone to extract deep spatial representations, followed by a Differentiable Soft Self-Organizing Map (Soft-SOM) that enhances topological clustering and interpretability. A spiking neural head accumulates temporal activations over discrete time steps, improving robustness and generalization. Trained on a ten-class waste dataset, HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several state-of-the-art architectures while maintaining a lightweight computational profile suitable for real-world deployment. Beyond its technical innovations, the framework provides tangible environmental benefits. By enabling precise and automated waste segregation, it supports higher recycling efficiency, reduces contamination in recyclable streams, and minimizes the ecological and operational costs of waste processing. The approach aligns with global sustainability priorities, particularly the United Nations Sustainable Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities, circular economy initiatives, and intelligent environmental management systems.[133] Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling
Jinhee Kim,Jae Jun An,Kang Eun Jeon,Jong Hwan Ko
Main category: cs.CV
TL;DR: 提出了一种减少多比特量化网络训练开销的方法,通过权重偏置校正和逐比特核心集采样策略,在不牺牲模型性能的情况下显著缩短训练时间。
Details
Motivation: 现有方法在每个支持的比特宽度上重复全数据集更新,导致训练成本随精度数量线性增长,且常需额外微调,训练负担大。 Method: 提出两种技术:1)权重偏置校正,通过中和量化引起的跨比特宽度偏置并统一激活分布,实现共享批归一化并消除微调需求;2)基于梯度重要性评分的逐比特核心集采样策略,利用隐式知识迁移现象为子模型选择紧凑且信息丰富的子集进行训练。 Result: 在CIFAR-10/100、TinyImageNet和ImageNet-1K上,结合ResNet和ViT架构的实验表明,该方法在保持竞争力或更优精度的同时,训练时间最多减少7.88倍。 Conclusion: 所提方法有效降低了多比特量化网络的训练开销,具备高效性和通用性,适用于多种模型和数据集。 Abstract: Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.[134] Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
Jing Bi,Guangyu Sun,Ali Vosoughi,Chen Chen,Chenliang Xu
Main category: cs.CV
TL;DR: 提出一种基于代理的架构,结合LLM推理与轻量级视觉模块,以解决多模态大语言模型在视觉任务中的幻觉和文本先验依赖问题。
Details
Motivation: 现有MLLM在复杂视觉任务中存在视觉幻觉和过度依赖文本先验的问题,需系统诊断并改进视觉推理能力。 Method: 设计三阶段评估框架诊断SOTA模型,并提出代理架构,集成LLM与专用视觉模块进行细粒度分析和推理链迭代优化。 Result: 在MMMU和MathVista上分别提升+10.3和+6.0,性能匹敌或超越更大模型。 Conclusion: 未来视觉推理模型应整合更多专用工具以增强对视觉内容的分析能力,所提系统有效且具可扩展性。 Abstract: Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.[135] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Xuyang Liu,Xiyan Gui,Yuchao Zhang,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出MixKV,一种结合重要性和多样性的新方法,用于优化大规模视觉-语言模型中的KV缓存压缩,有效缓解内存瓶颈,提升多模态理解任务性能。
Details
Motivation: 现有KV缓存压缩方法主要关注保留高重要性键值对,忽略了多模态场景中特有的语义冗余模式,导致语义覆盖不全。 Method: 通过分析LVLM中不同注意力头的冗余程度,提出MixKV方法,在压缩时自适应地平衡重要性和多样性,以更好地保留语义信息。 Result: 在极端压缩(budget=64)下,MixKV在五个多模态理解基准上平均比基线提升5.1%,在GUI定位任务中SnapKV和AdaKV分别提升8.0%和9.0%,且保持相近推理效率。 Conclusion: MixKV能有效提升现有KV缓存压缩方法的性能,兼顾语义覆盖与压缩效率,适用于LVLM和LLM,具有良好的扩展性和实用性。 Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.[136] ALICE-LRI: A General Method for Lossless Range Image Generation for Spinning LiDAR Sensors without Calibration Metadata
Samuel Soutullo,Miguel Yermo,David L. Vilariño,Óscar G. Lorenzo,José C. Cabaleiro,Francisco F. Rivera
Main category: cs.CV
TL;DR: 本文提出了一种名为ALICE-LRI的通用、传感器无关的方法,首次实现了从旋转式LiDAR点云中无损生成范围图像,无需制造商元数据或校准文件,并实现零点损失的完全点云重建。
Details
Motivation: 传统的LiDAR数据投影方法存在几何不一致性,导致不可逆的信息丢失,影响高保真应用的需求。 Method: 通过自动反向工程任何旋转式LiDAR传感器的内在几何结构,推断激光束配置、角度分布及每束的校准修正,实现无损投影。 Result: 在KITTI和DurLAR数据集上全面评估显示,ALICE-LRI实现了完美的点保留(零点丢失),几何精度保持在传感器精度范围内,并具备实时性能。 Conclusion: ALICE-LRI实现了从近似到无损LiDAR投影的范式转变,为需要完整几何保持的高精度遥感应用开辟了新可能。 Abstract: 3D LiDAR sensors are essential for autonomous navigation, environmental monitoring, and precision mapping in remote sensing applications. To efficiently process the massive point clouds generated by these sensors, LiDAR data is often projected into 2D range images that organize points by their angular positions and distances. While these range image representations enable efficient processing, conventional projection methods suffer from fundamental geometric inconsistencies that cause irreversible information loss, compromising high-fidelity applications. We present ALICE-LRI (Automatic LiDAR Intrinsic Calibration Estimation for Lossless Range Images), the first general, sensor-agnostic method that achieves lossless range image generation from spinning LiDAR point clouds without requiring manufacturer metadata or calibration files. Our algorithm automatically reverse-engineers the intrinsic geometry of any spinning LiDAR sensor by inferring critical parameters including laser beam configuration, angular distributions, and per-beam calibration corrections, enabling lossless projection and complete point cloud reconstruction with zero point loss. Comprehensive evaluation across the complete KITTI and DurLAR datasets demonstrates that ALICE-LRI achieves perfect point preservation, with zero points lost across all point clouds. Geometric accuracy is maintained well within sensor precision limits, establishing geometric losslessness with real-time performance. We also present a compression case study that validates substantial downstream benefits, demonstrating significant quality improvements in practical applications. This paradigm shift from approximate to lossless LiDAR projections opens new possibilities for high-precision remote sensing applications requiring complete geometric preservation.[137] AutoScape: Geometry-Consistent Long-Horizon Scene Generation
Jiacheng Chen,Ziyu Jiang,Mingfu Liang,Bingbing Zhuang,Jong-Chyi Su,Sparsh Garg,Ying Wu,Manmohan Chandraker
Main category: cs.CV
TL;DR: 本文提出了AutoScape,一种长时驾驶场景生成框架,通过RGB-D扩散模型生成几何一致的关键帧,并利用视频扩散模型插值生成连贯的长时驾驶视频,在FID和FVD指标上显著优于现有方法。
Details
Motivation: 现有的驾驶场景生成方法在长时程中难以保持几何一致性和视觉质量,需要更可靠的方法来支持自动驾驶仿真等应用。 Method: 提出了一种新的RGB-D扩散模型,联合处理图像与深度信息,在共享潜在空间中建模;通过已生成关键帧的点云显式条件输入,并引入基于 warp-consistent 的采样引导机制以维持长期几何一致性;使用视频扩散模型在关键帧之间进行插值生成密集视频帧。 Result: AutoScape 能生成超过20秒的真实且几何一致的驾驶视频,在长时程FID和FVD指标上分别比先前最优方法提升48.6%和43.0%。 Conclusion: AutoScape 通过稀疏关键帧与密集插值的两阶段生成策略,有效解决了长时驾驶场景生成中的几何一致性问题,显著提升了生成质量和评估指标。 Abstract: This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.[138] ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology
Nima Torbati,Anastasia Meshcheryakova,Ramona Woitek,Diana Mechtcheriakova,Amirreza Mahbod
Main category: cs.CV
TL;DR: 提出一种基于注意力驱动特征融合的CNN和视觉Transformer双编码器模型,用于提升组织病理学图像语义分割性能,在GCPS和PUMA数据集上优于现有方法。
Details
Motivation: 为了提升组织病理学图像中语义组织分割的准确性,克服现有深度学习模型在复杂组织结构中的局限性。 Method: 设计了一个统一的双编码器模型,结合卷积神经网络(CNN)和视觉Transformer(ViT),通过注意力机制实现特征融合,增强对多尺度和全局上下文信息的建模能力。 Result: 在GCPS数据集上达到76.79% mIoU和86.87% mDice,在PUMA数据集上达到64.93% mIoU和76.60% mDice,优于当前先进方法和基线模型。 Conclusion: 所提出的注意力驱动特征融合双编码器模型有效提升了组织病理学图像语义分割性能,具有良好的应用前景。 Abstract: Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet[139] DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Noam Issachar,Guy Yariv,Sagie Benaim,Yossi Adi,Dani Lischinski,Raanan Fattal
Main category: cs.CV
TL;DR: 本文提出了一种名为动态位置外推(DyPE)的新方法,无需训练即可使预训练的扩散Transformer模型生成远超其训练分辨率的图像。
Details
Motivation: 由于自注意力机制在图像token数量上呈二次扩展,超高清图像生成的训练成本极高,因此需要一种高效的方法来突破分辨率限制。 Method: 利用扩散过程中固有的频谱 progression,DyPE在每一步动态调整模型的位置编码,使其频率谱与当前生成阶段匹配。 Result: DyPE在多个基准测试中显著提升了性能,在1600万像素等超高分辨率下实现了最先进的生成质量,且无额外采样成本。 Conclusion: DyPE是一种有效的训练-free方法,能够显著扩展预训练扩散Transformer的生成分辨率,推动了高分辨率图像生成的发展。 Abstract: Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.[140] AlphaFlow: Understanding and Improving MeanFlow Models
Huijie Zhang,Aliaksandr Siarohin,Willi Menapace,Michael Vasilkovsky,Sergey Tulyakov,Qing Qu,Ivan Skorokhodov
Main category: cs.CV
TL;DR: 本文提出了α-Flow,一个统一轨迹流匹配、Shortcut Model和MeanFlow的框架,通过课程学习策略解决了MeanFlow中优化冲突的问题,在ImageNet-1K上实现了新的SOTA生成性能。
Details
Motivation: MeanFlow的成功尚未被充分理解,其目标函数存在优化冲突,导致收敛缓慢,因此需要一种更有效的方法来解耦这些冲突项。 Method: 提出α-Flow框架,将多种方法统一在一个公式下,并采用从轨迹流匹配逐步退火到MeanFlow的课程学习策略,以缓解优化冲突。 Result: 在ImageNet-1K 256x256上使用标准DiT主干网络,α-Flow在不同规模和设置下均优于MeanFlow,最大的α-Flow-XL/2+模型在1-NFE和2-NFE下分别达到2.58和2.15的FID分数。 Conclusion: α-Flow通过解耦优化目标并引入课程学习,显著提升了训练收敛性和生成质量,是Few-step生成建模中的新SOTA方法。 Abstract: MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).[141] CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image
Binbin Huang,Haobin Duan,Yiqun Zhao,Zibo Zhao,Yi Ma,Shenghua Gao
Main category: cs.CV
TL;DR: 本文提出了一种名为Cupid的基于生成模型的3D重建方法,能够从单张2D图像中准确推断物体的相机位姿、3D形状和纹理。
Details
Motivation: 现有的3D重建方法在从单幅图像恢复精确的形状和位姿方面存在局限,尤其在缺乏先验信息时鲁棒性不足。因此,需要一种统一的生成框架来联合优化形状与位姿估计。 Method: Cupid将3D重建建模为从学习到的3D对象分布中的条件采样过程,采用两阶段流匹配管线:第一阶段生成粗略3D几何结构并进行位姿恢复;第二阶段融合对齐的图像特征以提升结构保真度和外观细节。该方法同时生成体素和像素-体素对应关系,并在共享的3D潜在空间中表示相机位姿和3D形状。 Result: 实验表明,Cupid在PSNR上提升了3 dB以上,Chamfer Distance降低了10%以上,位姿估计精度与现有单目方法相当,并在视觉质量上优于基线生成模型。 Conclusion: Cupid通过统一的生成式框架实现了更准确、更鲁棒的3D重建,在形状、位姿和纹理恢复方面均取得了领先性能。 Abstract: This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.[142] Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature
Lei Cheng,Siyang Cao
Main category: cs.CV
TL;DR: 本文提出了一种融合雷达与摄像头数据的多目标跟踪(MOT)框架,通过在线标定和共同特征匹配提升跟踪精度,首次探索了雷达-摄像头共同特征在在线标定中的应用。
Details
Motivation: 现有研究常将雷达作为辅助传感器,未能充分利用其在三维空间中精确测距的优势;本文旨在充分发挥雷达的关键作用,并减少人工干预。 Method: 提出一种雷达-摄像头融合的MOT框架,利用传感器间的共同特征实现在线标定,通过特征匹配与类别一致性检查来提高传感器关联准确性。 Result: 框架在真实交通场景和受控环境中均表现出更高的跟踪精度,并简化了雷达-摄像头的数据映射过程。 Conclusion: 该方法有效提升了多目标跟踪的精度与自动化程度,验证了雷达在MOT中作为主传感器的潜力。 Abstract: This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role--despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system--our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at https://github.com/radar-lab/Radar_Camera_MOT[143] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model
Xiaolong Wang,Lixiang Ru,Ziyuan Huang,Kaixiang Ji,Dandan Zheng,Jingdong Chen,Jun Zhou
Main category: cs.CV
TL;DR: 提出了一种基于自回归生成的图像分割新范式(ARGenSeg),在统一框架内实现多模态理解和像素级感知。
Details
Motivation: 现有方法依赖离散表示或特定解码器,难以捕捉细粒度视觉细节。 Method: 通过图像生成框架,利用MLLM输出视觉token,并用通用VQ-VAE解码为密集掩码,结合并行生成策略减少推理延迟。 Result: 在多个分割数据集上超越先前最先进方法,显著提升推理速度,同时保持强大多模态理解能力。 Conclusion: ARGenSeg实现了高效、精确的像素级分割,推动了MLLM在密集预测任务中的应用。 Abstract: We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.[144] Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
Dean L Slack,G Thomas Hudson,Thomas Winterbottom,Noura Al Moubayed
Main category: cs.CV
TL;DR: 提出一种基于纯Transformer的自回归视频预测模型,通过连续像素空间表示和简单的端到端训练,在物理模拟预测中显著延长了准确预测的时间范围(提升达50%),并在视频质量与物理参数估计泛化性方面表现良好。
Details
Motivation: 现有视频生成方法在长时间物理模拟的因果建模方面存在不足,难以准确捕捉时空动态。本文旨在通过简化模型结构并专注于物理仿真数据,提升视频预测中的物理一致性和可解释性。 Method: 采用纯Transformer架构,直接在连续像素空间进行自回归视频预测,比较多种时空自注意力布局,使用物理模拟数据集进行无监督训练,并通过目标跟踪指标评估时空推理能力。同时引入探针模型进行可解释性分析,识别编码PDE参数信息的关键网络区域。 Result: 相比现有的潜在空间方法,该模型在保持常见视频质量指标相当的同时,将物理准确预测的时间范围提升了最高50%;且探针模型显示其能泛化至分布外的物理参数估计,验证了模型对物理规律的学习能力。 Conclusion: 本文提出的简单、参数高效且可解释的纯Transformer方法为基于注意力机制的时空建模提供了有效平台,尤其适用于需要长期物理一致性的视频预测任务。 Abstract: Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.[145] SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution
Ritik Shah,Marco F Duarte
Main category: cs.CV
TL;DR: 提出SpectraMorph,一种物理引导的自监督融合框架,通过解混瓶颈实现高光谱与多光谱图像融合,具有可解释性、快速训练和强鲁棒性。
Details
Motivation: 现有深度学习方法缺乏可解释性,且在多光谱图像波段极少时性能下降。 Method: 采用物理引导的自监督框架,从低分辨率高光谱图像提取端元,用多层感知机从多光谱图像预测丰度图,通过线性混合重建光谱。 Result: 在合成和真实数据集上均优于无监督/自监督方法,接近有监督方法性能,训练快且对单波段MSI仍鲁棒。 Conclusion: SpectraMorph实现了高性能、可解释和高效的高光谱超分辨融合,适用于少波段甚至单波段多光谱图像。 Abstract: Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.[146] Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Nimrod Berman,Omkar Joglekar,Eitan Kosman,Dotan Di Castro,Omri Azencot
Main category: cs.CV
TL;DR: 本文提出了一个通用的模态翻译框架LDDBM,基于潜在变量扩散桥模型,可在无需对齐维度的情况下实现任意模态间的转换。
Details
Motivation: 现有模态翻译方法受限于共享维度、高斯先验和特定架构等假设,缺乏通用性和理论基础。 Method: 提出Latent Denoising Diffusion Bridge Model(LDDBM),在共享潜在空间中构建模态间的桥梁,引入对比对齐损失和预测损失,并设计了领域无关的编码器-解码器结构用于潜在空间中的噪声预测。 Result: 该方法在多视图到3D形状生成、图像超分辨率和多视图场景合成等任务上表现优异,支持任意模态对,实验验证了其有效性和稳定性。 Conclusion: LDDBM为通用模态翻译提供了新的强基线,具有良好的扩展性和实际应用潜力。 Abstract: Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.[147] LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas
Guocheng Gordon Qian,Ruihang Zhang,Tsai-Shien Chen,Yusuf Dalva,Anujraaj Argo Goyal,Willi Menapace,Ivan Skorokhodov,Meng Dong,Arpit Sahni,Daniil Ostashev,Ju Hu,Sergey Tulyakov,Kuan-Chieh Jackson Wang
Main category: cs.CV
TL;DR: 提出LayerComposer,一种用于个性化多主体文本到图像生成的交互式框架,通过分层画布和锁定机制实现无遮挡组合与高保真保留。
Details
Motivation: 现有个性化生成模型在空间组成上的交互控制不足,并且难以扩展到多个主体。 Method: 引入分层画布表示,每个主体位于独立层上,并设计锁定机制以保持选定层的高保真度,同时允许其他层灵活适应上下文。 Result: 实验表明,LayerComposer在多主体个性化图像生成中实现了优于现有方法的空间控制和身份保持能力。 Conclusion: LayerComposer通过分层结构和无需架构修改的锁定机制,有效提升了多主体生成的可控性与质量。 Abstract: Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.[148] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Yihao Meng,Hao Ouyang,Yue Yu,Qiuyu Wang,Wen Wang,Ka Leong Cheng,Hanlin Wang,Yixuan Li,Cheng Chen,Yanhong Zeng,Yujun Shen,Huamin Qu
Main category: cs.CV
TL;DR: HoloCine 是一种新型文本到视频模型,能够整体生成连贯的多镜头叙事场景,弥补了传统模型在叙事一致性上的不足。