Table of Contents
cs.CL [Back]
[1] The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts
Warren Johnson
Main category: cs.CL
TL;DR: 本文验证了代码生成与链式推理对提示压缩的不同容忍度,揭示了'困惑度悖论'机制,并提出了任务感知自适应压缩算法TAAC,在降低成本的同时保持高质量。
Details
Motivation: 解决前序研究在基准覆盖范围有限、'困惑度悖论'机制未验证、缺乏自适应算法三大缺陷。 Method: 在六大代码与四大推理基准上验证压缩阈值泛化性;开展首次逐token困惑度分析(n=723);提出TAAC(Task-Aware Adaptive Compression)算法,并通过MBPP(n=1800)进行系统性验证。 Result: 确认压缩阈值跨语言与难度泛化;发现'困惑度悖论':语法token高困惑度被保留,关键数值token低困惑度却被剪枝;签名注入使通过率提升34个百分点;TAAC实现22%成本降低与96%质量保持,优于固定比率压缩7%。 Conclusion: 提示压缩效果高度依赖任务类型与token语义角色,需任务感知的自适应策略;'困惑度悖论'揭示了当前压缩方法的内在偏差,签名注入与TAAC可有效缓解。 Abstract: In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.[2] Language Model Representations for Efficient Few-Shot Tabular Classification
Inwon Kang,Parikshit Ram,Yi Zhou,Horst Samulowitz,Oshani Seneviratne
Main category: cs.CL
TL;DR: 本文提出TaRL方法,利用现有大语言模型(LLM)的语义嵌入进行Web表格的少样本分类,通过去除嵌入共性成分和校准softmax温度,在低数据场景下达到与专用模型相当的效果。
Details
Motivation: 如何复用已部署的大语言模型(LLMs)对异构Web表格(如商品目录、知识库导出表、科学数据门户)进行高效、少样本分类,避免训练专用模型。 Method: 提出Table Representation with Language Model(TaRL)范式:1)提取表格各行的LLM语义嵌入;2)去除所有嵌入中的公共成分;3)引入可学习的softmax温度校准机制,由基于手工特征的元学习器预测最优温度。 Result: 在语义丰富的表格、少样本(k ≤ 32)设定下,TaRL性能媲美当前最优专用表格模型;验证了复用现成LLM基础设施进行Web表格语义理解的可行性与有效性。 Conclusion: 无需重新训练或构建专用模型,仅通过轻量级后处理(去共性+温度校准),即可释放LLM嵌入在Web表格少样本分类任务中的潜力,为Web原生结构化数据理解提供高效、可扩展的新路径。 Abstract: The Web is a rich source of structured data in the form of tables, from product catalogs and knowledge bases to scientific datasets. However, the heterogeneity of the structure and semantics of these tables makes it challenging to build a unified method that can effectively leverage the information they contain. Meanwhile, Large language models (LLMs) are becoming an increasingly integral component of web infrastructure for tasks like semantic search. This raises a crucial question: can we leverage these already-deployed LLMs to classify structured data in web-native tables (e.g., product catalogs, knowledge base exports, scientific data portals), avoiding the need for specialized models or extensive retraining? This work investigates a lightweight paradigm, $\textbf{Ta}$ble $\textbf{R}$epresentation with $\textbf{L}$anguage Model~($\textbf{TaRL}$), for few-shot tabular classification that directly utilizes semantic embeddings of individual table rows. We first show that naive application of these embeddings underperforms compared to specialized tabular models. We then demonstrate that their potentials can be unlocked with two key techniques: removing the common component from all embeddings and calibrating the softmax temperature. We show that a simple meta-learner, trained on handcrafted features, can learn to predict an appropriate temperature. This approach achieves performance comparable to state-of-the-art models in low-data regimes ($k \leq 32$) of semantically-rich tables. Our findings demonstrate the viability of reusing existing LLM infrastructure for efficient semantics-driven pathway to reuse existing LLM infrastructure for Web table understanding.[3] KD4MT: A Survey of Knowledge Distillation for Machine Translation
Ona de Gibert,Joseph Attieh,Timothee Mickus,Yves Scherrer,Jörg Tiedemann
Main category: cs.CL
TL;DR: This survey comprehensively reviews 105 papers on Knowledge Distillation for Machine Translation (KD4MT), categorizing methodological advances and practical applications, identifying research gaps, evaluation inconsistencies, risks (e.g., hallucination, bias), and the evolving role of LLMs; it also provides a public database and glossary.
Details
Motivation: To systematically synthesize the rapidly growing body of work on Knowledge Distillation for Machine Translation (KD4MT), address the lack of unified evaluation practices, identify key research gaps and practical risks, and support future research with accessible resources. Method: A comprehensive literature survey of 105 papers on KD4MT (up to October 1, 2025), involving qualitative and quantitative analysis, categorization by methodology and application, identification of trends and gaps, risk assessment, and provision of guidelines, a public database, and a glossary. Result: A structured taxonomy of KD4MT methods and applications, identification of common trends and major research gaps, exposure of inconsistent evaluation practices, documentation of practical risks (e.g., hallucination, bias amplification), practical selection guidelines, and insights into how LLMs are reshaping KD4MT. Conclusion: KD in MT goes beyond model compression to serve as a versatile knowledge transfer mechanism; however, the field suffers from fragmented evaluation, underexplored risks, and lacks standardized benchmarks—this survey consolidates knowledge, highlights challenges, and offers tools to advance rigorous and responsible KD4MT research. Abstract: Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.[4] Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs
Xinyu Gao,Shaonan Wang,Nai Ding
Main category: cs.CL
TL;DR: 本文提出了一种名为门控树交叉注意力(GTCA)的轻量级、检查点兼容的分支结构,可在不修改原始解码器-only大语言模型主干的前提下,注入句法结构信息,从而提升模型对语法扰动的鲁棒性,且不损害其在多项选择问答和常识推理等任务上的性能。
Details
Motivation: 解码器-only大语言模型虽性能强大,但对细微语法扰动高度敏感,影响下游推理可靠性;而直接向已训练好的模型注入句法结构易破坏其预训练能力。 Method: 设计了一种检查点兼容的门控树交叉注意力(GTCA)分支,利用预计算的成分句法块(constituency chunk)记忆进行注意力计算,并引入token更新掩码与分阶段训练策略,以精细控制句法信息注入的范围与时序。 Result: 在多个基准测试和不同Transformer主干模型上,GTCA显著提升了语法鲁棒性,优于持续训练基线,同时未损害多项选择问答与常识推理性能。 Conclusion: GTCA为解码器-only大语言模型提供了一种实用、非侵入式的句法鲁棒性增强方案,兼顾结构引导与预训练能力保留。 Abstract: Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.[5] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models
Pranav Bhandari,Usman Naseem,Mehwish Nasim
Main category: cs.CL
TL;DR: 本文研究了大语言模型中人格特质控制的独立性假设,发现人格特质的引导方向存在显著几何依赖性,即使去除线性重叠,仍会产生跨特质行为影响;强制正交化虽实现几何独立但削弱引导强度,表明LLM中人格特质处于弱耦合子空间。
Details
Motivation: 检验人格特质在大语言模型中能否被独立控制这一隐含假设是否成立。 Method: 分析Big Five人格特质引导向量在LLaMA-3-8B和Mistral-8B模型中的几何关系,采用无约束、软/硬正交化等几何调节方案。 Result: 人格引导方向存在显著几何依赖性;硬正交化可实现几何独立但无法消除跨特质行为影响,且会削弱引导强度。 Conclusion: LLM中人格特质并非完全独立,而是处于轻微耦合的子空间,限制了完全独立的特质控制。 Abstract: Personality steering in large language models (LLMs) commonly relies on injecting trait-specific steering vectors, implicitly assuming that personality traits can be controlled independently. In this work, we examine whether this assumption holds by analysing the geometric relationships between Big Five personality steering directions. We study steering vectors extracted from two model families (LLaMA-3-8B and Mistral-8B) and apply a range of geometric conditioning schemes, from unconstrained directions to soft and hard orthonormalisation. Our results show that personality steering directions exhibit substantial geometric dependence: steering one trait consistently induces changes in others, even when linear overlap is explicitly removed. While hard orthonormalisation enforces geometric independence, it does not eliminate cross-trait behavioural effects and can reduce steering strength. These findings suggest that personality traits in LLMs occupy a slightly coupled subspace, limiting fully independent trait control.[6] Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling
Andrius Matšenas,Anet Lello,Tõnis Lees,Hans Peep,Kim Lilii Tamm
Main category: cs.CL
TL;DR: 本研究验证了大语言模型(LLMs)作为问卷式人格评估的动态替代方案的有效性,发现其与金标准IPIP-50问卷在部分人格维度上具有中等程度的收敛效度,且用户感知准确性相当。
Details
Motivation: 探索大语言模型是否可作为传统问卷式人格评估的动态、交互式替代方案,以提升评估体验与适用性。 Method: 采用被试内实验设计(N=33),将引导式LLM对话生成的五大性格维度得分与IPIP-50问卷结果进行对比,并测量用户对两种方法生成结果的感知准确性。 Result: LLM与IPIP-50在尽责性、开放性和神经质维度上得分无显著差异(统计等价),但在宜人性和外向性上存在显著差异;整体收敛效度中等(r=0.38–0.58);用户认为LLM生成的人格档案与问卷结果同样准确。 Conclusion: LLM驱动的对话式人格评估是一种有前景的新心理测量路径,但需针对特定人格特质进行校准优化。 Abstract: This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.[7] Preference Optimization for Review Question Generation Improves Writing Quality
Karun Sharma,Vidushee Vats,Shengzhi Li,Yuxiang Wang,Zhongtian Sun,Prayag Tiwari
Main category: cs.CL
TL;DR: 本文提出IntelliReward奖励模型和IntelliAsk提问生成模型,通过改进LLM生成审稿问题的质量,强调证据支持、努力程度和文本依据,显著提升推理与写作能力评估表现。
Details
Motivation: 现有基于大语言模型(LLM)的同行评审提问方法多生成表面化问题,超50%词元来自论文首页,缺乏深度与证据支撑,亟需更符合专家标准的提问生成机制。 Method: 构建冻结自回归LLM+可训练多头Transformer(作用于最后50个token状态)的IntelliReward奖励模型;结合解耦裁剪与动态采样策略优化(DAPO)训练IntelliAsk提问生成模型,使其对齐人类在努力、证据和依据方面的偏好。 Result: IntelliAsk在MuSR推理任务(68.3 vs 64.7 Acc)和WritingBench写作评估(8.31 vs 8.07)上显著优于Qwen3-32B基线;IntelliReward在预测专家偏好上优于API调用式监督微调基线。 Conclusion: 高质量审稿问题生成不仅提升同行评审质量,也反映并促进模型更广泛的推理与写作能力;所发布模型与标注数据为LLM生成问题的依据性、努力度与证据性提供了可自动评估的新基准。 Abstract: Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50\% of their question tokens from a paper's first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.[8] Large Language Models for Assisting American College Applications
Zhengliang Liu,Weihang You,Peng Shu,Junhao Chen,Yi Pan,Hanqi Jiang,Yiwei Li,Zhaojun Ding,Chao Cao,Xinliang Li,Yifan Zhou,Ruidong Zhang,Shaochen Xu,Wei Ruan,Huaqin Zhao,Dajiang Zhu,Tianming Liu
Main category: cs.CL
TL;DR: EZCollegeApp is an LLM-powered system that helps high school students navigate complex college applications by structuring forms, grounding answers in official documents, and keeping humans in full control.
Details
Motivation: American college applications are fragmented, repetitive, and ambiguous, requiring students to cross-reference multiple sources—creating barriers especially for under-resourced applicants. Method: EZCollegeApp uses a mapping-first paradigm to separate form understanding from answer generation; it ingests official admissions documents, applies retrieval-augmented QA, and provides a human-in-the-loop chatbot interface with grounded suggestions. Result: The system achieves consistent reasoning across diverse application portals, supports secure and privacy-preserving interactions, and is validated via automated testing and human quality assessment. Conclusion: EZCollegeApp demonstrates how LLMs can meaningfully assist in high-stakes, document-intensive real-world tasks while preserving transparency, control, and trust through human oversight and authoritative grounding. Abstract: American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (https://github.com/ezcollegeapp-public/ezcollegeapp-public) to facilitate the broader impact of this work.[9] Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey
David Y. Liu,Aditya Joshi,Paul Dawson
Main category: cs.CL
TL;DR: 本文综述了大语言模型(LLMs)在叙事理论应用中的研究进展,提出一个基于叙事学的分类法,分析了数据集、任务、理论融合及方法趋势,并指出当前缺乏统一叙事评估基准的问题,建议未来聚焦于理论驱动的指标构建、大规模文化分析及理论验证型实验。
Details
Motivation: 自然语言处理(NLP)与叙事学长期存在脱节,而LLMs为二者交叉提供了新契机;亟需系统梳理现有工作并建立理论支撑的分析框架。 Method: 文献综述与分类学构建:基于叙事学经典区分(如故事/话语、叙述者/视角等),对NLP中叙事相关数据集、任务设计、提示工程与微调策略进行模式归纳与归类。 Result: 识别出叙事数据与任务、理论映射方式、LLM方法趋势三类核心模式;揭示LLM能便捷桥接抽象叙事概念与NLP流程,但缺乏统一基准制约模型比较与理论验证。 Conclusion: 推动叙事NLP发展的关键不在于构建单一‘叙事质量’通用基准,而在于发展基于叙事理论的细粒度评估指标、开展理论驱动的大规模人文分析,并设计可反哺叙事理论本身的生成实验。 Abstract: Applications of narrative theories using large language models (LLMs) deliver promising use-cases in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research engages with fields of narrative studies, and proposes a taxonomy for ongoing efforts that reflect established distinctions in narratology. We discover patterns in the following: narrative datasets and tasks, narrative theories and NLP pipeline and methodological trends in prompting and fine-tuning. We highlight how LLMs enable easy connections of NLP pipelines with abstract narrative concepts and opportunities for interdisciplinary collaboration. Challenges remain in attempts to work towards any unified definition or benchmark of narrative related tasks, making model comparison difficult. For future directions, instead of the pursuit of a single, generalised benchmark for 'narrative quality', we believe that progress benefits more from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes to incrementally improve model performance; conducting large-scale, theory-driven literary/social/cultural analysis; and creating experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.[10] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Ha Na Cho,Sairam Sutari,Alexander Lopez,Hansen Bow,Kai Zheng
Main category: cs.CL
TL;DR: 本文提出了一种轻量级审计流程,用于识别和抑制临床NLP模型中的时间与词汇泄漏,以提升模型在真实临床场景中的安全性、校准性和时序有效性。
Details
Motivation: 临床NLP模型易受时间与词汇泄漏影响,导致虚假高预测性能,威胁临床部署安全。 Method: 设计并集成可解释性驱动的轻量级审计流水线,在最终训练前识别并抑制泄漏相关信号;以择期脊柱手术后次日出院预测为案例进行评估。 Result: 经审计的模型表现出更保守、更优校准的概率估计,且对出院相关词汇线索依赖降低。 Conclusion: 面向部署的临床NLP系统应优先保障时间有效性、概率校准与行为鲁棒性,而非追求乐观的性能指标。 Abstract: Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.[11] A Lightweight Explainable Guardrail for Prompt Safety
Md Asiful Islam,Mihai Surdeanu
Main category: cs.CL
TL;DR: 本文提出了一种轻量级可解释防护方法LEG,通过多任务学习联合训练提示分类器和解释分类器,并利用合成数据与新型损失函数提升分类与可解释性性能。
Details
Motivation: 现有方法在提示分类的可解释性方面存在不足,且模型过大;同时大语言模型存在确认偏差,影响解释质量。 Method: 采用多任务学习架构,联合训练提示分类器和解释分类器;使用新型合成数据生成策略以缓解LLM的确认偏差;设计融合交叉熵、焦点损失与基于不确定性的加权机制的新型损失函数。 Result: LEG在三个数据集上,无论领域内还是领域外,其提示分类与可解释性性能均达到或超过当前最优方法,且模型尺寸显著更小。 Conclusion: LEG是一种高效、轻量且具备强可解释性的提示安全分类方法,兼顾性能与实用性,具备开源潜力。 Abstract: We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.[12] Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu,Xingyu Ren,Zhiqiang You,Yumeng Zhang,Zhoupeng Shou
Main category: cs.CL
TL;DR: 本文提出Goal-Oriented Preference Optimization (GOPO),一种分层强化学习框架,通过解耦策略规划与响应生成来提升任务导向型对话系统的长程任务成功率;在多个数据集(尤其是Mgshop)上显著提升新提出的序列级指标TSE,并超越PPO、Memento及大模型基线。
Details
Motivation: 现有大语言模型在任务型对话中多采用token级似然或偏好优化,难以对齐长周期任务成功目标。 Method: 提出GOPO框架,包含Expert Agent(在对话轨迹层面优化多轮目标偏好)和Customer Service Agent(严格按选定策略生成响应),并引入基于真实电商交互数据的序列级评估指标Task-focused Sequential Engagement (TSE)。 Result: 在Mgshop上,GOPO相较PPO和Memento分别提升TSE达7.7%和10.3%;14B模型超越Qwen-235B和GPT-5.2达2.7%和1.5%;消融实验证实Expert Agent对长程优化至关重要。 Conclusion: GOPO为商业场景下的任务型对话系统建立了新范式,显著提升长程任务完成能力与序列级交互质量。 Abstract: Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.[13] Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective
Yunhao Liu,Zian Jia,Xinyu Gao,Kanjun Xu,Yun Xiong
Main category: cs.CL
TL;DR: 本文提出SeleCom框架,通过查询条件化信息选择替代传统全压缩,提升RAG中软上下文压缩的效果与效率。
Details
Motivation: 现有软上下文压缩方法依赖强制性全压缩,忽视查询相关性,导致性能下降;作者发现其存在不可行性(与LLM生成行为冲突)和非必要性(稀释关键信息)两大问题。 Method: 提出SeleCom:一种基于选择器的软压缩框架,将编码器重构为查询条件化的信息选择器;采用仅解码器结构,并在大规模、多样性、难度分级的合成QA数据集上结合课程学习进行训练。 Result: SeleCom显著优于现有软压缩方法,在保持甚至超越非压缩RAG基线性能的同时,降低33.8%~84.6%的计算开销与延迟。 Conclusion: 查询条件化的选择式压缩比全压缩更契合RAG任务需求,是提升软压缩性能与效率的有效范式。 Abstract: Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.[14] Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach
Yi Liu
Main category: cs.CL
TL;DR: 本文提出了一种名为CRAF的协同推理与自适应融合框架,用于多源异构舆情分析,结合传统特征方法与大语言模型,通过多阶段推理机制提升跨平台语义对齐、自适应特征融合、联合主题-情感建模及多模态内容理解能力,并在理论泛化界和多项实验指标上优于基线。
Details
Motivation: 多源异构舆情数据存在结构差异、语义偏差和平台特异性偏置,导致统一建模困难。 Method: 提出CRAF框架,包含四个核心组件:跨平台协同注意力模块、分层自适应融合机制、联合优化策略(共享潜在空间学习主题与情感)、以及支持OCR/ASR/视觉情感分析的多模态提取能力。 Result: 理论证明CRAF泛化误差界更紧(减少O(sqrt(d log K / m)));实验显示在三个多平台数据集上主题聚类ARI达0.76(+4.1%),情感分析F1达0.84(+3.8%),新平台标注数据需求降低75%。 Conclusion: CRAF有效提升了多源舆情分析的准确性、鲁棒性与跨平台适应性,为融合传统方法与LLM提供了可扩展的系统化范式。 Abstract: The analysis of public opinion from multiple heterogeneous sources presents significant challenges due to structural differences, semantic variations, and platform-specific biases. This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework that systematically integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism. Our approach features four key innovations: (1) a cross-platform collaborative attention module that aligns semantic representations while preserving source-specific characteristics, (2) a hierarchical adaptive fusion mechanism that dynamically weights features based on both data quality and task requirements, (3) a joint optimization strategy that simultaneously learns topic representations and sentiment distributions through shared latent spaces, and (4) a novel multimodal extraction capability that processes video content from platforms like Douyin and Kuaishou by integrating OCR, ASR, and visual sentiment analysis. Theoretical analysis demonstrates that CRAF achieves a tighter generalization bound with a reduction of O(sqrt(d log K / m)) compared to independent source modeling, where d is feature dimensionality, K is the number of sources, and m is sample size. Comprehensive experiments on three multi-platform datasets (Weibo-12, CrossPlatform-15, NewsForum-8) show that CRAF achieves an average topic clustering ARI of 0.76 (4.1% improvement over best baseline) and sentiment analysis F1-score of 0.84 (3.8% improvement). The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.[15] State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models
Annie Wong,Aske Plaat,Thomas Bäck,Niki van Stein,Anna V. Kononova
Main category: cs.CL
TL;DR: 本文系统研究了在动态环境中,大型语言模型(LLMs)对状态表征的设计选择(粒度、结构、空间定位)如何显著影响其推理与决策性能,并发现状态构建过程本身比信息内容更能激发模型的空间与长程推理能力。
Details
Motivation: 随着大模型从静态推理走向动态交互环境,其在变化环境中的表现高度依赖状态表征方式;但该因素此前未被系统探究。 Method: 在多个序贯决策基准上,固定模型参数,系统性地控制并对比三种状态表征维度:(1)粒度(长文本 vs 摘要),(2)结构(自然语言 vs 符号化),(3)空间定位(纯文本 vs 图像或文本地图编码)。 Result: 1)轨迹摘要可降噪并提升长程推理稳定性;2)自然语言表征最鲁棒,结构化编码仅对具备强代码/结构输出先验的模型有益;3)文本地图编码优于图像输入,因其构造过程强制模型执行空间推理。 Conclusion: 状态表征的设计选择是影响性能的关键独立因素;但即便优化表征,当前LLMs/VLMs在需多子任务协同的长程目标中仍表现脆弱。 Abstract: As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.[16] From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants
Krittin Pachtrachai,Petmongkon Pornpichitsuwan,Wachiravit Modecrua,Touchapon Kraisingkorn
Main category: cs.CL
TL;DR: 本文提出了一种基于历史通话转录文本构建和评估对话式AI助手的端到端框架,通过PIPA风格评分筛选高质量对话、LLM提取结构化知识用于RAG、模块化提示工程控制行为,并在房地产与招聘领域验证其高准确率、强鲁棒性及约30%自主处理率。
Details
Motivation: 客户导向行业构建可靠对话AI助手面临噪声数据、知识碎片化及需精准人工接管等挑战,尤其依赖实时信息的领域更难自动化。 Method: 1)用简化PIPA框架对通话转录文本评分并筛选高质量样本;2)用大语言模型从精选文本中抽取结构化知识,作为RAG唯一知识源;3)通过从单体到模块化、可管控的提示调优控制助手行为;4)采用基于转录文本的用户模拟器定量评估覆盖度、事实准确性与人工升级行为,并辅以红队测试评估鲁棒性。 Result: 在房地产与专业招聘两个高难度领域中,助手实现约30%通话自主处理、接近完美的事实准确率与拒绝行为,且在对抗性测试中表现强鲁棒性。 Conclusion: 该端到端框架有效克服了实时信息依赖场景下的对话AI构建难点,验证了基于真实通话数据训练与评估的可行性与实用性。 Abstract: Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.[17] Reranker Optimization via Geodesic Distances on k-NN Manifolds
Wen G. Gong
Main category: cs.CL
TL;DR: Maniscope是一种基于流形几何的轻量级重排序方法,通过在k近邻图上计算测地距离提升RAG检索质量,在多个BEIR数据集上达到接近交叉编码器的性能,但延迟降低10–45倍,适合实时部署。
Details
Motivation: 现有神经重排序方法(如交叉编码器和大语言模型)计算开销大、延迟高(3–5秒/查询),难以满足实时RAG需求。 Method: Maniscope构建检索文档候选集的k近邻流形,并在其上计算测地距离;融合全局余弦相似度与局部流形几何结构,以捕捉欧氏度量无法建模的语义关系。 Result: 在8个BEIR数据集(1233个查询)上验证:在NFCorpus、TREC-COVID、AorB三个最难数据集上NDCG@3分别提升+7.0%、+1.6%、+2.8%(相比HNSW基线);平均延迟仅4.7ms(比HNSW快3.2倍);相比交叉编码器,精度损失<2%,延迟降低10–45倍;相比LLM重排序器,在TREC-COVID上仅低0.5% NDCG@3,但延迟低840倍。 Conclusion: Maniscope是一种高效、准确、低延迟的几何重排序方法,显著优于传统图基线,逼近强神经重排序器性能,具备实际RAG系统部署价值。 Abstract: Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-encoders or large language models (LLMs), requiring substantial computational resources and exhibiting latencies of 3-5 seconds per query. We propose Maniscope, a geometric reranking method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds constructed over retrieved document candidates. This approach combines global cosine similarity with local manifold geometry to capture semantic structure that flat Euclidean metrics miss. Evaluating on eight BEIR benchmark datasets (1,233 queries), Maniscope outperforms HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average). Compared to cross-encoder rerankers, Maniscope achieves within 2% accuracy at 10-45x lower latency. On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over Maniscope at 840x higher latency, positioning Maniscope as a practical alternative for real-time RAG deployment. The method requires O(N D + M^2 D + M k log k) complexity where M << N , enabling sub-10 ms latency. We plan to release Maniscope as open-source software.[18] CAST: Achieving Stable LLM-based Text Analysis for Data Analytics
Jinxiang Xie,Zihao Li,Wei He,Rui Ding,Shi Han,Dongmei Zhang
Main category: cs.CL
TL;DR: 本文提出CAST框架,通过算法提示和思考前置技术提升大语言模型在表格文本分析中的输出稳定性,同时保持或提升输出质量。
Details
Motivation: 现有大语言模型在表格文本分析(如摘要生成和行级标注)中难以满足数据分析所需的高输出稳定性要求。 Method: 提出CAST框架,包含算法提示(为推理路径施加程序化约束)和思考前置(强制模型在最终生成前做出明确的中间承诺);并设计CAST-S和CAST-T两个稳定性评估指标。 Result: 在多个公开基准和不同LLM主干模型上的实验表明,CAST在稳定性上显著优于所有基线方法,稳定性得分最高提升16.2%,同时不损害输出质量。 Conclusion: CAST有效提升了LLM在表格文本分析任务中的输出稳定性,为数据驱动场景下可信地部署LLM提供了新思路。 Abstract: Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2\%, while maintaining or improving output quality.[19] Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation
Guoshan Liu,Bin Zhu,Yian Li,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
Main category: cs.CL
TL;DR: 本文提出了一种语义驱动的两阶段框架,通过动作与食材预测验证、监督与强化微调、以及语义置信度评分与修正模块,显著提升基于食物图像生成菜谱的语义准确性。
Details
Motivation: 现有多模态大语言模型在图像生成菜谱任务中虽有高词汇分数(如BLEU、ROUGE),但常生成语义错误的动作或食材,缺乏语义合理性保障。 Method: 提出两阶段流程:第一阶段监督微调(SFT)使用动作推理数据集和食材语料库;第二阶段强化微调(RFT)引入频率感知奖励以提升长尾动作与泛化食材预测;并设计语义置信度评分与修正(SCSR)模块进行后处理过滤与纠正。 Result: 在Recipe1M数据集上达到SOTA性能,语义保真度显著提升。 Conclusion: 语义建模与分阶段优化策略可有效缓解多模态菜谱生成中的语义失配问题,为生成质量评估与提升提供了新范式。 Abstract: Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.[20] Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning
Daehoon Gwak,Minseo Jung,Junwoo Park,Minho Park,ChaeHun Park,Junha Hyung,Jaegul Choo
Main category: cs.CL
TL;DR: 本文发现LLM通过自生成few-shot示例提升推理性能的关键不在于生成的示例本身,而在于生成过程本身;集成式提示(Integrated prompting)显著优于解耦式提示(Decoupled prompting)和零样本提示。
Details
Motivation: 现有研究发现LLM可通过自生成few-shot示例提升推理性能,但其内在机制尚不明确,导致难以有效应用该技术。 Method: 在多种LLM架构上,系统比较三种提示策略:零样本提示、集成式提示(模型在同一提示中创建并求解问题)、解耦式提示(复用自生成示例但排除其生成上下文);并辅以注意力机制分析。 Result: 集成式提示在多个模型和任务上持续优于零样本和解耦式提示;解耦式提示仅比零样本提示带来微弱增益;注意力分析显示二者存在显著差异。 Conclusion: 自生成提示的优势源于问题创建过程本身,而非生成的示例内容,这对设计更优提示策略具有重要启示。 Abstract: Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.[21] NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey
Dhiman Goswami,Jai Kruthunz Naveen Kumar,Sanchari Das
Main category: cs.CL
TL;DR: 本文提出NLP-PRISM框架,系统评估社交媒体NLP应用中的隐私风险,涵盖数据采集、预处理、可见性、公平性、计算风险与合规性六个维度;分析203篇论文发现现有方法在隐私保护微调下性能下降1%-23%,模型效用降低2%-9%,且面临较高的成员推断(AUC=0.81)与属性推断(准确率75%)风险;呼吁加强匿名化、隐私感知学习与公平训练。
Details
Motivation: NLP在社交媒体分析中广泛应用,但常处理含PII、行为线索和元数据的内容,带来监控、画像和定向广告等隐私风险,亟需系统性风险评估框架。 Method: 综述203篇同行评议论文,构建六维隐私风险评估框架NLP-PRISM(涵盖数据收集、预处理、可见性、公平性、计算风险、监管合规),并在六类NLP任务上实证检验其适用性,结合隐私攻击(MIA/AIA)与效用折衷量化分析。 Result: Transformer模型在隐私保护微调后F1下降1%-23%;六类任务中隐私研究覆盖严重不足;模型效用下降2%-9%;成员推断攻击AUC达0.81,属性推断准确率达0.75。 Conclusion: 当前社交媒体NLP研究在隐私保护方面存在显著缺口,需推动强匿名化、隐私感知学习与公平驱动训练,以实现伦理化NLP实践。 Abstract: Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.[22] Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?
Berry Gerrits
Main category: cs.CL
TL;DR: 本文通过让大型语言模型(LLMs)在经典文字冒险游戏Zork中完成任务,评估其问题解决与推理能力;结果显示,包括ChatGPT、Claude和Gemini在内的主流模型平均完成度不足10%,且详细指令或‘延展思考’机制均未提升表现,暴露出其在元认知、策略反思与经验学习方面的根本性缺陷。
Details
Motivation: 评估当前大语言模型在结构化自然语言交互环境中的真实问题解决与推理能力,尤其是元认知与动态策略调整能力。 Method: 将ChatGPT、Claude、Gemini等主流闭源LLM接入Zork游戏环境,在最小提示与详细提示两种设置下运行,以得分(最高350分)为主要指标,并结合定性分析其动作序列与对话历史中的推理行为。 Result: 所有模型平均得分低于35分(<10%完成度),Claude Opus 4.5最高仅得约75分;详细指令和‘extended thinking’无显著提升;模型普遍存在重复无效动作、策略不一致、无法从历史中学习等问题。 Conclusion: 当前LLMs在文本冒险游戏这类需长期规划、自我监控与经验迭代的任务中表现严重不足,反映出其推理能力仍缺乏真正的元认知基础,不宜高估其通用问题解决水平。 Abstract: In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.[23] Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman
Main category: cs.CL
TL;DR: 本文提出一种基于确定性多带图灵机的形式化框架来建模大语言模型(LLM)的交互过程,将各组件(如输入字符、词元、词表、参数、激活、概率分布和输出文本)映射为独立磁带,从而精确定位失败模式发生的阶段,并揭示提示工程等技术的作用机制与局限。
Details
Motivation: 大型语言模型在看似简单的任务上仍存在失败现象,现有分析缺乏形式化、可验证的理论框架,难以精确定位问题根源。 Method: 构建一个确定性多带图灵机模型,每条磁带对应LLM推理流程中的一个关键组件,通过该模型对任务执行过程进行分阶段建模与故障归因。 Result: 成功将典型失败模式(如计数任务中因分词丢失字符结构)定位到具体管道阶段;解释了思维链提示为何有效(将计算外化至输出磁带)及其根本局限;为LLM分析提供了可证伪的替代性理论工具。 Conclusion: 该形式化模型弥补了几何隐喻和经验缩放律的不足,支持严谨的错误分析,是理解与改进LLM行为的重要理论基础。 Abstract: Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.[24] Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches
Noopur Zambare,Kiana Aghakasiri,Carissa Lin,Carrie Ye,J. Ross Mitchell,Mohamed Abdalla
Main category: cs.CL
TL;DR: 本文系统评估了不同规模的预训练语言模型在临床去标识化任务中的性能,发现小模型在保持高性能的同时显著降低了推理成本,并在多语言、多文化及性别敏感场景下展现出更强的泛化能力;作者还发布了BERT-MultiCulture-DEID系列开源模型。
Details
Motivation: 以往研究未充分考察大语言模型在临床去标识化任务中跨格式、跨文化、跨性别的泛化能力,亟需系统性评估与改进。 Method: 系统评估多种模型(包括BERT系列、Llama/Qwen等大小语言模型),在多语言(中文、印地语、西班牙语等)、多文化及性别化姓名数据上进行细调与测试;提出并开源BERT-MultiCulture-DEID系列模型,基于MIMIC数据集并注入多语言标识符。 Result: 小模型(如7B级LLM和BERT变体)在多语言、多文化及性别场景下去标识化性能优于大模型,且推理成本大幅降低;BERT-MultiCulture-DEID在跨文化鲁棒性上显著提升。 Conclusion: 小模型在临床去标识化中兼顾效率与泛化性,尤其适用于资源受限和多元文化医疗环境;本研究首次量化了效率-泛化权衡,并为公平、高效的临床隐私保护提供了实用路径。 Abstract: Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: https://doi.org/10.5281/zenodo.18342291[25] VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering
Shuhui Qu
Main category: cs.CL
TL;DR: VDLM是一种新型的变量扩散语言模型,通过在语义变量嵌入空间中进行迭代优化(而非传统文本生成),分离规划与渲染过程,并结合轨迹感知优化和鲁棒的Vec2Text解码,显著提升长文本推理任务性能。
Details
Motivation: 自回归语言模型在多步推理中存在不可逆的左到右解码限制,难以进行有效修订。 Method: 提出VDLM:1)采用LLaDA风格的掩码扩散在语义变量嵌入空间进行迭代规划;2)用轨迹感知优化对规划器进行嵌入空间奖励/价值驱动的后训练;3)设计Vec2Text渲染器及嵌入扰动机制以增强解码鲁棒性。 Result: 在9个涵盖通用推理、数学与代码的基准上,VDLM预训练表现具竞争力,后训练在长形式生成任务上显著超越其他基线。 Conclusion: 嵌入空间后训练与鲁棒的潜在到文本渲染是扩散语言建模的关键有效路径。 Abstract: Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbf{Vec2Text} renderer and introduce \textbf{embedding perturbations} to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.[26] CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content
Diletta Abbonato
Main category: cs.CL
TL;DR: 本文提出了一款名为'CheckIfExist'的开源网络工具,用于通过CrossRef、Semantic Scholar和OpenAlex等多源验证实时检测参考文献是否真实存在,以应对大语言模型引发的参考文献幻觉问题。
Details
Motivation: 大型语言模型在学术工作流中广泛应用,导致参考文献幻觉(即生成看似合理但实际不存在的引用)问题日益严重,甚至出现在NeurIPS和ICLR等顶级会议论文中,亟需自动化验证机制。 Method: 开发了名为'CheckIfExist'的开源Web工具,采用级联验证架构,结合字符串相似度算法计算多维匹配置信度得分,支持单条及批量BibTeX条目验证,并对接CrossRef、Semantic Scholar和OpenAlex三大数据库。 Result: 该工具可在数秒内完成单条或批量参考文献验证,返回经验证的APA格式引用及可导出的BibTeX记录,填补了现有参考管理工具缺乏实时真实性验证与商用检测服务受限于免费额度或高昂费用之间的空白。 Conclusion: CheckIfExist为学术界提供了一种高效、开放、易用的参考文献真实性验证解决方案,有助于提升学术出版的可信度与文献完整性。 Abstract: The proliferation of large language models (LLMs) in academic workflows has introduced unprecedented challenges to bibliographic integrity, particularly through reference hallucination -- the generation of plausible but non-existent citations. Recent investigations have documented the presence of AI-hallucinated citations even in papers accepted at premier machine learning conferences such as NeurIPS and ICLR, underscoring the urgency of automated verification mechanisms. This paper presents "CheckIfExist", an open-source web-based tool designed to provide immediate verification of bibliographic references through multi-source validation against CrossRef, Semantic Scholar, and OpenAlex scholarly databases. While existing reference management tools offer bibliographic organization capabilities, they do not provide real-time validation of citation authenticity. Commercial hallucination detection services, though increasingly available, often impose restrictive usage limits on free tiers or require substantial subscription fees. The proposed tool fills this gap by employing a cascading validation architecture with string similarity algorithms to compute multi-dimensional match confidence scores, delivering instant feedback on reference authenticity. The system supports both single-reference verification and batch processing of BibTeX entries through a unified interface, returning validated APA citations and exportable BibTeX records within seconds.[27] P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA
Xingda Lyu,Gongfu Lyu,Zitai Yan,Yuxin Jiang
Main category: cs.CL
TL;DR: 本文提出了一种新型检索增强生成(RAG)方法——Prompt-Enhanced Parametric RAG(P-RAG),融合参数化知识、外部检索证据与思维链(CoT)提示,并通过LoRA微调LLaMA-3.2-1B-Instruct模型,在PubMedQA和2WikiMultihopQA数据集上显著超越标准RAG,尤其在多跳推理任务中表现突出。
Details
Motivation: 现有LLM受限于静态训练数据,而传统RAG高度依赖知识库质量;亟需一种能协同利用模型内参知识与外部检索信息、并提升复杂推理能力的新型RAG架构。 Method: 提出P-RAG:结合参数化知识(LLM内部)与检索证据,引入Chain-of-Thought提示引导推理,并采用Low-Rank Adaptation(LoRA)对LLaMA-3.2-1B-Instruct进行生物医学领域微调;对比Standard RAG与DA-RAG,在PubMedQA和2WikiMultihopQA上评估。 Result: P-RAG在PubMedQA上F1达93.33%,较Standard RAG提升10.47个百分点(相对提升12.64%);在2WikiMultihopQA上整体准确率33.44%,约为Standard RAG(17.83%)的两倍;在Compare子集达44.03%,且CoT显著提升多跳推理性能。 Conclusion: P-RAG是一种高效、可扩展且上下文自适应的RAG新范式,特别适用于高精度要求的生物医学问答任务;其结合参数知识、检索证据与结构化提示的设计为RAG发展提供了新方向。 Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG's potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.[28] Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity
Haihui Pan,Yuzhong Hong,Shaoke Lv,Junwei Bao,Hongfei Jiang,Yang Song
Main category: cs.CL
TL;DR: 本文提出了一种名为QEMPO的新方法,在保证大语言模型输出质量的同时提升其多样性,通过约束熵最大化策略优化实现,并在实验中展现出与RLHF相当甚至更优的性能。
Details
Motivation: 现有对齐方法虽能提升大语言模型输出质量,但会降低输出多样性;而提升多样性的方法又常损害性能,亟需兼顾二者的新方法。 Method: 提出Quality-constrained Entropy Maximization Policy Optimization (QEMPO),将对齐任务分解为质量和多样性两个分布,通过施加不同约束来获得多样化策略,并设计了在线与离线两种训练方法进行策略优化。 Result: 实验表明QEMPO在保持甚至超越RLHF性能的同时,显著提升了模型输出的多样性。 Conclusion: QEMPO是一种有效平衡输出质量与多样性的新对齐框架,为LLM对齐研究提供了新思路。 Abstract: Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models' output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.[29] Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion
Pengcheng Zhou,Haochen Li,Zhiqiang Nie,JiaLe Chen,Qing Gong,Weizhen Zhang,Chun Yu
Main category: cs.CL
TL;DR: CogitoRAG是一种受人类情景记忆启发的检索增强生成框架,通过提取和演化语义主旨、构建多维知识图谱、查询分解、实体扩散检索及CogniRank重排序等机制,显著提升复杂知识整合与推理能力。
Details
Motivation: 现有RAG框架因文本离散表示导致语义完整性丢失和检索偏差,需借鉴人类认知记忆机制改进。 Method: 提出CogitoRAG框架:离线阶段将非结构化语料提炼为‘语义主旨’并构建成融合实体、关系事实与记忆节点的多维知识图谱;在线阶段通过查询分解模块拆解复杂查询,实体扩散模块基于结构相关性与实体频率奖励进行关联检索,并用CogniRank算法融合扩散得分与语义相似度重排序,最终以段落-记忆配对形式提供高密度证据。 Result: 在五个主流问答基准和GraphBench多任务生成任务上显著超越当前最优RAG方法。 Conclusion: CogitoRAG通过模拟人类认知记忆过程,有效提升了RAG在复杂知识整合与推理任务中的性能与鲁棒性。 Abstract: Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.[30] Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens
Yichi Zhang,Zhuo Chen,Lingbing Guo,Wen Zhang,Huajun Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于token的多模态知识图谱推理基础模型TOFU,通过将结构、视觉和文本信息离散为模态特定token,并采用分层融合与混合消息机制,实现了跨多模态知识图谱的强泛化能力。
Details
Motivation: 现有MMKGR方法多为转导式,难以泛化到新知识图谱;而现有知识图谱基础模型主要利用结构模式,忽略了丰富的多模态信号。 Method: TOFU将结构、视觉和文本信息离散为模态特定token,并采用分层融合架构与混合消息机制处理这些token,以获得可迁移的多模态知识图谱推理特征。 Result: 在17个转导式、归纳式和全归纳式多模态知识图谱上,TOFU持续优于强KGFM和MMKGR基线模型,在未见MMKG上表现出色。 Conclusion: TOFU是一种具有强跨MMKG泛化能力的多模态知识图谱推理基础模型,有效融合了多模态信号与图结构信息。 Abstract: Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.[31] Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation
Xinguo Feng,Zhongkui Ma,Zihan Wang,Alsharif Abuadbba,Guangdong Bai
Main category: cs.CL
TL;DR: 本文提出GHOST防御机制,通过在token层面进行混淆来抵御梯度反演攻击(GIAs),在保护隐私的同时保持模型效用。
Details
Motivation: 现有基于梯度扰动的防御方法因保留梯度、嵌入与token空间间的语义相似性而效果有限,亟需更根本的防御思路。 Method: 提出GHOST:一种token级混淆机制,利用语义不同但嵌入相近的‘影子token’替代原始token,在token空间实现语义断连,同时维持嵌入与梯度空间的连通性;包含多准则搜索和对齐内部输出的影子token选择两步。 Result: 在BERT到Llama等多种模型及数据集上验证有效:隐私恢复率低至1%,分类F1达0.92、困惑度为5.45,显著优于现有方法。 Conclusion: GHOST通过解耦跨空间关联提供了一种新颖且高效的GIA防御范式,在强隐私保障与高模型效用之间实现了良好平衡。 Abstract: Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs' direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.[32] MultiCube-RAG for Multi-hop Question Answering
Jimeng Shi,Wei Hu,Runchu Tian,Bowen Jin,Wonbin Kweon,SeongKu Kang,Yunfan Kang,Dingqi Ye,Sizhe Zhou,Shaowen Wang,Jiawei Han
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的多立方体检索增强生成(MultiCube-RAG)方法,利用本体驱动的多维正交‘立方体’结构建模主题、属性与关系,支持多跳问答中的多步推理与检索,并在准确率、效率和可解释性上优于现有方法。
Details
Motivation: 现有RAG方法难以准确捕捉多跳问答所需的结构性语义;图结构RAG噪声大、计算开销高;单步检索忽略多跳推理;基于训练的方法收敛不稳定且开销高。 Method: 构建本体驱动的多维正交‘立方体’结构,每个立方体专精一类主题;提出MultiCube-RAG,通过沿立方体维度将复杂多跳查询分解为简单子查询并顺序求解,实现训练-free的多步推理与检索。 Result: 在四个多跳QA数据集上,MultiCube-RAG比多种基线平均准确率提升8.9%,同时具备更高效率和内在可解释性。 Conclusion: MultiCube-RAG通过结构化、模块化、分解式推理,有效克服了现有RAG在多跳问答中对结构性语义建模不足、计算低效及缺乏可解释性的关键缺陷。 Abstract: Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.[33] Doc-to-LoRA: Learning to Instantly Internalize Contexts
Rujikorn Charakorn,Edoardo Cetin,Shinnosuke Uesaka,Robert Tjarko Lange
Main category: cs.CL
TL;DR: 本文提出Doc-to-LoRA(D2L),一种轻量级超网络,可在单次前向传播中为长上下文生成LoRA适配器,实现近似上下文蒸馏,显著降低推理延迟与KV缓存内存消耗,并在长文本问答任务中超越传统上下文蒸馏方法。
Details
Motivation: Transformer的二次方注意力开销导致长序列推理内存密集且缓慢;而现有上下文蒸馏(CD)方法因需每提示训练,存在计算成本高和延迟大等问题,难以实用。 Method: 提出Doc-to-LoRA(D2L):一个元学习的轻量级超网络,针对未见提示快速生成LoRA适配器,使目标LLM无需重复访问原始长上下文即可响应后续查询。 Result: 在‘海中捞针’长上下文任务中,D2L在超出模型原生上下文窗口4倍以上时仍达近完美零样本准确率;在真实QA数据集上,其性能优于标准CD,同时显著降低峰值内存与更新延迟。 Conclusion: D2L为LLM提供了高效、低开销的长上下文适应机制,支持快速知识更新与个性化交互,有望推动长上下文场景下的实用化部署。 Abstract: Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.[34] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam,Md Sirajus Salekin,Nivedha Balakrishnan,Vincil C. Bishop,Niharika Jain,Spencer Romo,Bob Strahan,Boyi Xie,Diego A. Socolinsky
Main category: cs.CL
TL;DR: 本文提出了首个面向文档包分割任务的综合基准数据集DocSplit,旨在评估大语言模型在识别文档边界、分类文档类型及维持页面顺序等方面的能力,并揭示了现有模型在此类复杂任务中的性能差距。
Details
Motivation: 现实应用中常需处理由多个文档拼接而成的异构多页文档包,但当前视觉文档理解研究尚未系统解决文档包分割这一基础任务。 Method: 构建了包含五个不同复杂度子集的DocSplit基准数据集,涵盖多样化的文档类型、版式和多模态场景;形式化定义了文档包分割任务,并提出新的评估指标;通过在多模态大语言模型上开展广泛实验进行验证。 Result: 实验表明当前多模态大语言模型在处理乱序页面、交错文档及缺乏明确分隔的文档等真实挑战时存在显著性能瓶颈。 Conclusion: DocSplit为法律、金融、医疗等文档密集型领域提供了系统性评估框架,推动文档理解能力的发展,并已开源数据集以支持后续研究。 Abstract: Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.[35] A Curious Class of Adpositional Multiword Expressions in Korean
Junghyun Min,Na-Rae Han,Jena D. Hwang,Nathan Schneider
Main category: cs.CL
TL;DR: This paper investigates Korean postpositional verb-based constructions (PVCs), a class of multiword adpositions, analyzes them using Korean Wikipedia data, distinguishes them from non-MWEs and light verb constructions, and proposes annotation guidelines to support future Korean MWE research and cross-lingual framework integration.
Details
Motivation: Korean multiword expressions—especially multiword adpositions—are underrepresented in cross-lingual annotation frameworks like PARSEME; there is a lack of systematic analysis, annotated resources, and integration for Korean MWEs. Method: The authors analyze PVCs using data from Korean Wikipedia, conduct a comparative survey distinguishing PVCs from non-MWEs and light verb constructions (LVCs), and develop annotation guidelines based on this linguistic analysis. Result: A systematic characterization of Korean PVCs is provided, along with clear distinctions from related constructions, and a set of annotation guidelines tailored for Korean multiword adpositions. Conclusion: The study fills a gap in Korean MWE research by establishing foundational analysis and annotation standards for PVCs, enabling better integration into multilingual frameworks and supporting future resource development. Abstract: Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts. In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing multilingual frameworks. In this paper, we study a class of Korean functional multiword expressions: postpositional verb-based constructions (PVCs). Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them with non-MWEs and light verb constructions (LVCs) with similar structure. Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.[36] CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
Bradley McDanel,Steven Li,Harshit Khaitan
Main category: cs.CL
TL;DR: 本文提出Answer-Informed Oracle来评估token重要性,并发现现有token-ranking方法在不同层间重要性估计不稳定,进而提出Cross-Layer Attention Aggregation(CLAA)方法,通过跨层聚合注意力分数提升长上下文LLM预填充阶段的推理效率,显著降低TTFT。
Details
Motivation: prefill阶段是长上下文大语言模型推理的计算瓶颈;现有基于token-ranking的加速方法存在token重要性估计不稳定、层间差异大的问题,且缺乏独立于具体启发式架构的评估手段。 Method: 提出Answer-Informed Oracle作为token重要性的真值基准(通过生成答案反向关注prompt来定义),诊断现有方法的层间不稳定性,并据此设计Cross-Layer Attention Aggregation(CLAA)——跨层聚合注意力得分以稳定重要性估计。 Result: CLAA显著缩小了与Oracle上界之间的差距,在Time-to-First-Token(TTFT)上相比Full KV Cache基线最高提速39%。 Conclusion: token重要性应跨层聚合而非依赖单层,CLAA提供了一种简单有效、可解释性强的prefill加速方案,揭示了层间一致性对token-ranking方法的关键作用。 Abstract: The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.[37] Surgical Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan,Amir Zur,Atticus Geiger,Dylan Hadfield-Menell
Main category: cs.CL
TL;DR: 本文提出生成因果中介(GCM)方法,用于在语言模型中定位并干预影响长文本响应行为(如拒绝回答、奉承倾向、文体转换)的关键组件(如注意力头),通过因果分析而非相关性分析实现更有效的行为控制。
Details
Motivation: 如何在语言模型中干预那些弥散在长文本响应多个词元中的行为?现有基于相关性的探针方法效果有限,需更可靠的因果定位与干预机制。 Method: 提出Generative Causal Mediation(GCM):构建对比输入-输出数据集,量化各模型组件(如注意力头)对二元概念的因果中介效应,并选取最强中介者进行干预 steering。 Result: 在拒绝回答、奉承倾向和风格迁移三个任务上,GCM 在三个语言模型上均成功定位长文本中表达的概念,且使用稀疏注意力头干预时持续优于基于相关性的探针基线。 Conclusion: GCM 是一种有效定位与控制大语言模型长文本响应行为的新范式,为可解释性与可控性提供了因果驱动的解决方案。 Abstract: Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.[38] Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs
Sean Trott,Samuel Taylor,Cameron Jones,James A. Michaelov,Pamela D. Rivière
Main category: cs.CL
TL;DR: 本研究通过在41个开源大语言模型上复现和扩展错误信念任务,发现约34%的模型对隐含知识状态敏感,但无一能完全模拟人类的‘解释消除’效应;大模型敏感性更高、心理测量预测力更强;并基于模型行为提出新假设:人类与模型均倾向于在使用非事实动词(如‘认为’)提示知识状态时更易归因错误信念,该效应在人类与模型间量级一致,暗示语言分布统计可能解释此现象。
Details
Motivation: 现有大语言模型(LMs)关于心理状态推理的研究多依赖少量闭源模型,限制了对人类社会认知理论(如心理状态推理部分源于语言经验)的严格检验及对LM能力的全面评估。 Method: 在41个来自不同家族的开源权重大语言模型上复现并扩展错误信念任务实验,系统评估其对隐含知识状态的敏感性、规模效应及心理测量预测力,并基于LM行为生成并验证关于人类认知的新假设(知识提示方式对错误信念归因的影响)。 Result: 34%的测试LM表现出对隐含知识状态的敏感性;大模型敏感性与心理测量预测力更高;人类与LM均在非事实动词(如‘John thinks...’)提示下比间接提示(如‘John looks in the...’)更易归因错误信念,且该效应的人类量级落在LM效应分布范围内,而主效应(知识状态敏感性)则人类显著高于LM。 Conclusion: 使用更大样本的开源LM有助于更严谨地检验人类认知理论并评估LM能力;语言的分布统计可能足以解释人类在知识提示方式上的偏差,但不足以解释其核心的心理状态推理能力。 Abstract: Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully ``explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (``John thinks...'') than when cued indirectly (``John looks in the...''). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.[39] Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities
Shankar Padmanabhan,Mustafa Omer Gul,Tanya Goyal
Main category: cs.CL
TL;DR: 本文提出了一种名为DiSC的上下文蒸馏方法,用于大语言模型的持续知识适应,能在学习新知识的同时有效缓解旧能力遗忘。
Details
Motivation: 现有方法无法在从新文档语料中学习知识的同时,避免对先前习得能力(如指令遵循、推理、事实知识)的遗忘。 Method: 提出Distillation via Split Contexts (DiSC),通过将训练样本划分为不同上下文段分别构建学生与教师分布,并最小化共享token上的KL散度,无需显式生成步骤。 Result: 在四个后训练模型和两个适应领域上的实验表明,DiSC在学习新知识与缓解旧能力遗忘之间取得了优于现有微调和蒸馏方法的权衡效果。 Conclusion: DiSC是一种简单高效、适用于持续知识适应的上下文蒸馏框架,能兼顾知识更新与能力保持。 Abstract: Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.[40] Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis
Rong Fu,Wenxin Zhang,Ziming Wang,Chunlei Meng,Jiaxuan Lu,Jiekai Wu,Kangan Qian,Hao Zhang,Simon Fong
Main category: cs.CL
TL;DR: 提出Missing-by-Design(MBD)框架,支持多模态情感分析中按需撤销特定模态数据,在保障隐私合规的同时维持模型性能。
Details
Motivation: 多模态系统处理敏感个人数据时,需支持用户或监管方对特定模态数据的可验证撤销,以满足隐私合规与用户自主权需求。 Method: MBD结合结构化表征学习与可认证参数修改流程:学习属性感知嵌入,利用生成器重建缺失模态;针对删除请求,采用显著性驱动候选选择与校准高斯更新,生成机器可验证的模态删除证书。 Result: 在基准数据集上,MBD在不完整输入下保持强预测性能,并实现实用的隐私-效用权衡,证明‘精准遗忘’比全量重训练更高效。 Conclusion: MBD为多模态系统提供了可验证、可扩展的模态级撤销机制,将‘外科式遗忘’确立为隐私保护中一种切实可行的替代方案。 Abstract: As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.[41] Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
Nithin Sivakumaran,Shoubin Yu,Hyunji Lee,Yue Zhang,Ali Payani,Mohit Bansal,Elias Stengel-Eskin
Main category: cs.CL
TL;DR: 本文提出REMUL方法,通过多听众强化学习提升链式推理(CoT)的忠实性,兼顾解释性与任务性能,在多个基准上显著提升忠实性指标与准确率。
Details
Motivation: 链式推理(CoT)有时不能真实反映大语言模型(LLM)的实际计算过程,影响其可解释性;且提升忠实性与可解释性常以牺牲任务性能为代价。 Method: 提出Reasoning Execution by Multiple Listeners(REMUL),基于‘易被他人跟随的推理更忠实’的假设:说话者生成截断推理链,多个听众模型执行该链并得出答案;说话者依据听众能否成功执行获得奖励,并结合掩码监督微调进行正确性正则化。 Result: 在BIG-Bench Extra Hard、MuSR、ZebraLogicBench和FOLIO等多个推理基准上,REMUL显著提升了三种忠实性指标(hint attribution、early answering AOC、mistake injection AOC),同时提高了准确率;分析表明增益具有跨领域鲁棒性,带来可读性提升,并生成更短、更直接的CoT。 Conclusion: REMUL有效缓解了CoT忠实性与任务性能之间的权衡,为构建既可信又高性能的推理系统提供了新范式。 Abstract: Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mistake injection AOC -- while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.[42] LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers
Peiqi Sui
Main category: cs.CL
TL;DR: 本文指出不确定性是当前大语言模型(LLM)在创意写作中表现平庸的关键限制因素,通过信息论方法量化了人类写作与模型生成文本之间的‘不确定性差距’,发现人类文本不确定性显著更高,且该差距与写作质量强相关,呼吁发展能区分有害幻觉与有益模糊性的新型对齐范式。
Details
Motivation: 现有LLM在创意写作中常显得陈腐、套路化;文学理论强调不确定性是创造性表达的必要条件,但当前对齐策略为保障事实性和减少幻觉而压制不确定性,导致创造力受限。 Method: 提出并形式化‘不确定性差距’概念,基于信息论(如熵等指标)对28个LLMs在高质量故事数据集上的续写结果进行受控分析,对比人类作者文本与模型生成文本的不确定性水平,并分类型(基础模型、指令微调模型、推理模型)及任务域(创意 vs 功能)进行比较。 Result: 人类写作始终表现出显著高于模型输出的不确定性;指令微调和推理模型比其基础模型不确定性更低,加剧差距;该差距在创意写作中比功能任务中更明显,且与人工评估的写作质量呈强正相关。 Conclusion: 要实现人类水平的创意写作能力,需构建新型‘不确定性感知’的对齐范式,以保留文学创作所需的建设性模糊性,同时抑制破坏性幻觉。 Abstract: We argue that uncertainty is a key and understudied limitation of LLMs' performance in creative writing, which is often characterized as trite and cliché-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the "uncertainty gap" between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.[43] Beyond Learning: A Training-Free Alternative to Model Adaptation
Namkyung Yoon,Kyeonghyun Yoo,Wooyong Jung,Sanghong Kim,Hwangnam Kim
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的语言模型‘模块移植’技术,通过识别并迁移在特定任务下局部激活显著的内部模块,实现对性能欠佳模型的即时功能增强,在跨代和指令微调对比实验中均展现出显著性能恢复效果。
Details
Motivation: 现有语言模型有时表现不如前代,而传统改进方法资源消耗大、耗时长,亟需一种能立即生效的轻量级干预手段。 Method: 基于激活分析识别语言模型中任务相关的局部活跃模块,并将其直接移植到目标模型中,不进行任何额外训练或微调。 Result: 在跨代模型间移植可使欠佳模型性能达目标基线的2倍,差距修复率超100%;在基础模型与指令微调模型间移植,可达目标基线约2.33倍,最高差距修复率达100%。 Conclusion: 语言模型具备任务局部化的内在模块性,模块移植是一种可行且高效的能力迁移新范式,开辟了‘模型移植’这一新研究方向。 Abstract: Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.[44] The Validity of Coreference-based Evaluations of Natural Language Understanding
Ian Porada
Main category: cs.CL
TL;DR: 本文通过扩展共指消解的评估方法,揭示了当前评估结果存在非泛化性和不一致性问题,并提出一种基于事件相对可能性推断的新评估方法,发现现代语言模型虽在标准基准上表现优异,但在评估条件稍作变化时泛化能力仍不足。
Details
Motivation: 现有基于共指消解的评估方法存在测量效度问题,如共指定义的争议性及不同基准间结果不一致,导致结论难以泛化,因此需改进评估实践以更准确衡量模型能力。 Method: 首先分析标准共指消解评估的设计缺陷;其次提出并实现一种聚焦于事件相对可能性推断能力的新评估方法,以更深入检验模型对共指消解核心能力的掌握。 Result: 现代语言模型在标准基准上表现优于早期基线系统,但其性能高度依赖评估条件,在微调上下文后泛化能力明显下降;新评估方法揭示了模型在人类可预期的泛化能力上的不足。 Conclusion: 当前NLP范式在共指消解任务中虽有进步,但受限于评估方法的效度缺陷和模型泛化能力薄弱,未来需发展更可靠的评估方法与更具本质泛化能力的系统。 Abstract: In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over earlier baseline systems within certain domains and types of coreference - but remain sensitive to the evaluation conditions: they often fail to generalize in ways one would expect a human to be capable of when evaluation contexts are slightly modified. Taken together, these findings clarify both the strengths, such as improved accuracy over baselines on widely used evaluations, and the limitations of the current NLP paradigm, including weaknesses in measurement validity, and suggest directions for future work in developing better evaluation methods and more genuinely generalizable systems.[45] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
Melkamu Abay Mersha,Jugal Kalita
Main category: cs.CL
TL;DR: 本文提出了一种名为上下文感知的层间积分梯度(CA-LIG)框架,用于统一、分层地解释Transformer模型的决策过程,显著提升了归因的保真度、上下文敏感性和可视化语义连贯性。
Details
Motivation: 现有可解释性方法局限于最终层归因、缺乏上下文感知、无法统一局部词元归因与全局注意力模式,且不能刻画相关性在各层间的演化及结构组件对决策的影响。 Method: 提出CA-LIG框架:在每个Transformer块内计算层间积分梯度,并将词元级归因与类别特异性注意力梯度融合,生成带符号、上下文敏感的归因图,追踪相关性在层级中的流动。 Result: 在多种任务(情感分析、长文档/多类分类、低资源语言仇恨言论检测、图像分类)和模型(BERT、XLM-R、AfroLM、MAE-ViT)上验证,CA-LIG比现有方法更具保真度、更强上下文敏感性、更清晰语义可视化。 Conclusion: CA-LIG提供更全面、上下文感知且可靠的Transformer决策解释,推动深度神经模型的实用可解释性与概念理解发展。 Abstract: Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.[46] Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications
Sanket Badhe,Deep Shah,Nehal Kathrotia
Main category: cs.CL
TL;DR: 本文提出了一个关于大语言模型中长尾知识的结构化分析框架,从定义、丢失/失真机制、技术干预措施及公平性等影响四个维度进行系统梳理,并指出当前评估方法对长尾行为的掩盖问题,最后探讨了隐私、可持续性和治理等开放挑战。
Details
Motivation: 尽管大语言模型在平均性能上随规模提升而改善,但在低频、领域特定、文化及时间敏感等长尾知识上仍存在持续且未被充分刻画的失败现象。 Method: 构建了一个四轴分析框架,整合技术与社会技术视角:(1)长尾知识的定义方式;(2)训练与推理中知识丢失或失真的机制;(3)缓解失败的技术干预;(4)失败对公平性、问责制、透明度和用户信任的影响;同时分析现有评估实践如何掩盖长尾行为。 Result: 形成了首个统一的概念框架,系统刻画长尾知识在LLM中的定义、丢失、评估与实际表现,并揭示了评估偏差、问责困难及隐私、可持续性与治理等深层约束。 Conclusion: 长尾知识问题不仅关乎模型能力边界,更牵涉模型部署中的伦理、社会与制度挑战;需跨学科协作,在技术改进之外加强评估范式革新与治理体系建设。 Abstract: Large language models (LLMs) are trained on web-scale corpora that exhibit steep power-law distributions, in which the distribution of knowledge is highly long-tailed, with most appearing infrequently. While scaling has improved average-case performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized. This paper develops a structured taxonomy and analysis of long-Tail Knowledge in large language models, synthesizing prior work across technical and sociotechnical perspectives. We introduce a structured analytical framework that synthesizes prior work across four complementary axes: how long-Tail Knowledge is defined, the mechanisms by which it is lost or distorted during training and inference, the technical interventions proposed to mitigate these failures, and the implications of these failures for fairness, accountability, transparency, and user trust. We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures. The paper concludes by identifying open challenges related to privacy, sustainability, and governance that constrain long-Tail Knowledge representation. Taken together, this paper provides a unifying conceptual framework for understanding how long-Tail Knowledge is defined, lost, evaluated, and manifested in deployed language model systems.[47] Are LLMs Ready to Replace Bangla Annotators?
Md. Najib Hasan,Touseef Hasan,Souvika Sarkar
Main category: cs.CL
TL;DR: 本研究系统评估了17个大语言模型(LLMs)在孟加拉语仇恨言论零样本标注任务中的表现,发现其存在显著的标注者偏差与判断不稳定性;出人意料的是,模型规模增大并不必然提升标注质量,更小、任务对齐度更高的模型反而更一致。
Details
Motivation: 探究LLMs作为自动化标注器在低资源、身份敏感场景(如孟加拉语仇恨言论识别)中的可靠性与潜在偏差,因该任务本身人类标注一致性低且偏差后果严重。 Method: 对17个LLMs在统一评估框架下开展系统性基准测试,采用零样本设置进行孟加拉语仇恨言论标注,并分析其标注一致性、偏差与规模效应。 Result: 发现LLMs普遍存在标注者偏差和判断不稳定性;模型规模与标注质量无正相关,部分小型模型比大型模型更稳定、一致。 Conclusion: 当前LLMs在低资源、敏感标注任务中存在重要局限,部署前需审慎评估,不能默认大模型更可靠。 Abstract: Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.[48] Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation
Jonathan Mutal,Perla Al Almaoui,Simon Hengchen,Pierrette Bouillon
Main category: cs.CL
TL;DR: 本文提出了Aladdin-FTI系统,用于阿拉伯方言的生成与翻译,支持多种方言、现代标准阿拉伯语(MSA)及英语之间的双向翻译,并开源了代码和模型。
Details
Motivation: 阿拉伯方言在自然语言处理(NLP)中长期缺乏代表性,因其非标准化和高度变异性给计算建模带来挑战;而大语言模型(LLMs)为将阿拉伯语视为多中心语言而非单一整体提供了新路径。 Method: 提出Aladdin-FTI系统,基于大语言模型,支持五种阿拉伯方言(摩洛哥、埃及、巴勒斯坦、叙利亚、沙特)的文本生成,以及这些方言与现代标准阿拉伯语(MSA)和英语之间的双向翻译。 Result: 成功构建并开源了一个支持多方言生成与跨语言翻译的统一模型系统,参与AMIYA共享任务。 Conclusion: Aladdin-FTI验证了利用大语言模型建模阿拉伯语多中心特性的可行性,为低资源阿拉伯方言NLP任务提供了可复用、可扩展的技术方案。 Abstract: Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model are publicly available.[49] MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models
Martin Hyben,Sebastian Kula,Jan Cegin,Jakub Simko,Ivan Srba,Robert Moro
Main category: cs.CL
TL;DR: 本文提出了MultiCW多语言数据集,用于检测值得核查的声明,并在多种模型上进行了基准测试,结果表明微调模型在该任务上优于零样本大语言模型。
Details
Motivation: 当前大型语言模型在媒体信息核查中应用逐渐增多,但自动化检测值得核查声明(check-worthy claims)这一关键步骤仍缺乏支持。 Method: 构建了包含16种语言、7个主题领域和2种写作风格的平衡多语言数据集MultiCW,并设计了跨语言的分布外评估集;对比评估了3种微调多语言Transformer模型与15种商用及开源大语言模型在零样本设置下的性能。 Result: 微调模型在声明分类任务中持续优于零样本大语言模型,并在语言、领域和风格上展现出强泛化能力。 Conclusion: MultiCW为推进自动化事实核查提供了严谨的多语言资源,支持对微调模型与前沿大语言模型在check-worthy claim检测任务上的系统性比较。 Abstract: Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.[50] MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He,Yu Wang,Churan Zhi,Yuanzhe Hu,Tzu-Ping Chen,Lang Yin,Ze Chen,Tong Arthur Wu,Siru Ouyang,Zihan Wang,Jiaxin Pei,Julian McAuley,Yejin Choi,Alex Pentland
Main category: cs.CL
TL;DR: 本文提出MemoryArena,一个用于评估具有记忆能力的智能体在多轮交互中记忆与行动协同能力的新基准。
Details
Motivation: 现有评估方法将记忆和行动分开测试,无法反映真实场景中二者紧密耦合的特点:智能体在与环境交互中动态构建记忆,并依赖该记忆指导后续决策。 Method: 设计MemoryArena评估框架,包含人工构造、子任务相互依赖的多会话任务,要求智能体在前期交互中提炼经验形成记忆,并在后续任务中调用该记忆完成整体目标;覆盖网页导航、偏好约束规划、渐进式信息搜索和顺序形式推理等场景。 Result: 实验表明,尽管某些智能体在长上下文记忆基准(如LoCoMo)上表现接近饱和,但在MemoryArena中表现较差,揭示了当前记忆评估体系的重大缺陷。 Conclusion: MemoryArena填补了评估智能体记忆-行动闭环能力的空白,凸显了面向真实交互场景的记忆评估必要性。 Abstract: Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.[51] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar,Ayush K Tarun,Murari Mandal,Maksym Andriushchenko,Antoine Bosselut
Main category: cs.CL
TL;DR: 本文提出STING框架,用于多轮红队测试LLM代理的非法任务执行能力,通过自适应提示和裁判代理评估,发现其在多轮场景下比单轮提示更易引发越狱行为,并揭示多语言环境下攻击成功率并不随资源减少而单调增加。
Details
Motivation: 现有代理滥用基准主要测试单次提示,无法有效衡量代理在多轮交互中协助有害或非法任务的能力,存在评估空白。 Method: 提出STING(Sequential Testing of Illicit N-step Goal execution)自动化红队框架,构建基于良性角色的多步非法计划,通过自适应后续提问和裁判代理追踪阶段完成情况;并建立将多轮红队建模为‘首次越狱时间’随机变量的分析框架,引入发现曲线、危险比归因和受限平均越狱发现时间等新分析工具。 Result: 在AgentHarm场景中,STING显著高于单轮提示及适配后的多轮基线;六种非英语语境下的多语言评估显示,攻击成功率与非法任务完成率并未在低资源语言中持续升高,与常见聊天机器人结论不同。 Conclusion: STING为真实部署环境中(天然多轮且常为多语言)的代理滥用评估与压力测试提供了实用、可扩展的方案。 Abstract: LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.[52] Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents
Mohammad H. A. Monfared,Lucie Flek,Akbar Karimi
Main category: cs.CL
TL;DR: 本文提出了一种面向方面情感分析(ABSA)的智能体式数据增强方法,通过迭代生成与验证提升合成数据质量,并在多个子任务和数据集上验证其优于传统提示方法,尤其对轻量级模型T5-Base提升显著。
Details
Motivation: 现有数据增强方法在ABSA任务中难以保证合成数据的标签保真度,尤其是涉及方面词生成的任务;需探索更可控、结构化的生成机制。 Method: 设计一种基于智能体(agentic)架构的数据增强方法,包含迭代生成与验证模块;同时构建指令提示(prompting-based)基线以控制变量,使用相同模型(T5-Base/Tk-Instruct)和指令。 Result: 智能体增强在标签保真度上优于提示法,尤其在ATE和ASPE等需生成方面词的任务中;与真实数据联合训练时增益更明显;对T5-Base性能提升更大,使其接近Tk-Instruct水平。 Conclusion: 智能体式数据增强是一种更鲁棒、可控的ABSA数据扩充范式,尤其适用于参数量较小或预训练不足的模型。 Abstract: We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder-decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.[53] TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy,Eilam Shapira,Yinon Goldshtein,Avi Yaeli,Nir Mashkif,Segev Shlomov
Main category: cs.CL
TL;DR: TabAgent 是一种新框架,用轻量级文本-表格分类器替代智能体系统中耗时且昂贵的 LLM 生成式决策模块(如短名单筛选),在保持任务成功率的同时大幅降低延迟(约95%)和推理成本(85–91%)。
Details
Motivation: 现有基于 LLM 的智能体系统在封闭集决策任务(如路由、短名单筛选、验证)中反复调用 LLM,导致高延迟和高 token 成本,亟需高效替代方案。 Method: TabAgent 包含三部分:(i) TabSchema —— 从执行轨迹中提取结构化模式、状态与依赖特征;(ii) TabSynth —— 基于模式对齐的合成数据增强监督;(iii) TabHead —— 轻量级分类器对候选动作打分。 Result: 在长周期 AppWorld 基准上,TabAgent 在不损失任务成功率的前提下,完全消除了短名单阶段的 LLM 调用,延迟降低约 95%,推理成本下降 85–91%;并可泛化至其他智能体决策环节。 Conclusion: TabAgent 提出了一种将生成式瓶颈替换为学习型判别式模块的新范式,为生产级智能体架构提供了高效、低成本的替代路径。 Abstract: Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.[54] IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
Saurabh Bharti,Gaurav Azad,Abhinaw Jagtap,Nachiket Tapas
Main category: cs.CL
TL;DR: IndicEval是一个面向印度多语种高利害考试(如UPSC、JEE、NEET)的真实题库评测平台,用于评估大语言模型在英语和印地语下的推理能力、学科知识与双语适应性;实验发现思维链提示有效提升准确率,但模型间差异显著,且印地语性能普遍明显低于英语。
Details
Motivation: 现有合成基准无法反映真实学术严谨性和多语言复杂性,亟需基于真实高利害考试的、兼顾STEM与人文学科、支持英-印双语的可扩展评测框架。 Method: 构建IndicEval平台,集成UPSC/JEE/NEET真实考题(英/印双语),采用Zero-Shot、Few-Shot和Chain-of-Thought三种提示策略自动化评估,并支持模型与语言的模块化扩展。 Result: 1)CoT提示显著提升各学科及语言的推理准确率;2)不同模型(Gemini 2.0 Flash、GPT-4、Claude、LLaMA 3-70B)在高难度考试中表现差异显著;3)印地语性能在Zero-Shot下大幅下降,凸显多语言退化问题。 Conclusion: IndicEval为多语种教育场景下的大语言模型提供了实践导向、可扩展的严格评测基础,揭示了双语推理与领域迁移的关键短板,为提升模型鲁棒性与语言适应性提供可操作洞见。 Abstract: The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.[55] Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Jenny Kunz
Main category: cs.CL
TL;DR: 本文研究了机器翻译数据(即翻译语料)对小型英语语言模型训练的影响,发现源语言的类型学特征和语料库属性显著影响模型在语法判断和语言建模任务上的表现:词汇多样性主要影响困惑度,而与英语的类型学相似性则更影响语法性能。
Details
Motivation: 机器翻译数据广泛用于多语言NLP,但其固有的‘翻译腔’(translationese)可能影响模型学习;现有工作缺乏对不同源语言如何系统性影响小模型行为的深入分析。 Method: 在24种类型学和资源丰富度各异的源语言所翻译的英文语料上训练小型英语语言模型,系统评估其在语言建模(困惑度)和语法可接受性判断上的表现,并分析源语言类型学距离、语料词汇多样性等变量的影响。 Result: 源语言显著影响模型行为:一般困惑度主要受翻译语料词汇多样性驱动;而语法性能则与源语言和英语的类型学相似性高度相关(在数据充足条件下)。 Conclusion: 翻译语料的源语言特性是影响模型语言能力的关键因素,不能仅将其视为噪声;未来多语言建模应更审慎地建模翻译ese效应,尤其需考虑源语言类型学信息。 Abstract: Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce. However, translated text differs systematically from native text. This phenomenon is known as translationese, and it reflects both traces of the source language and characteristic properties of translation itself. In this paper, we study how training on machine-translated data affects small English language models, focusing on how translationese from different source languages shapes linguistic acceptability judgments and language modelling for different domains. We train models on English text translated from 24 typologically and resource-diverse source languages, enabling a systematic analysis of how source language and corpus properties influence what models learn. Our results show that the source language has a clear impact on model behavior: general perplexity is more driven by the lexical diversity of the translated corpus, while grammatical performance is strongly correlated to typological similarity to English, given enough data.[56] Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong,Zixi Zhang,Junyi Liu,Yiren Zhao
Main category: cs.CL
TL;DR: 本文提出Team-of-Thoughts,一种基于异构模型协同的多智能体系统架构,通过调度器校准与工具智能体自评估机制,动态匹配最适配模型,在多个推理与代码生成基准上显著超越同构基线。
Details
Motivation: 现有MAS依赖静态、同构模型配置,难以发挥不同后训练模型的独特优势。 Method: 提出Team-of-Thoughts架构,包含两个核心机制:(1) 调度器校准方案,识别协调能力强的模型;(2) 工具智能体自评估协议,使其主动刻画自身领域专长;推理时调度器依据专长画像动态激活最适工具智能体。 Result: 在五个推理与代码生成基准(如AIME24、LiveCodeBench)上性能持续领先;AIME24达96.67%,LiveCodeBench达72.53%,显著优于同构角色扮演基线(80%和65.93%)。 Conclusion: 异构智能体协同可通过精细化调度与自评估机制有效提升多智能体系统性能,为MAS设计提供新范式。 Abstract: Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.[57] Learning to Learn from Language Feedback with Social Meta-Learning
Jonathan Cook,Diego Antognini,Martin Klissarov,Claudiu Musat,Edward Grefenstette
Main category: cs.CL
TL;DR: 本文提出社会元学习(SML)方法,通过微调大语言模型(LLMs)使其能在对话中主动寻求并利用语言反馈来解决单轮无法完成的任务,并展现出跨领域泛化与处理模糊任务的能力。
Details
Motivation: 大型语言模型在对话中难以从纠正性反馈中学习,且缺乏主动寻求反馈的机制,导致对话僵硬、单向,缺乏人类对话的适应性。 Method: 受人类社会元学习启发,将SML建模为一种微调方法,在模拟教学对话中训练LLM主动 soliciting 并利用语言反馈;将静态任务转化为交互式社会学习问题。 Result: SML使模型能通过多轮对话解决单轮无法解决的问题;具备跨领域泛化能力(如数学训练提升代码反馈学习);在信息不全任务中更少贸然作答,更倾向于主动提问获取关键信息。 Conclusion: SML是一种可扩展的方法,能有效提升AI系统从语言反馈中学习的能力,增强其对话适应性与鲁棒性。 Abstract: Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans - the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.[58] From Growing to Looping: A Unified View of Iterative Computation in LLMs
Ferdinand Kapl,Emmanouil Angelis,Kaitlin Maile,Johannes von Oswald,Stefan Bauer
Main category: cs.CL
TL;DR: 本文揭示了循环计算(looping)和深度增长(depth growing)两种模型架构改进方法在提升推理能力方面的内在联系,指出它们共享迭代计算的机制,并展示了二者可组合使用以进一步提升性能。
Details
Motivation: 循环计算和深度增长都被发现能增强模型推理能力,但二者关系尚不明确,本文旨在从机制上统一解释其共性。 Method: 通过分析模型深度方向的行为特征(如对后层依赖增强、循环/增长模块的重复模式),建立二者在迭代计算上的关联;进一步实验验证二者在推理任务中的可组合性与适应性(如推理时对深度增长模型施加循环、调整训练数据或微调策略等)。 Result: 循环与深度增长模型展现出收敛的深度特征;二者可组合使用,在部分推理任务上精度提升达2倍;在更多上下文示例或监督微调数据下表现更优;高质量数学导向的预训练混合策略可进一步放大深度增长模型的收益,并可通过适配中间块循环进一步提升。 Conclusion: 循环计算与深度增长是互补且实用的迭代计算诱导方法,可用于有效提升和扩展模型的推理能力。 Abstract: Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.[59] Optimizing Soft Prompt Tuning via Structural Evolution
Zhenzhen Huang,Chaoning Zhang,Haoyu Bian,Songbo Zhang,Chi-lok Andy Tai,Jiaquan Zhang,Caiyan Qin,Jingjing Qu,Yalan Ye,Yang Yang,Heng Tao Shen
Main category: cs.CL
TL;DR: 本文提出了一种基于拓扑形态演化的软提示调优优化方法,利用持续同调量化软提示在连续参数空间中的结构表示及其训练演化过程,并设计了拓扑软提示损失(TSLoss)以提升性能与可解释性。
Details
Motivation: 软提示调优缺乏显式语义和可追溯的训练行为,导致其可解释性受限。 Method: 采用拓扑数据分析(TDA)中的持续同调来量化软提示的结构表示及训练演化,并据此构建拓扑软提示损失(TSLoss)用于优化。 Result: 实验表明,TSLoss能加速收敛、提升调优性能,并提供从结构与拓扑视角理解与优化软提示调优的可解释方法。 Conclusion: 拓扑稳定的、紧凑的软提示具有更优下游性能;TSLoss为软提示调优提供了结构化、可解释的优化路径。 Abstract: Soft prompt tuning leverages continuous embeddings to capture task-specific information in large pre-trained language models (LLMs), achieving competitive performance in few-shot settings. However, soft prompts rely on high-dimensional, implicit representations and lack explicit semantics and traceable training behaviors, which limits their interpretability. To address this limitation, we propose a soft prompt tuning optimization method based on topological morphological evolution. Specifically, we employ persistent homology from topological data analysis (TDA) to quantify the structural representations of soft prompts in continuous parameter space and their training process evolution. Quantitative analysis shows that topologically stable and compact soft prompts achieve better downstream performance. Based on this empirical observation, we construct a loss function for optimizing soft prompt tuning, termed Topological Soft Prompt Loss (TSLoss). TSLoss guides the model to learn structurally stable adaptations by quantifying inter-parameter connectivity and redundancy. Extensive experiments show that training with TSLoss accelerates convergence and improves tuning performance, providing an interpretable method to understand and optimize soft prompt tuning from structural and topological perspectives.[60] Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Taja Kuzman Pungeršek,Peter Rupnik,Daniela Širinić,Nikola Ljubešić
Main category: cs.CL
TL;DR: 本文提出了ParlaCAP——一个用于分析欧洲议会议程设置的大规模多语言数据集,并设计了一种基于教师-学生框架的低成本、领域适配的政策主题分类方法,利用大语言模型生成标注训练小模型,在准确性和可扩展性上优于现有方法。
Details
Motivation: 现有基于人工标注但跨领域的CAP分类器在议会语境下表现不佳,且缺乏大规模、多语言、带丰富元数据的欧洲议会政策议题标注数据集,限制了比较政治学研究。 Method: 采用教师-学生框架:以高性能大语言模型(LLM)为‘教师’,在ParlaMint语料(800万+欧洲议会演讲)上按CAP框架生成高质量标注;再以多语言编码器模型为‘学生’,在LLM标注数据上微调,构建轻量高效分类器。 Result: 所提方法达到与人类标注者间一致率相当的LLM-人工一致性;新分类器显著优于基于跨领域人工标注数据训练的现有CAP分类器;同时发布含发言人/政党元数据及ParlaSent情感预测的ParlaCAP数据集。 Conclusion: ParlaCAP数据集与低成本、高适配性的标注范式,为跨国家、跨语言的议会注意力、政策表征与政治话语比较研究提供了可靠基础设施和新方法论路径。 Abstract: This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.[61] Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Zhuqian Zhou,Kirk Vanacore,Bakhtawar Ahtisham,Jinsook Lee,Doug Pietrzak,Daryl Hedley,Jorge Dias,Chris Shaw,Ruth Schäfer,René F. Kizilcec
Main category: cs.CL
TL;DR: 本文提出MathEd-PII数据集和数学教育对话中PII检测的新方法,解决数值表达式被误判为PII导致教学内容被过度脱敏的问题;通过引入数学感知提示策略,在F1分数上显著优于基线模型(0.821 vs. 0.379),验证了领域感知建模对保持数据教育效用的必要性。
Details
Motivation: 数学辅导对话中的数值表达式(如日期、ID)易被通用PII检测系统误识别,造成核心教学内容被过度脱敏,严重损害数据集的教育研究价值。 Method: 构建首个面向数学辅导对话的PII检测基准数据集MathEd-PII(含1000个会话、11.5万条消息),采用人机协同LLM流程进行标注与隐私保护替代生成;提出基于密度的分段方法定位误红区域,并对比四种PII检测策略:Presidio基线、基础LLM提示、数学感知提示、分段感知提示。 Result: 发现误红主要集中在数学密集区域;数学感知提示策略F1达0.821,远超Presidio基线0.379,且显著降低数值类假阳性;验证了领域知识对提升PII检测精度与数据效用的关键作用。 Conclusion: 通用PII检测工具不适用于数学教育对话,必须融合数学语义理解以实现效用保持的脱敏;本文提供了新基准与实证依据,推动教育数据隐私保护向领域自适应方向发展。 Abstract: Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.[62] CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes
Miguel Marques,Ana Luísa Fernandes,Ana Filipa Pacheco,Rute Rebouças,Inês Cantante,José Isidro,Luís Filipe Cunha,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano,Ricardo Campos
Main category: cs.CL
TL;DR: 本文提出了CitiLink-Summ数据集,首个面向欧洲葡萄牙语市政会议纪要主题摘要的基准语料库,并基于该数据集对多种生成式模型和大语言模型进行了基准测试。
Details
Motivation: 市政会议纪要内容冗长难懂,公众难以获取关键信息;现有研究缺乏针对低资源语言(如欧洲葡萄牙语)中该领域高质量标注摘要数据集,制约了自动摘要模型的发展与评估。 Method: 构建包含100份会议纪要、2322个人工撰写主题摘要的CitiLink-Summ语料库;采用BART、PRIMERA等先进生成模型及大语言模型进行摘要生成实验;使用ROUGE、BLEU、METEOR和BERTScore等指标进行多维度评估。 Result: 建立了市政会议纪要主题摘要任务在欧洲葡萄牙语上的首个基准结果,验证了多种模型在该复杂行政文本上的性能表现。 Conclusion: CitiLink-Summ填补了低资源语言市政文本摘要领域的数据空白,为后续NLP研究提供了重要资源和可复现的评估基准。 Abstract: Municipal meeting minutes are formal records documenting the discussions and decisions of local government, yet their content is often lengthy, dense, and difficult for citizens to navigate. Automatic summarization can help address this challenge by producing concise summaries for each discussion subject. Despite its potential, research on summarizing discussion subjects in municipal meeting minutes remains largely unexplored, especially in low-resource languages, where the inherent complexity of these documents adds further challenges. A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain. In this paper, we present CitiLink-Summ, a new corpus of European Portuguese municipal meeting minutes, comprising 100 documents and 2,322 manually hand-written summaries, each corresponding to a distinct discussion subject. Leveraging this dataset, we establish baseline results for automatic summarization in this domain, employing state-of-the-art generative models (e.g., BART, PRIMERA) as well as large language models (LLMs), evaluated with both lexical and semantic metrics such as ROUGE, BLEU, METEOR, and BERTScore. CitiLink-Summ provides the first benchmark for municipal-domain summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts.[63] ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
Antoine Chaffin,Luca Arnaboldi,Amélie Chatelain,Florent Krzakala
Main category: cs.CL
TL;DR: 本文研究了多向量模型的预训练,发现大规模多向量预训练能显著提升模型性能;提出的ColBERT-Zero模型仅用公开数据即超越依赖闭源强数据的先进模型;同时指出监督微调前置可大幅降低对昂贵无监督预训练的依赖,并强调预训练与微调设置对齐的重要性。
Details
Motivation: 现有最优多向量模型依赖于在强单向量模型上进行小规模知识蒸馏(KD),而忽略了多向量模型自身大规模预训练的潜力。本文旨在探索多向量模型直接预训练的有效性及可行性。 Method: 提出并实现了多向量模型(如ColBERT)的大规模端到端预训练(ColBERT-Zero),全部使用公开数据;对比分析了纯KD、监督微调+KD、以及完整预训练三种范式;系统评估了预训练与微调设置对齐的影响。 Result: ColBERT-Zero(仅用公开数据)超越GTE-ModernColBERT及其基座GTE-ModernBERT(使用更强闭源数据);监督微调前置可显著逼近全预训练性能,跳过最耗资源的无监督阶段;预训练与微调设置对齐被证实是复用现有模型的关键因素。 Conclusion: 多向量模型的大规模预训练极具价值且可行;ColBERT-Zero树立了同尺寸模型新SOTA;轻量级训练策略(监督+KD)可作为全预训练的有效替代;设置对齐是模型迁移和复用的重要实践准则。 Abstract: Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.[64] Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian,Guangzhi Sun,Mark J. F. Gales,Kate M. Knill
Main category: cs.CL
TL;DR: 本文提出BT-sigma模型,通过引入判别参数来联合推断项目排名和LLM评判员可靠性,从而在无监督条件下提升自然语言生成评估中成对比较的聚合效果。
Details
Motivation: 现有LLM作为自动评估器的方法假设所有评判员可靠性相同,但实际中LLM评判员在不同任务和维度上性能差异大、判断概率存在偏差与不一致,且缺乏人工标注用于校准。 Method: 提出BT-sigma——一种面向评判员感知的Bradley-Terry模型扩展,为每个LLM评判员引入判别参数,仅基于成对比较数据联合学习项目排序与评判员可靠性。 Result: 在NLG评估基准数据集上的实验表明,BT-sigma持续优于基于平均的聚合方法;所学判别参数与LLM判断的环一致性等独立指标高度相关;进一步分析证实其可解释为一种无监督校准机制。 Conclusion: BT-sigma能有效建模LLM评判员的可靠性差异,在无需人工监督的情况下提升成对比较聚合的准确性与鲁棒性,为LLM作为评审员的应用提供了更可靠的理论与实践基础。 Abstract: Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.[65] AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models
Adib Sakhawat,Fardeen Sadab
Main category: cs.CL
TL;DR: 本文提出Adversarial Resource Extraction Game (AREG)基准,通过多轮零和谈判评估大语言模型在说服(进攻)与抵抗(防御)两方面的社会智能,发现二者弱相关且存在系统性防御优势。
Details
Motivation: 现有对大语言模型社会智能的评估多局限于静态文本生成,难以刻画动态、对抗性交互中的说服与抵抗能力;需构建能同时、联合评估这两类能力的新基准。 Method: 设计AREG——一个以金融资源争夺为背景的多轮、零和对抗性谈判游戏,并在前沿模型间开展循环赛;结合量化得分分析与细粒度语言行为分析(如承诺寻求、验证寻求等策略)。 Result: 说服与抵抗能力弱相关(ρ=0.33),经验上可分离;所有模型均表现出系统性防御优势(抵抗分 > 说服分);增量式承诺寻求策略提升资源提取成功率,而验证寻求型回应比直接拒绝更利于成功防御。 Conclusion: LLM的社会影响力并非单一能力,仅关注说服的评估框架会忽略其不对称的行为脆弱性;应采用兼顾攻防的动态交互式基准进行更全面的社会智能评估。 Abstract: Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($ρ= 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.[66] Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit
Main category: cs.CL
TL;DR: 本文提出Quecto-V1,一个专为印度法律领域定制的小型语言模型(124M参数),基于GPT-2架构从零训练于印度成文法语料,并采用8位量化压缩至150MB以下,支持离线CPU运行,在法定条文检索任务中优于通用小模型,兼顾性能、隐私与可及性。
Details
Motivation: 解决大型语言模型在法律智能应用中引发的资源鸿沟问题:现有SOTA系统依赖大参数量(7B+)和云端推理,导致资源受限从业者难以使用,并带来数据主权风险。 Method: 构建基于GPT-2的轻量级领域专用模型Quecto-V1(124M参数),仅在印度成文法(IPC、CrPC、宪法)上从零训练;强调法律文本的词汇密度;采用后训练8比特量化(GGUF格式)实现极致压缩与离线部署。 Result: Quecto-V1在法定定义与刑罚条款检索任务中达到高保真度,显著优于通用小型语言模型;8比特量化使模型体积减少74%,检索准确率仅下降<3.5%;可在消费级CPU上完全离线运行。 Conclusion: 在法律等专业高风险领域,领域专用训练结合激进量化是一种可行、隐私优先且可普及的替代方案,可摆脱对巨型云模型的依赖。 Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes "lexical density" within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.[67] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu,Xiaohao Liu,ZhaoXing Ren,Yaodong Yang,Juntao Dai
Main category: cs.CL
TL;DR: 本文提出了一种资源高效的多语言安全对齐方法——多语言一致性(MLC)损失,无需低资源语言的额外响应级监督,仅通过多语言提示变体即可提升多语言语义层面的方向一致性。
Details
Motivation: 现有扩展对齐至多语言的方法通常需要大量目标语言高质量监督数据或与高资源语言成对对齐,限制了可扩展性。 Method: 提出一种即插即用的多语言一致性(MLC)损失,通过提升多语言表征向量的共线性,在单次更新中实现多语言语义层面的方向一致性,仅依赖多语言提示变体,无需低资源语言的响应级监督。 Result: 在不同模型架构和对齐范式上验证有效,显著提升多语言安全性,对通用模型能力影响小,并展现出更好的跨语言泛化能力。 Conclusion: MLC是一种实用、资源高效、适用于低监督场景的多语言一致性对齐方案。 Abstract: The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.[68] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding,Nicholas Tomlin,Greg Durrett
Main category: cs.CL
TL;DR: 本文提出Calibrate-Then-Act(CTA)框架,使大语言模型(LLM)能显式权衡探索成本与不确定性,在信息检索和编程等任务中实现更优的序贯决策。
Details
Motivation: LLM在解决需与环境交互获取信息的复杂问题时,面临成本-不确定性权衡难题(如是否编写测试代码),但现有方法缺乏对此类权衡的显式建模。 Method: 将信息检索、编程等任务形式化为带潜在状态的序贯决策问题;设计CTA框架,向LLM注入先验知识以支持其显式推理成本-不确定性权衡;并在强化学习训练下验证其鲁棒性。 Result: 在信息寻求型问答和简化编程任务上,CTA显著提升LLM的决策质量,且效果在RL微调后仍保持。 Conclusion: 显式引导LLM进行成本-收益权衡推理,可有效提升其在不确定环境中的探索效率与决策最优性,CTA是一种通用且鲁棒的增强范式。 Abstract: LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.[69] Reinforced Fast Weights with Next-Sequence Prediction
Hee Seung Hwang,Xindi Wu,Sanghyuk Chun,Olga Russakovsky
Main category: cs.CL
TL;DR: 本文提出REFINE框架,通过强化学习和下一序列预测(NSP)目标改进快速权重模型,提升其长上下文建模能力,显著优于传统下一词预测(NTP)训练方式。
Details
Motivation: 快速权重架构虽具低内存开销优势,但受限于仅优化单token预测的NTP范式,难以建模多token语义连贯性与长程依赖。 Method: 提出REFINE:基于预测熵选择关键位置、生成多token rollout、设计自监督序列级奖励,并采用分组相对策略优化(GRPO)进行强化学习训练;支持中训、后训及测试时训练。 Result: 在LaCT-760M和DeltaNet-1.3B上,REFINE在needle-in-a-haystack检索、长上下文问答及LongBench多项任务中持续超越NTP监督微调。 Conclusion: REFINE为快速权重架构提供了有效且通用的长上下文建模增强框架。 Abstract: Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.cs.CV [Back]
[70] Egocentric Bias in Vision-Language Models
Maijunxian Wang,Yijiang Li,Bingyang Wang,Tianwei Zhao,Ran Ji,Qingying Gao,Emmy Liu,Hokin Deng,Dezhi Luo
Main category: cs.CV
TL;DR: 本文提出FlipSet基准测试,用于评估视觉-语言模型(VLMs)在二级视觉视角采择(L2 VPT)任务中的能力,发现大多数模型存在严重自我中心偏差,无法将社会意识与空间操作有效结合。
Details
Motivation: 视觉视角采择是社会认知的基础能力,但现有VLMs在该能力上的系统性缺陷尚缺乏精细诊断工具,亟需一个能分离空间变换与三维场景复杂性的认知驱动基准。 Method: 构建FlipSet诊断基准,要求模型对2D字符字符串进行180度视角旋转模拟(Level-2 VPT),并通过控制实验分别评估理论心智、心理旋转及二者整合能力。 Result: 在103个VLMs上测试发现:绝大多数模型表现低于随机水平,约75%错误源于直接复制相机视角;控制实验显示模型单独任务表现良好,但整合时崩溃,揭示出‘组合性缺陷’。 Conclusion: 当前VLMs缺乏将社会意识与空间操作绑定的机制,暴露其基于模型的空间推理存在根本性局限;FlipSet为多模态系统视角采择能力提供了认知 grounded 的诊断平台。 Abstract: Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.[71] Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment
Jingwei Li,Jiaxin Tong,Pengfei Wu
Main category: cs.CV
TL;DR: 本文提出了一种名为MSBA-CLIP的新框架,结合CLIP的多模态对齐能力与多变量软融合增强(MSBA)及伪造强度估计(MFIE)模块,显著提升了深度伪造检测的准确率与跨域泛化能力。
Details
Motivation: 现有深度伪造检测方法在面对多种伪造技术导致的数据分布偏移时,准确率和泛化性不足。 Method: 提出MSBA-CLIP框架:利用CLIP提取细粒度伪造痕迹;设计多变量软融合增强(MSBA)策略,混合多种伪造图像以提升泛化性;引入多变量伪造强度估计(MFIE)模块,显式建模不同伪造模式与强度。 Result: 在域内测试中,Accuracy和AUC分别提升3.32%和4.02%;跨域测试(五个数据集)平均AUC提升3.27%;消融实验证明两模块均有效。 Conclusion: MSBA-CLIP显著增强了深度伪造检测模型的鲁棒性与泛化能力,是迈向更通用检测方法的重要进展,但依赖大模型带来较高计算开销。 Abstract: The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32\% and 4.02\%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27\%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.[72] A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving
June Moh Goo,Zichao Zeng,Jan Boehm
Main category: cs.CV
TL;DR: 本文是首篇关于自动驾驶中LiDAR超分辨率方法的综述,系统梳理了CNN、模型驱动展开、隐式表示及Transformer/Mamba四类方法,涵盖数据表示、问题建模、基准与评估,并指出实时推理、跨传感器泛化等趋势与挑战。
Details
Motivation: 高分辨率LiDAR昂贵,低分辨率LiDAR点云稀疏、丢失关键细节;缺乏对LiDAR超分辨率方法的系统性综述,阻碍技术落地与跨传感器兼容。 Method: 对现有LiDAR超分辨率方法进行分类整理(CNN、模型驱动深度展开、隐式表示、Transformer/Mamba),并统一梳理基础概念(数据表示、问题定义、数据集、评估指标)。 Result: 建立了首个全面的LiDAR超分辨率研究框架,归纳出范围图像表示、极致模型压缩、分辨率灵活架构等趋势,并强调实时推理与跨传感器泛化能力。 Conclusion: LiDAR超分辨率在实用化方面仍面临诸多开放挑战,需进一步探索高效轻量模型、强泛化能力架构及更贴近真实部署的评估范式。 Abstract: LiDAR sensors are often considered essential for autonomous driving, but high-resolution sensors remain expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution addresses this challenge by using deep learning to enhance sparse point clouds, bridging the gap between different sensor types and enabling cross-sensor compatibility in real-world deployments. This paper presents the first comprehensive survey of LiDAR super-resolution methods for autonomous driving. Despite the importance of practical deployment, no systematic review has been conducted until now. We organize existing approaches into four categories: CNN-based architectures, model-based deep unrolling, implicit representation methods, and Transformer and Mamba-based approaches. We establish fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics. Current trends include the adoption of range image representation for efficient processing, extreme model compression and the development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment. We conclude by identifying open challenges and future research directions for advancing LiDAR super-resolution technology.[73] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
Xianwei Mao,Kai Ye,Sheng Zhou,Nan Zhang,Haikuan Huang,Bin Li,Jiajun Bu
Main category: cs.CV
TL;DR: 本文提出MaS-VQA框架,通过Mask-and-Select机制联合过滤图像区域与知识片段,实现显式知识筛选与隐式知识推理的紧密耦合,提升KB-VQA任务的准确性和鲁棒性。
Details
Motivation: 现有KB-VQA方法受限于检索知识的噪声、不相关性及与视觉内容错位,且模型内部知识难以控制和解释,简单融合多源知识导致推理效果差、答案准确率低。 Method: 提出MaS-VQA框架:首先检索候选知识片段,再通过Mask-and-Select机制同步剪枝无关图像区域和弱相关知识片段,生成高信噪比的紧凑多模态知识;该知识进一步约束并引导大模型内部知识在语义空间中的激活,实现显隐知识协同建模。 Result: 在Encyclopedic-VQA和InfoSeek数据集上,MaS-VQA在多个MLLM骨干模型上均取得一致性能提升;消融实验证明其选择机制能有效降噪并提升知识利用率。 Conclusion: MaS-VQA通过显式知识筛选与隐式知识推理的协同设计,显著提升了KB-VQA中多源知识融合的准确性与可解释性,为知识增强型视觉问答提供了新范式。 Abstract: Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.[74] EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery
Zelin Xu,Yupu Zhang,Saugat Adhikari,Saiful Islam,Tingsong Xiao,Zibo Liu,Shigang Chen,Da Yan,Zhe Jiang
Main category: cs.CV
TL;DR: 本文提出了EarthSpatialBench,一个用于评估多模态大语言模型(MLLMs)在地球影像上空间推理能力的综合基准,涵盖定性与定量距离/方向推理、系统性拓扑关系、多种查询类型及多模态对象引用方式。
Details
Motivation: 现有地球影像空间推理基准缺乏对定量方向与距离推理、系统性拓扑关系及复杂几何对象(如多边形、折线)的支持,难以满足具身AI等对精确物理世界交互的需求。 Method: 构建了包含32.5万问答对的EarthSpatialBench基准,覆盖四类空间推理任务,并在开源与闭源MLLM上开展系统性评测以识别其空间推理局限。 Result: 实验揭示当前MLLM在地球影像上的定量空间推理(如精确距离/方向计算)、复杂拓扑关系理解及多边形等几何对象处理方面存在显著不足。 Conclusion: EarthSpatialBench填补了地球影像空间推理评估的空白,为推动MLLM在地理空间智能领域的进步提供了标准化测试平台和明确改进方向。 Abstract: Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.[75] A Study on Real-time Object Detection using Deep Learning
Ankita Bose,Jayasravani Bhumireddy,Naveen N
Main category: cs.CV
TL;DR: 本文综述了深度学习在实时目标检测中的应用,详细介绍了主流算法(如Faster R-CNN、YOLO、SSD等)、公开基准数据集、多领域应用案例及对比实验,并指出了未来研究挑战与方向。
Details
Motivation: 目标检测在人机交互、安防监控、交通管理、医疗、AR/VR等多个领域具有重要应用价值,实时性与准确性需求推动深度学习方法的发展与优化。 Method: 系统性综述方法:梳理主流深度学习目标检测模型(两阶段与单阶段),分析其原理与性能;总结常用公开数据集(如COCO、PASCAL VOC);归纳典型应用场景;开展控制实验对比不同策略效果。 Result: 明确了各类模型的适用场景与优劣;提供了跨应用领域的实证分析结果;通过对比实验得出若干关键发现(如精度-速度权衡、小目标检测瓶颈等);识别出当前技术局限。 Conclusion: 深度学习显著提升了目标检测的精度与效率,但实时性、鲁棒性、小目标与遮挡处理等仍存挑战;未来需探索轻量化架构、自监督学习、多模态融合等新路径。 Abstract: Object detection has compelling applications over a range of domains, including human-computer interfaces, security and video surveillance, navigation and road traffic monitoring, transportation systems, industrial automation healthcare, the world of Augmented Reality (AR) and Virtual Reality (VR), environment monitoring and activity identification. Applications of real time object detection in all these areas provide dynamic analysis of the visual information that helps in immediate decision making. Furthermore, advanced deep learning algorithms leverage the progress in the field of object detection providing more accurate and efficient solutions. There are some outstanding deep learning algorithms for object detection which includes, Faster R CNN(Region-based Convolutional Neural Network),Mask R-CNN, Cascade R-CNN, YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), RetinaNet etc. This article goes into great detail on how deep learning algorithms are used to enhance real time object recognition. It provides information on the different object detection models available, open benchmark datasets, and studies on the use of object detection models in a range of applications. Additionally, controlled studies are provided to compare various strategies and produce some illuminating findings. Last but not least, a number of encouraging challenges and approaches are offered as suggestions for further investigation in both relevant deep learning approaches and object recognition.[76] Visual Memory Injection Attacks for Multi-Turn Conversations
Christian Schlarmann,Matthias Hein
Main category: cs.CV
TL;DR: 本文提出了一种名为视觉记忆注入(VMI)的新型隐蔽攻击方法,可在多轮对话中通过篡改图像使LVLM在触发提示下输出指定恶意信息,揭示了当前LVLM在长上下文多轮交互场景下的严重安全漏洞。
Details
Motivation: 生成式大视觉语言模型(LVLMs)用户快速增长,但其在长上下文、多轮交互场景下的安全性尚未被充分研究,尤其是图像输入被恶意篡改后的潜在风险。 Method: 提出视觉记忆注入(VMI)攻击:攻击者上传经扰动的图像,用户下载后作为输入;模型在常规提示下表现正常,但在特定触发提示下输出预设的恶意目标消息;该攻击在多轮对话中仍保持有效性。 Result: 在多个近期开源LVLM上成功验证VMI攻击的有效性,证明仅通过一张扰动图像即可在多轮对话中实现对用户的规模化操纵。 Conclusion: LVLM在多轮交互中易受视觉记忆注入攻击,亟需提升其对基于图像输入的对抗性攻击的鲁棒性。 Abstract: Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at https://github.com/chs20/visual-memory-injection[77] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
Yuval Levental
Main category: cs.CV
TL;DR: 本文通过实验揭示了视觉-语言模型(VLMs)在定位二值网格中无文本标识的填充单元时存在根本性缺陷:当使用文本符号(如#和.)表示网格时性能良好,而换成无文字的填充方块时性能急剧下降,表明VLMs严重依赖文本识别通路进行空间推理,其原生视觉通路的空间定位能力薄弱。
Details
Motivation: 探究VLMs是否真正具备通用视觉空间理解能力,还是过度依赖图像中的文本线索进行推理。 Method: 构建15×15二值网格(填充密度10.7%–41.8%),分别以文本符号(.和#)和无网格线的纯填充方块两种图像形式呈现;输入至Claude Opus、ChatGPT 5.2和Gemini 3 Thinking三个前沿VLM,评估其单元定位准确率与F1分数;所有输入均为图像,确保经由同一视觉编码器。 Result: 文本符号条件下,Claude/ChatGPT达~91%准确率、84% F1,Gemini为84%/63%;填充方块条件下三者均骤降至60–73%准确率、29–39% F1;F1下降幅度达34–54分;各模型在方块条件下表现出不同失败模式(少计数、多计数、模板幻觉),但共性是空间定位能力严重退化。 Conclusion: VLMs的空间推理能力高度依赖图像中可识别的文本线索,其原生视觉通路对非文本视觉元素的定位能力极弱,暴露了当前架构在通用视觉理解上的根本局限。 Abstract: We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition -- systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) -- but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.[78] Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration
Yiwen Wang,Jiahao Qin
Main category: cs.CV
TL;DR: 本文提出GPEReg-Net,一种基于场景外观解耦与全局位置编码的光声显微镜图像配准方法,显著提升高速双向扫描下的配准精度与时间一致性。
Details
Motivation: 高速光学分辨率光声显微镜(OR-PAM)双向扫描引发域偏移与几何错位,现有配准方法受限于亮度恒定假设或缺乏时序建模能力。 Method: 提出GPEReg-Net框架:1)利用AdaIN实现场景特征与外观码的解耦,避免显式形变场估计;2)引入融合可学习位置嵌入、正弦编码与跨帧注意力的全局位置编码(GPE)模块以建模时序结构。 Result: 在OR-PAM-Reg-4K基准(432个测试样本)上,NCC达0.953,SSIM达0.932,PSNR达34.49dB,SSIM和PSNR分别超越SOTA 3.8%和1.99dB。 Conclusion: GPEReg-Net通过外观解耦与显式时序建模,有效解决双向扫描OR-PAM中的域偏移与时间不一致问题,为高速高保真生物医学成像提供新范式。 Abstract: High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg-Net, a scene-appearance disentanglement framework that separates domain-invariant scene features from domain-specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image-to-image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at https://github.com/JiahaoQin/GPEReg-Net.[79] Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds
Phoenix Yu,Tilo Burghardt,Andrew W Dowsey,Neill W Campbell
Main category: cs.CV
TL;DR: 本文提出了一种新的detect-segment-identify流程,结合Open-Vocabulary Weight-free Localisation和Segment Anything模型,在密集牛群场景下显著提升Holstein-Friesian奶牛的检测与重识别准确率。
Details
Motivation: 现有基于YOLO等的目标检测方法在奶牛密集聚集、尤其是具有复杂斑纹轮廓时性能下降严重,亟需更鲁棒、可迁移的解决方案。 Method: 构建detect-segment-identify三阶段流程:先用Open-Vocabulary Weight-free Localisation和SAM进行无监督定位与分割预处理,再接入Re-ID网络;并采用无监督对比学习优化重识别性能。 Result: 在自建的9天农场CCTV数据集上,检测准确率达98.93%,较OBB和SAM基线分别提升47.52%和27.13%;Re-ID准确率达94.82%。 Conclusion: 该方法验证了在真实农场拥挤场景中无需人工干预即可实现高精度、高可靠性的奶牛检测与重识别,具备实际落地价值。 Abstract: Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.[80] Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning
Mohamed Khalil Ben Salah,Philippe Jouvet,Rita Noumeir
Main category: cs.CV
TL;DR: 本文提出了一种面向儿科重症监护病房(PICU)的自监督预训练框架,结合VisionMamba架构与自适应掩码机制,通过渐进式课程学习和教师-学生蒸馏策略,在无标签临床视频上实现了高鲁棒性的无接触心率估计。
Details
Motivation: 传统接触式传感器在儿童患者中易引发皮肤刺激、感染风险和不适;而现有rPPG方法在PICU中受限于运动伪影、遮挡、光照变化及实验室到临床的域偏移。 Method: 基于VisionMamba设计自监督预训练框架,引入轻量级Mamba控制器实现时空重要性评分与概率化块采样;采用三阶段渐进式课程学习(公开干净视频→合成遮挡视频→500例患儿无标签临床视频),并结合监督专家模型指导的学生-教师知识蒸馏。 Result: 相比标准掩码自编码器MAE降低42%,优于PhysFormer 31%,最终MAE达3.2 bpm;无需显式ROI提取即能聚焦脉搏丰富区域,并在临床遮挡与噪声下保持鲁棒性。 Conclusion: 该框架有效缓解了PICU中rPPG应用的数据稀缺与域适配难题,为临床无接触生命体征监测提供了可落地的新范式。 Abstract: Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise.[81] LAND: A Longitudinal Analysis of Neuromorphic Datasets
Gregory Cohen,Alexandre Marcireau
Main category: cs.CV
TL;DR: 本文综述了神经形态计算领域中数据集的现状与挑战,分析了423个现有数据集,指出其在规模、标准化、可访问性及任务定义方面存在的问题,并探讨了合成数据(仿真/视频转事件)的利弊,提出元数据集作为缓解数据需求与偏差的新思路。
Details
Motivation: 神经形态研究虽数据集数量激增,但普遍存在数据难找、难理解、难使用、缺乏标准化等问题,且大量论文仍呼吁更多更大数据,亟需系统性梳理与反思。 Method: 对超过423个神经形态数据集进行系统性快照式调研,分析其任务类型、数据结构、规模演化、获取与使用难度;同时对比评估真实数据与合成数据(仿真/视频转事件)的适用性与局限性,并引入元数据集概念作为解决方案。 Result: 揭示了当前神经形态数据集在规模膨胀下的三大核心问题:数据体量过大、标准缺失、访问困难;发现合成数据日益增多,虽利于算法验证但易导致新应用探索中的偏差;提出元数据集可有效减少重复造数并缓解任务-数据耦合带来的偏差。 Conclusion: 神经形态领域的真正瓶颈并非数据量不足,而是数据质量、可复现性与任务适配性不足;未来应推动标准化建设、提升数据可访问性,并审慎使用合成数据,鼓励基于元数据集的泛化性研究。 Abstract: Neuromorphic engineering has a data problem. Despite the meteoric rise in the number of neuromorphic datasets published over the past ten years, the conclusion of a significant portion of neuromorphic research papers still states that there is a need for yet more data and even larger datasets. Whilst this need is driven in part by the sheer volume of data required by modern deep learning approaches, it is also fuelled by the current state of the available neuromorphic datasets and the difficulties in finding them, understanding their purpose, and determining the nature of their underlying task. This is further compounded by practical difficulties in downloading and using these datasets. This review starts by capturing a snapshot of the existing neuromorphic datasets, covering over 423 datasets, and then explores the nature of their tasks and the underlying structure of the presented data. Analysing these datasets shows the difficulties arising from their size, the lack of standardisation, and difficulties in accessing the actual data. This paper also highlights the growth in the size of individual datasets and the complexities involved in working with the data. However, a more important concern is the rise of synthetic datasets, created by either simulation or video-to-events methods. This review explores the benefits of simulated data for testing existing algorithms and applications, highlighting the potential pitfalls for exploring new applications of neuromorphic technologies. This review also introduces the concepts of meta-datasets, created from existing datasets, as a way of both reducing the need for more data, and to remove potential bias arising from defining both the dataset and the task.[82] SAM 3D Body: Robust Full-Body Human Mesh Recovery
Xitong Yang,Devansh Kukreja,Don Pinkus,Anushka Sagar,Taosha Fan,Jinhyung Park,Soyong Shin,Jinkun Cao,Jiawei Liu,Nicolas Ugrinovic,Matt Feiszli,Jitendra Malik,Piotr Dollar,Kris Kitani
Main category: cs.CV
TL;DR: 本文提出了SAM 3D Body(3DB),一种可提示的单图像全身3D人体网格恢复模型,首次采用解耦骨骼结构与表面形状的新型参数化网格表示Momentum Human Rig(MHR),结合编码器-解码器架构与多模态辅助提示,在真实场景中展现出卓越泛化性与精度,并开源了模型与表示。
Details
Motivation: 现有3D人体网格恢复(HMR)方法在真实复杂场景下泛化能力弱、精度不稳定,缺乏对罕见姿态和成像条件的建模能力,且传统参数化表示难以兼顾骨骼结构与表面形状的灵活性。 Method: 提出可提示的3DB模型,采用新参数化表示MHR;构建编码器-解码器架构,支持2D关键点和掩码等辅助提示;设计多阶段高质量标注流水线(融合人工标注、可微优化、多视图几何与稠密关键点检测);开发数据引擎以增强数据多样性;构建按姿态与外观分类的新评估数据集。 Result: 3DB在定性用户偏好研究和定量分析中均显著超越先前方法,展现出更强的泛化能力与一致性精度,尤其在多样化的in-the-wild条件下表现优异。 Conclusion: 3DB是首个将SAM式可提示范式引入单图像全身3D HMR的工作,MHR为人体建模提供了更灵活的表示基础;模型与MHR均已开源,推动社区发展。 Abstract: We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.[83] BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features
Juampablo E. Heras Rivera,Dickson T. Chen,Tianyi Ren,Daniel K. Low,Asma Ben Abacha,Alberto Santamaria-Pang,Mehmet Kurt
Main category: cs.CV
TL;DR: 本文提出BTReport框架,通过确定性特征提取和大语言模型结合的方式生成可解释、低幻觉的脑肿瘤放射学报告,并构建了配套的BTReport-BraTS合成数据集。
Details
Motivation: 神经肿瘤学领域缺乏公开的配对影像-报告数据集,限制了放射学报告生成(RRG)的发展。 Method: BTReport将RRG分为两步:首先通过确定性方法从影像中提取临床相关特征,再利用大语言模型进行语法组织和叙述格式化,不依赖端到端视觉语言模型。 Result: 生成的报告更贴近真实临床报告,所用特征可预测生存期和IDH突变状态;配套发布的BTReport-BraTS数据集为BraTS影像添加了合成报告。 Conclusion: BTReport实现了高可解释性、低幻觉的脑肿瘤报告生成,为缺乏标注数据的医学AI任务提供了新范式。 Abstract: Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at https://github.com/KurtLabUW/BTReport.[84] MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval
Ahmad Elallaf,Yu Zhang,Yuktha Priya Masupalli,Jeong Yang,Young Lee,Zechun Cao,Gongbo Liang
Main category: cs.CV
TL;DR: 本文提出了MedProbCLIP,一种用于胸部X光片与放射科报告表征学习及双向检索的概 率性视觉-语言学习框架,通过高斯嵌入建模不确定性,并在MIMIC-CXR数据集上展现出优于确定性与概率性基线模型的性能与可靠性。
Details
Motivation: 现有视觉-语言基础模型采用确定性嵌入,在高风险生物医学应用中缺乏所需的可靠性。 Method: MedProbCLIP将图像和文本表示建模为高斯分布嵌入,采用概率对比学习目标;引入变分信息瓶颈缓解过自信预测,并在训练中使用多视角X光编码与多段报告编码以实现临床对齐的细粒度监督。 Result: 在MIMIC-CXR数据集上,MedProbCLIP在检索与零样本分类任务中均超越CLIP、CXR-CLIP和PCME++等基线;同时展现出更优的校准性、风险-覆盖行为、选择性检索可靠性及对临床相关扰动的鲁棒性。 Conclusion: 概率化视觉-语言建模可显著提升放射科图文检索系统的可信度与安全性,为高风险医学AI应用提供新范式。 Abstract: Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.[85] LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization
Idil Bilge Altun,Mert Onur Cakiroglu,Elham Buxton,Mehmet Dalkilic,Hasan Kurban
Main category: cs.CV
TL;DR: 本文提出Learnable Geometric Quantization(LGQ),一种端到端可学习离散化几何结构的图像tokenization方法,通过软分配与变分自由能优化,在保证代码本高效利用的同时提升生成质量。
Details
Motivation: 现有离散图像tokenization方法在灵活性与稳定性之间存在权衡:向量量化易出现优化偏差和码本利用不均,而结构化标量/隐式量化则受限于固定离散几何,难以适应异构潜在分布。 Method: LGQ用温度控制的软分配替代硬最近邻查找,建模为各向同性高斯混合的后验责任,最小化变分自由能;引入token级尖锐性正则与全局使用正则,实现高置信且均衡的码本利用。 Result: 在ImageNet上VQGAN风格主干下,LGQ在16K码本规模时相比FSQ提升rFID 11.88%且激活码减少49.96%,相比SimVQ提升rFID 6.06%且有效表示率降低49.45%,生成保真度相当但更稀疏高效。 Conclusion: LGQ通过可学习几何量化机制,兼顾离散tokenization的表达能力、训练稳定性与码本利用率,为高效视觉生成提供了新范式。 Abstract: Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ[86] OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
Tianwei Lin,Zhongwei Qiu,Wenqiao Zhang,Jiang Liu,Yihan Xie,Mingjian Gao,Zhenxuan Fan,Zhaocheng Li,Sijing Li,Zhongle Xie,Peng LU,Yueting Zhuang,Yingda Xia,Ling Zhang,Beng Chin Ooi
Main category: cs.CV
TL;DR: 本文提出OmniCT,一种统一的切片-体素大视觉语言模型,通过空间一致性增强、器官级语义增强和新构建的大规模CT数据集MedEval-CT,在CT影像理解中实现了细粒度细节敏感性与宏观空间推理能力的兼顾,并确立了跨模态医学影像理解的新范式。
Details
Motivation: 现有大视觉语言模型在CT影像理解中存在切片级与体素级建模割裂的问题:切片级模型缺乏跨切片空间一致性,体素级模型则粒度粗糙且难以兼容切片输入,阻碍临床转化。 Method: 提出OmniCT模型,包含三项核心技术:(i) 空间一致性增强(SCE),融合体素切片组合与三轴位置编码,并采用MoE混合投影实现高效切片-体素适配;(ii) 器官级语义增强(OSE),通过分割与ROI定位显式对齐解剖区域,强化病灶与器官级语义;(iii) 构建MedEval-CT——目前最大规模的切片-体素CT数据集与混合评测基准。 Result: OmniCT在多种临床任务上显著超越现有方法,兼具微观细节敏感性与宏观空间推理能力,并确立了跨模态医学影像理解的新范式。 Conclusion: OmniCT成功弥合了CT影像中切片驱动与体素驱动理解的鸿沟,为医学大模型的临床落地提供了统一建模范式与可扩展技术路径。 Abstract: Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.[87] CHAI: CacHe Attention Inference for text2video
Joel Mathew Cherian,Ashutosh Muralidhara Bharadwaj,Vima Gupta,Anand Padmanabha Iyer
Main category: cs.CV
TL;DR: CHAI提出Cache Attention机制,通过跨推理缓存和选择性注意力复用语义相关的潜在表示,在仅8步去噪下生成高质量视频,推理速度提升1.65–3.35倍。
Details
Motivation: 现有加速方法要么需昂贵重训练,要么依赖启发式跳步,难以在减少去噪步数时保持视频质量。 Method: 提出CHAI框架,核心为Cache Attention机制,支持跨不同文本提示的语义相关潜变量缓存与注意力复用。 Result: 仅用8个去噪步即可生成高质量视频;相比OpenSora 1.2,推理速度提升1.65x–3.35x,同时保持视频质量。 Conclusion: 跨推理缓存结合Cache Attention是一种高效、免重训练的文本到视频扩散模型加速新范式。 Abstract: Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.[88] IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models
Parsa Madinei,Srijita Karmakar,Russell Cohen Hoffing,Felix Gervitz,Miguel P. Eckstein
Main category: cs.CV
TL;DR: IRIS是一种无需训练的新方法,利用实时眼动追踪数据解决开放性视觉问答(VQA)中的歧义问题,在模糊问题上将大视觉语言模型(VLM)回答准确率从35.2%提升至77.2%,且不损害非模糊问题性能。
Details
Motivation: 解决开放性VQA中因图像-问题对存在歧义而导致大VLM回答不准的问题,利用人类自然注视行为提供细粒度语义线索。 Method: 提出IRIS框架,不依赖模型微调,而是实时结合用户眼动数据(尤其是提问起始时刻附近的注视点),动态引导VLM聚焦于图像最相关区域以实现意图解析。 Result: 在500组独特图像-问题对的用户研究中,IRIS使模糊问题回答准确率提升超一倍(35.2%→77.2%),在多种SOTA VLM上均表现稳健,并开源新基准数据集、实时交互协议与评估套件。 Conclusion: 眼动信号是高效、通用且无需训练的VQA歧义消解信号源,IRIS为构建更自然、交互式多模态系统提供了新范式。 Abstract: We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.[89] Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing
Huichan Seo,Minki Hong,Sieun Choi,Jihie Kim,Jean Oh
Main category: cs.CV
TL;DR: 本文研究了图像到图像(I2I)编辑中基于人口统计学特征(如种族、性别、年龄)的偏差问题,提出了两种新型失败模式:软擦除(Soft Erasure)和刻板印象替换(Stereotype Replacement),并通过构建受控基准测试与多模型评估揭示了身份保持失败具有普遍性、人口统计不均衡性及受隐含社会先验影响的特点;进一步提出无需修改模型的提示级身份约束方法,可显著缓解少数群体的身份改变,凸显当前I2I编辑器中存在不对称的身份先验。
Details
Motivation: 尽管文本到图像(T2I)生成中的族群偏差已被广泛研究,但指令引导的图像到图像(I2I)编辑中与人口统计特征相关的失败仍缺乏系统探索。本文旨在揭示相同编辑指令在不同人口统计群体上是否产生系统性差异,并识别其潜在机制。 Method: 提出并形式化两种失败模式(软擦除与刻板印象替换);构建一个控制变量的基准,利用种族、性别、年龄条件化的肖像生成与编辑任务,配合诊断性提示集;采用视觉-语言模型(VLM)评分与人工评估联合验证;设计提示级身份约束策略进行干预实验。 Result: 发现身份保持失败普遍存在且在不同人口群体间分布不均;失败受隐含社会先验(如职业驱动的性别推断)影响;提示级身份约束可在不更新模型前提下显著降低少数群体的身份改变,而对多数群体影响甚微。 Conclusion: 身份保持是I2I编辑中一个核心且具有人口统计不均衡性的失败模式;现有编辑器隐含不对称的身份先验;亟需构建具备人口统计鲁棒性的I2I编辑系统。 Abstract: Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: https://seochan99.github.io/i2i-demographic-bias[90] Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
Patrick Poggi,Divake Kumar,Theja Tulabandhula,Amit Ranjan Trivedi
Main category: cs.CV
TL;DR: 本文提出UncL-STARK,一种在不改变原有Transformer跟踪器结构的前提下,实现动态、不确定性感知的推理深度自适应方法,通过随机深度训练与知识蒸馏提升中间层鲁棒性,并利用热图估计不确定性以驱动深度选择,在显著降低计算开销的同时保持精度。
Details
Motivation: Transformer单目标跟踪器虽精度高,但采用固定深度推理,对每一帧均执行完整编码器-解码器流程,导致在长视频中(尤其帧间一致性高时)产生冗余计算开销。 Method: 提出UncL-STARK:1)架构保持不变;2)通过随机深度训练+知识蒸馏使模型在多个中间深度具备预测鲁棒性;3)从角点定位热图中轻量估计不确定性;4)结合视频时序一致性设计反馈式策略,动态选择下一帧的编码器和解码器深度。 Result: 在GOT-10k和LaSOT数据集上实现最高12% GFLOPs降低、8.9%延迟下降、10.8%能耗节省,同时跟踪精度仅比全深度基线下降不超过0.2%。 Conclusion: UncL-STARK实现了计算效率与跟踪精度的良好权衡,验证了在Transformer跟踪器中引入不确定性驱动的动态深度推理的有效性与实用性。 Abstract: Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.[91] DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling
Yiming Ju,Hanyu Zhao,Quanyue Ma,Donglin Hao,Chengwei Wu,Ming Li,Songjing Wang,Tengfei Pan
Main category: cs.CV
TL;DR: DataCube 是一个智能视频处理平台,支持自动视频处理、多维语义建模与查询驱动的检索,帮助用户从大规模视频库中高效构建定制化数据子集。
Details
Motivation: 大规模视频库日益丰富,但将原始视频转化为高质量、任务特定的数据集成本高、效率低。 Method: 提出 DataCube 平台,构建视频片段的结构化语义表示,结合神经重排序与深度语义匹配实现混合检索,并提供交互式 Web 界面。 Result: 实现了对海量视频库的高效定制化子集构建,支持训练、分析、评估及私有视频库的可搜索系统建设;平台已公开上线并提供演示视频。 Conclusion: DataCube 显著提升了视频数据集构建的自动化与灵活性,为视频理解与生成任务提供了实用基础设施。 Abstract: Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at https://datacube.baai.ac.cn/. Demo Video: https://baai-data-cube.ks3-cn-beijing.ksyuncs.com/custom/Adobe%20Express%20-%202%E6%9C%8818%E6%97%A5%20%281%29%281%29%20%281%29.mp4[92] EasyControlEdge: A Foundation-Model Fine-Tuning for Edge Detection
Hiroki Nakamura,Hiroto Iino,Masashi Okada,Tadahiro Taniguchi
Main category: cs.CV
TL;DR: 本文提出EasyControlEdge,通过适配图像生成基础模型来实现边缘检测,引入边缘导向目标和无条件动态引导机制,在有限数据下实现了更清晰、高效的结果。
Details
Motivation: 现实中的边缘检测(如建筑平面图、卫星图像、医学图像)对边缘的清晰度和数据效率要求高,但现有方法在少量训练样本下难以生成清晰的原始边缘图。 Method: 提出EasyControlEdge,将图像生成基础模型适配到边缘检测任务;设计边缘导向的像素空间损失函数;在推理阶段引入基于无条件动态的引导机制,通过调节引导尺度控制边缘密度。 Result: 在BSDS500、NYUDv2、BIPED和CubiCasa数据集上实验表明,该方法在无后处理的清晰度评估及小样本设置下均优于当前最优方法。 Conclusion: 利用图像生成基础模型的先验知识与迭代优化能力,结合边缘专用适配策略,可有效提升边缘检测的清晰度与数据效率。 Abstract: We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.[93] HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis
J. Dhar,M. K. Pandey,D. Chakladar,M. Haghighat,A. Alavi,S. Mistry,N. Zaidi
Main category: cs.CV
TL;DR: 本文提出了一种新型混合并行-融合级联注意力网络(HyPCA-Net),通过高效残差自适应学习注意力块和双视角级联注意力块,解决现有医学多模态融合方法计算开销大、信息损失严重及跨模态共享表征能力弱的问题,在10个公开数据集上显著提升性能并大幅降低计算成本。
Details
Motivation: 现有医学多模态融合方法存在计算开销大、级联注意力导致信息损失、难以学习鲁棒跨模态共享表征等问题,限制其在低资源环境和多疾病分析中的泛化能力。 Method: 提出HyPCA-Net,包含两个核心模块:(a) 计算高效的残差自适应学习注意力块,用于提取精细的模态特异性表征;(b) 双视角级联注意力块,用于学习跨模态鲁棒共享表征。 Result: 在10个公开医学数据集上,HyPCA-Net相较现有最优方法最高提升性能5.2%,最高降低计算成本73.1%。 Conclusion: HyPCA-Net有效平衡了模型性能与计算效率,提升了多模态医学图像融合的表征能力与泛化性,适用于低资源环境下的多疾病分析任务。 Abstract: Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.[94] AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards
David Smerkous,Zian Wang,Behzad Najafian
Main category: cs.CV
TL;DR: AFFMAE是一种面向高分辨率图像自监督预训练的新型框架,通过自适应、非网格化的可见token动态合并,克服了传统MAE与分层架构结合的结构限制,在电子显微镜分割任务中显著降低计算和内存开销,同时保持性能。
Details
Motivation: 高分辨率自监督预训练通常依赖大规模服务器资源,限制了中小型实验室开发领域专用基础模型;现有MAE难以与分层下采样架构有效结合,因其受限于密集网格先验和掩码感知设计折衷。 Method: 提出AFFMAE框架:基于自适应、非网格的可见token动态合并;丢弃掩码token,仅对可见token进行层次化合并;设计数值稳定的混合精度Flash式聚类注意力核;引入深度监督缓解稀疏阶段表征坍缩。 Result: 在高分辨率电子显微镜分割任务上,AFFMAE在参数量相当情况下达到ViT-MAE性能,FLOPs降低至1/7,内存减半,并可在单张RTX 5090上更快完成训练。 Conclusion: AFFMAE为资源受限场景下的高分辨率分层自监督预训练提供了高效、可扩展且结构友好的新范式。 Abstract: Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at https://github.com/najafian-lab/affmae.[95] Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images
Manel Guzmán,Antonio Agudo
Main category: cs.CV
TL;DR: 本文提出了一种基于多视角人工视觉的新型眼镜框轮廓追踪方法,利用InVision系统采集的彩色图像,通过图像分割、深度估计和多视角融合实现亚毫米级精度的眼镜框测量,无需专用机械设备,简化了验光师工作流程。
Details
Motivation: 传统眼镜框机械追踪方法需要精确校准和额外设备,耗时且效率低,亟需一种更高效、低成本的替代方案。 Method: 基于InVision系统采集的多视角彩色图像,构建完整处理流程:图像获取 → 眼镜框前景分割 → 深度估计获取三维信息 → 多视角RGB与深度数据融合以精确提取轮廓。 Result: 在真实数据上验证了多种配置与变体,所得轮廓测量精度达到亚毫米级,性能媲美现有方案,同时完全摆脱专用追踪设备依赖。 Conclusion: 该视觉驱动方法显著降低了光学技师的操作复杂度与设备成本,在保证精度的同时提升了配镜流程自动化水平。 Abstract: Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.[96] A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks
Santiago C. Vilabella,Pablo Pérez-Núñez,Beatriz Remeseiro
Main category: cs.CV
TL;DR: 本文提出一种基于自监督学习的特征提取器增强方法,仅用少量标注数据即可达到甚至超越ImageNet预训练模型在目标检测任务上的性能,并提升模型鲁棒性与可靠性。
Details
Motivation: 深度学习模型日益复杂庞大,而高质量标注数据获取成本高、耗时长,尤其在目标检测等任务中,严重制约模型开发效率与企业应用落地。 Method: 采用自监督学习策略,在无标签数据上训练新型特征提取器,避免依赖大规模人工标注的ImageNet数据集,同时针对目标检测任务优化特征表示能力。 Result: 所提模型在目标检测任务上优于当前最优的ImageNet预训练特征提取器,且能更聚焦于物体关键区域,获得更优、更鲁棒的特征表示。 Conclusion: 增强特征提取器(尤其是通过自监督方式)可显著缓解对大量标注数据的依赖,为低资源场景下的目标检测提供高效可行的新范式。 Abstract: In the fast-evolving field of artificial intelligence, where models are increasingly growing in complexity and size, the availability of labeled data for training deep learning models has become a significant challenge. Addressing complex problems like object detection demands considerable time and resources for data labeling to achieve meaningful results. For companies developing such applications, this entails extensive investment in highly skilled personnel or costly outsourcing. This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge, enabling models to learn more effective representations with less labeled data. Utilizing a self-supervised learning strategy, we present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet and particularly designed for object detection tasks. Moreover, the results demonstrate that our approach encourages the model to focus on the most relevant aspects of an object, thus achieving better feature representations and, therefore, reinforcing its reliability and robustness.[97] Subtractive Modulative Network with Learnable Periodic Activations
Tiou Wang,Zhuoqian Yang,Markus Flierl,Mathieu Salzmann,Sabine Süsstrunk
Main category: cs.CV
TL;DR: 本文提出了一种受减法合成启发的新型隐式神经表示(INR)架构——减法调制网络(SMN),其通过可学习周期性激活层和调制掩码模块实现高参数效率与高重建精度,在图像重建和3D NeRF任务中均优于现有方法。
Details
Motivation: 提升隐式神经表示(INR)的参数效率与重建精度,借鉴经典减法合成思想设计更符合信号处理原理的网络结构。 Method: 提出Subtractive Modulative Network(SMN),包含可学习周期性激活层(Oscillator)生成多频基,以及一系列调制掩码模块(Filters)主动产生高阶谐波;结合理论分析与实验验证。 Result: 在两个图像数据集上PSNR达40+ dB,重建精度和参数效率均优于当前最优方法;在3D NeRF新视角合成任务中也展现出稳定优势。 Conclusion: SMN是一种原理清晰、高效实用的INR新架构,为隐式表示建模提供了新的信号处理视角与有效实现路径。 Abstract: We propose the Subtractive Modulative Network (SMN), a novel, parameter-efficient Implicit Neural Representation (INR) architecture inspired by classical subtractive synthesis. The SMN is designed as a principled signal processing pipeline, featuring a learnable periodic activation layer (Oscillator) that generates a multi-frequency basis, and a series of modulative mask modules (Filters) that actively generate high-order harmonics. We provide both theoretical analysis and empirical validation for our design. Our SMN achieves a PSNR of $40+$ dB on two image datasets, comparing favorably against state-of-the-art methods in terms of both reconstruction accuracy and parameter efficiency. Furthermore, consistent advantage is observed on the challenging 3D NeRF novel view synthesis task. Supplementary materials are available at https://inrainbws.github.io/smn/.[98] SCAR: Satellite Imagery-Based Calibration for Aerial Recordings
Henry Hölzemann,Michael Schleiss
Main category: cs.CV
TL;DR: SCAR是一种利用地理参考卫星图像进行空中视觉-惯性系统长期自动校准优化的方法,通过2D-3D匹配实现内、外参联合估计,无需人工标定操作,在多种真实场景下显著降低重投影误差和定位误差。
Details
Motivation: 现有空中视觉-惯性系统校准方法依赖专用机动操作或人工布设的控制点,难以应对长期部署中校准参数退化问题;亟需一种无需人工干预、可利用公开地理空间数据持续自动修正校准参数的方法。 Method: 提出SCAR方法,利用公开正射影像与数字高程模型生成2D–3D对应关系,将航拍图像与地理参考图像对齐,联合估计相机内参与外参,实现基于外部地理空间数据的在线/离线自动校准优化。 Result: 在两年间六次大规模、多季节、多环境的空中航测任务上验证,SCAR显著优于Kalibr、COLMAP和VINS-Mono等基线方法,中位重投影误差大幅下降,并带来视觉定位旋转误差降低和整体位姿精度提升。 Conclusion: SCAR实现了长时空中作业下高精度、鲁棒且可复现的自动校准,完全摆脱对人工干预或专用标定动作的依赖,为实际野外部署提供了实用可靠的校准解决方案。 Abstract: We introduce SCAR, a method for long-term auto-calibration refinement of aerial visual-inertial systems that exploits georeferenced satellite imagery as a persistent global reference. SCAR estimates both intrinsic and extrinsic parameters by aligning aerial images with 2D--3D correspondences derived from publicly available orthophotos and elevation models. In contrast to existing approaches that rely on dedicated calibration maneuvers or manually surveyed ground control points, our method leverages external geospatial data to detect and correct calibration degradation under field deployment conditions. We evaluate our approach on six large-scale aerial campaigns conducted over two years under diverse seasonal and environmental conditions. Across all sequences, SCAR consistently outperforms established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin, and translating these calibration gains into substantially lower visual localization rotation errors and higher pose accuracy. These results demonstrate that SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without the need for manual intervention.[99] Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired
Qi He,XiangXiang Wang,Jingtao Zhang,Yongbin Yu,Hongxiang Chu,Manping Fan,JingYe Cai,Zhenglin Yang
Main category: cs.CV
TL;DR: 本文提出了一种自适应多尺度注意力聚合(AMAA)框架,用于单目3D语义场景补全(SSC),通过可靠性导向的体素特征调节和分层自适应特征门控策略,提升结构一致性和语义准确性,并在NYUv2上取得性能提升且可在嵌入式平台稳定部署。
Details
Motivation: 现有单目SSC方法缺乏对体素特征可靠性的显式建模及跨尺度信息传播的约束,在2D-3D投影与多尺度融合中易受投影扩散和特征纠缠影响,导致结构稳定性不足,难以满足视障用户室内辅助感知的安全需求。 Method: 基于MonoScene架构,提出AMAA框架:1)通过并行通道-空间注意力聚合,在语义与空间维度联合校准提升的体素特征;2)采用分层自适应特征门控策略,调控编码器-解码器多尺度融合中的信息注入。 Result: 在NYUv2基准上,SSC mIoU达27.25%(+0.31),SC IoU达43.10%(+0.59);完整框架在NVIDIA Jetson嵌入式平台实现稳定部署。 Conclusion: AMAA在不显著增加系统复杂度前提下,提升了单目SSC的质量与结构鲁棒性,为面向视障用户的室内辅助感知系统提供了可靠、可部署的解决方案。 Abstract: In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability.To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales.Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.[100] ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
Daichi Yashima,Shuhei Kurita,Yusuke Oda,Komei Sugiura
Main category: cs.CV
TL;DR: 本文提出ReMoRa,一种直接在视频压缩表示上操作的多模态大语言模型,通过保留稀疏RGB关键帧和引入去噪后的细粒度运动表征(作为光流的紧凑代理)来高效处理长视频,显著提升长视频理解性能。
Details
Motivation: 长视频理解面临计算不可行和冗余问题,因自注意力机制具有序列长度的二次复杂度,直接处理全部RGB帧不现实。 Method: 提出ReMoRa模型:1)保留稀疏RGB关键帧表征外观;2)用块级运动表征编码时序动态,并通过专用模块去噪并生成细粒度运动表征;3)采用线性复杂度的特征压缩策略。 Result: 在LongVideoBench、NExT-QA和MLVU等多个长视频理解基准上,ReMoRa显著优于基线方法。 Conclusion: 直接利用视频压缩域信息(关键帧+运动表征)并优化其质量与效率,是提升MLLM长视频理解能力的有效新范式。 Abstract: While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.[101] Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems
Ali Faraz,Raja Kolla,Ashish Kulkarni,Shubham Agarwal
Main category: cs.CV
TL;DR: 本文研究了两种用于构建多语言OCR系统的训练策略,发现微调现有OCR模型在准确率-延迟权衡上优于端到端训练的多模态方法;提出的Chitrapathak-2和Parichay系列模型在印度多语言OCR任务中达到SOTA性能,并为实际部署提供了实用指导。
Details
Motivation: 设计适用于印度的OCR系统需兼顾语言多样性、文档异质性和部署限制,现有方法在准确率与效率间难以平衡。 Method: 比较两种训练策略:1)端到端训练通用视觉编码器+多语言语言模型;2)微调未针对目标语言训练的现有OCR模型;并开发专用模型Parichay处理9类印度政府文档。 Result: Chitrapathak-2在泰卢固语上达SOTA(6.69 char ANLS),其余语言位列第二,且提速3–6倍;Parichay在关键字段抽取上达89.8% Exact Match,推理更快。 Conclusion: 微调策略更优;Chitrapathak与Parichay系列共同实现了印度场景下SOTA性能,并为生产级OCR流水线建设提供实践指南。 Abstract: Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.[102] Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
Jinsong Li,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin
Main category: cs.CV
TL;DR: 本文提出了一种名为Visual Self-Refine(VSR)的新范式,通过像素级定位与可视化反馈机制提升大视觉语言模型在图表解析等视觉密集型任务中的准确性;并基于该范式构建了ChartVSR模型及新基准ChartP-Bench。
Details
Motivation: 现有大视觉语言模型在文本层面具备推理与自修正能力,但在以视觉感知为核心的复杂任务(如图表解析)中表现不佳,易出现数据遗漏、错位和幻觉等问题;受人类用手指作为‘视觉锚点’辅助读图的启发,提出引入视觉自反馈机制。 Method: 提出Visual Self-Refine(VSR)范式:模型生成像素级定位→可视化→将可视化结果反馈给自身进行直观检查与修正;在图表解析任务中实例化为ChartVSR,分为Refine Stage(迭代利用视觉反馈校准所有数据点的像素定位)与Decode Stage(以校准后的定位为精确视觉锚点解析结构化数据);同时构建高难度新基准ChartP-Bench。 Result: ChartVSR在图表解析任务上显著优于现有方法,尤其在处理视觉密集图表时大幅降低数据遗漏、错位和幻觉错误;ChartP-Bench填补了现有基准在挑战性与细粒度评估方面的不足;VSR被验证为一种通用视觉反馈机制,可拓展至其他视觉中心任务。 Conclusion: VSR范式通过引入可解释、可反馈的像素级视觉锚点,有效弥合了LVLM在视觉感知准确性上的短板,为提升视觉理解类模型的可靠性提供了新路径。 Abstract: While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.[103] MMA: Multimodal Memory Agent
Yihao Lu,Wanru Cheng,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: 本文提出多模态记忆代理(MMA),通过动态可靠性评分机制(融合来源可信度、时间衰减与冲突感知共识)优化检索记忆的加权与置信决策,并构建MMA-Bench基准评估信念动态,揭示RAG代理中由基础模型引入的‘视觉安慰剂效应’。
Details
Motivation: 长时程多模态智能体依赖外部记忆,但基于相似度的检索常召回过时、低可信或冲突信息,导致过度自信错误。 Method: 提出Multimodal Memory Agent(MMA),为每个检索项计算动态可靠性分数(结合来源可信度、时间衰减和冲突感知网络共识),并据此重加权证据或选择置信不足时拒绝回答;同时构建可控的多模态矛盾基准MMA-Bench。 Result: 在FEVER上准确率持平但方差降低35.2%、选择性效用提升;在LoCoMo中提升可操作准确率并减少错误回答;在MMA-Bench的Vision模式下Type-B准确率达41.18%,而基线为0.0%。 Conclusion: 动态可靠性建模显著提升多模态记忆代理的鲁棒性与安全性,且揭示了视觉模态引入的隐性偏差问题。 Abstract: Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.[104] Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection
Alexis Winter,Jean-Vincent Martini,Romaric Audigier,Angelique Loesch,Bertrand Luvison
Main category: cs.CV
TL;DR: 本文提出一个统一的基准框架来评估目标检测模型的对抗攻击与防御,发现现代对抗攻击在Transformer架构上迁移性差,且混合多种高扰动攻击的对抗训练策略效果最佳。
Details
Motivation: 目标检测模型在自动驾驶等系统中至关重要,但其对对抗攻击的敏感性带来严重安全风险;而当前防御研究受限于缺乏标准化评估,难以公平比较不同攻击或防御方法。 Method: 提出面向数字、非补丁式攻击的统一基准框架,引入解耦定位与分类误差的指标,并采用多种感知度量评估扰动代价;在此基础上对前沿攻击方法和多种检测器进行大规模实验。 Result: 1)现代对抗攻击在从CNN迁移到Vision Transformer时表现出显著的迁移性不足;2)混合多种高扰动、多目标(如空间与语义)攻击的数据集进行对抗训练,效果优于单一攻击训练。 Conclusion: 建立标准化评估基准对推动目标检测鲁棒性研究至关重要;提升迁移性和设计多样化对抗样本是增强检测模型鲁棒性的关键方向。 Abstract: Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.[105] DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images
Zeng Tao,Ying Jiang,Yunuo Chen,Tianyi Xie,Huamin Wang,Yingnian Wu,Yin Yang,Abishek Sampath Kumar,Kenji Tashiro,Chenfanfu Jiang
Main category: cs.CV
TL;DR: DressWild 是一种新型前馈式方法,能从单张野外图像中直接生成物理一致的2D裁剪图与对应3D服装,无需多视角输入或迭代优化。
Details
Motivation: 现有前馈方法难以处理多样姿态和视角,而基于优化的方法计算开销大、难扩展;实际应用(如建模、制作、仿真)需要可编辑、可分离、可仿真的服装裁剪图。 Method: 提出 DressWild 管道:利用视觉语言模型(VLM)在图像级归一化姿态变化,提取姿态感知且具3D信息的服装特征;通过Transformer编码器融合特征,预测可直接用于物理仿真、纹理合成和多层虚拟试穿的裁剪图参数。 Result: 实验表明该方法能稳健地从野外单图恢复多种裁剪图及对应3D服装,无需多视图或迭代优化,兼顾效率与可扩展性。 Conclusion: DressWild 为真实感服装仿真与动画提供了高效、可扩展、端到端的前馈解决方案,显著提升了从单图生成可编辑、仿真就绪裁剪图的能力。 Abstract: Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.[106] Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
Kaiting Liu,Hazel Doughty
Main category: cs.CV
TL;DR: 本文提出了一种名为'类别分割(category splitting)'的新任务,旨在对已有视频分类器进行编辑,将粗粒度类别细分为更精细的子类,且不损害其他类别的准确率;方法上结合零样本编辑(利用模型潜在的组合结构)与低样本微调,并在新构建的视频基准上验证了其有效性。
Details
Motivation: 现有视频识别模型通常基于固定、过于粗粒度的类别体系训练,难以适应任务演进中新出现的细粒度区分需求;重新标注和训练成本高昂。 Method: 提出零样本编辑方法,利用视频分类器潜在的组合结构揭示细粒度区分;辅以低样本微调,并以零样本结果作为初始化。 Result: 在新构建的视频类别分割基准上,该方法显著优于视觉-语言基线,在新拆分子类别上提升准确率,同时保持其余类别性能不变。 Conclusion: 类别分割是一种实用且可扩展的模型编辑范式;零样本初始化+低样本微调的组合策略高效可行,为动态演化的识别任务提供了低成本更新路径。 Abstract: Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.[107] Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face
Nicolò Di Domenico,Annalisa Franco,Matteo Ferrara,Davide Maltoni
Main category: cs.CV
TL;DR: 本文提出了一种基于Arc2Face身份条件生成模型的新型人脸融合攻击方法,能在无监督活体采集场景下生成高保真、高攻击潜力的融合人脸图像,其攻击效果媲美传统基于关键点的方法。
Details
Motivation: 现有电子身份证件中的人脸识别系统易受人脸融合攻击威胁,主要源于多国护照注册流程缺乏受控的活体人脸采集环节。 Method: 基于Arc2Face这一身份条件化人脸基础模型,利用其紧凑身份表征合成高保真人脸图像,构建新型人脸融合攻击方法。 Result: 在多个主流人脸融合检测数据集(包括两个新构建的FEI和ONOT衍生数据集)上验证,所提方法的融合攻击潜力与传统基于关键点的技术相当,证实其能有效保持并调控身份信息。 Conclusion: Arc2Face驱动的融合方法是一种极具现实威胁的新一代攻击手段,凸显了当前证件照采集流程的安全脆弱性,亟需加强活体检测与身份验证机制。 Abstract: Face morphing attacks are widely recognized as one of the most challenging threats to face recognition systems used in electronic identity documents. These attacks exploit a critical vulnerability in passport enrollment procedures adopted by many countries, where the facial image is often acquired without a supervised live capture process. In this paper, we propose a novel face morphing technique based on Arc2Face, an identity-conditioned face foundation model capable of synthesizing photorealistic facial images from compact identity representations. We demonstrate the effectiveness of the proposed approach by comparing the morphing attack potential metric on two large-scale sequestered face morphing attack detection datasets against several state-of-the-art morphing methods, as well as on two novel morphed face datasets derived from FEI and ONOT. Experimental results show that the proposed deep learning-based approach achieves a morphing attack potential comparable to that of landmark-based techniques, which have traditionally been regarded as the most challenging. These findings confirm the ability of the proposed method to effectively preserve and manage identity information during the morph generation process.[108] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Qi You,Yitai Cheng,Zichao Zeng,James Haworth
Main category: cs.CV
TL;DR: 本文提出CLIP-MHAdapter,一种轻量级适配器,在CLIP的patch token上引入多头自注意力机制,以提升街景图像细粒度属性分类性能,在保持低计算开销的同时达到SOTA效果。
Details
Motivation: 现有基于CLIP的方法主要依赖全局图像嵌入,难以捕捉复杂街景中关键的细粒度、局部化属性;同时街景属性分类任务整体计算开销大。 Method: 提出CLIP-MHAdapter,在CLIP视觉编码器输出的patch tokens后添加一个带多头自注意力机制的瓶颈MLP,建模patch间依赖关系;仅训练该轻量适配器(约1.4M参数),冻结CLIP主干。 Result: 在Global StreetScapes数据集8个属性分类任务上达到最优或具有竞争力的精度,刷新多项SOTA结果,且参数量和计算成本显著低于全模型微调。 Conclusion: 通过在patch级引入结构化注意力机制,轻量适配器可有效增强CLIP对街景局部语义的理解能力,为高效、精准的下游视觉任务提供新范式。 Abstract: Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.[109] Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge
Jiaming Liu,Felix Petersen,Yunhe Gao,Yabin Zhang,Hyojin Kim,Akshay S. Chaudhari,Yu Sun,Stefano Ermon,Sergios Gatidis
Main category: cs.CV
TL;DR: 本文提出Self-Supervised Semantic Bridge (SSB)框架,结合自监督视觉编码器与扩散模型,在无需跨域监督下实现空间保真度高的图像翻译,显著提升医学图像合成与文本引导编辑效果。
Details
Motivation: 现有对抗扩散和扩散反演方法分别存在泛化性差和重建保真度低的问题,亟需一种无需目标域对抗损失且能保持空间一致性的新范式。 Method: 提出SSB框架,利用自监督视觉编码器学习对表观变化鲁棒但保留几何结构的语义表示,构建共享潜在空间以条件化扩散桥模型,实现无配对数据的图像翻译。 Result: 在域内和域外医学图像合成任务上均超越强基线方法,并可无缝扩展至高质量文本引导编辑任务。 Conclusion: SSB通过引入外部语义先验与自监督表征学习,有效解耦外观与几何信息,在无跨域监督前提下实现了高保真、空间一致的图像翻译,为扩散模型在未配对图像转换中的应用提供了新思路。 Abstract: Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.[110] PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction
Bo Lang,Nirav Savaliya,Zhihao Zheng,Jinglun Feng,Zheng-Hang Yeh,Mooi Choo Chuah
Main category: cs.CV
TL;DR: 本文提出了一种端到端的在线高精矢量化地图构建框架,通过语义感知查询生成、历史栅格地图记忆、历史地图引导和短期未来引导模块,显著提升了地图构建的时间一致性与稳定性。
Details
Motivation: 现有基于查询的方法存在随机查询初始化和隐式时序建模问题,导致全局地图构建中出现时间不一致和不稳定。 Method: 提出语义感知查询生成器、历史栅格化地图记忆、历史地图引导模块和短期未来引导模块,联合实现地图实例跟踪与短期预测。 Result: 在nuScenes和Argoverse2数据集上显著优于当前最优方法,且效率良好。 Conclusion: 所提框架有效解决了HD地图在线构建中的时间一致性问题,为自动驾驶提供了更鲁棒、稳定的地图支持。 Abstract: High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.[111] VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection
Yingyuan Yang,Tian Lan,Yifei Gao,Yimeng Lu,Wenjun He,Meng Wang,Chenghao Liu,Chen Zhang
Main category: cs.CV
TL;DR: 本文提出VETime,首个统一时序与视觉模态的时序异常检测框架,通过细粒度视觉-时序对齐与动态融合,解决现有模型在点异常定位与上下文异常感知间的根本权衡。
Details
Motivation: 现有时间序列异常检测基础模型在1D时序模型(精确定位但缺乏全局上下文)和2D视觉模型(捕获全局模式但时序不对齐、定位粗糙)之间存在根本性权衡,亟需一种兼顾二者优势的新范式。 Method: 提出VETime框架:包含可逆图像转换模块、块级时序对齐模块以构建共享视觉-时序时间线;引入异常窗口对比学习机制与任务自适应多模态融合模块,实现两模态感知优势的动态互补集成。 Result: 在零样本场景下显著超越现有最先进模型,定位精度更高、计算开销低于当前视觉方法。 Conclusion: VETime成功弥合了时序建模与视觉建模在TSAD中的鸿沟,验证了细粒度跨模态对齐与自适应融合的有效性,为多模态时序理解提供了新思路。 Abstract: Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.[112] Learning Situated Awareness in the Real World
Chuhan Li,Ruilin Han,Joy Hsu,Yongyuan Liang,Rajiv Dhawan,Jiajun Wu,Ming-Hsuan Yang,Xin Eric Wang
Main category: cs.CV
TL;DR: 本文提出了SAW-Bench,一个用于评估多模态基础模型在真实世界视频中自我中心情境感知能力的新基准,强调以观察者为中心的空间关系理解,揭示了当前模型与人类之间显著的性能差距及空间推理缺陷。
Details
Motivation: 现有多模态基础模型基准主要关注环境中心的空间关系,忽视了以观察者为中心(如视角、姿态、运动)的关系推理,亟需新基准填补这一空白。 Method: 构建了SAW-Bench基准:包含786段由Ray-Ban Meta(Gen 2)智能眼镜采集的真实世界视频,覆盖室内外多样场景;含2071+人工标注的问答对;设计六类以观察者为中心的情境感知任务。 Result: 最佳多模态模型Gemini 3 Flash与人类存在37.66%的性能差距;模型虽能利用部分几何线索,但难以推断一致的相机几何结构,导致系统性空间推理错误。 Conclusion: SAW-Bench推动多模态模型从被动观看到理解物理锚定、以观察者为中心的动态空间智能,为 situated spatial intelligence 提供新评测范式。 Abstract: A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.[113] Are Object-Centric Representations Better At Compositional Generalization?
Ferdinand Kapl,Amir Mohammad Karimi Mamaghan,Maximilian Seitzer,Karl Henrik Johansson,Carsten Marr,Stefan Bauer,Andrea Dittadi
Main category: cs.CV
TL;DR: 本文提出了一种面向视觉问答的组合泛化基准(CLEVRTex、Super-CLEVR、MOVi-C),系统评估了对象中心(OC)表征与稠密表征在受限资源条件下的泛化能力,发现OC表征在数据量、多样性或计算受限时更具优势。
Details
Motivation: 组合泛化是人类认知的核心能力,也是机器学习的关键挑战;尽管对象中心(OC)表征被广泛认为有助于组合泛化,但在视觉丰富场景中缺乏系统性实证支持。 Method: 构建三个可控视觉世界(CLEVRTex、Super-CLEVR、MOVi-C)上的视觉问答基准,对比DINOv2/SigLIP2及其OC变体,在控制训练数据多样性、样本量、表征维度、下游模型容量和计算开销的前提下,系统评估其组合泛化性能。 Result: (1)OC方法在更难的组合泛化任务中表现更优;(2)稠密表征仅在较简单任务中略优,但需显著更高的下游计算开销;(3)OC模型样本效率更高,而稠密编码器仅在数据充足且多样性高时才能追平或超越OC。 Conclusion: 当数据规模、训练多样性或下游计算任一受限时,对象中心表征能提供更强的组合泛化能力。 Abstract: Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.[114] Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning
Mingjia Shi,Yinhan He,Yaochen Zhu,Jundong Li
Main category: cs.CV
TL;DR: 本文提出了一种名为Saliency-Aware Principle (SAP)的选择机制,用于提升视觉-语言模型(VLMs)在推理过程中的视觉 grounding 稳定性与可控性,无需额外训练且模型无关。
Details
Motivation: 现有VLMs在推理中视觉输入仅在初始提供,后续文本自回归生成易导致视觉 grounding 错误累积;同时,传统视觉引导方式粗糙、噪声大,难以支持长文本推理。 Method: 提出SAP方法,基于高层推理原则(而非token级轨迹)进行选择,支持动态重访视觉证据和多路径并行推理,具备模型无关性和数据免费特性。 Result: SAP在同等token预算下显著降低物体幻觉,推理更稳定、响应延迟更低,性能媲美甚至优于CoT式长序列推理。 Conclusion: SAP为VLMs提供了一种高效、鲁棒、低开销的推理增强范式,有效缓解视觉 grounding 退化问题。 Abstract: Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.[115] TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos
Namitha Padmanabhan,Matthew Gwilliam,Abhinav Shrivastava
Main category: cs.CV
TL;DR: 本文提出TeCoNeRV方法,通过空间-时间分解、残差存储和时序一致性正则化,显著提升超网络驱动的隐式神经表示视频压缩性能,实现更高分辨率支持、更低码率和更快编码速度。