Skip to content

Table of Contents

cs.CL [Back]

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

Ashish Kattamuri,Ishita Prasad,Meetu Malhotra,Arpita Vats,Rahul Raja,Albert Lie

Main category: cs.CL

TL;DR: 提出一种结合组相对策略优化(GRPO)和多语言对比奖励信号的新框架,提升跨语言Text-to-SQL系统的执行准确率和语义准确性,仅用3000个训练样本即可让3B模型超越更大的8B模型。

Details Motivation: 现有Text-to-SQL方法过于关注可执行查询,忽视语义对齐挑战,且在非英语语言上性能显著下降。 Method: 在GRPO框架中引入基于语义相似性的多语言对比奖励信号,增强SQL生成与用户意图之间的语义对齐。 Result: 在MultiSpider七语言数据集上,LLaMA-3-3B模型的执行准确率达87.4%(+26 pp),语义准确率达59.14%(+6.85 pp),优于零样本8B模型。 Conclusion: 通过对比奖励实现定向语义对齐,可在小规模模型和少量训练数据下显著提升跨语言Text-to-SQL性能。 Abstract: Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge -- both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) -- all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.

[2] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

Ratna Kandala,Akshata Kishore Moharir,Divya Arvinda Nayak

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的生成式操作框架,旨在解决可解释人工智能在心理健康筛查中技术透明性与临床实用性之间的脱节问题。

Details Motivation: 当前的XAI方法(如SHAP和LIME)虽能提供技术上准确的特征重要性评分,但缺乏对临床医生和患者有用的可操作洞察,导致实验室成果难以转化为临床应用。 Method: 提出生成式操作框架,利用大语言模型作为核心翻译引擎,结合RAG技术整合临床指南,将多种XAI工具的技术输出转化为人类可读、证据支持的临床叙述。 Result: 该框架能够有效整合技术解释与临床知识,提升AI在临床工作流中的可用性,并支持针对不同利益相关者的定制化沟通。 Conclusion: 生成式操作框架为弥合XAI在心理健康领域的“实验室到临床”鸿沟提供了可行路径,推动AI从孤立数据点向可操作、可信的临床决策支持系统演进。 Abstract: Explainable Artificial Intelligence (XAI) has been presented as the critical component for unlocking the potential of machine learning in mental health screening (MHS). However, a persistent lab-to-clinic gap remains. Current XAI techniques, such as SHAP and LIME, excel at producing technically faithful outputs such as feature importance scores, but fail to deliver clinically relevant, actionable insights that can be used by clinicians or understood by patients. This disconnect between technical transparency and human utility is the primary barrier to real-world adoption. This paper argues that this gap is a translation problem and proposes the Generative Operational Framework, a novel system architecture that leverages Large Language Models (LLMs) as a central translation engine. This framework is designed to ingest the raw, technical outputs from diverse XAI tools and synthesize them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives. To justify our solution, we provide a systematic analysis of the components it integrates, tracing the evolution from intrinsic models to generative XAI. We demonstrate how this framework directly addresses key operational barriers, including workflow integration, bias mitigation, and stakeholder-specific communication. This paper also provides a strategic roadmap for moving the field beyond the generation of isolated data points toward the delivery of integrated, actionable, and trustworthy AI in clinical practice.

[3] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park,Hyejin Park,Hyeseon Ahn,Yo-Sub Han

Main category: cs.CL

TL;DR: 本文提出了一种名为STELA的新框架,通过利用语言中的词性n-gram建模的不确定性来动态调整水印强度,在语法约束强的地方减弱水印以保持文本质量,在语言灵活性高的地方增强水印以提高可检测性,实现了无需访问模型logits的公开可验证检测。

Details Motivation: 当前大语言模型的水印技术依赖于模型特定信号(如token级熵),导致无法实现公开验证,且难以平衡文本质量与检测鲁棒性之间的权衡。 Method: STELA框架利用词性(POS)n-gram建模语言的不确定性,动态调节水印强度:在语法限制较强的上下文中降低水印强度以保持生成质量,在语言自由度较高的上下文中增强水印以提升检测性能;检测器不依赖任何模型logits,实现公开可验证。 Result: 在英语、中文和韩语等多种类型的语言上实验表明,STELA在检测鲁棒性方面优于现有方法,同时保持了良好的文本质量。 Conclusion: STELA通过结合语言结构的自然灵活性,实现了高质量、强鲁棒且可公开验证的文本水印,为构建可信的AI生态系统提供了有效工具。 Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.

[4] Users as Annotators: LLM Preference Learning from Comparison Mode

Zhongze Cai,Xiaocheng Li

Main category: cs.CL

TL;DR: 本文提出了一种利用用户在与大语言模型交互中产生的偏好数据进行模型对齐的新方法,通过分析用户行为模型推断数据质量,并使用期望最大化算法估计用户的潜在质量因子,从而过滤低质量标注数据。

Details Motivation: 传统的大语言模型对齐依赖专业人工标注的成对偏好数据,成本高且难以扩展;而用户在日常交互中产生的偏好标签虽更具个性化优势,但缺乏质量控制,因此需要一种能自动评估和筛选用户标注质量的方法。 Method: 提出一种基于不对称响应生成(来自不同模型或同一模型的不同版本)的用户行为模型,利用期望最大化(EM)算法估计用户的潜在质量因子,并据此过滤用户标注数据。 Result: 实验表明该方法能有效捕捉用户行为特征,在下游任务中显著提升用于大语言模型对齐的偏好数据的质量。 Conclusion: 通过建模用户行为并估计其标注质量,所提方法能够有效利用非专业用户的偏好数据,为大语言模型对齐提供一种低成本、高质量的数据收集新路径。 Abstract: Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.

[5] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

Chao Han,Yijuan Liang,Zihao Xuan,Daokuan Wu,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文提出了一种名为“知情路由”(informed routing)的新方法,通过预测模块在推理时评估token的可恢复性,实现执行或近似处理的灵活策略,在保持模型性能的同时显著降低计算开销。

Details Motivation: 现有的动态token级计算分配方法依赖贪婪路由机制,容易导致信息不可逆丢失和次优选择,限制了大语言模型在实际应用中的效率。 Method: 提出“知情路由”范式,引入轻量级特征预测器(LFF),在路由决策前预测每个单元的输出,从而判断是否执行计算或进行近似,兼顾token的重要性和可恢复性。 Result: 在语言建模和推理任务上实验表明,该方法在多种稀疏水平下均实现了最先进的效率-性能权衡,即使不进行最终的LoRA微调,也优于需要完整微调的强基线方法,并减少超过50%的训练时间。 Conclusion: 知情路由通过前瞻性评估token的可恢复性,有效解决了传统贪婪路由的信息丢失问题,为大语言模型的高效推理提供了新的解决方案。 Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing--a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token's immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing

[6] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

Minsik Choi,Hyegang Son,Changhoon Kim,Young Geun Kim

Main category: cs.CL

TL;DR: 提出了一种新的剪枝准则HIES,结合了头重要性分数和注意力熵,相较于仅使用HIS的方法,在模型质量和稳定性上均有显著提升。

Details Motivation: 现有的基于梯度的头部重要性评分(HIS)方法仅捕捉梯度贡献,忽略了注意力模式的多样性,导致剪枝效果受限。 Method: 引入HIES(Head Importance-Entropy Score),将HIS与注意力熵结合,综合评估每个注意力头的贡献,从而更有效地识别冗余头进行剪枝。 Result: 实验表明,基于HIES的剪枝方法在模型质量上最多提升15.2%,稳定性提高2.04倍,实现了高效压缩且不牺牲准确性和稳定性。 Conclusion: HIES通过融合重要性与注意力多样性信息,优于传统HIS方法,是一种更优的Transformer模型剪枝准则。 Abstract: Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.

[7] ConDABench: Interactive Evaluation of Language Models for Data Analysis

Avik Dutta,Priyanshu Gupta,Hosein Hasanbeig,Rahul Pratap Singh,Harshit Nigam,Sumit Gulwani,Arjun Radhakrishna,Gustavo Soares,Ashish Tiwari

Main category: cs.CL

TL;DR: 本文提出了ConDABench,一个用于生成和评估对话式数据分析(ConDA)任务的框架,旨在解决现实世界中目标不明确和数据不干净的数据分析任务中的交互性挑战。

Details Motivation: 现有的LLM数据处理基准未能捕捉到真实场景中的复杂性和交互需求,因此需要一个新的框架来更好地评估模型在复杂、交互式任务中的表现。 Method: 提出了一种多智能体工作流,从公开数据集的洞察文章中自动生成1,420个ConDA问题,并开发了一个评估工具来系统地测试现有LLM在这些任务上的表现。 Result: 评估结果显示,尽管新一代模型能解决更多任务实例,但在需要长时间持续交互的任务上并未显著提升性能。 Conclusion: ConDABench为构建更具协作性的模型提供了新途径,有助于推动LLM在复杂交互式数据分析任务中的发展。 Abstract: Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.

[8] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

Debarun Bhattacharjya,Balaji Ganesan,Junkyu Lee,Radu Marinescu,Katsiaryna Mirylenka,Michael Glass,Xiao Shou

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)在生成输出时的不确定性量化(UQ)问题,提出了一种基于输出一致性的黑箱UQ框架,并引入了基于相似性的聚合方法和新的置信度估计技术,在多种任务上验证了其优于基线的校准性能。

Details Motivation: 为了提升AI系统的可信度,需要有效评估LLM生成结果的不确定性;黑箱UQ方法因无需访问模型内部信息而具有实际优势,但其在复杂生成任务中的有效性尚需深入研究。 Method: 提出一个高层、非语言化的基于相似性的聚合框架,利用生成输出之间的一致性作为正确性的代理指标,并在此框架下设计新的置信度估计模型,使用小规模训练集进行训练。 Result: 在问答、摘要和文本到SQL等多种任务上的实验表明,所提出的基于相似性的UQ方法相比基线方法能产生更优校准的置信度。 Conclusion: 基于输出一致性的黑箱UQ方法在复杂生成任务中是有效的,所提出的框架和新技术有助于提升LLM不确定性估计的可靠性与实用性。 Abstract: When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.

[9] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

Weibin Cai,Reza Zafarani

Main category: cs.CL

TL;DR: 提出一种文化感知的框架,通过构建个体的仇恨子空间来解决训练标签偏见和跨文化解释差异的问题,实验表明该方法在所有指标上平均优于现有最先进方法1.05%。

Details Motivation: 现有的仇恨言论检测方法通常忽视了训练标签存在偏见以及不同文化背景下对仇恨言论的定义存在差异这一现实复杂性。 Method: 分析数据稀疏性、文化纠缠和模糊标注等挑战,提出文化感知框架,建模文化属性组合,并利用标签传播捕捉每种组合的独特特征,构建个体仇恨子空间以提升分类性能。 Result: 实验结果显示,所提方法在所有指标上平均比现有最先进方法高出1.05%。 Conclusion: 该文化感知框架有效应对了文化差异带来的标签偏见和模糊性问题,显著提升了仇恨言论检测的性能。 Abstract: Hate speech detection has been extensively studied, yet existing methods often overlook a real-world complexity: training labels are biased, and interpretations of what is considered hate vary across individuals with different cultural backgrounds. We first analyze these challenges, including data sparsity, cultural entanglement, and ambiguous labeling. To address them, we propose a culture-aware framework that constructs individuals' hate subspaces. To alleviate data sparsity, we model combinations of cultural attributes. For cultural entanglement and ambiguous labels, we use label propagation to capture distinctive features of each combination. Finally, individual hate subspaces, which in turn can further enhance classification performance. Experiments show our method outperforms state-of-the-art by 1.05\% on average across all metrics.

[10] Meronymic Ontology Extraction via Large Language Models

Dekai Zhang,Simone Conia,Antonio Rago

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLM)从原始评论文本中全自动提取产品本体(特别是meronymies)的方法,相较于基于BERT的基线方法表现更优。

Details Motivation: 手动构建本体耗时、昂贵且费力,而现有自动化方法仍有提升空间,因此需要一种更高效、准确的自动本体提取方法。 Method: 利用大语言模型(LLM),直接从产品评论文本中提取meronymy关系,构建产品本体,并采用LLM-as-a-judge的方式进行评估。 Result: 所提出的方法在LLM评估下优于现有的BERT-based基线方法,生成的本体质量更高。 Conclusion: 大语言模型能够有效支持全自动的产品本体提取,为未来在更广泛领域应用LLM进行本体构建提供了基础。 Abstract: Ontologies have become essential in today's digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.

[11] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Yutao Wu,Xiao Liu,Yinghui Li,Yifeng Gao,Yifan Ding,Jiale Ding,Xiang Zheng,Xingjun Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为ADMIT的对抗性多注入技术,用于在检索增强生成(RAG)系统中进行知识投毒攻击,能够在极低投毒率下高效翻转事实核查结果,并诱导出欺骗性解释,且无需访问目标大语言模型或检索器。

Details Motivation: 现有的知识投毒研究主要关注LLM对误导性内容的易受性,但在真实事实核查场景中,检索结果通常包含大量可信证据,因此需要研究在此类复杂环境下仍有效的投毒攻击方法。 Method: 提出ADMIT——一种少样本、语义对齐的知识投毒方法,通过向知识库中注入多个语义一致的对抗性文档来操纵事实核查结果,整个过程无需访问目标LLM、检索器或进行token级控制。 Result: 实验表明ADMIT在4种检索器、11个LLM和4个跨领域基准上均具有强迁移性,平均攻击成功率达86%,投毒率仅为0.93×10⁻⁶,即使存在强反驳证据也保持鲁棒性,相比现有最先进方法ASR提升11.2%。 Conclusion: ADMIT揭示了现实世界中基于RAG的事实核查系统的严重安全漏洞,强调了开发更鲁棒防御机制的紧迫性。 Abstract: Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs' susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.

[12] Serialized EHR make for good text representations

Zhirong Chou,Quan Qin,Shi Li

Main category: cs.CL

TL;DR: SerialBEHRT是一种基于SciBERT并针对电子健康记录(EHR)序列数据进行扩展预训练的领域对齐基础模型,通过引入时间序列化建模提升患者表征能力,在抗生素敏感性预测任务中表现出优于现有方法的性能。

Details Motivation: 现有的医疗基础模型难以协调电子健康记录(EHR)的表格化、事件驱动特性与自然语言模型的序列先验之间的结构差异,限制了对患者就诊间长期依赖关系的捕捉。 Method: 提出SerialBEHRT模型,通过对SciBERT进行额外的预训练,利用结构化的EHR序列数据,并显式建模临床事件之间的时间和上下文关系,从而生成更丰富的患者表征。 Result: 在抗生素敏感性预测任务上,SerialBEHRT相较于当前最先进的EHR表征方法表现出更优且更稳定的性能。 Conclusion: 将EHR数据以时间序列方式组织并用于基础模型预训练,能有效提升临床表征学习效果,验证了时序序列化在医疗AI模型设计中的重要性。 Abstract: The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.

[13] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang,Nasib Ullah,Erik Schultheis,Rohit Babbar

Main category: cs.CL

TL;DR: 本文提出了DynaSpec,一种上下文相关的动态短列表机制,用于加速大语言模型推理中的推测解码过程,相比固定词汇子集方法具有更好的鲁棒性和效率。

Details Motivation: 现有推测解码方法在使用固定频率词表子集时存在脆弱性,无法适应不同语料和领域,且抑制了罕见或特定领域词汇的生成,影响了每轮验证的接受长度。 Method: 引入轻量级的粗粒度元分类器,将上下文路由到少量词汇簇,选取的簇的并集构成drafter的短列表,验证阶段仍使用完整词汇表以保证准确性;通过并行执行draft编码和元短列表生成来提前完成元分类器计算。 Result: 在标准推测解码基准上,DynaSpec相比固定短列表基线显著提升了平均接受长度,且上下文相关的选择允许使用更小的短列表而不降低接受率。 Conclusion: DynaSpec通过动态、上下文感知的词汇短列表机制,在保持验证准确性的前提下有效加速了推测解码,适用于多样化任务且具备良好泛化能力。 Abstract: Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.

[14] On-device System of Compositional Multi-tasking in Large Language Models

Ondrej Bohdal,Konstantinos Theodosiadis,Asterios Mpatziakas,Dimitris Filippidis,Iro Spyrou,Christos Zonios,Anastasios Drosou,Dimosthenis Ioannidis,Kyeng-Hun Lee,Jijoong Moon,Hyeonmok Ko,Mete Ozay,Umberto Michieli

Main category: cs.CL

TL;DR: 提出一种针对摘要和翻译组合任务的高效多任务处理方法,通过在适配器上添加可学习的投影层,在保持计算效率的同时实现良好性能。

Details Motivation: 现有适配器方法难以同时处理复杂的组合任务(如长对话的翻译摘要),需要更高效的集成方案。 Method: 在结合摘要与翻译LoRA适配器的基础上,引入一个可学习的投影层,以实现任务间的有效融合,减少计算开销。 Result: 实验表明该方法在云端和设备端均具有良好的性能和速度,适用于资源受限的高速应用场景。 Conclusion: 所提框架在保持参数高效的同时,显著提升了组合多任务的执行效果,具备实际应用潜力。 Abstract: Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.

[15] Language steering in latent space to mitigate unintended code-switching

Andrey Goncharov,Nikolai Kondusov,Alexey Zaytsev

Main category: cs.CL

TL;DR: 提出一种基于主成分分析的潜在空间语言引导方法,有效减少多语言大模型中的代码转换现象,保持语义且计算开销极低。

Details Motivation: 多语言大语言模型在下游任务中常出现非预期的代码转换,影响可靠性,需要一种轻量级方法在推理时控制语言身份。 Method: 通过在平行翻译上进行主成分分析识别语言方向,并在推理时沿这些方向调整词元嵌入以控制语言身份。 Result: 使用单个主成分即可实现95-99%的语言分类准确率,在Qwen2.5和Llama-3.2模型上将下一词元分布差异减少最多42%,并发现语言身份在模型深层接近线性可分。 Conclusion: 该方法能有效抑制多语言大模型中的代码切换,同时保持语义完整性,仅需少量平行数据校准,具有较低计算开销和良好实用性。 Abstract: Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99\% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models. We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.

[16] Revisiting the UID Hypothesis in LLM Reasoning Traces

Minju Gwak,Guijin Son,Jaehyung Kim

Main category: cs.CL

TL;DR: 该论文提出基于熵的信息流度量方法,发现大语言模型在成功推理时信息密度呈现全局不均匀性,与人类交流的均匀信息密度假设相反。

Details Motivation: 受心理语言学中均匀信息密度(UID)假设启发,研究大语言模型推理过程中信息流动的特性,尤其是思维链(CoT)中间步骤的忠实性和可解释性问题。 Method: 引入基于熵的度量方法来分析大语言模型推理路径中的信息流,并在三个数学推理基准上进行验证。 Result: 发现在三个数学任务中,正确的推理过程表现出显著的信息密度波动,即全局非均匀性,这与人类遵循的UID模式相反。 Conclusion: 成功的机器推理并不遵循均匀信息流模式,这一发现挑战了现有认知,并为设计更可解释、自适应的推理模型提供了新方向。 Abstract: Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics -- which posits that humans communicate by maintaining a stable flow of information -- we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.

[17] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

Sicheng Lyu,Yu Gu,Xinyu Wang,Jerry Huang,Sitao Luan,Yufei Cui,Xiao-Wen Chang,Peng Lu

Main category: cs.CL

TL;DR: EvoEdit是一种新的大语言模型编辑策略,通过连续的零空间对齐来缓解灾难性干扰,实现稳定高效的模型编辑,在真实序列知识编辑基准上表现优于或媲美现有最先进方法,并有最高达3.53倍的速度提升。

Details Motivation: 现有的模型编辑方法在连续编辑场景中存在灾难性干扰问题,即新编辑会破坏先前的知识更新,因此需要一种更稳定的方法来支持大语言模型的持续更新。 Method: 提出EvoEdit,采用顺序零空间对齐策略,在每次新编辑时保持原始和已修改知识表示不变,确保对保留知识的输出不变性,从而有效减轻干扰。 Result: 在真实世界的序列知识编辑基准测试中,EvoEdit性能优于或相当于现有的先进定位后编辑技术,并实现了最高达3.53倍的加速。 Conclusion: EvoEdit为动态变化的信息环境下的大语言模型编辑提供了一种简单而有效的解决方案,具有强理论保证,强调了发展更系统化编辑方法的必要性。 Abstract: Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.

[18] ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups

Peter Banyas,Shristi Sharma,Alistair Simmons,Atharva Vispute

Main category: cs.CL

TL;DR: 本文提出了ConsistencyAI,一个独立的基准测试,用于评估大语言模型(LLM)在不同用户 persona 下的事实一致性。实验发现,不同提供商和主题会影响模型回答的一致性,其中 xAI 的 Grok-3 表现最佳,而一些轻量级模型表现较差。

Details Motivation: 检测大语言模型是否因用户人口特征不同而在回答相同问题时提供事实不一致的内容,从而评估其公平性和可靠性。 Method: 使用19个LLM,对15个主题各请求5个事实,每个模型重复100次查询,每次附加不同的人格背景;通过句子嵌入和跨人格余弦相似度计算加权平均得分以衡量一致性。 Result: 在100个人格的实验中,一致性得分介于0.7896到0.9065之间,平均为0.8656;Grok-3最一致,轻量模型排名靠后;就业市场一致性最低,G7领导人最高;疫苗和以巴冲突等话题存在提供商差异。 Conclusion: LLM的事实一致性受提供商和话题影响,需推动不受用户特征影响的提示策略,以提升模型公平性与可信赖性。 Abstract: Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.

[19] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Fabian Wenz,Omar Bouattour,Devin Yang,Justin Choi,Cecil Gregg,Nesime Tatbul,Çağatay Demiralp

Main category: cs.CL

TL;DR: 本文提出了BenchPress,一个结合人类专家与大语言模型(LLM)的系统,用于加速构建领域特定的文本到SQL(text-to-SQL)基准数据集,显著减少人工标注时间和成本。

Details Motivation: 现有的text-to-SQL研究多基于公开数据集,而在真实企业环境中效果不佳;构建私有企业基准(如Beaver)依赖大量人工标注SQL日志,耗时且昂贵,因此需要更高效的标注方法。 Method: 提出BenchPress系统,采用检索增强生成(RAG)和大语言模型为SQL查询生成多个自然语言描述草案,由人类专家进行选择、排序或编辑,实现人机协同标注。 Result: 在企业SQL日志上的实验表明,LLM辅助显著减少了标注所需时间与人力,同时提高了标注准确性和基准数据集的可靠性。 Conclusion: BenchPress通过融合LLM生成与人类验证,有效降低了构建领域特定text-to-SQL基准的成本与门槛,提升了模型评估的鲁棒性,具有实际应用价值。 Abstract: Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

[20] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

Mamadou K. Keita,Christopher Homan,Sebastien Diarra

Main category: cs.CL

TL;DR: 提出了一种名为Rule-to-Tag(R2T)的混合框架,通过将语言学规则嵌入神经网络训练目标,结合“原则性学习”(PrL)范式,在低资源NLP任务中实现了高准确率,尤其在Zarma语POS标注和NER任务中显著优于传统方法。

Details Motivation: 在低资源语言处理中,标注数据稀缺,传统监督学习效果受限。作者希望探索一种不依赖大量标注数据、而是利用显式语言规则指导模型学习的方法,提升对未登录词等挑战的鲁棒性。 Method: 提出R2T框架,将多层次语言规则整合到神经网络的训练目标中,设计自适应损失函数,包含一个正则化项以引入对OOV词的原则性不确定性建模。该方法属于“原则性学习”(PrL)范式,即模型通过任务约束而非仅靠标注数据进行训练。 Result: 在Zarma语POS标注任务中,仅使用无标注文本训练的R2T-BiLSTM模型达到98.2%准确率,优于在300个标注句子上微调的AfriBERTa基线。在NER等更复杂任务中,R2T预训练+50个标注句子微调的效果超过基线模型使用300个标注句子的结果。 Conclusion: R2T框架验证了“原则性学习”在低资源NLP任务中的有效性,表明将显式规则与神经网络结合,能显著减少对标注数据的依赖,提升模型泛化能力和数据效率。 Abstract: We introduce the Rule-to-Tag (R2T) framework, a hybrid approach that integrates a multi-tiered system of linguistic rules directly into a neural network's training objective. R2T's novelty lies in its adaptive loss function, which includes a regularization term that teaches the model to handle out-of-vocabulary (OOV) words with principled uncertainty. We frame this work as a case study in a paradigm we call principled learning (PrL), where models are trained with explicit task constraints rather than on labeled examples alone. Our experiments on Zarma part-of-speech (POS) tagging show that the R2T-BiLSTM model, trained only on unlabeled text, achieves 98.2% accuracy, outperforming baselines like AfriBERTa fine-tuned on 300 labeled sentences. We further show that for more complex tasks like named entity recognition (NER), R2T serves as a powerful pre-training step; a model pre-trained with R2T and fine-tuned on just 50 labeled sentences outperformes a baseline trained on 300.

[21] Harnessing Consistency for Robust Test-Time LLM Ensemble

Zhichen Zeng,Qi Yu,Xiao Lin,Ruizhong Qiu,Xuying Ning,Tianxin Wei,Yuchen Yan,Jingrui He,Hanghang Tong

Main category: cs.CL

TL;DR: 本文提出了CoRE,一种利用模型一致性来增强大语言模型集成鲁棒性的即插即用技术,通过在token级和模型级建模一致性,有效缓解因分词差异和模型能力不一导致的集成失败问题。

Details Motivation: 不同大语言模型具有各异的优势与缺陷,集成方法虽能整合其互补能力,但现有方法对异构分词和模型专长差异带来的错误信号缺乏鲁棒性,易在token和模型层面出现失败。 Method: 提出CoRE方法,包含token级一致性和模型级一致性:前者通过低通滤波降低高不一致性token(如分词不对齐)的权重;后者通过提升自信心高且输出偏离小的模型贡献,增强整体一致性。该方法可与多种集成策略结合。 Result: 在多个基准、模型组合和集成策略上的实验表明,CoRE显著提升了集成性能与鲁棒性,有效应对token级分歧和模型级低置信问题。 Conclusion: CoRE通过细粒度与粗粒度的一致性建模,增强了LLM集成对异质性问题的鲁棒性,是一种通用且有效的集成增强技术。 Abstract: Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.

[22] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

A H M Rezaul Karim,Ozlem Uzuner

Main category: cs.CL

TL;DR: MasonNLP系统结合通用大语言模型与检索增强生成(RAG)框架,在MEDIQA-WV 2025伤口护理VQA任务中取得第三名,验证了轻量级RAG在临床多模态NLP任务中的有效性。

Details Motivation: 提升医疗视觉问答系统在伤口护理场景下的回答质量与结构化属性生成能力,支持临床决策。 Method: 采用通用领域指令微调的大语言模型,结合基于文本和视觉示例的检索增强生成(RAG)框架,通过简单索引和融合添加相关范例,无需额外训练或复杂重排序。 Result: 系统在19支队伍、51个提交中排名第三,平均得分41.37%,在dBLEU、ROUGE、BERTScore及LLM-based指标上均提升响应质量和模式遵循性。 Conclusion: 轻量级RAG结合通用大语言模型可作为多模态临床NLP任务的简单而有效的基线方法。 Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.

[23] ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing

Shivanshu Kumar,Gopalakrishnan Srinivasan

Main category: cs.CL

TL;DR: 提出了一种名为ShishuLM的高效语言模型架构,通过减少参数量和KV缓存需求,在保持性能的同时显著降低内存占用和延迟。

Details Motivation: Transformer模型虽然性能优越,但存在较高的内存和计算开销,且存在结构冗余,亟需更高效的模型架构。 Method: 基于AI可解释性和推理时层剪枝的研究,利用归一化与注意力计算在中等上下文场景下的线性特性,用多层感知机(MLPs)近似整个Transformer块。 Result: ShishuLM在训练和推理阶段分别实现了最高25%的内存减少和40%的延迟降低,适用于不同规模的小型语言模型。 Conclusion: 从预训练角度出发,ShishuLM为构建更高效的小型语言模型提供了可行路径,尤其适合代理式AI系统中的应用。 Abstract: While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, presenting opportunities for optimization without compromising performance. Taking insights from research in AI interpretability and inference-time layer pruning, we introduce an efficient language model architecture, referred to as ShishuLM, which reduces both the parameter count and Key-Value (KV) cache requirements. Given the increasing importance of Small Language Models (SLMs) in agentic AI systems, we evaluate our approach on two SLMs of different scales. Our analysis reveals that for moderate-context scenarios, normalization coupled with attention computation is roughly linear with the input, enabling entire transformer blocks to be approximated through Multi-Layer Perceptrons (MLPs). Our results show that ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference, compared to parent models. Our experimental and analytical findings provide insights towards building more efficient SLM architectures from a pre-training standpoint.

[24] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

Chenyu Zhang,Sharifa Alghowinem,Cynthia Breazeal

Main category: cs.CL

TL;DR: 本研究提出了首个用于大规模情感感知的集成LLM框架,分析了16,986轮学生与AI导师PyTutor的对话,揭示了学习过程中动态情感变化,发现学生多呈轻微积极情绪和中等唤醒水平,困惑与好奇常见,挫折较少但具破坏性,情绪持续时间短且易变,中性情绪常成为向积极转变的转折点。

Details Motivation: 现有研究对大语言模型在教育中的影响关注较多,但对其在辅导过程中对学生情感状态的影响理解不足,因此需要深入探讨LLM介导教学中的情感动态,以推动生成式AI在教育中负责任地应用。 Method: 采用由三个前沿大模型(Gemini、GPT-4o、Claude)组成的集成框架,对PyTutor与261名本科生的对话进行零样本情感标注,提取效价、唤醒度、学习帮助性评分及自由文本情绪标签,并通过秩加权池化与跨模型多数共识融合结果,生成稳健的情绪画像。 Result: 学生在与AI导师互动中表现出轻微积极情绪和中等唤醒水平;困惑和好奇频繁出现,挫折虽较少但仍会影响学习进程;情绪持续时间短,积极情绪稍长但易被打断;负面情绪通常能快速消解,甚至转为积极状态;中性情绪常作为向上转变的转折点。 Conclusion: 该集成LLM框架能有效捕捉学习过程中的细微情感变化,揭示了情感状态的短暂性和可塑性,表明在中性情绪时进行适时干预可能有助于引导学生走向积极学习状态,为AI导师的情感响应设计提供了实践依据。 Abstract: While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners' evolving affective states. To achieve this, we analyzed two semesters' worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners' emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived--positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.

[25] Unlocking the Potential of Diffusion Language Models through Template Infilling

Junhoo Lee,Seungyeon Kim,Nojun Kwak

Main category: cs.CL

TL;DR: 提出了一种针对扩散语言模型的模板填充(TI)方法,结合动态段分配(DSA),在数学推理和代码生成任务上显著优于基线方法。

Details Motivation: 现有的扩散语言模型仍沿用自回归模型的前缀提示方法,限制了其生成灵活性和性能,因此需要一种更适配的条件生成策略。 Method: 提出模板填充(Template Infilling, TI)方法,先生成目标响应的结构模板,再填充 masked 段;引入动态段分配(DSA)机制,根据生成置信度自适应调整段长度。 Result: 在数学推理和代码生成基准上比基线平均提升17.01%;在多token生成场景中实现有效加速同时保持生成质量。 Conclusion: TI结合DSA为扩散语言模型提供了更灵活、高效的生成方式,显著提升了生成性能与可控性。 Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs' generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$\%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.

[26] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

Elwin Huaman,Wendi Huaman,Jorge Luis Huaman,Ninfa Quispe

Main category: cs.CL

TL;DR: 本文探讨了将克丘亚语(Quechua)纳入Common Voice平台的过程,以应对低资源语言在语音技术发展中的数据稀缺问题。研究以普诺克丘亚语(qxp)为案例,展示了语言接入和语料收集的实践,并报告了目前Common Voice已包含191.1小时的克丘亚语语音数据(86%已验证),其中普诺方言占12小时(77%已验证)。文章还提出了涵盖技术挑战、社区参与及原住民数据主权等伦理议题的研究议程,旨在推动包容性语音技术和语言社区的数字赋权。

Details Motivation: 低资源语言如克丘亚语面临语音数据匮乏的问题,限制了其在语音技术中的发展。Common Voice提供了一个开放、社区驱动的解决方案,促进这些语言的数字化和语音技术包容性。 Method: 通过将17种克丘亚语纳入Common Voice平台,重点以普诺克丘亚语(qxp)为例,实施语言接入流程并收集朗读与自发语音语料,同时推动社区参与和数据验证。 Result: Common Voice目前已收录191.1小时的克丘亚语语音数据(86%已验证),其中普诺克丘亚语贡献了12小时(77%已验证),验证了该方法的有效性和平台潜力。 Conclusion: Common Voice为低资源语言提供了可行的数据建设路径,结合技术与伦理考量,有助于实现语音技术的包容性发展和原住民语言社区的数字赋权。 Abstract: Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.

[27] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

Johann Pignat,Milena Vucetic,Christophe Gaudet-Blavignac,Jamil Zaghir,Amandine Stettler,Fanny Amrein,Jonatan Bonjour,Jean-Philippe Goldman,Olivier Michielin,Christian Lovis,Mina Bjelogrlic

Main category: cs.CL

TL;DR: FRACCO是一个包含1301个合成法语临床病例的专家标注语料库,用于支持法语肿瘤学领域的命名实体识别和概念规范化研究。

Details Motivation: 法语文本的临床自然语言处理工具缺乏标注数据集,特别是在肿瘤学领域资源稀缺。 Method: 基于西班牙语CANTEMIST语料库翻译生成法语临床文本,并由领域专家进行双重标注;使用ICD-O作为标准对形态学、解剖位置和组织分化进行术语标注,同时增加复合表达式的规范化标注层;通过自动化匹配与人工验证结合的方式完成标准化注释。 Result: 最终数据集包含71127个ICD-O规范化条目,涵盖399种唯一形态学代码(来自2549种不同表达)、272个解剖位置代码(来自3143种表达)和2043个唯一复合表达式(来自11144种表达)。 Conclusion: FRACCO为法语肿瘤学文本的命名实体识别和概念规范化提供了权威的基准数据集。 Abstract: Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.

[28] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Filipe Laitenberger,Dawid Kopiczko,Cees G. M. Snoek,Yuki M. Asano

Main category: cs.CL

TL;DR: 提出了一种名为GateSkip的残差流门控机制,通过在解码器-only语言模型中实现逐token层跳过,以减少计算开销并保持高准确率。

Details Motivation: 为了在不显著损失性能的情况下降低大型语言模型推理时的计算成本,探索有效的层跳过方法。 Method: 引入sigmoid-linear门控机制,对每个Attention/MLP分支输出进行压缩,并根据门控值在推理时跳过低重要性token,结合每层预算实现动态计算。 Result: 在长文本推理任务中节省最多15%计算量且保持90%以上基线准确率;在指令微调模型上,在接近50%计算节省时仍能匹配基线质量,甚至在全计算下实现精度提升。 Conclusion: GateSkip提供了一种稳定、可微、易于集成到预训练模型中的层跳过方案,兼具效率提升与对模型内部信息流动的可解释性。 Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15\% compute while retaining over 90\% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50\% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

[29] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Jimin Lim,Arjun Damerla,Arthur Jiang,Nam Le

Main category: cs.CL

TL;DR: 本文提出了一种新的基准测试,用于评估大语言模型(LLMs)在仅使用自然语言反馈的多臂老虎机环境中进行不确定性下序贯决策的能力。实验表明,Qwen3-4B在选择最优臂方面表现优异,达到89.2%的准确率,超越了其他大型语言模型和传统算法,表明从纯语言中可以涌现出概率推理能力。

Details Motivation: 探索大语言模型在缺乏数值提示的情况下,仅通过自然语言进行不确定环境下的序贯决策能力,填补该领域的研究空白。 Method: 设计了一个基于文本反馈的多臂老虎机环境,要求LLMs仅根据‘你获得了一个代币’等语言提示推断潜在奖励结构并做出决策,并与Thompson Sampling、Epsilon Greedy、UCB和随机选择等传统算法进行比较。 Result: 大多数开源LLM表现不如传统算法,但Qwen3-4B实现了89.2%的最佳臂选择率,显著优于其他LLM和传统方法。 Conclusion: 研究表明,仅通过语言模型也能发展出概率推理和有效决策的能力,提出的基准为评估非数值、自然语言情境下的决策能力提供了新方向。 Abstract: Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.

[30] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

Alexandre Galashov,Matt Jones,Rosemary Ke,Yuan Cao,Vaishnavh Nagarajan,Michael C. Mozer

Main category: cs.CL

TL;DR: 提出了一类名为“Catch Your Breath”(CYB)的监督训练目标,使语言模型能动态自主地为每个输入token调整计算步数,通过引入“”和“”机制,模型可在需要时请求额外计算资源。实验表明,CYB模型在更少训练数据下即可达到与基线模型相当的性能,并能根据token复杂度自适应调整计算量。

Details Motivation: 传统语言模型对每个token使用固定计算量,难以应对不同输入的复杂性差异。希望让模型能够根据需要动态分配计算资源,提升效率与准确性。 Method: 将输出token的选择建模为带时间成本的序贯决策问题,引入机制,允许模型在不确定时请求延迟。研究了三种CYB损失变体:CYB-AP(任意时间预测)、CYB-VA(变分方法)和CYB-DP(基于计算预算的惩罚)。 Result: CYB模型仅需基线模型三分之一的训练数据即可达到相同性能,且能根据token级别复杂性和上下文自适应调整处理时间。例如,在复数名词后常暂停,但从不在缩写词首token后暂停,对歧义词如'won'表现出高暂停可变性。 Conclusion: CYB损失函数有效引导模型学会在必要时请求额外计算,实现了计算资源的动态分配,在减少训练数据需求的同时保持甚至提升模型性能,展示了自适应计算在语言建模中的潜力。 Abstract: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a output. If the model is granted a delay, a specialized token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.

[31] PAGE: Prompt Augmentation for text Generation Enhancement

Mauro Jose Pacchiotti,Luciana Ballejos,Mariel Ale

Main category: cs.CL

TL;DR: 本文提出了PAGE框架,通过使用轻量级辅助模块(如分类器或提取器)增强输入,从而提升自然语言生成模型在特定任务中的生成质量与可控性,且无需复杂的辅助生成模型。

Details Motivation: 现有的自然语言生成模型在面对特定任务时表现不佳,调整模型需要大量额外数据,因此需要一种更简单、灵活的方法来提升其性能。 Method: 提出PAGE框架,利用轻量级辅助模块对输入文本进行推理,并将其输出用于构建增强后的输入,以改进生成结果;采用模块化设计,易于适配不同任务。 Result: 在需求工程领域的概念验证中,结合分类器的辅助模块有效提升了软件需求生成的质量。 Conclusion: PAGE提供了一种简洁、可扩展的生成增强方法,无需训练大型辅助生成模型,具有良好的任务适应性和应用潜力。 Abstract: In recent years, natural language generative models have shown outstanding performance in text generation tasks. However, when facing specific tasks or particular requirements, they may exhibit poor performance or require adjustments that demand large amounts of additional data. This work introduces PAGE (Prompt Augmentation for text Generation Enhancement), a framework designed to assist these models through the use of simple auxiliary modules. These modules, lightweight models such as classifiers or extractors, provide inferences from the input text. The output of these auxiliaries is then used to construct an enriched input that improves the quality and controllability of the generation. Unlike other generation-assistance approaches, PAGE does not require auxiliary generative models; instead, it proposes a simpler, modular architecture that is easy to adapt to different tasks. This paper presents the proposal, its components and architecture, and reports a proof of concept in the domain of requirements engineering, where an auxiliary module with a classifier is used to improve the quality of software requirements generation.

[32] Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

Bolei Ma,Yong Cao,Indira Sen,Anna-Carolina Haensch,Frauke Kreuter,Barbara Plank,Daniel Hershcovich

Main category: cs.CL

TL;DR: 本文主张在使用大语言模型(LLM)进行社会模拟时,应重视开放性生成文本的价值,而非局限于封闭式问答形式。通过借鉴调查方法学和自然语言处理的进展,作者认为开放式设计能更真实地捕捉观点、推理和个体差异,提升测量效度、减少研究者偏差,并促进社会科学研究与NLP的融合。

Details Motivation: 当前LLM用于社会模拟的研究多采用封闭式格式(如选择题),限制了LLM生成能力的发挥,无法真实反映社会现象的复杂性和多样性。作者旨在倡导利用LLM的开放生成特性,以提升社会模拟的真实性与方法论价值。 Method: 结合数十年的调查方法学理论与最新的自然语言处理技术,论文从理论层面论证了在LLM社会模拟中引入开放-ended文本输出的重要性,并提出了改进测量、设计和评估框架的方向。 Result: 论证了开放-ended生成在LLM社会模拟中的多重优势:包括更好地捕捉观点多样性和推理过程、支持未预期观点的发现、减少研究者强加的引导偏差、增强表达力与个体性、辅助预测试,并提升整体方法论实用性。 Conclusion: 应发展新的实践方法和评估框架,充分利用而非限制LLM的开放生成能力,推动NLP与社会科学之间的协同创新。 Abstract: Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes "in" LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.

[33] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

Ariel Kamen

Main category: cs.CL

TL;DR: 该研究评估了十种最先进的大语言模型在IAB 2.2分层分类体系下的文本分类表现,发现尽管模型规模不断增大,但经典指标表现仍有限,且普遍存在幻觉和类别膨胀问题;通过构建多模型集成方法,显著提升了准确性并消除了幻觉。

Details Motivation: 评估当前大语言模型在结构化文本分类任务中的实际性能,并探索如何克服其在准确性和可靠性方面的局限性。 Method: 使用8,660个人工标注样本和统一的零样本提示对十种大语言模型进行测试,并引入传统指标(准确率、精确率、召回率、F1分数)与LLM特有指标(幻觉比率、膨胀比率、分类成本)进行综合评估;同时设计了一种基于多个LLM作为独立专家的集成方法。 Result: 现有大语言模型平均准确率为34%,精确率42%,召回率45%,F1分数41%;普遍存在高幻觉和类别膨胀现象;Gemini 1.5/2.0 Flash和GPT 20B/120B具有较好的成本效益,GPT 120B幻觉最少;集成方法显著提升性能,消除幻觉并降低膨胀。 Conclusion: 单纯扩大模型规模或改进架构不足以提升文本分类准确性,协调多个模型协作的集成策略可能是实现甚至超越人类专家水平的有效路径。 Abstract: This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.

[34] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma,Andrei Cojocaru,Neel Kolhe,Bradley Louie,Robin Said Sharif,Haihan Zhang,Vincent Zhuang,Matei Zaharia,Sewon Min

Main category: cs.CL

TL;DR: 本文提出了一种系统性方法来开发和验证用于评估大语言模型生成数学证明的细粒度评分器,并构建了首个包含专家标注的ProofBench数据集,通过实验设计出性能优越的ProofGrader评分器,显著优于基线方法,并在实际选择任务中接近人类水平表现。

Details Motivation: 当前大语言模型在数学推理方面的评估主要集中于有明确答案的任务,而对自然语言数学证明的生成与评估仍缺乏可靠的细粒度评价机制,亟需一个准确的自动评估方案。 Method: 提出一种系统性的评估器设计方法,基于0-7分的细粒度打分标准;构建ProofBench数据集,涵盖六大数学竞赛的145道题目及435个LLM生成解法;探索评估器在骨干模型、上下文输入、指令和工作流程等维度的设计空间,结合强推理能力的LM、参考解答与评分标准以及集成策略优化评估性能。 Result: 所提出的ProofGrader评估器对专家评分的平均绝对误差(MAE)低至0.926,显著优于朴素基线;在n选一任务中(n=16),平均得分达4.14(满分7),填补了从二元评估器(2.48)到人类最优(4.62)之间78%的差距。 Conclusion: 本研究填补了LLM生成数学证明缺乏可靠细粒度评估工具的空白,ProofGrader具备高精度与实用价值,有望推动数学推理领域中模型生成与评估的协同发展。 Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

[35] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

Fali Wang,Jihai Chen,Shuhua Yang,Ali Al-Lawati,Linli Tang,Hui Liu,Suhang Wang

Main category: cs.CL

TL;DR: 本文系统综述了小语言模型(SLM)与大语言模型(LLM)协作的研究进展,提出了以性能提升、成本效益、云边隐私和可信性为目标的分类体系,并总结了代表性方法、设计范式及未来挑战。

Details Motivation: 由于大语言模型在微调成本、推理延迟、边缘部署和可靠性方面存在局限,而小语言模型具有高效、轻量和适应性强的优势,因此需要探索SLM与LLM协同工作的框架以兼顾效率与性能。 Method: 本文提出了一种基于协作目标的分类法,涵盖性能增强、成本效益、云-边隐私和可信性四个方面,并在此框架下对现有方法进行系统梳理和分析。 Result: 总结了实现SLM-LLM协作的多种代表性方法和设计范式,识别出当前面临的开放性挑战。 Conclusion: SLM与LLM的协同是实现高效、安全、可扩展语言模型部署的重要方向,未来需进一步优化协作机制以应对实际应用需求。 Abstract: Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs' specialization and efficiency with LLMs' generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.

[36] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

Zhaoyang Shang,Sibo Wei,Jianbin Guo,Rui Zhou,Lifeng Dong,Yin Luo

Main category: cs.CL

TL;DR: 提出了一种受认知科学启发的指令数据选择与标注指导框架THTB,通过结合质量过滤和内外部难度评分,优先选择高阶认知指令,在仅使用少量数据的情况下显著提升模型性能和领域适应能力。

Details Motivation: 现有方法在选择高质量微调数据时过度依赖大模型内部知识,缺乏可解释性和泛化性,且难以有效指导专业领域的数据标注。 Method: 提出THTB框架,结合内在与外在难度评分,对指令数据进行质量过滤和认知难度评估,优先选择更具挑战性的高阶认知指令,用于监督微调和标注指导。 Result: 实验表明,使用5%的数据即可超越全量数据训练的模型,且在垂直领域使用2%数据训练的模型优于更大数据集训练的模型,具有良好的泛化性和标注指导能力。 Conclusion: THTB提供了一种可解释、可量化的方法来高效选择指令数据并指导标注,在减少训练成本的同时提升了模型在通用和垂直领域的表现。 Abstract: Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs' internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on https://github.com/DYJG-research/THTB.

[37] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Olga E. Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Daniele Nardi

Main category: cs.CL

TL;DR: 本文提出了一种针对大语言模型越狱攻击的系统性研究,构建了一个包含50种策略的分层分类体系,并通过红队测试分析了各类攻击的有效性与流行程度,同时评估了基于该分类的自动检测方法,还发布了首个意大利语多轮对抗对话数据集。

Details Motivation: 现有防御手段多集中于单轮攻击,且缺乏跨语言覆盖,分类体系不完整,难以全面捕捉越狱技术的多样性。因此需要一个更全面、结构化的框架来理解不同越狱策略如何利用模型漏洞。 Method: 通过组织一次结构化的红队挑战赛,收集多语言(特别是意大利语)多轮对抗对话数据,归纳并扩展已有分类,建立包含50种策略的七大家族分层 taxonomy,并用于分析攻击成功率及检测性能。 Result: 提出了涵盖7大类共50种越狱策略的分层分类体系;发现某些策略如伪装和认知过载在实际中更为有效;基于该分类的提示能提升越狱检测效果;发布了含1364个多轮对话的意大利语数据集。 Conclusion: 系统的分类体系有助于深入理解越狱攻击机制,揭示当前防御的不足,为开发更鲁棒的检测方法提供了基础,特别是在多轮和多语言场景下。 Abstract: Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

[38] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

Misam Abbas

Main category: cs.CL

TL;DR: 本文评估了两种作者归属方法(固定风格嵌入和基于大语言模型的裁判)在区分人类与AI生成文本中的表现,发现二者在不同文本类型中各有优势,表明需要结合多种策略进行多维度的归属分析。

Details Motivation: 随着大语言模型生成的文本越来越接近人类写作,准确区分文本来源变得愈发困难,亟需有效的作者归属机制。 Method: 采用固定风格嵌入和指令微调的LLM裁判(GPT-4o)两种方法,在包含六种文体的人类-AI平行语料库上进行基准测试。 Result: 风格嵌入在GPT-4o生成文本上准确率为82%,优于LLM裁判的68%;LLM裁判在LLaMA生成文本上略优于风格嵌入(85% vs. 81%),但差异不显著。LLM裁判在小说和学术文本中表现更好,而风格嵌入在口语和剧本对话中更优。 Conclusion: 作者归属是一个多维问题,不同方法在不同文体中表现互补,未来应发展融合策略;同时开源代码与数据以支持可重复研究。 Abstract: Attributing authorship in the era of large language models (LLMs) is increasingly challenging as machine-generated prose rivals human writing. We benchmark two complementary attribution mechanisms , fixed Style Embeddings and an instruction-tuned LLM judge (GPT-4o) on the Human AI Parallel Corpus, an open dataset of 600 balanced instances spanning six domains (academic, news, fiction, blogs, spoken transcripts, and TV/movie scripts). Each instance contains a human prompt with both a gold continuation and an LLM-generated continuation from either GPT-4o or LLaMA-70B-Instruct. The Style Embedding baseline achieves stronger aggregate accuracy on GPT continuations (82 pct vs. 68 pct). The LLM Judge is slightly better than the Style embeddings on LLaMA continuations (85 pct vs. 81 pct) but the results are not statistically significant. Crucially, the LLM judge significantly outperforms in fiction and academic prose, indicating semantic sensitivity, whereas embeddings dominate in spoken and scripted dialogue, reflecting structural strengths. These complementary patterns highlight attribution as a multidimensional problem requiring hybrid strategies. To support reproducibility we provide code on GitHub and derived data on Hugging Face under the MIT license. This open framework provides a reproducible benchmark for attribution quality assessment in AI-generated content, along with a review of related literature influencing this work.

[39] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder,Clément Dumas,Stewart Slocum,Helena Casademunt,Cameron Holmes,Robert West,Neel Nanda

Main category: cs.CL

TL;DR: 本文研究了在窄域上微调大语言模型(LLM)所产生的激活偏差,发现这些偏差可通过模型差异分析方法识别,并可用于推断微调内容和格式。作者提出一种基于LLM的可解释性代理,利用这些偏差显著提升对微调领域的理解能力,并在多种模型架构与规模上验证了该现象。研究表明此类偏差可能源于过拟合,混入预训练数据可减轻但不完全消除风险。论文警示使用窄域微调模型作为通用微调代理的研究可能存在局限性,呼吁更深入探究其影响并建立更真实的案例研究。

Details Motivation: 窄域微调被广泛用于适配大语言模型及构建具有特定行为的模型以支持研究,但其可能导致模型内部产生可被解读的强偏差。理解这些偏差的性质对于模型可解释性、安全性和研究有效性至关重要。然而当前缺乏系统分析此类微调影响的方法,也未充分认识到其潜在误导性,因此需要深入探究窄域微调带来的副作用及其对AI安全与可解释性研究的影响。 Method: 通过模型差异分析(model diffing)技术比较微调前后模型的激活差异,特别是在随机文本前几个token上的激活变化。利用该差异进行激活引导(steering),生成类似微调数据格式和内容的文本。设计一个基于LLM的可解释性代理,接入激活偏差信息,评估其在识别微调领域方面的性能,并与基线提示方法对比。实验涵盖多个模型架构(Gemma, LLaMA, Qwen)和参数规模(1B至32B),涉及虚假事实、异常对齐、潜隐学习和禁忌词猜测等任务。同时测试混合预训练数据对偏差缓解的效果。 Result: 发现窄域微调会在LLM激活中引入强烈且可检测的偏差,这些偏差能有效揭示微调数据的内容与结构特征;基于激活差异的引导方法可生成高度相似于微调数据的文本;引入激活偏差的可解释性代理显著优于仅使用提示的基线方法;在不同架构和规模模型中均观察到一致现象;混入预训练数据可大幅削弱但不能完全消除此类偏差,表明存在残留风险。 Conclusion: 窄域微调会在大语言模型中留下明显的训练目标痕迹,表现为可被解析的激活偏差,这对模型可解释性和安全性研究具有重要意义。当前普遍将窄域微调模型作为研究通用微调(如对话微调)代理的做法可能存在生态效度问题,结果可能不够真实可靠。应改进微调训练方式以减少过拟合相关偏差,并推动开展更贴近实际场景的模型差异分析、安全与可解释性研究。 Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.

[40] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen,John Le,Thai T. Vu,Willy Susilo,Heath Cooper

Main category: cs.CL

TL;DR: 提出RAID框架,通过在嵌入空间中优化连续表示并结合拒绝感知正则化与一致性约束,有效生成能绕过大模型安全机制的自然后缀,攻击成功率更高且成本更低。

Details Motivation: 大语言模型虽性能强大,但仍易受越狱攻击,现有方法在攻击效率和自然性方面存在不足,需更系统的方法揭示其安全漏洞。 Method: 将离散token松弛为连续嵌入,通过联合目标优化:鼓励生成受限内容、引入拒绝感知正则化避免拒绝方向、保持语义连贯;再通过批评引导的解码将嵌入映射回自然token。 Result: 在多个开源大模型上实验表明,RAID相比现有白盒和黑盒基线方法,攻击成功率更高,查询次数更少,计算成本更低。 Conclusion: 嵌入空间中的正则化对理解和缓解大模型越狱漏洞至关重要,RAID为评估和增强模型安全性提供了有效工具。 Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

[41] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

Nicole Smith-Vaniz,Harper Lyon,Lorraine Steigner,Ben Armstrong,Nicholas Mattei

Main category: cs.CL

TL;DR: 该论文通过道德基础理论(MFT)分析大语言模型(LLM)在政治和道德问题上的回应,评估其是否存在意识形态倾向,并检验LLM在不同提示和角色扮演下代表不同意识形态的准确性。

Details Motivation: 由于LLM在医疗、法律和人际关系等关键领域被广泛用作建议提供者,亟需理解其在政治与道德议题中的潜在偏见,特别是其回应是否表现出特定意识形态倾向。 Method: 采用道德基础理论(MFT)框架,将LLM的回应与现有大规模人类数据进行直接比较,分析LLM固有回应、显式提示下的政治立场表达,以及基于人口统计特征的角色扮演表现。 Result: 研究发现LLM的回应确实表现出可量化的意识形态倾向,且在显式提示下能较准确地模拟不同政治立场,但在基于人口特征的角色扮演中仍存在偏差。 Conclusion: LLM生成的道德和政治回应具有意识形态依赖性,提示我们在高风险应用中需谨慎对待其输出,并进一步改进模型以减少潜在偏见。 Abstract: Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.

[42] Schema for In-Context Learning

Pan Chen,Shaohong Chen,Mark Wang,Shi Xuan Leong,Priscilla Fung,Varinia Bernales,Alan Aspuru-Guzik

Main category: cs.CL

TL;DR: 本文提出了Schema Activated In-Context Learning (SA-ICL),通过引入认知科学中的图式理论,为大语言模型提供显式的抽象化推理结构,从而提升其在新任务上的推理能力。

Details Motivation: 传统上下文学习缺乏在抽象层面上进行知识检索与迁移的明确机制,受人类利用已有心理框架理解新信息的启发,作者希望构建一种更接近人类认知的推理增强方法。 Method: 从示例中提取关键推理步骤及其关系,构建轻量级、结构化的抽象图式,并将其用于增强模型面对新问题时的推理过程。 Result: 在GPQA数据集的化学和物理问题上,SA-ICL显著提升了多种大语言模型的性能,最高提升达36.19%,且减少了对多个示例的依赖,增强了可解释性。 Conclusion: SA-ICL不仅统一了多种上下文学习策略,还为提升大语言模型类人推理能力提供了新路径。 Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model's reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.

[43] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Yuanchen Wu,Saurabh Verma,Justin Lee,Fangzhou Xiong,Poppy Zhang,Amel Awadelkarim,Xu Chen,Yubai Yuan,Shawndra Hill

Main category: cs.CL

TL;DR: 提出了一种无需标签的提示优化框架PDO,通过LLM裁判提供成对偏好反馈,在BBH和MS MARCO任务上优于基线方法。

Details Motivation: 减少对高质量标注数据的依赖,解决提示工程中标签获取成本高、速度慢的问题。 Method: 将提示优化建模为对决_bandit问题,采用双汤普森采样(D-TS)选择信息量大的提示对,并结合高性能提示引导变异生成新候选。 Result: 在BIG-bench Hard和MS MARCO上,PDO在样本效率和性能上均优于基线方法,消融实验验证了D-TS和提示变异的有效性。 Conclusion: PDO是一种高效且灵活的无标签提示优化框架,可扩展用于部分标签场景以缓解裁判噪声。 Abstract: Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.

[44] Interpreting the Latent Structure of Operator Precedence in Language Models

Dharunish Yugeswardeenoo,Harshil Nukala,Cole Blondin,Sean O Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 研究了LLaMA 3.2-3B模型在算术任务中是否编码运算符优先级,发现中间计算结果存在于残差流中,且模型在线性嵌入中编码优先级,并提出通过交换关键嵌入维度来修改优先级的partial embedding swap方法。

Details Motivation: 探索大语言模型在算术推理中的内部机制,特别是运算符优先级是否被编码在其内部表示中。 Method: 构建包含三个操作数和两个运算符的算术表达式数据集,使用logit lens、线性分类探针和UMAP可视化等可解释性技术,追踪残差流中的中间结果及其在MLP和注意力层后的表示。 Result: 发现中间计算结果存在于残差流中(尤其在MLP后),运算符的嵌入在线性空间中编码了优先级信息,并验证了通过嵌入交换可改变模型对优先级的处理。 Conclusion: LLM在内部表示中确实编码了运算符优先级,且可通过修改特定嵌入维度来干预其算术推理过程,揭示了模型执行符号计算的部分机制。 Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator's embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.

cs.CV [Back]

[45] MultiFoodhat: A potential new paradigm for intelligent food quality inspection

Yue Hu,Guohang Zhuang

Main category: cs.CV

TL;DR: 提出了一种基于多智能体对话的零样本食物识别框架MultiFoodChat,结合视觉语言模型和大语言模型,通过多轮图文对话实现无需训练的食物图像分类,在多个公开数据集上表现优于现有方法。

Details Motivation: 现有监督模型依赖大量标注数据且难以泛化到未见食物类别,限制了在真实场景中的应用。 Method: 构建一个由视觉语言模型和大语言模型驱动的多智能体对话框架,引入对象感知令牌(OPT)捕捉细粒度视觉特征,并通过交互式推理代理(IRA)动态解析上下文信息以优化预测。 Result: 在多个公共食品数据集上的实验表明,MultiFoodChat在识别准确性和可解释性方面均优于现有的无监督和少样本方法。 Conclusion: MultiFoodChat为零样本食物识别提供了一种新范式,具备良好的应用潜力,可用于智能食品质量检测与饮食分析。 Abstract: Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.

[46] Post-surgical Endometriosis Segmentation in Laparoscopic Videos

Andreas Leibetseder,Klaus Schoeffmann,Jörg Keckstein,Simon Keckstein

Main category: cs.CV

TL;DR: 该论文提出了一种用于辅助妇科医生诊断子宫内膜异位症的系统,能够分割腹腔镜视频中常见的深色子宫内膜病灶,并通过彩色标注和检测摘要提升视频浏览效率。

Details Motivation: 子宫内膜异位症在体内表现多样,视觉识别困难,非专科医生容易误诊,因此需要一种辅助诊断工具来提高识别准确性和效率。 Method: 开发并训练一个系统,用于分割腹腔镜手术视频中常见的深色子宫内膜病灶,采用多色覆盖标注病灶区域,并生成检测摘要以支持视频快速浏览。 Result: 系统能够有效分析腹腔镜视频,准确标注病灶区域,并提供检测摘要以改善视频导航和临床决策支持。 Conclusion: 该系统为子宫内膜异位症的视觉识别提供了可行的辅助工具,有助于提升临床诊断效率和准确性,特别是在非专科环境下。 Abstract: Endometriosis is a common women's condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.

[47] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

Jia Yun Chua,Argyrios Zolotas,Miguel Arana-Catania

Main category: cs.CV

TL;DR: 该研究结合YOLO与视觉语言模型(VLMs)如LLaVA、ChatGPT和Gemini,提升遥感图像中的飞机检测与场景理解能力,在标注与未标注数据及退化图像上均显著改善检测精度与上下文理解。

Details Motivation: 传统视觉模型在遥感图像分析中受限于大量标注数据需求和上下文理解能力不足,而通用型视觉语言模型(VLMs)在该领域应用尚不充分,亟需探索融合方案以提升性能。 Method: 将YOLO目标检测模型与多种视觉语言模型(LLaVA、ChatGPT、Gemini)结合,利用VLMs的语义理解能力增强遥感图像的上下文解释,并在标注、未标注及退化图像数据上进行评估。 Result: 在飞机检测与计数任务中,各类模型平均MAE降低48.46%;CLIPScore提升6.17%,表明图像理解能力增强,尤其在低质量或挑战性条件下表现更优。 Conclusion: 结合传统视觉模型与VLMs的方法能有效提升遥感图像分析的准确性与上下文感知能力,为少样本学习等实际应用场景提供了可行路径。 Abstract: Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.

[48] Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer

Kelvin Szolnoky,Anders Blilie,Nita Mulliqi,Toyonori Tsuzuki,Hemamali Samaratunga,Matteo Titus,Xiaoyi Ji,Sol Erika Boman,Einar Gudlaugsson,Svein Reidar Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radisław Kordek,Roman Łowicki,Brett Delahunt,Kenneth A. Iczkowski,Theo van der Kwast,Geert J. L. H. van Leenders,Katia R. M. Leite,Chin-Chen Pan,Emiel Adrianus Maria Janssen,Martin Eklund,Lars Egevad,Kimmo Kartasalo

Main category: cs.CV

TL;DR: 该研究开发并验证了一种基于AI的深度学习模型,用于提高前列腺癌中筛状结构(cribriform morphology)的检测准确性,模型在内部和外部验证中均表现出优异性能,并优于多位病理专家的一致性水平。

Details Motivation: 筛状结构是提示前列腺癌预后不良的重要组织学特征,但目前报告不足且病理医生间判读差异大,亟需提高检测的一致性和准确性。 Method: 采用EfficientNetV2-S编码器结合多实例学习的深度学习模型,对来自三组队列的640例前列腺穿刺活检全切片图像进行端到端分类训练,并在内部和外部独立队列中验证;同时与九位专家的判读结果进行对比分析。 Result: 模型在内部验证中AUC为0.97(95%CI: 0.95-0.99),Cohen's kappa为0.81;在外部验证中AUC为0.90(95%CI: 0.86-0.93),kappa为0.55;在88例切片的对比中,模型平均一致性(kappa=0.66)高于所有九位病理专家(kappa=0.35-0.62)。 Conclusion: 该AI模型在检测前列腺癌筛状结构方面达到或超过病理专家水平,有助于提升诊断可靠性、标准化报告流程,并优化患者的治疗决策。 Abstract: Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model's performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen's kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen's kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen's kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen's kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.

[49] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

Junjie Nan,Jianing Li,Wei Chen,Mingkun Zhang,Xueqi Cheng

Main category: cs.CV

TL;DR: 提出了一种名为NAPPure的扩展对抗净化框架,用于处理非加性对抗扰动,通过似然最大化分离干净图像和扰动参数,在GTSRB和CIFAR-10数据集上显著提升了图像分类模型的鲁棒性。

Details Motivation: 现有的对抗净化方法主要针对加性扰动设计,对现实世界中常见的非加性扰动(如模糊、遮挡和失真)效果较差,因此需要一种能应对多种扰动类型的新方法。 Method: 建立对抗图像的生成过程,并通过最大似然估计将潜在的干净图像与扰动参数解耦,从而实现对非加性扰动的有效净化。 Result: 在GTSRB和CIFAR-10数据集上的实验表明,NAPPure显著提高了图像分类模型在非加性对抗扰动下的鲁棒性。 Conclusion: NAPPure框架能够有效扩展对抗净化方法至非加性扰动场景,增强了模型在复杂扰动下的稳定性与适用性。 Abstract: Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.

[50] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Xiaoqian Shen,Wenxuan Zhang,Jun Chen,Mohamed Elhoseiny

Main category: cs.CV

TL;DR: 本文提出了一种基于图结构的检索-推理增强生成框架Vgent,用于提升长视频理解中大视频语言模型(LVLMs)的性能,通过结构化语义图和中间推理步骤有效解决长时序依赖与噪声干扰问题,在多个基准上显著优于现有方法。

Details Motivation: 由于上下文窗口限制和长期时序信息保持困难,现有的大视频语言模型在处理长视频时面临挑战;同时,直接应用检索增强生成(RAG)会破坏时间依赖并引入无关信息,影响推理准确性。 Method: 提出Vgent框架:1)将视频表示为保留片段间语义关系的结构化图以提升检索效果;2)引入中间推理步骤,利用结构化验证减少检索噪声,并显式聚合跨片段相关信息。 Result: 在三个长视频理解基准上评估多种开源LVLM,Vgent在MLVU上比基线模型提升3.0%~5.4%,优于现有视频RAG方法8.6%。 Conclusion: Vgent通过图结构建模和中间推理机制,有效提升了LVLM在长视频理解任务中的准确性和上下文感知能力,为视频RAG提供了新的解决方案。 Abstract: Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.

[51] Synchronization of Multiple Videos

Avihai Naaman,Ron Shapira Weber,Oren Freifeld

Main category: cs.CV

TL;DR: 提出了一种基于原型的时序对齐框架Temporal Prototype Learning (TPL),用于同步不同场景或生成式AI视频中的多视角视频,显著提升了准确性、效率和鲁棒性。

Details Motivation: 传统方法难以处理跨场景或生成式AI视频之间的非线性时序错位问题,需要一种更鲁棒的多视频同步方法。 Method: 通过预训练模型提取高维嵌入,并构建共享的一维紧凑原型序列作为时序锚点,实现无需成对匹配的多视频对齐。 Result: 在多个数据集上验证了TPL的有效性,显著提升同步性能,并首次实现了多个生成式AI视频的同步对齐。 Conclusion: TPL是一种高效且通用的多视频同步框架,尤其适用于复杂场景和生成式AI视频的时序对齐任务。 Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/

[52] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Emanuel Garbin,Guy Adam,Oded Krams,Zohar Barzelay,Eran Guendelman,Michael Schwarz,Moran Vatelmacher,Yigal Shenkman,Eli Peker,Itai Druker,Uri Patish,Yoav Blum,Max Bluvstein,Junxuan Li,Rawal Khirodkar,Shunsuke Saito

Main category: cs.CV

TL;DR: 提出了一种无需训练、基于多视角手机照片生成高保真3D头像的新方法,通过“捕捉、标准化、渲染”流程实现身份保持和高度真实感。

Details Motivation: 现有单视角方法存在几何不一致和身份失真问题,合成数据训练的模型难以还原皮肤皱纹和细发等高频细节,影响真实感。 Method: 提出两个关键模块:一是生成式标准化模块,将非结构化多视角图像转换为统一表示;二是基于Transformer的模型,使用从真实人物穹顶采集数据构建的高保真高斯溅射头像大数据集进行训练。 Result: 该方法能从无序手机照片生成静态四分之三身3D头像,在几何一致性、身份保持和视觉真实感方面表现优异,尤其在皮肤纹理和毛发细节上提升明显。 Conclusion: 所提出的“Capture, Canonicalize, Splat”管道在零样本条件下实现了高质量、身份保持的3D头像生成,克服了现有方法在真实感和细节还原上的局限。 Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

[53] cubic: CUDA-accelerated 3D Bioimage Computing

Alexandr A. Kalinin,Anne E. Carpenter,Shantanu Singh,Matthew J. O'Meara

Main category: cs.CV

TL;DR: cubic是一个开源Python库,通过集成CuPy和RAPIDS cuCIM的GPU加速功能,扩展了SciPy和scikit-image的API,实现了对2D和3D生物图像处理的高效、可扩展分析,支持设备无关的计算调度,显著提升了现有工作流的速度与兼容性。

Details Motivation: 现有的生物图像分析工具在可扩展性、效率、GPU加速支持和与其他科学计算工作流的互操作性方面存在局限,难以应对现代显微镜产生的大规模2D和3D数据。 Method: 开发了一个名为cubic的开源Python库,其API与SciPy和scikit-image兼容,并利用CuPy和RAPIDS cuCIM实现GPU加速;采用设备无关的设计,自动根据数据位置调度CPU或GPU执行运算。 Result: 在单个操作的基准测试以及去卷积和分割流程的复现中,cubic实现了显著的速度提升,同时保持算法准确性,并能无缝集成到现有的生物图像分析流程中,支持从预处理到特征提取的全流程加速。 Conclusion: cubic为可扩展、可重复的生物图像分析提供了坚实基础,良好地融入了Python科学计算生态,支持交互式探索和高通量自动化分析。 Abstract: Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic's API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic

[54] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Yuancheng Xu,Wenqi Xian,Li Ma,Julien Philip,Ahmet Levent Taşel,Yiwei Zhao,Ryan Burgert,Mingming He,Oliver Hermann,Oliver Pilarski,Rahul Garg,Paul Debevec,Ning Yu

Main category: cs.CV

TL;DR: 提出了一种通过新型定制数据管道实现多视角角色一致性和3D相机控制的视频扩散模型框架,利用4D高斯点阵和视频重光照技术提升虚拟制作中的生成质量与控制能力。

Details Motivation: 为了在视频扩散模型中实现多视角角色一致性与精确的3D相机控制,满足虚拟制作中对高质量、可定制化视频生成的需求。 Method: 构建了一个定制数据流水线,使用4D高斯点阵重渲染体捕捉表演,并结合多样化的相机轨迹与视频重光照技术生成训练数据,微调开源视频扩散模型以实现多视角身份保持、相机与光照控制;支持多主体生成(联合训练与噪声融合)、场景定制及运动与空间布局控制。 Result: 实验表明该方法在视频质量、个性化准确性、相机控制和光照适应性方面均有提升,支持高效的多主体组合与复杂场景定制。 Conclusion: 该框架显著提升了视频生成在虚拟制作中的集成能力,为角色一致性、多视角控制和实际应用扩展提供了有效解决方案。 Abstract: We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.

[55] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Ryo Masumura,Shota Orihashi,Mana Ihori,Tomohiro Tanaka,Naoki Makishima,Taiga Yamane,Naotaka Kawata,Satoshi Suzuki,Taichi Katayama

Main category: cs.CV

TL;DR: 本文提出了一种联合建模Big Five和HEXACO人格模型的方法,用于从多模态人类行为中自动识别表观人格特质。

Details Motivation: 现有研究多关注Big Five模型,而忽视了能评估诚实-谦逊等特质的HEXACO模型,且二者在机器学习建模中的关系尚不明确。 Method: 通过联合优化Big Five和HEXACO的识别过程,利用自我介绍视频数据集进行多模态行为分析。 Result: 实验表明所提方法能有效识别Big Five和HEXACO人格特质。 Conclusion: 联合建模有助于提升对多模态人类行为的理解,特别是在表观人格识别中整合HEXACO模型具有潜力。 Abstract: This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.

[56] LOTA: Bit-Planes Guided AI-Generated Image Detection

Hongsong Wang,Renxi Cheng,Yang Zhang,Chaolei Han,Jie Gui

Main category: cs.CV

TL;DR: 提出一种基于位平面的噪声图像生成与检测方法,用于高效区分AI生成图像与真实图像。

Details Motivation: 随着GAN和扩散模型的发展,AI生成图像越来越逼真,现有基于图像重构误差的检测方法计算成本高且难以捕捉原始图像中的内在噪声特征。 Method: 利用位平面图像处理技术提取噪声特征,设计最大梯度块选择策略以增强噪声信号,并提出轻量级分类头(包括基于噪声的分类器和噪声引导分类器)进行检测。 Result: 在GenImage基准上达到98.9%的平均准确率,比现有方法提升11.9%,跨生成器泛化性能优异(GAN到Diffusion超过98.2%,Diffusion到GAN超过99.2%),且误差提取速度达毫秒级,比现有方法快近百倍。 Conclusion: 该方法在检测AI生成图像方面具有高精度、高效率和强泛化能力,适用于实际应用中的快速鉴别需求。 Abstract: The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9\%} (\textbf{11.9}\%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2\% from GAN to Diffusion and over 99.2\% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.

[57] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Soumyya Kanti Datta,Tanvi Ranga,Chengzhe Sun,Siwei Lyu

Main category: cs.CV

TL;DR: 提出了一种新的多模态音视频框架PIA,用于检测由生成模型产生的深度伪造内容。

Details Motivation: 传统检测方法难以有效识别由GANs、扩散模型等先进生成模型制造的深度伪造媒体,因为这些方法主要依赖手动设计的规则和单模态策略,无法捕捉细微的时间不一致性。 Method: 该研究利用音素序列、嘴唇几何数据以及先进的面部身份嵌入,结合语言、动态面部运动和面部识别线索,构建了一个名为Phoneme-Temporal and Identity-Dynamic Analysis (PIA) 的多模态框架。 Result: 通过在多个互补模态中识别不一致之处,该方法显著提升了对细微深度伪造篡改的检测能力。 Conclusion: PIA框架在检测现代深度伪造方面优于传统方法,展现出更强的鲁棒性和准确性。 Abstract: The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA

[58] Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 本文提出了一种专为基于事件的光学相机通信(OCC)系统设计的新型调制方案——事件间隔调制(EIM),通过利用事件之间的时间间隔传输信息,显著提升了传输速率,并在室内实现了10米距离28 kbps和50米距离8.4 kbps的传输,创造了新的性能基准。

Details Motivation: 传统的基于帧的OCC系统存在比特率低、处理负载高的问题,现有基于事件相机的OCC系统未充分挖掘事件传感器的独特特性,缺乏专门优化的调制方案。 Method: 提出事件间隔调制(EIM)方案,建立EIM的理论模型,调整和优化事件传感器(EVS)参数以适配EIM,实验确定最大可用调制阶数,并进行传输验证实验。 Result: 成功实现了在10米距离28 kbps和50米距离8.4 kbps的传输速率,为基于事件的OCC系统设定了新的比特率基准。 Conclusion: EIM是一种能有效利用事件传感器异步性和高动态范围特性的高效调制方法,显著提升了事件基OCC系统的性能,具有实现高速、低延迟通信的潜力。 Abstract: Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.

[59] MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering

Mingkai Liu,Dikai Fan,Haohua Que,Haojia Gao,Xiao Liu,Shuxue Peng,Meixia Lin,Shengyu Gu,Ruicong Ye,Wanli Qiu,Handong Yao,Ruopeng Zhang,Xianliang Huang

Main category: cs.CV

TL;DR: 提出了一种基于混合专家的加速坐标编码方法(MACE),用于大规模场景中的高效定位与高质量渲染,显著降低成本并保持高精度。

Details Motivation: 现有场景坐标回归(SCR)方法在小规模场景中表现良好,但在扩展到大规模场景时受限于单个网络的容量,且计算成本高。 Method: 引入受MOE启发的门控网络,隐式分类并选择子网络,每次推理仅激活一个子网络;提出无辅助损失的负载均衡策略(ALF-LB)以提升定位精度。 Result: 在剑桥测试集上的实验表明,该方法仅需10分钟训练即可实现高质量渲染,并显著降低计算成本。 Conclusion: MACE为大规模场景的定位与渲染提供了一种高效、精确的解决方案。 Abstract: Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.

[60] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen,Wentao Jiang,Yiran Zhu,Tiezheng Ge,Zhiguo Cao,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的身份保持奖励引导优化(IPRO)框架,用于提升图像到视频生成中的人物身份一致性,尤其在人脸变化大或占比较小时表现优异。

Details Motivation: 现有图像到视频生成模型在人物表情和动作变化较大时难以保持输入图像与生成视频之间的身份一致性,尤其当人脸在图像中占比较小时问题更为严重。 Method: 提出IPRO框架,利用面部身份评分器作为奖励信号,通过反向传播最后几步采样链的奖励信号来优化扩散模型,并引入KL散度正则化稳定训练过程;同时设计新的面部评分机制,利用真实视频中的多角度面部特征增强泛化能力。 Result: 在Wan 2.2 I2V模型和自研I2V模型上进行了大量实验,结果表明该方法显著提升了身份一致性,且无需修改模型结构或增加辅助模块。 Conclusion: IPRO提供了一种直接有效的微调方法,在不改变模型架构的前提下显著增强了图像到视频生成中的身份保持能力,具有良好的应用前景。 Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{https://ipro-alimama.github.io/}{https://ipro-alimama.github.io/}.

[61] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Xiangyu Meng,Zixian Zhang,Zhenghao Zhang,Junchao Liao,Long Qin,Weizhi Wang

Main category: cs.CV

TL;DR: 提出Identity-GRPO,一种基于人类反馈的优化框架,提升多人体视频生成中的身份一致性。

Details Motivation: 现有方法在复杂场景下难以保持多人身份的一致性,尤其是在动态交互中,影响生成视频的真实性和连贯性。 Method: 构建一个大规模偏好数据集,训练视频奖励模型,并设计适用于多人身份一致性的GRPO变体,用于优化现有视频生成方法如VACE和Phantom。 Result: 实验表明,Identity-GRPO在人类一致性指标上相比基线方法最高提升18.9%,并通过消融研究验证了标注质量和设计选择对优化效果的影响。 Conclusion: Identity-GRPO有效提升了多人体视频生成中的身份保持能力,为强化学习与个性化视频生成的结合提供了可行路径。 Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.

[62] MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching

Tingman Yan,Tao Liu,Xilian Yang,Qunfei Zhao,Zeyang Xia

Main category: cs.CV

TL;DR: 本文提出了一种名为MatchAttention的新型注意力机制,通过动态匹配相对位置实现高效高分辨率跨视图匹配,并结合MatchDecoder、门控交叉注意力和一致性损失,在多个基准上实现了最先进的性能,兼顾高精度与低计算复杂度。

Details Motivation: 现有的跨注意力机制因二次复杂度和缺乏显式的匹配约束,难以有效处理高分辨率图像的跨视图匹配问题。 Method: 提出MatchAttention机制,利用可学习的相对位置动态确定键值对的采样中心;引入BilinearSoftmax实现连续可微的滑窗注意力采样;通过残差连接在层间迭代更新相对位置;设计基于MatchAttention的MatchDecoder,并结合门控交叉MatchAttention和一致性约束损失以应对跨视图遮挡。 Result: 在Middlebury榜单上MatchStereo-B平均误差排名第一,KITTI分辨率推理仅需29ms;MatchStereo-T可在0.1秒内处理4K图像并仅使用3GB GPU内存;在KITTI 2012/2015、ETH3D和Spring等数据集上均达到SOTA性能。 Conclusion: MatchAttention通过显式建模相对位置显著提升了跨视图匹配的效率与精度,实现了实时、高分辨率、高准确率的匹配,为实际应用提供了可行方案。 Abstract: Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at https://github.com/TingmanYan/MatchAttention.

[63] Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的光学相机通信系统鲁棒解调方案,首次在户外实验中实现了长距离、低误码率的数据传输。

Details Motivation: 为了提高光学相机通信系统在复杂户外环境下的解调性能和通信可靠性。 Method: 结合OOK调制、切换解调和数字锁相环技术,利用事件相机实现鲁棒解调。 Result: 在200米60kbps和400米30kbps条件下,户外实验中误码率低于10^-3。 Conclusion: 所提出的解调方案显著提升了光学相机通信系统在远距离和动态环境下的性能,具有良好的应用前景。 Abstract: We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} < 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.

[64] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Alexander Valverde,Brian Xu,Yuyin Zhou,Meng Xu,Hongyun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为GauSSmart的混合方法,通过结合2D基础模型与3D高斯点阵重建,提升了稀疏区域的场景重建质量。

Details Motivation: 现有高斯点阵方法在稀疏数据下难以捕捉细节和保持真实感,受限于3D训练数据的稀疏性。 Method: 引入2D计算机视觉技术,如凸滤波和基于DINO等基础模型的语义特征监督,利用2D分割先验和高维特征嵌入来引导高斯点的致密化与优化。 Result: 在三个数据集上验证,GauSSmart在多数场景中优于现有的高斯点阵方法。 Conclusion: 2D-3D混合方法具有显著潜力,能有效克服单一方法的局限性。 Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

[65] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Kieu-Anh Truong Thi,Huy-Hieu Pham,Duc-Trong Le

Main category: cs.CV

TL;DR: 提出一种基于因果推断的框架,利用语义特征并减轻混杂因素影响,通过设计显式包含中介变量和观察到的组织切片的转换策略,在CAMELYON17和私有数据集上实现了最高7%的性能提升。

Details Motivation: 解决全视野数字图像(WSI)中由于采集过程或数据源差异导致的域偏移问题,现有方法主要依赖于统计相关性建模,忽视了因果关系。 Method: 采用基于因果推断的框架,实施前门原则,设计引入中介变量和观察到的组织切片的变换策略来对齐特征分布。 Result: 在CAMELYON17和一个私有组织病理学数据集上验证了方法的有效性,跨未见域表现出一致的性能增益,最高提升了7%。 Conclusion: 因果推断是一种应对组织病理学图像分析中域偏移问题的强大工具,所提方法优于现有基线方法。 Abstract: Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.

[66] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

Kyungryul Back,Seongbeom Park,Milim Kim,Mincheol Kwon,SangHyeok Lee,Hyunyoung Lee,Junhee Cho,Seunghyun Park,Jinkyu Kim

Main category: cs.CV

TL;DR: 提出一种无需训练的三层对比解码方法,通过水印问题选择关键层,有效减少大视觉语言模型中的幻觉现象。

Details Motivation: 大视觉语言模型(LVLMs)在多模态任务中表现良好,但容易产生幻觉,依赖单一模态或记忆训练数据而缺乏视觉 grounding。 Method: 提出训练-free 的三层对比解码结合水印机制:首先选择成熟层和业余层,然后利用水印相关问题识别视觉 grounding 良好的 pivot 层,最后应用三层对比解码生成输出。 Result: 在POPE、MME和AMBER等公开基准上实验表明,该方法在减少LVLM幻觉、提升视觉 grounding 回答方面达到最先进的性能。 Conclusion: 所提方法无需训练即可有效缓解LVLM的幻觉问题,提升了生成结果的可靠性和与视觉输入的对齐程度。 Abstract: Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.

[67] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection

Shivangi Yadav,Arun Ross

Main category: cs.CV

TL;DR: 提出MID-StyleGAN框架,结合扩散模型与GAN生成多域合成眼纹图像,有效缓解活体检测中数据稀缺问题,并显著提升攻击检测性能。

Details Motivation: 由于实际呈现攻击(PA)样本难以获取,导致虹膜活体检测(PAD)技术面临训练和评估数据稀缺的问题,亟需有效的合成数据生成方法。 Method: 提出MID-StyleGAN,融合扩散模型与生成对抗网络(GAN),采用多域架构实现真实眼纹图像与多种攻击域(如打印眼睛、美瞳等)之间的图像转换,并设计自适应损失函数以保持眼部数据的域一致性。 Result: 实验表明,MID-StyleGAN在生成高质量、多样化合成眼纹图像方面优于现有方法;在LivDet2020数据集上,1%误检率下的真检率从93.41%提升至98.72%。 Conclusion: MID-StyleGAN能有效生成逼真的多域眼纹图像,显著提升PAD系统性能,为解决虹膜生物特征系统中数据不足问题提供了可扩展的解决方案。 Abstract: An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.

[68] Vision-Centric Activation and Coordination for Multimodal Large Language Models

Yunnan Wang,Fan Lu,Kecheng Zheng,Ziyuan Huang,Ziqiang Li,Wenjun Zeng,Xin Jin

Main category: cs.CV

TL;DR: 本文提出了VaCo,通过引入视觉中心激活与协调机制,利用多个视觉基础模型优化多模态大语言模型(MLLM)的表征能力。

Details Motivation: 主流MLLM仅依赖文本token的下一词预测进行监督,忽略了对分析能力至关重要的视觉中心信息。 Method: 引入视觉判别对齐机制,结合可学习的模块化任务查询(MTQs)和视觉对齐层(VALs),并在多组MTQ间使用令牌网关掩码(TGM)协调不同视觉基础模型的表征冲突。 Result: 大量实验表明,VaCo在多个基准上显著提升了不同MLLM的性能,增强了视觉理解能力。 Conclusion: VaCo通过融合多视觉基础模型的监督信号,有效提升了MLLM在视觉理解方面的表现,验证了视觉中心优化的重要性。 Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.

[69] Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration

Siddharth Tourani,Jayaram Reddy,Sarvesh Thakur,K Madhava Krishna,Muhammad Haris Khan,N Dinesh Reddy

Main category: cs.CV

TL;DR: 提出一种基于循环一致关键点和GRU与变换同步结合的位姿块的自监督RGB-D点云配准方法,在ScanNet和3DMatch上优于以往自监督方法。

Details Motivation: 利用大量无标签的RGB-D数据进行场景几何推理,提升无监督点云配准性能。 Method: 使用循环一致的关键点作为显著点来增强匹配过程中的空间一致性约束,并设计了一个结合GRU循环单元与变换同步的新型位姿块,融合历史和多视角信息。 Result: 在ScanNet和3DMatch数据集上超越了之前的自监督配准方法,甚至优于一些旧的有监督方法,并能有效集成到现有方法中。 Conclusion: 所提出的方法能有效利用无标签RGB-D数据,通过关键点一致性和递归结构提升配准精度,具有良好的通用性和性能。 Abstract: With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.

[70] Spatial Preference Rewarding for MLLMs Spatial Understanding

Han Qiu,Peng Gao,Lewei Lu,Xiaoqin Zhang,Ling Shao,Shijian Lu

Main category: cs.CV

TL;DR: 提出了一种名为SPR(Spatial Preference Rewarding)的方法,通过奖励机制提升多模态大语言模型(MLLMs)的细粒度空间理解能力,有效改善了物体定位和区域描述生成的准确性。

Details Motivation: 现有MLLMs在细粒度空间感知方面表现不足,且缺乏对模型实际响应的直接监督,导致难以满足用户对精确空间理解的需求。 Method: SPR方法引入语义和定位评分机制,评估MLLM生成描述的质量,并通过优选优化策略,将高精度修正描述与低分初始描述配对,实现对模型响应的直接优化。 Result: 在标准指代和定位基准上的实验表明,SPR能以极小的训练开销显著提升MLLM的空间理解能力。 Conclusion: SPR通过细粒度反馈和偏好优化,有效增强了MLLM在复杂视觉任务中的空间感知与描述能力,为未来多模态模型的精细化理解提供了新思路。 Abstract: Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR

[71] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun,Jungwon Park,Jumgmin Ko,Changin Choi,Wonjong Rhee

Main category: cs.CV

TL;DR: 本文提出了一种名为DOS(Directional Object Separation)的方法,通过调整CLIP文本嵌入来改善多对象文本到图像生成中的对象忽略和混合问题。

Details Motivation: 现有的文本到图像模型在处理包含多个对象的提示时常常出现对象忽略或混合的问题,尤其是在对象形状、纹理相似或背景偏差明显的情况下。因此需要一种能够增强对象区分能力的方法。 Method: 基于对CLIP嵌入的两个关键观察,提出DOS方法,修改三种类型的CLIP文本嵌入,以增强不同对象在语义空间中的方向性分离,从而提升多对象生成的准确性。 Result: 实验结果显示,DOS在多个基准上显著提高了多对象图像生成的成功率,减少了对象混合现象;在人类评估中,相比四种竞争方法获得了26.24%-43.04%更多的偏好投票。 Conclusion: DOS是一种实用且有效的方法,能够显著提升多对象文本到图像生成的质量,尤其适用于复杂场景下的对象关系建模。 Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

[72] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights

Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan

Main category: cs.CV

TL;DR: 提出了一种高效的双分辨率双向Mamba模型(DRBD-Mamba),用于3D脑肿瘤分割,在保持高精度的同时显著提升计算效率,并在多个BraTS数据划分上验证了其鲁棒性和优越性。

Details Motivation: 现有的Mamba-based模型在脑肿瘤分割中存在计算开销大、跨不同数据划分的鲁棒性未充分探索的问题,缺乏可靠的评估体系。 Method: 提出DRBD-Mamba模型,采用空间填充曲线减少多轴扫描开销,设计门控融合模块整合前向与反向上下文,并引入量化块增强鲁棒性;同时构建五个系统性BraTS2023数据划分用于更严格的评估。 Result: 在近期方法使用的20%测试集上,全肿瘤Dice提升0.10%,肿瘤核心提升1.75%,增强肿瘤提升0.93%;在新提出的五折交叉验证中,平均Dice在肿瘤核心和增强肿瘤分别提升0.86%和1.45%,且计算效率提高15倍。 Conclusion: DRBD-Mamba在保持高分割精度的同时大幅降低计算成本,具备更强的鲁棒性和实际应用潜力,为基于Mamba的医学图像分割提供了高效可靠的解决方案。 Abstract: Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20\% test set used by recent methods, our model achieves Dice improvements of 0.10\% for whole tumor, 1.75\% for tumor core, and 0.93\% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86\% for tumor core and 1.45\% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.

[73] BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble

Brandon Hill,Kma Solaiman

Main category: cs.CV

TL;DR: 本文提出了BoardVision框架,用于检测主板装配层面的缺陷,并通过YOLOv7和Faster R-CNN的对比实验及提出的CTV Voter轻量级集成方法,提升了检测的精度与召回率平衡,同时发布了可部署的GUI检测工具。

Details Motivation: 主板装配层面的缺陷检测在高产量电子制造中至关重要,但现有研究多集中于裸板或线路级缺陷,缺乏对整板装配缺陷的系统性探索。 Method: 提出BoardVision框架,使用YOLOv7和Faster R-CNN进行基准测试,并设计了基于置信度和时序投票的轻量级集成方法CTV Voter来提升检测性能。 Result: 实现了精度与召回率的良好平衡,验证了模型在真实扰动(如亮度、清晰度、方向变化)下的鲁棒性,并开发了可部署的GUI检测工具。 Conclusion: 计算机视觉技术可通过系统化框架和实用工具有效应用于主板装配缺陷检测,推动其从实验室评估走向实际质量控制应用。 Abstract: Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.

[74] DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis

Chao Tu,Kun Huang,Jie Zhang,Qianjin Feng,Yu Zhang,Zhenyuan Ning

Main category: cs.CV

TL;DR: 提出了一种名为DCMIL的渐进式表示学习模型,用于高效处理全切片图像(WSI)以进行癌症预后预测,无需密集标注且在多种癌症类型上表现优异。

Details Motivation: 现有方法受限于千兆像素级输入的计算瓶颈和密集人工标注的稀缺,且常忽略多放大倍数WSI中的细粒度信息及肿瘤微环境差异。 Method: 提出双课程对比多实例学习(DCMIL)模型,采用从易到难的渐进式表示学习策略,直接将千兆像素级WSI转化为预后预测结果,无需依赖密集标注。 Result: 在12种癌症类型(5,954名患者,12.54百万个图像块)上的实验表明,DCMIL优于标准的基于WSI的预后模型,能识别出与预后相关的细粒度区域,提供稳健的实例不确定性估计,并捕捉正常与肿瘤组织间的形态学差异。 Conclusion: DCMIL是一种高效、无需密集标注的WSI分析框架,在癌症预后预测中表现出色,具有生成新生物学见解的潜力,代码已公开。 Abstract: The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.

[75] Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang,Yifan Bian,Li Li,Jingran Wu,Xianguo Zhang,Dong Liu

Main category: cs.CV

TL;DR: 提出了一种统一的神经视频压缩框架,结合帧内和帧间编码,通过单个模型自适应地处理每一帧,有效解决了遮挡、新内容处理和误差传播等问题,在压缩效率、稳定性和实时性上优于DCVC-RT。

Details Motivation: 现有神经视频压缩方案在处理遮挡、新内容和帧间误差传播方面存在不足,需引入类似传统编码中的帧内编码机制以提升性能。 Method: 设计了一个支持统一帧内和帧间编码的神经视频压缩框架,采用单一模型自适应选择编码模式,并提出双帧同时压缩策略,充分利用前后向帧间冗余。 Result: 实验结果显示该方法相比DCVC-RT平均BD-rate降低10.7%,帧级码率和质量更稳定,且保持实时编解码能力。 Conclusion: 所提出的统一编码框架有效克服了现有NVC方法的关键缺陷,在保持实时性的同时显著提升了压缩效率和稳定性。 Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

[76] Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Sven Jacob,Weijia Shao,Gjergji Kasneci

Main category: cs.CV

TL;DR: 提出一种基于核范数正则化的最小失真通用对抗攻击方法,用于视频目标检测,通过自适应乐观指数梯度法优化,在保持高隐蔽性的同时提升了攻击效果。

Details Motivation: 深度学习模型在视频目标检测中易受通用对抗扰动攻击,现有方法存在扰动结构不理想、优化效率低等问题。 Method: 采用核范数正则化引导扰动集中在背景区域,设计自适应乐观指数梯度算法进行高效优化,实现最小化失真的通用对抗攻击。 Result: 所提方法在攻击效果上优于低秩投影梯度下降和Frank-Wolfe类攻击方法,同时保持较高的视觉隐蔽性。 Conclusion: 该方法有效提升了视频目标检测模型对抗通用扰动的脆弱性评估能力,为安全关键应用中的模型鲁棒性研究提供了新思路。 Abstract: Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.

[77] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi

Main category: cs.CV

TL;DR: 该论文综述了2018-2025年间49项基于无监督深度生成模型的神经影像异常检测研究,涵盖自编码器、变分自编码器、生成对抗网络和去噪扩散模型,表明这些模型在脑MRI异常检测中具有潜力,尤其适用于缺乏标注数据的罕见或异质性疾病。

Details Motivation: 由于全监督方法依赖大量体素级标注数据且局限于已知病理,而现实中健康数据更易获取,因此需要能在仅使用健康数据训练下发现偏离正常结构的异常区域的无监督方法。 Method: 采用PRISMA指南指导的范围综述方法,系统梳理并分析近年来使用自动编码器、变分自动编码器、生成对抗网络和去噪扩散模型等无监督深度生成模型在脑部MRI和CT图像中进行异常检测与分割的研究。 Result: 共纳入49项研究,生成模型对大范围局灶性病变检测效果良好,并在细微异常检测方面取得进展;模型能生成可解释的伪健康重建图像,有助于识别异常区域;不同架构的设计选择影响性能表现。 Conclusion: 无监督生成模型为神经影像异常检测提供了有前景的方向,未来应发展解剖感知建模、基础模型、任务适配的评估指标及严格的临床验证,以推动其在临床中的应用。 Abstract: Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.

[78] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Thomas Katraouras,Dimitrios Rafailidis

Main category: cs.CV

TL;DR: 本文提出了一种名为MIR-L的多任务图像恢复模型压缩方法,通过迭代剪枝策略在高稀疏度下发现性能优异的子网络,仅保留10%参数仍能保持高性能。

Details Motivation: 由于在线社交网络导致图像质量下降,多任务图像恢复模型虽能处理多种退化类型,但通常参数量过大、计算效率低,因此需要高效压缩方法。 Method: 采用迭代剪枝策略,在每轮剪去低幅值权重后将剩余权重重置为初始值,从而在过参数化的深层模型中寻找高性能的稀疏子网络(即“ winning tickets”)。 Result: 在去雨、去雾和去噪任务的基准数据集上,MIR-L仅保留10%可训练参数时仍保持甚至超过现有模型的恢复性能。 Conclusion: MIR-L通过有效的迭代剪枝能够发现高稀疏但高性能的子网络,显著提升多任务图像恢复模型的计算效率,具备实际部署潜力。 Abstract: Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model's optimization, effectively uncovering "winning tickets" that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.

[79] Grazing Detection using Deep Learning and Sentinel-2 Time Series Data

Aleksis Pirinen,Delia Fano Yela,Smita Chakraborty,Erik Källman

Main category: cs.CV

TL;DR: 该研究利用Sentinel-2 L2A时间序列影像,结合CNN-LSTM模型,实现对农田季节性放牧的二分类检测,具有较高的召回率和实用性。

Details Motivation: 放牧行为影响农业生产和生物多样性,但目前缺乏可扩展的监测手段,亟需一种大范围、高效的方法来识别放牧区域。 Method: 基于多时相反射率特征,使用CNN-LSTM模型集成方法,在4月至10月的影像数据上训练并预测每块田地是否被放牧。 Result: 在五个验证集上平均F1得分为77%,对放牧草地的召回率达到90%;在仅能检查4%场地的情况下,模型优先预测未放牧区域的检查效率比随机检查高17.2倍。 Conclusion: 利用中等分辨率、免费的卫星数据可有效支持以保护为目标的土地利用合规性监测,具备实际应用与推广价值。 Abstract: Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.

[80] Vision Mamba for Permeability Prediction of Porous Media

Ali Kashefi,Tapan Mukerji

Main category: cs.CV

TL;DR: 本文首次提出使用Vision Mamba作为骨干网络来预测三维多孔介质的渗透率,相比ViT和CNN在计算效率、内存占用和参数量方面具有优势。

Details Motivation: 由于Vision Mamba在图像分类中展现出比Vision Transformers(ViTs)和CNN更高效的计算与内存特性,本文探索其在三维多孔介质渗透率预测中的应用潜力。 Method: 采用Vision Mamba作为主干网络构建模型,并与ViT和CNN在多个渗透率预测指标上进行对比,同时通过消融实验分析其组件对性能的影响。 Result: 实验证明Vision Mamba在渗透率预测任务中优于ViT和CNN,具备更低的计算开销和更高的内存效率,且代码已公开以支持可复现性。 Conclusion: Vision Mamba在三维多孔介质渗透率预测中表现出优越性能,有望成为大型视觉模型中替代ViT的有效选择。 Abstract: Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.

[81] Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing

Qurrat Ul Ain,Atif Aftab Ahmed Jilani,Zunaira Shafqat,Nigar Azhar Butt

Main category: cs.CV

TL;DR: SurgScan是一个基于YOLOv8的AI系统,用于实时检测外科手术器械缺陷,在102,876张图像上训练,达到99.3%的准确率,推理速度为4.2-5.8毫秒,支持工业级自动化质量控制。

Details Motivation: 传统外科器械质检依赖人工,易出错且不一致,存在影响无菌性、机械完整性和患者安全的风险,亟需自动化、高精度的缺陷检测方案。 Method: 提出SurgScan框架,采用YOLOv8模型,使用包含11类器械和5类主要缺陷的高分辨率图像数据集(共102,876张)进行训练,并引入对比度增强预处理以提升检测效果。 Result: SurgScan在准确率上达到99.3%,推理速度为4.2-5.8毫秒/图像,优于现有CNN模型;统计分析表明对比度增强显著提升检测性能。 Conclusion: SurgScan具备高精度、实时性和可扩展性,可实现符合ISO 13485和FDA标准的自动化质检,减少人工依赖,提升医疗器械制造的安全与效率。 Abstract: Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.

[82] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Yunze Tong,Didi Zhu,Zijing Hu,Jinluan Yang,Ziyu Zhao

Main category: cs.CV

TL;DR: 本文提出了一种提示语感知的噪声投影方法,通过在去噪前对初始噪声进行文本条件化优化,提升文本到图像生成中图像与提示语的对齐程度,无需修改预训练模型且推理成本低。

Details Motivation: 在文本到图像生成中,由于训练和推理阶段噪声分布不一致(训练时噪声依赖于提示语,推理时则来自无提示的高斯分布),导致生成图像可能与提示语不匹配。现有方法通过改变去噪过程或多噪声采样后筛选来缓解此问题,但效率或效果有限。 Method: 提出一种噪声投影器,在推理时基于提示语嵌入对随机噪声进行条件化优化,使其更接近训练时的噪声分布。该方法首先采样多个噪声并利用视觉-语言模型提供图像级反馈,将这些信号蒸馏为奖励模型,并通过准直接偏好优化训练噪声投影器。 Result: 实验表明,该方法显著提升了不同提示下的文本-图像对齐度,且无需参考图像或手工先验,推理时仅需单次前向传播,效率高于多采样后筛选的方法。 Conclusion: 通过引入提示语感知的噪声投影,有效缩小了文本到图像生成中的训练-推理差距,提升了生成图像的语义一致性,同时保持低推理成本和无需修改原生成模型的优势。 Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

[83] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Handong Zheng,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL是一种高效、资源消耗低的文档解析模型,集成了动态分辨率视觉编码器和轻量级语言模型,支持109种语言,在页面级和元素级识别任务上达到SOTA性能。

Details Motivation: 为了实现高效且准确的多语言文档解析,尤其是在复杂元素(如表格、公式、图表)识别方面,同时降低模型资源消耗,提升实际部署可行性。 Method: 采用NaViT风格的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型结合,构建紧凑型视觉语言模型PaddleOCR-VL-0.9B,并在公开和内部基准上进行综合评估。 Result: PaddleOCR-VL在多个公共和内部基准上均取得SOTA性能,显著优于现有方法,具备快速推理速度和强大多语言支持能力(109种语言)。 Conclusion: PaddleOCR-VL是一个高性能、低资源消耗的文档解析解决方案,适用于真实场景中的大规模部署。 Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

[84] Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology

Xinrui Huang,Fan Xiao,Dongming He,Anqi Gao,Dandan Li,Xiaofan Zhang,Shaoting Zhang,Xudong Wang

Main category: cs.CV

TL;DR: DentVFM是首个面向牙科的视觉基础模型家族,通过自监督学习在大规模多模态牙科影像数据集DentVista上训练,具备跨任务、跨模态的强泛化能力,在多种牙科应用中表现优异,显著优于现有方法。

Details Motivation: 现有牙科AI系统受限于单一模态、特定任务设计和对标注数据的依赖,难以泛化到多样临床场景;同时缺乏专业评估基准,限制了牙科智能的发展。 Method: 提出DentVFM,基于Vision Transformer架构构建2D和3D视觉基础模型,采用自监督学习方式在包含约160万张多中心、多模态影像的DentVista数据集上进行预训练,并构建DentBench作为涵盖八大牙科亚专科的综合评测基准。 Result: DentVFM在疾病诊断、治疗分析、生物标志物识别、解剖结构检测与分割等多个任务上展现出卓越的泛化能力,显著优于监督、自监督和弱监督基线方法,且具备跨模态诊断能力,在某些场景下诊断可靠性超过资深牙医。 Conclusion: DentVFM建立了牙科AI的新范式,提供了一种可扩展、可适应且标签高效的解决方案,有望推动智能化牙科医疗发展,弥补全球口腔医疗资源缺口。 Abstract: Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.

[85] Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval

Keima Abe,Hayato Muraki,Shuhei Tomoshige,Kenichi Oishi,Hitoshi Iyatomi

Main category: cs.CV

TL;DR: 提出了一种名为PL-SE-ADA的新型域适应框架,用于医学图像(如脑部MR)的域协调整合与可解释表示学习,通过分离域不变和域特异性特征,在保持疾病相关信息的同时实现图像重建、疾病分类和域识别的优异性能,并提供全框架的高可解释性。

Details Motivation: 医学图像在不同成像站点间存在域偏移,影响机器学习性能,现有方法缺乏可解释性,难以满足医疗应用需求。 Method: 设计双编码器(f_E和f_SE)分别提取域不变(z_u)和域特异性(z_d)特征,结合图像重构(f_D)和域预测器(g_D),通过对抗训练和基于z_u与z_d重建图像之和的方式实现协同优化。 Result: 在图像重建、疾病分类和域识别任务上表现优于或等于现有方法,同时实现了对域无关特征和域特异性成分的可视化,提升了模型可解释性。 Conclusion: PL-SE-ADA是一种有效的通用域适应框架,能够在保持疾病相关信息的同时实现良好的域协调整合与高度可解释的表示学习,适用于多中心医学影像分析。 Abstract: Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.

[86] Exploring Image Representation with Decoupled Classical Visual Descriptors

Chenyuan Qu,Hao Chen,Jianbo Jiao

Main category: cs.CV

TL;DR: 本文提出了VisualSplit框架,通过将图像分解为经典视觉描述符(如边缘、颜色和强度分布)来提升现代学习方法的可解释性,并在图像生成与编辑等任务中实现有效的属性控制。

Details Motivation: 深度学习虽然在图像理解任务上取得了显著进展,但其内部表示通常不透明,难以解释。相比之下,经典视觉描述符具有良好的人类可理解性。本文旨在探索现代学习方法是否可以从这些经典线索中受益。 Method: 提出VisualSplit框架,显式地将图像分解为解耦的经典描述符,并通过基于重建的预训练策略学习每个描述符的本质特征,同时保持其可解释性。 Result: VisualSplit能够在图像生成和编辑等高级视觉任务中有效实现属性控制,超越传统的分类与分割任务,验证了该方法在视觉理解中的有效性。 Conclusion: 将经典视觉描述符融入现代学习框架可以兼顾性能与可解释性,VisualSplit为视觉表征学习提供了一种新的有效途径。 Abstract: Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: https://chenyuanqu.com/VisualSplit/.

[87] Exploring Cross-Modal Flows for Few-Shot Learning

Ziqi Jiang,Yanghao Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了Flow Matching Alignment (FMA),一种模型无关的多步调整方法,通过学习跨模态速度场来实现更精确和鲁棒的特征对齐,显著提升了复杂数据集上的性能。

Details Motivation: 现有参数高效微调(PEFT)方法仅进行单步调整,难以有效解耦高度纠缠的跨模态特征,尤其在复杂数据集上表现不足。 Method: 提出FMA方法,采用固定耦合策略保证类别对应,引入噪声增强缓解数据稀缺,并设计早停求解器以提升效率和准确性。 Result: FMA在多个基准和骨干网络上均取得显著性能提升,尤其在挑战性数据集上效果突出。 Conclusion: FMA作为首个支持多步调整的模型无关PEFT方法,能够实现更精细的跨模态对齐,优于传统单步调整方法。 Abstract: Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

[88] Consistent text-to-image generation via scene de-contextualization

Song Tang,Peihao Gong,Kunyu Li,Kai Guo,Boyu Wang,Mao Ye,Jianwei Zhang,Xiatian Zhu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的提示嵌入编辑方法SDeC,用于解决文本到图像生成中的身份偏移问题,通过去情境化机制有效保持主体身份一致性。

Details Motivation: 现有方法在处理文本到图像生成中身份保持问题时,通常假设已知所有目标场景,这在实际应用中不现实。 Method: 提出Scene De-Contextualization (SDeC) 方法,通过量化SVD方向稳定性来自适应重加权特征值,抑制提示嵌入中的场景-身份相关性。 Result: 实验表明SDeC能显著提升身份保持能力,同时维持场景多样性,且无需预先知晓所有目标场景。 Conclusion: SDeC是一种高效、灵活且通用的训练-free方法,适用于真实场景下的文本到图像一致生成任务。 Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

[89] Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang,Cheng Shi,Yang Wang,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种面向第一人称视频流的主动问答任务,旨在实现AI在人类生活场景中的主动理解与实时响应。为此,作者构建了ESTP-Bench评测基准和ESTP-F1指标,并设计了一个包含数据引擎、多阶段训练策略和动态压缩技术的完整技术框架。

Details Motivation: 为了让AI能够在真实人类环境中主动感知、理解和响应动态事件,而不仅仅是被动观察,需要建立具备主动性、时效性和同步效率的智能系统。 Method: 提出了一个包含三个核心组件的技术 pipeline:数据引擎用于生成高质量训练数据;多阶段训练策略提升模型理解与推理能力;动态压缩技术保障响应的及时性与计算效率。同时构建了ESTP-Bench基准和ESTP-F1评估指标。 Result: 所提模型在多个在线和离线基准上优于多种基线方法,有效实现了主动连贯性、即时响应性和同步高效性三大关键特性。 Conclusion: 该研究推动了AI在第一人称视角下的主动交互能力发展,为未来智能助手在复杂动态环境中的部署提供了可行的技术路径和评估体系。 Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/

[90] BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU

Junyi Wu,Jiaming Xu,Jinhao Li,Yongkang Zhou,Jiayi Pan,Xingyang Li,Guohao Dai

Main category: cs.CV

TL;DR: 提出BalanceGS,一种算法-系统协同设计方法,用于提升3D高斯点阵的训练效率,通过密度控制、自适应采样和内存访问优化,在几乎不损失质量的情况下实现1.44倍加速。

Details Motivation: 传统3D高斯点阵训练存在密度分配不均、计算负载不平衡和内存访问碎片化三大效率问题。 Method: 1)启发式工作负载感知的高斯密度控制;2)基于相似性的高斯采样与合并;3)基于重排序的内存访问映射策略。 Result: 在NVIDIA A100 GPU上相比3DGS实现了1.44倍的训练速度提升,且质量损失可忽略。 Conclusion: BalanceGS通过算法与系统协同优化显著提升了3D高斯点阵的训练效率,为实时3D重建提供了更高效的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44$\times$ training speedup on a NVIDIA A100 GPU with negligible quality degradation.

[91] CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification

Dongwook Lee,Sol Han,Jinwhan Kim

Main category: cs.CV

TL;DR: 本文提出了CALM-Net,一种基于曲率感知的多分支神经网络,用于LiDAR点云车辆重识别,通过融合边缘卷积、点注意力和曲率嵌入提升特征判别性,在nuScenes数据集上比最强基线提高了约1.97%的平均重识别精度。

Details Motivation: 为了从三维点云中学习更具判别性和互补性的特征以更好地区分不同车辆,解决现有方法在几何细节利用上的不足。 Method: 提出CALM-Net,采用多分支架构,分别利用边缘卷积、点注意力机制和曲率嵌入来捕捉局部表面变化和上下文信息,并融合这些分支的特征进行车辆重识别。 Result: 在大规模nuScenes数据集上的实验表明,CALM-Net相比最强基线模型平均重识别精度提升了约1.97个百分点。 Conclusion: 引入曲率信息有助于提升LiDAR点云车辆重识别性能,多分支特征学习框架能有效融合几何与上下文特征,验证了其在该任务中的有效性。 Abstract: This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97\% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.

[92] Talking Points: Describing and Localizing Pixels

Matan Rusanovsky,Shimon Malnick,Shai Avidan

Main category: cs.CV

TL;DR: 本文提出了一种用于像素级关键点定位的新框架,包含生成关键点描述的Point Descriptor和回归精确坐标的Point Localizer,并构建了LlamaPointInPart数据集进行训练与评估。

Details Motivation: 现有视觉-语言模型局限于对象或区域级别的对齐,缺乏通过自然语言实现像素级关键点理解的能力。 Method: 提出由Point Descriptor和Point Localizer组成的双向框架,利用合成的20K+图像-关键点-描述三元组数据集LlamaPointInPart进行训练,并采用GRPO优化策略提升跨类别泛化能力。 Result: 实验表明该方法在LlamaPointInPart上优于基线模型,且通过新评估协议验证了其高精度定位能力。 Conclusion: 所提框架实现了自然语言与像素级关键点的精准对齐,支持未来在关键点引导的理解与语言引导的精确定位中的应用。 Abstract: Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.