cs.CL [Back]

[1] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

Wannan Yang,Xinchi Qiu,Lei Yu,Yuchen Zhang,Oliver Aobo Yang,Narine Kokhlikyan,Nicola Cancedda,Diego Garcia-Olano

Main category: cs.CL

TL;DR: 本文提出了一种名为CASAL的对比激活引导方法，通过将可解释性与摊销优化结合，有效减少大语言模型的幻觉问题，且在计算和数据效率上显著优于现有方法。

Details

Motivation: 大语言模型虽然能力强大，但容易产生幻觉，即给出错误答案而不承认无知。需要一种无需实时干预即可减少幻觉的方法。 Method: 引入CASAL算法，利用对比激活引导，将激活引导的好处直接嵌入模型权重中，仅需训练单个Transformer层的子模块。 Result: 在多个短问答基准上减少30%-40%的幻觉，计算效率提高30倍，数据效率提高20倍，并能泛化到分布外领域，适用于文本和视觉-语言模型，包括密集和MoE模型。 Conclusion: CASAL是一种高效、实用的减少大模型幻觉的方法，为可解释性驱动的方法在生产系统中的应用提供了新方向。 Abstract: Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

[2] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

Vivek Bhavsar,Joseph Ereifej,Aravanan Gurusami

Main category: cs.CL

TL;DR: RA-FSM是一种基于GPT的模块化研究助手，通过有限状态机控制生成过程，结合向量检索与确定性引用管道，提供可信赖、可追溯的答案，适用于高风险技术场景。

Details

Motivation: 大语言模型在文献综合中易产生幻觉和错误引用，限制了其在专家工作流中的应用。因此需要一个更可靠、可解释的研究助手系统。 Method: 提出RA-FSM系统，采用Relevance->Confidence->Knowledge的有限状态机控制生成流程，结合向量检索与确定性引用机制，并构建分层知识库支持多源学术数据。系统根据问题回答性评分决定是否检索，动态分解问题并仅在必要时触发检索。 Result: 在光子学领域的六类任务中评估显示，领域专家在盲测中更偏好RA-FSM，认为其边界条件处理更强、证据使用更可信；相比Notebook LM和基础GPT API，RA-FSM探索范围更广，且具有可控的延迟与成本开销。 Conclusion: RA-FSM通过结构化控制流程和多源知识集成，提升了生成结果的可靠性与引用准确性，适用于高风险技术领域，且可推广至其他科学领域。 Abstract: Large language models accelerate literature synthesis but can hallucinate and mis-cite, limiting their usefulness in expert workflows. We present RA-FSM (Research Assistant - Finite State Machine), a modular GPT-based research assistant that wraps generation in a finite-state control loop: Relevance -> Confidence -> Knowledge. The system is grounded in vector retrieval and a deterministic citation pipeline. The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed, and emits answers with confidence labels and in-corpus, de-duplicated references. A ranked-tier ingestion workflow constructs a domain knowledge base from journals, conferences, indices, preprints, and patents, writing both to a dense vector index and to a relational store of normalized metrics. We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design. In blinded A/B reviews, domain experts prefer RA-FSM to both a strong Notebook LM (NLM) and a vanilla Default GPT API call single-pass baseline, citing stronger boundary-condition handling and more defensible evidence use. Coverage and novelty analyses indicate that RA-FSM explores beyond the NLM while incurring tunable latency and cost overheads. The design emphasizes transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains.

[3] KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

So Kuroki,Yotaro Kubo,Takuya Akiba,Yujin Tang

Main category: cs.CL

TL;DR: 提出一种混合架构，结合实时语音到语音模型和后端大语言模型，实现在低延迟下生成富含知识的自然对话响应。

Details

Motivation: 实时语音到语音模型缺乏深度知识和语义理解，而级联系统虽知识丰富但延迟高，影响自然交互流畅性。 Method: 通过S2S transformer处理用户语音实现即时响应，同时将查询发送至后端大语言模型，实时注入其文本响应以指导语音生成。 Result: 在MT-Bench的语音合成版本上评估，系统在回答正确性上显著优于基线S2S模型，接近级联系统水平，同时保持与基线相当的低延迟。 Conclusion: 该混合架构成功平衡了低延迟与知识丰富性，提升了实时语音对话系统的性能。 Abstract: Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

[4] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering

Ziqing Wang,Chengsheng Mao,Xiaole Wen,Yuan Luo,Kaize Ding

Main category: cs.CL

TL;DR: 提出了一种无需训练的代理框架AMANDA，通过LLM代理实现医学知识增强，解决医学视觉问答中内在和外在推理瓶颈问题。

Details

Motivation: 现有医学多模态大模型在低资源场景下因医学推理能力瓶颈（忽略图像细节和缺乏专业知识整合）而表现不佳。 Method: AMANDA框架结合粗到细的问题分解进行内在知识增强，并通过生物医学知识图谱检索实现外在知识增强，无需训练。 Result: 在八个医学视觉问答基准上，AMANDA在零样本和少样本设置下均显著提升性能。 Conclusion: AMANDA有效缓解了Med-MLLMs的推理瓶颈，提升了低资源场景下的医学视觉问答能力。 Abstract: Medical Multimodal Large Language Models (Med-MLLMs) have shown great promise in medical visual question answering (Med-VQA). However, when deployed in low-resource settings where abundant labeled data are unavailable, existing Med-MLLMs commonly fail due to their medical reasoning capability bottlenecks: (i) the intrinsic reasoning bottleneck that ignores the details from the medical image; (ii) the extrinsic reasoning bottleneck that fails to incorporate specialized medical knowledge. To address those limitations, we propose AMANDA, a training-free agentic framework that performs medical knowledge augmentation via LLM agents. Specifically, our intrinsic medical knowledge augmentation focuses on coarse-to-fine question decomposition for comprehensive diagnosis, while extrinsic medical knowledge augmentation grounds the reasoning process via biomedical knowledge graph retrieval. Extensive experiments across eight Med-VQA benchmarks demonstrate substantial improvements in both zero-shot and few-shot Med-VQA settings. The code is available at https://github.com/REAL-Lab-NU/AMANDA.

[5] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

Kanghoon Yoon,Minsub Kim,Sungjae Lee,Joonhyung Lee,Sunghyeon Woo,Yeonjun In,Se Jung Kwon,Chanyoung Park,Dongsoo Lee

Main category: cs.CL

TL;DR: 提出SelfJudge方法，通过目标模型的自监督训练judge验证器，提升LLM推理速度与准确性权衡。

Details

Motivation: 现有judg解码方法依赖人工标注或具有可验证真实答案的任务，限制了在多样化NLP任务中的泛化能力。 Method: 通过自监督方式训练judge验证器，利用目标模型评估替换token后的响应是否保持原意，从而实现语义保持的自动验证训练。 Result: SelfJudge在多种NLP任务上实现了优于现有judg解码基线的推理-准确性权衡。 Conclusion: SelfJudge提供了一种广泛适用的、加速大语言模型推理的有效方案。 Abstract: Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

[6] EntropyLong: Effective Long-Context Training via Predictive Uncertainty

Junlong Jia,Ziyang Chen,Xing Wu,Chaochen Gao,Zijia Lin,Debing Zhang,Songlin Hu,Binghui Guo

Main category: cs.CL

TL;DR: 提出EntropyLong方法，利用预测不确定性验证长距离依赖质量，通过模型在环的验证机制构建高质量长上下文训练数据。

Details

Motivation: 现有长上下文语言模型的数据构建方法（如文本拼接或启发式方法）难以保证真实的长距离依赖关系。 Method: 识别文档中的高熵位置，从大型语料库中检索语义相关上下文，并通过其是否降低预测熵来验证有效性，构建包含验证过上下文补充的训练样本。 Result: 在FineWebEdu和Cosmopedia上构建了128K长度序列的数据集，训练模型在RULER和LongBenchv2基准上显著优于基线，尤其在需要远距离信息任务中表现突出。 Conclusion: 基于熵的验证机制对长上下文训练至关重要，EntropyLong能有效提升模型的长程依赖建模能力。 Abstract: Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose EntropyLong, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This model-in-the-loop verification ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWebEdu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBenchv2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropybased verification for long-context training.

[7] Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER)

Moonkyung Ryu,Chih-Wei Hsu,Yinlam Chow,Mohammad Ghavamzadeh,Craig Boutilier

Main category: cs.CL

TL;DR: 提出一种结合行为模拟器和语言模型提示的方法，生成具有用户状态一致性的自然对话，用于构建开放的对话推荐系统数据集。

Details

Motivation: 由于公开的对话推荐系统（CRS）数据稀缺，直接微调语言模型困难；现有基于语言模型的用户模拟器常缺乏行为一致性。 Method: 结合行为模拟器与语言模型提示技术，生成符合用户潜在状态的自然对话，并构建包含偏好获取和示例批评的大规模开源CRS数据集。 Result: 生成的对话在评分者评估中表现出较高的行为一致性、事实性和自然性。 Conclusion: 该方法能有效生成高质量、行为一致的CRS对话数据，有助于推动语言模型在推荐系统中的应用。 Abstract: While language models (LMs) offer great potential for conversational recommender systems (CRSs), the paucity of public CRS data makes fine-tuning LMs for CRSs challenging. In response, LMs as user simulators qua data generators can be used to train LM-based CRSs, but often lack behavioral consistency, generating utterance sequences inconsistent with those of any real user. To address this, we develop a methodology for generating natural dialogues that are consistent with a user's underlying state using behavior simulators together with LM-prompting. We illustrate our approach by generating a large, open-source CRS data set with both preference elicitation and example critiquing. Rater evaluation on some of these dialogues shows them to exhibit considerable consistency, factuality and naturalness.

[8] A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

Yapei Feng,Feng Jiang,Shanhao Wu,Hua Zhong

Main category: cs.CL

TL;DR: 本文提出了一种名为look-ahead Sync的新方法，用于解决神经语言隐写中因分词歧义导致的解码失败问题，在保持可证明安全性的同时显著提升了嵌入容量，实验表明其在英语和中文基准上均显著优于SyncPool。

Details

Motivation: 现有方法SyncPool通过粗粒度同步机制解决分词歧义问题，但牺牲了大量嵌入容量，限制了实际应用。 Method: 提出look-ahead Sync方法，仅对真正不可区分的令牌序列进行最小化同步采样，保留其他可区分路径以最大化嵌入容量，并提供安全性理论证明。 Result: 在英文（Llama 3）和中文（Qwen 2.5）基准上，该方法接近理论容量上限，英文嵌入率提升超过160%，中文提升25%，尤其在候选池较大时表现更优。 Conclusion: look-ahead Sync在不牺牲安全性的前提下大幅提升了嵌入容量，推动了高容量、可证明安全的语言隐写技术向实用化迈进。 Abstract: Neural linguistic steganography aims to embed information into natural text while preserving statistical undetectability. A fundamental challenge in this ffeld stems from tokenization ambiguity in modern tokenizers, which can lead to catastrophic decoding failures. The recent method, SyncPool, addresses this ambiguity by employing a coarse-grained synchronization mechanism over groups of ambiguous candidates. However, SyncPool sacriffces embedding capacity, as it utilizes the entire Shannon entropy of an ambiguous group solely for synchronization rather than for payload embedding. We propose a method named look-ahead Sync, which overcomes the capacity limitation of SyncPool while retaining its provable security guarantees. Our approach performs minimal synchronized sampling only on truly indistinguishable token sequences, while strategically preserving all other discernible paths to maximize embedding capacity. We provide theoretical proofs for the security of our method and analyze the gap between its achievable embedding capacity and the theoretical upper bound. Experiments on English (using Llama 3) and Chinese (using Qwen 2.5) benchmarks show that our method consistently approaches the theoretical capacity upper bound and signiffcantly outperforms SyncPool. The improvement in embedding rate exceeds 160% in English and 25% in Chinese, particularly in settings with larger candidate pools. This work represents a signiffcant step toward practical high-capacity provably secure linguistic steganography.

Chiara Pugliese,Francesco Lettich,Guido Rocchietti,Chiara Renso,Fabio Pinelli

Main category: cs.CL

TL;DR: 本文介绍了两个公开的、语义增强的人类轨迹数据集，包含GPS轨迹、上下文信息和大语言模型生成的社交媒体文本，支持多模态移动分析和语义推理。

Details

Motivation: 为了支持行为建模、移动预测和知识图谱构建等研究，需要结合真实移动数据与丰富语义信息的可复用数据资源。 Method: 基于OpenStreetMap的GPS轨迹，通过开源可复现的处理流程，添加停留点、移动、兴趣点、交通模式、天气等上下文层，并利用大语言模型生成逼真的社交媒体帖子，最终以表格和RDF格式发布数据集。 Result: 发布了覆盖巴黎和纽约的两个大规模、结构不同的语义增强轨迹数据集，支持FAIR数据原则，具备多模态、多城市、语义Web兼容等特点。 Conclusion: 该数据集是首个整合真实移动、结构化语义增强、大语言模型生成文本和语义网技术的可复用资源，推动了基于LLM的移动分析和智能应用研究。 Abstract: In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.

[10] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Zhe Li,Wei Zhao,Yige Li,Jun Sun

Main category: cs.CL

TL;DR: 提出一种基于表示和梯度分析的新框架RepT，用于诊断大语言模型中的有害行为，具备高效性和语义可解释性。

Details

Motivation: 大语言模型常产生有害内容、事实错误和社会偏见，现有归因方法因噪声大和计算复杂难以有效诊断其根本原因。 Method: 在模型激活空间中直接分析表示及其梯度，建立输出与训练数据之间的语义关联信号，实现样本级和细粒度的词元级归因分析。 Result: 在追踪有害内容、检测后门投毒和识别知识污染等任务中表现优异，能精确定位影响模型行为的具体样本和短语。 Conclusion: RepT为理解和审计大语言模型风险提供了强有力工具，有助于最终缓解这些风险。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model's activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs. The code is available at https://github.com/plumprc/RepT.

[11] FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

Xiao-Wen Yang,Zihao Zhang,Jianuo Cao,Zhi Zhou,Zenan Li,Lan-Zhe Guo,Yuan Yao,Taolue Chen,Yu-Feng Li,Xiaoxing Ma

Main category: cs.CL

TL;DR: 本文提出了一个名为FormalML的Lean 4基准，用于评估大语言模型在数学证明中完成子目标的能力，强调当前模型在准确性和效率上的不足。

Details

Motivation: 探索大语言模型作为数学家助手在复杂证明中填补缺失步骤的能力，这一任务被称为子目标完成。 Method: 构建了一个基于机器学习基础理论的FormalML基准，使用翻译策略将过程性证明转换为声明性形式，提取了4937个涵盖优化和概率不等式的问题。 Result: 评估结果显示现有最先进的定理证明器在准确性与效率方面仍存在显著局限。 Conclusion: 需要更强大的基于大语言模型的定理证明器来有效实现子目标完成。 Abstract: Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,

[12] KurdSTS: The Kurdish Semantic Textual Similarity

Abdulhady Abas Abdullah,Hadi Veisi,Hussein M. Al

Main category: cs.CL

TL;DR: 本文介绍了首个库尔德语语义文本相似度（STS）数据集，包含10,000个句子对，并评估了多种模型在该数据集上的表现。

Details

Motivation: 由于低资源语言如库尔德语缺乏语义相似度研究资源，作者旨在填补这一空白。 Method: 构建了一个包含正式和非正式语域的库尔德语STS数据集，并使用Sentence-BERT、多语言BERT等模型进行基准测试。 Result: 实现了具有竞争力的结果，但揭示了库尔德语形态、拼写变异和语码混合带来的挑战。 Conclusion: 该数据集和基线模型为库尔德语语义研究和低资源NLP提供了可复现的评估工具和良好起点。 Abstract: Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.

[13] CRACQ: A Multi-Dimensional Approach To Automated Document Assessment

Ishak Soltani,Francisco Belo,Bernardo Tavares

Main category: cs.CL

TL;DR: 本文提出了CRACQ，一种针对机器生成文本的多维度自动评估框架，涵盖连贯性、严谨性、适当性、完整性和质量五个维度，相比单一评分和LLM评判方法更具可解释性和稳定性。

Details

Motivation: 现有的自动评估方法多依赖单一总分，缺乏对文本多维度特征的细粒度分析，且可解释性差，难以适用于多样化的生成文本场景。 Method: 借鉴基于特征的自动作文评分（AES）思路，设计了包含五个评估维度的评分标准（rubric），结合语言学、语义和结构信号，构建CRACQ框架，并在500份合成的资助申请书上进行训练和测试。 Result: CRACQ在真实强弱申请案例中表现出比LLM作为评判者的更稳定和可解释的特质级判断，验证了其在细粒度评估上的优势，但在可靠性与领域适用范围方面仍存在挑战。 Conclusion: CRACQ提供了一种可解释、多维度的自动化文本评估方法，适用于多种机器生成文本的评估任务，未来需进一步提升其鲁棒性和跨领域适应能力。 Abstract: This paper presents CRACQ, a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality. Building on insights from traitbased Automated Essay Scoring (AES), CRACQ expands its fo-cus beyond essays to encompass diverse forms of machine-generated text, providing a rubricdriven and interpretable methodology for automated evaluation. Unlike singlescore approaches, CRACQ integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis. Trained on 500 synthetic grant pro-posals, CRACQ was benchmarked against an LLM-as-a-judge and further tested on both strong and weak real applications. Preliminary results in-dicate that CRACQ produces more stable and interpretable trait-level judgments than direct LLM evaluation, though challenges in reliability and domain scope remain

[14] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

Samyak Jhaveri,Praphul Singh,Jangwon Kim,Tara Taghavi,Krishnaram Kenthapadi

Main category: cs.CL

TL;DR: 提出一种结合GRPO与DocLens的评估集成强化学习框架，用于长篇临床文本生成，直接优化事实准确性和完整性，无需单独训练奖励模型或依赖人工参考。

Details

Motivation: 临床文档自动化需要精确对齐诸如完整性和事实准确性等优先事项，现有方法依赖人工参考或额外奖励模型，成本高且难以扩展。 Method: 采用Group Relative Policy Optimization（GRPO）与DocLens（基于对话的确定性评估器）结合的强化学习框架，通过声明级别的评估提供奖励信号，并引入简单的奖励门控策略以降低训练成本。 Result: 实证结果显示生成的临床记录在事实性、完整性和简洁性上更优，遗漏和幻觉更少；独立的GPT-5定性评估也更偏好该方法的输出；训练成本显著降低。 Conclusion: 该框架可扩展至实际应用场景，支持自定义目标（如指南依从性、计费偏好），且在高质量基准下仍表现出保守但显著的改进。 Abstract: Automating clinical documentation with large language models requires precise alignment with priorities such as completeness and factual grounding. We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation that couples Group Relative Policy Optimization (GRPO) with DocLens, a claim-level evaluator that provides deterministic, dialogue-grounded rewards. Our method directly optimizes factual grounding and completeness without training a separate reward model or relying on human-authored references. Empirically, the approach improves clinical note quality and reduces training cost via a simple reward-gating strategy. An independent GPT-5 qualitative evaluation further supports these gains, showing higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Because the benchmarks are relatively clean and the base model already well aligned, these improvements likely represent a conservative lower bound. The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.

[15] Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Kevin Zhou,Adam Dejl,Gabriel Freedman,Lihu Chen,Antonio Rago,Francesca Toni

Main category: cs.CL

TL;DR: 本文研究了在基于计算论证的可解释性大语言模型（ArgLLMs）中集成不确定性量化（UQ）方法的效果，通过实验评估不同UQ方法在主张验证任务中的表现，发现尽管简单，直接提示法在ArgLLMs中表现出色，优于更复杂的方法。

Details

Motivation: 随着大语言模型（LLMs）的发展，确保其可靠性变得日益重要，尤其是在涉及复杂或争议性陈述的决策场景中，因此需要有效的不确定性量化（UQ）方法来提升模型的可信度和解释性。 Method: 将不同的LLM不确定性量化（UQ）方法集成到基于计算论证的可解释性框架ArgLLMs中，并通过主张验证任务进行实验评估，提出一种新的UQ方法有效性评估方式。 Result: 实验结果表明，在ArgLLMs中使用直接提示法作为UQ策略效果显著，性能优于多种更复杂的UQ方法，尤其在处理复杂和争议性语句时表现出更强的鲁棒性。 Conclusion: 直接提示是一种简单但高效的UQ方法，在ArgLLMs中具有优越表现，说明在特定可解释性框架下，简单的UQ策略可能比复杂方法更有效，为未来UQ方法的设计与评估提供了新视角。 Abstract: Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods' effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

[16] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Xin Gao,Ruiyi Zhang,Daniel Du,Saurabh Mahindre,Sai Ashish Somayajula,Pengtao Xie

Main category: cs.CL

TL;DR: 本文研究了通过提示词模拟大语言模型（LLM）早期知识截止的能力，发现虽然在直接提问时有效，但在涉及因果相关知识时难以实现真正的“遗忘”，提示当前时间预测任务评估需更严格设置。

Details

Motivation: 由于LLM依赖预训练数据，在时间预测任务中可能存在数据污染问题，准确预测可能源于记忆而非推理，导致泛化能力被高估。因此，探索是否可通过提示让LLM模拟更早的知识截止时间，以缓解该问题。 Method: 构建了三个评估数据集，分别测试LLM在提示下对直接事实知识、语义变化和因果相关知识的遗忘能力，从而评估提示是否能有效模拟知识截止。 Result: 实验表明，提示能在直接查询时有效抑制后期知识，但在查询仅与遗忘内容存在因果关联时效果有限，模型仍会泄露截止日期后的知识。 Conclusion: 当前基于提示的遗忘方法在模拟知识截止方面存在局限，尤其在处理隐性因果知识时不足，未来需设计更严格的评估框架用于时间预测任务。 Abstract: Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

[17] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Yifan Wang,Bolian Li,Junlin Wu,Zhaoxuan Tan,Zheli Liu,Ruqi Zhang,Ananth Grama,Qingkai Zeng

Main category: cs.CL

TL;DR: 本文提出了DRIFT方法，利用现实场景中丰富的用户不满信号进行偏好学习，在缺乏显式满意反馈的情况下实现高效的大模型后训练，显著提升了模型性能并保持了解答多样性。

Details

Motivation: 现实中的大语言模型应用产生大量隐式用户不满信号，但显式满意反馈稀缺，现有偏好学习方法依赖人工标注或假设充足正向反馈，与实际数据分布不匹配。 Method: 提出DRIFT（Dissatisfaction-Refined Iterative Preference Training），以真实世界中的不满信号为训练锚点，从不断演化的策略中动态采样正面样本进行迭代训练。 Result: 在WildBench和AlpacaEval2等评测中，DRIFT在7B和14B模型上显著优于基线方法，14B模型甚至超过GPT-4o-mini；同时保留了生成多样化高奖励回答的能力，避免梯度退化。 Conclusion: DRIFT是一种有效且可扩展的现实世界大模型后训练方案，能够充分利用最丰富和信息量最大的不满信号，提升模型性能。 Abstract: Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

Aurélien Bück-Kaeffer,Je Qin Chooi,Dan Zhao,Maximilian Puelma Touzel,Kellin Pelrine,Jean-François Godbout,Reihaneh Rabbany,Zachary Yang

Main category: cs.CL

TL;DR: 本文提出了SIMPACT框架和BluePrint数据集，用于训练和评估基于大语言模型的社交媒体代理，通过行为建模和标准化数据支持可扩展、合乎伦理的社交媒体模拟。

Details

Motivation: 现有研究缺乏用于微调和评估大语言模型作为社交媒体代理的标准数据资源，且涉及人类被试的实验存在伦理和操作难题。 Method: 提出SIMPACT框架，以动作预测为任务，构建基于真实社交媒体行为的数据集；采用聚类方法将用户抽象为行为角色，并定义群体和集群层面的评估指标。 Result: 发布了BluePrint数据集，包含12种社交互动类型，基于去标识化的Bluesky政治言论数据，支持上下文相关的语言和行为建模，在行为保真度和风格真实性方面表现良好。 Conclusion: SIMPACT和BluePrint为训练逼真的社交媒体代理提供了标准化、隐私保护的数据与评估基础，推动了对错误信息、极化等社会议题的负责任模拟研究。 Abstract: Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.

[19] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Peijun Zhu,Ning Yang,Jiayu Wei,Jinghang Wu,Haijun Zhang

Main category: cs.CL

TL;DR: 提出一种基于动态专家聚类和结构化压缩的统一框架，有效解决MoE大模型中的负载不平衡、参数冗余和通信开销三难问题，在保持性能的同时显著提升效率。

Details

Motivation: MoE大语言模型面临负载不平衡、参数冗余和通信开销的三重挑战，现有方法难以协同解决这些问题。 Method: 采用在线聚类方法，结合参数与激活相似性动态重组专家；在每个集群内，将专家权重分解为共享基矩阵和低秩残差适配器，并设计两阶段分层路由策略，结合异构精度存储和动态卸载机制。 Result: 在GLUE和WikiText-103上，模型性能与标准MoE相当，总参数减少约80%，吞吐量提高10%-20%，专家负载方差降低三倍以上，峰值内存接近稠密模型。 Conclusion: 结构重组是实现可扩展、高效且内存友好的MoE大模型的有效途径。 Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.

[20] Small Language Models for Curriculum-based Guidance

Konstantinos Katharakis,Sippo Rossi,Raghava Rao Mukkamala

Main category: cs.CL

TL;DR: 本研究探讨了基于检索增强生成（RAG）的小型语言模型（SLMs）作为教育领域AI助教的潜力，发现经过优化提示和检索，SLMs在准确性与教学对齐方面可媲美大型语言模型（如GPT-4o），且具备更低能耗、成本和隐私风险，更适合可持续的个性化教学应用。

Details

Motivation: 探索在教育中采用生成式AI时，如何在保证教学质量的同时降低计算资源消耗和环境影响，推动可持续、可扩展的个性化学习。 Method: 构建基于检索增强生成（RAG）的AI助教系统，评估八个开源小型语言模型（如LLaMA 3.1、IBM Granite 3.3、Gemma 3）在课程指导任务中的表现，并与GPT-4o进行基准对比。 Result: 结果显示，在适当提示和检索支持下，小型语言模型在响应准确性和教学对齐方面可达到与大型语言模型相当的水平，同时可在消费级硬件上实现实时运行，显著降低能源消耗和云依赖。 Conclusion: 小型语言模型结合RAG技术是可持续、高效且隐私友好的AI教学助手解决方案，适合教育机构在不牺牲性能的前提下实现绿色AI教育规模化。 Abstract: The adoption of generative AI and large language models (LLMs) in education is still emerging. In this study, we explore the development and evaluation of AI teaching assistants that provide curriculum-based guidance using a retrieval-augmented generation (RAG) pipeline applied to selected open-source small language models (SLMs). We benchmarked eight SLMs, including LLaMA 3.1, IBM Granite 3.3, and Gemma 3 (7-17B parameters), against GPT-4o. Our findings show that with proper prompting and targeted retrieval, SLMs can match LLMs in delivering accurate, pedagogically aligned responses. Importantly, SLMs offer significant sustainability benefits due to their lower computational and energy requirements, enabling real-time use on consumer-grade hardware without depending on cloud infrastructure. This makes them not only cost-effective and privacy-preserving but also environmentally responsible, positioning them as viable AI teaching assistants for educational institutions aiming to scale personalized learning in a sustainable and energy-efficient manner.

[21] mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

Guy Dar

Main category: cs.CL

TL;DR: 提出mini-vec2vec，一种高效、稳定的线性方法，用于在无平行数据情况下对齐文本嵌入空间，性能优于原始vec2vec。

Details

Motivation: 原始的vec2vec方法虽然能近乎完美地对齐文本嵌入空间，但计算成本高且不稳定，限制了其应用。 Method: 采用三阶段流程：伪平行嵌入向量的初步匹配、变换拟合和迭代优化，学习一个线性映射。 Result: mini-vec2vec在效率上比原始vec2vec高出几个数量级，同时结果相当甚至更优，且具有高鲁棒性和可解释性。 Conclusion: mini-vec2vec是一种高效、稳定且可解释的文本嵌入对齐方法，适用于大规模应用和跨领域推广。 Abstract: We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method's stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.

[22] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL

Dzmitry Pihulski,Karol Charchut,Viktoria Novogrodskaia,Jan Kocoń

Main category: cs.CL

TL;DR: LLMSQL是对WikiSQL的系统性修订，旨在为大语言模型时代提供一个干净、标准化的Text-to-SQL基准数据集，通过修复原始数据中的多种错误并以纯文本形式提供自然语言问题和完整SQL查询，提升现代NL2SQL模型的训练与评估效果。

Details

Motivation: WikiSQL因存在大小写不一致、数据类型错误、语法错误和未回答问题等缺陷，已不再适用于现代大语言模型研究，亟需一个更高质量、适配当前生成式模型需求的基准数据集。 Method: 对WikiSQL中的错误进行分类，并开发自动化方法进行数据清洗和重新标注；将数据格式调整为纯文本的自然语言问题与完整SQL查询对，构建适合大语言模型直接生成与评估的LLMSQL基准。 Result: 成功构建了LLMSQL数据集，修复了WikiSQL中的各类问题，并在多个主流大语言模型（如Gemma 3、LLaMA 3.2、Qwen 2.5等）上进行了评估，验证了其作为新型NL2SQL基准的有效性和实用性。 Conclusion: LLMSQL不是一个简单的WikiSQL更新版，而是一个面向大语言模型的全新基准，提供了更清洁、更直接的数据格式，有助于推动Text-to-SQL任务在生成式AI时代的发展。 Abstract: Converting natural language questions into SQL queries (Text-to-SQL) enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early NL2SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the LLM era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models (LLMs), including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1 and others. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark: unlike the original WikiSQL, tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural language-to-SQL models.

[23] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs

Dzmitry Pihulski,Jan Kocoń

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLM）在采用特定政治和文化视角时，如何判断政治言论中的冒犯性内容。

Details

Motivation: 旨在理解LLM在不同政治立场和文化背景下对冒犯性言论的判断差异，提升其在多元意识形态环境下的适用性。 Method: 使用2020年美国大选推文的多语言子集（MD-Agreement数据集），评估多个最新LLM（如DeepSeek-R1、o4-mini、GPT-4.1-mini等）从极右、保守、中间、进步等政治立场出发，在英语、波兰语和俄语中判断推文是否冒犯。 Result: 具有显式推理能力的较大模型（如DeepSeek-R1、o4-mini）在不同意识形态和文化间表现出更高的一致性和敏感性；较小模型则难以捕捉细微差别。推理能力显著提升了判断的个性化和可解释性。 Conclusion: 推理能力是使LLM适应跨语言、跨意识形态复杂社会政治文本分类的关键因素。 Abstract: We explore how large language models (LLMs) assess offensiveness in political discourse when prompted to adopt specific political and cultural perspectives. Using a multilingual subset of the MD-Agreement dataset centered on tweets from the 2020 US elections, we evaluate several recent LLMs - including DeepSeek-R1, o4-mini, GPT-4.1-mini, Qwen3, Gemma, and Mistral - tasked with judging tweets as offensive or non-offensive from the viewpoints of varied political personas (far-right, conservative, centrist, progressive) across English, Polish, and Russian contexts. Our results show that larger models with explicit reasoning abilities (e.g., DeepSeek-R1, o4-mini) are more consistent and sensitive to ideological and cultural variation, while smaller models often fail to capture subtle distinctions. We find that reasoning capabilities significantly improve both the personalization and interpretability of offensiveness judgments, suggesting that such mechanisms are key to adapting LLMs for nuanced sociopolitical text classification across languages and ideologies.

[24] Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

Yihao Wu,Tianrui Wang,Yizhou Peng,Yi-Wen Chao,Xuyi Zhuang,Xinsheng Wang,Shunshun Yin,Ziyang Ma

Main category: cs.CL

TL;DR: 本论文首次系统研究了端到端语音对话模型中的偏见问题，发现闭源模型整体偏见较低，而开源模型对年龄和性别更敏感，多轮对话可能加剧推荐任务中的跨群体差异。

Details

Motivation: 语音对话模型中的副语言特征（如年龄、性别、口音）可能引入或放大偏见，影响决策与推荐的公平性，但相关研究尚不充分，因此需系统评估此类模型在多轮对话中的偏见表现。 Method: 通过构建FairDialogue数据集，使用Group Unfairness Score (GUS) 和相似性归一化统计率 (SNSR) 量化偏见，评估包括Qwen2.5-Omni、GLM-4-Voice、GPT-4o Audio和Gemini-2.5-Flash在内的多种语音大模型在多轮对话中受重复负面反馈影响下的偏见变化。 Result: 闭源模型（如GPT-4o）偏见普遍低于开源模型；开源模型对年龄和性别更敏感；推荐任务比决策任务更容易放大跨群体差异；多轮对话中偏见可能持续存在。 Conclusion: 语音对话模型存在显著偏见，尤其在开源模型和推荐任务中，且多轮交互可能加剧不公平现象。研究强调需关注音频输入输出场景下的公平性，并公开数据集与代码以推动后续研究。 Abstract: While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.

[25] An Senegalese Legal Texts Structuration Using LLM-augmented Knowledge Graph

Oumar Kane,Mouhamad M. Allaya,Dame Samb,Mamadou Bousso

Main category: cs.CL

TL;DR: 本研究探讨了利用人工智能和大语言模型（LLM）改善塞内加尔司法系统中法律文本访问的方法，成功提取并组织了大量法律条文，构建了图数据库以可视化法律文本间的关联，并验证了先进三元组提取技术的有效性。

Details

Motivation: 为解决塞内加尔司法系统中法律文档难以获取和组织的问题，提升公众和法律从业者对法律信息的理解与使用效率。 Method: 从各类法律文件中提取7967条法律条文，重点针对《土地与公共财产法典》，构建包含2872个节点和10774个关系的图数据库，并采用GPT-4o、GPT-4和Mistral-Large等模型进行三元组提取以识别法律关系和元数据。 Result: 成功建立了结构化的法律知识图谱，有效展示了法律文本之间的相互联系，多种大语言模型在三元组提取任务中表现出良好性能，显著提升了法律信息的可访问性和组织性。 Conclusion: 人工智能和大语言模型可有效支持塞内加尔法律系统的数字化转型，为公民和法律专业人士提供更高效的权利与责任理解工具。 Abstract: This study examines the application of artificial intelligence (AI) and large language models (LLM) to improve access to legal texts in Senegal's judicial system. The emphasis is on the difficulties of extracting and organizing legal documents, highlighting the need for better access to judicial information. The research successfully extracted 7,967 articles from various legal documents, particularly focusing on the Land and Public Domain Code. A detailed graph database was developed, which contains 2,872 nodes and 10,774 relationships, aiding in the visualization of interconnections within legal texts. In addition, advanced triple extraction techniques were utilized for knowledge, demonstrating the effectiveness of models such as GPT-4o, GPT-4, and Mistral-Large in identifying relationships and relevant metadata. Through these technologies, the aim is to create a solid framework that allows Senegalese citizens and legal professionals to more effectively understand their rights and responsibilities.

[26] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness

Shreya Saha,Shurui Li,Greta Tuckute,Yuanning Li,Ru-Yuan Zhang,Leila Wehbe,Evelina Fedorenko,Meenakshi Khosla

Main category: cs.CL

TL;DR: 研究表明，通过使用视觉和语言模型的表征来预测语言皮层对句子的神经响应，发现多图像生成和多义句平均能提高预测准确性，表明人脑语言系统具有高度抽象、形式独立的意义表征，且语义表征比现有语言模型更丰富。

Details

Motivation: 探讨人类语言系统中意义表征的抽象性，检验是否存在超越语言形式的抽象语义表征。 Method: 利用视觉模型生成句子对应的多张图像并提取其嵌入表示，同时对句子的多个释义及其上下文扩展进行嵌入平均，用以预测语言皮层的神经响应。 Result: 聚合多张生成图像或多个释义的嵌入能显著提升对语言皮层响应的预测准确性，甚至超过基于原句嵌入的预测；加入隐含上下文细节进一步提升效果。 Conclusion: 语言皮层存在高度抽象且形式独立的意义表征，其语义表征比当前语言模型更丰富和广泛。 Abstract: The human language system represents both linguistic forms and meanings, but the abstractness of the meaning representations remains debated. Here, we searched for abstract representations of meaning in the language cortex by modeling neural responses to sentences using representations from vision and language models. When we generate images corresponding to sentences and extract vision model embeddings, we find that aggregating across multiple generated images yields increasingly accurate predictions of language cortex responses, sometimes rivaling large language models. Similarly, averaging embeddings across multiple paraphrases of a sentence improves prediction accuracy compared to any single paraphrase. Enriching paraphrases with contextual details that may be implicit (e.g., augmenting "I had a pancake" to include details like "maple syrup") further increases prediction accuracy, even surpassing predictions based on the embedding of the original sentence, suggesting that the language system maintains richer and broader semantic representations than language models. Together, these results demonstrate the existence of highly abstract, form-independent meaning representations within the language cortex.

[27] DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

Guanghao Li,Zhihui Fu,Min Fang,Qibin Zhao,Ming Tang,Chun Yuan,Jun Wang

Main category: cs.CL

TL;DR: DiffuSpec是一种无需训练的框架，利用预训练的扩散语言模型（DLM）在单次前向传递中生成多令牌草案，通过因果一致性路径搜索和自适应草案长度控制器，实现高达3倍的时钟速度提升。

Details

Motivation: 由于自回归解码的串行特性导致延迟增加，尽管大型语言模型的准确性提高，但需要减少解码延迟。 Method: 使用预训练的扩散语言模型生成多令牌草案，并引入因果一致性路径搜索（CPS）和自适应草案长度（ADL）控制器来优化草案生成与验证过程。 Result: 在多个基准测试中，DiffuSpec实现了最高达3倍的墙钟加速，证明了基于扩散的草案生成是自回归草案生成器的一个强大替代方案。 Conclusion: DiffuSpec提供了一种有效的非自回归草案生成方法，显著提高了大型语言模型的解码效率。 Abstract: As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.

[28] Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

Jiashu Ye,Tong Wu,Weiwen Chen,Hao Zhang,Zeteng Lin,Xingxing Li,Shujuan Weng,Manni Zhu,Xin Yuan,Xinlong Hong,Jingjie Li,Junyu Zheng,Zhijiong Huang,Jing Tang

Main category: cs.CL

TL;DR: 本文提出了一种面向大气排放领域的知识增强型大语言模型代理Emission-GPT，基于超过10,000份文档构建知识库，支持非专家用户通过自然语言交互进行排放数据查询、可视化和情景分析。

Details

Motivation: 现有排放相关知识碎片化且获取效率低，非专业人士难以理解和使用排放数据，限制了研究与管理的进展。 Method: 构建包含标准、报告、指南和文献的知识库，结合提示工程与问题补全技术，开发Emission-GPT模型，实现自然语言驱动的排放数据分析与问答功能。 Result: 在广东省的案例研究表明，Emission-GPT可通过简单提示从原始数据中提取点源分布和行业趋势等关键信息，并支持排放清单查询、来源贡献分析及情景排放因子推荐。 Conclusion: Emission-GPT具有模块化和可扩展架构，可自动化传统人工流程，有望成为新一代排放清单构建和情景评估的基础工具。 Abstract: Improving air quality and addressing climate change relies on accurate understanding and analysis of air pollutant and greenhouse gas emissions. However, emission-related knowledge is often fragmented and highly specialized, while existing methods for accessing and compiling emissions data remain inefficient. These issues hinder the ability of non-experts to interpret emissions information, posing challenges to research and management. To address this, we present Emission-GPT, a knowledge-enhanced large language model agent tailored for the atmospheric emissions domain. Built on a curated knowledge base of over 10,000 documents (including standards, reports, guidebooks, and peer-reviewed literature), Emission-GPT integrates prompt engineering and question completion to support accurate domain-specific question answering. Emission-GPT also enables users to interactively analyze emissions data via natural language, such as querying and visualizing inventories, analyzing source contributions, and recommending emission factors for user-defined scenarios. A case study in Guangdong Province demonstrates that Emission-GPT can extract key insights--such as point source distributions and sectoral trends--directly from raw data with simple prompts. Its modular and extensible architecture facilitates automation of traditionally manual workflows, positioning Emission-GPT as a foundational tool for next-generation emission inventory development and scenario-based assessment.

[29] Spiral of Silence in Large Language Model Agents

Mingze Zhong,Meng Fang,Zijing Shi,Yuxuan Huang,Shunfeng Zheng,Yali Du,Ling Chen,Jun Wang

Main category: cs.CL

TL;DR: 本研究探讨了在大语言模型（LLM）群体中是否会出现类似“沉默螺旋”（SoS）的舆论动态，并提出了一个评估框架，发现历史和角色信息共同作用时会引发多数主导和SoS模式。

Details

Motivation: 由于沉默螺旋理论原本适用于人类社会，而大语言模型不具备心理机制，因此需要探究纯统计性语言生成是否也会产生类似的社会动态。 Method: 通过控制‘历史’和‘角色’信号的有无，设计四种实验条件，使用Mann-Kendall、Spearman秩相关等趋势检验及峰度、四分位距等集中度量来评估意见动态。 Result: 实验表明，历史与角色信息结合会导致明显的多数主导和SoS模式；仅有历史信息会引起强烈锚定效应；仅有角色信息则导致多样化但不相关的观点；缺乏历史锚定时，SoS动态无法形成。 Conclusion: 研究揭示了LLM群体中可能涌现出类似沉默螺旋的社会现象，强调在AI系统设计中需监测并缓解此类共识偏差，对计算社会学和负责任AI具有重要意义。 Abstract: The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the 'agents' are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS-like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of 'History' and 'Persona' signals. Opinion dynamics are assessed using trend tests such as Mann-Kendall and Spearman's rank, along with concentration measures including kurtosis and interquartile range. Experiments across open-source and closed-source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.

[30] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Haojie Ouyang,Jianwei Lv,Lei Ren,Chen Wei,Xiaojie Wang,Fangxiang Feng

Main category: cs.CL

TL;DR: 本文提出ChunkLLM，一种轻量级、可插拔的训练框架，通过QK Adapter和Chunk Adapter解决Transformer模型在长序列处理中的计算效率问题，在保持高性能的同时显著提升推理速度。

Details

Motivation: Transformer模型因自注意力机制的平方复杂度在处理长序列时面临严重计算效率问题，现有基于块选择和压缩的方法存在语义不完整或训练-推理效率低的问题。 Method: 提出ChunkLLM框架，包含QK Adapter（用于特征压缩和块注意力获取）和Chunk Adapter（用于检测语义块边界）；训练时仅更新适配器参数，并采用注意力蒸馏方法优化QK Adapter；推理时仅在检测到块边界时触发块选择。 Result: 在多个长短文本基准任务上验证，ChunkLLM在短文本任务中性能相当，在长文本任务中保持98.64%性能，KV缓存保留率达48.58%，处理120K长文本时最高加速4.48倍。 Conclusion: ChunkLLM有效平衡了模型效率与性能，具备良好的训练-推理效率和语义完整性，适用于长上下文场景的大模型部署。 Abstract: Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

[31] A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History

Matei-Iulian Cocu,Răzvan-Cosmin Cristia,Adrian Marius Dumitran

Main category: cs.CL

TL;DR: 本研究通过在多种语言和情境下让大语言模型回答具争议性的罗马尼亚历史问题，评估其潜在偏见。研究发现模型的回答稳定性有限，存在跨语言或格式不一致现象，揭示了模型在特定语境下的倾向性。

Details

Motivation: 认识到历史叙述常受文化与国家意识形态影响，而大语言模型可能继承训练数据中的偏见，导致缺乏中立性，因此需评估其在多语言情境下的回答一致性。 Method: 研究分三个阶段进行：首先让多个大语言模型对一系列争议性历史问题给出肯定回答，随后在同一问题上要求其以数值评分形式回应，比较其前后回答的一致性，并分析不同语言和响应格式下的差异。 Result: 二元回答的稳定性较高但不完美，且因语言而异；模型常在不同语言或回答格式间改变立场，数值评分常与初始二元选择不符，最一致的模型未必最准确或最中立。 Conclusion: 大语言模型在处理具争议性历史问题时表现出明显的不一致性与潜在偏见，提示用户需谨慎对待其输出，特别是在涉及历史文化敏感议题时。 Abstract: In this case study, we select a set of controversial Romanian historical questions and ask multiple Large Language Models to answer them across languages and contexts, in order to assess their biases. Besides being a study mainly performed for educational purposes, the motivation also lies in the recognition that history is often presented through altered perspectives, primarily influenced by the culture and ideals of a state, even through large language models. Since they are often trained on certain data sets that may present certain ambiguities, the lack of neutrality is subsequently instilled in users. The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself; after providing an affirmative answer to some given question, an LLM could shift its way of thinking after being asked the same question again, but being told to respond with a numerical value of a scale. Results show that binary response stability is relatively high but far from perfect and varies by language. Models often flip stance across languages or between formats; numeric ratings frequently diverge from the initial binary choice, and the most consistent models are not always those judged most accurate or neutral. Our research brings to light the predisposition of models to such inconsistencies, within a specific contextualization of the language for the question asked.

[32] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

Kuntai Cai,Juncheng Liu,Xianglin Yang,Zhaojie Niu,Xiaokui Xiao,Xing Chen

Main category: cs.CL

TL;DR: 本文提出了实例级上下文学习（ILCL），通过引导探索和紧凑的TODO森林机制，自动构建高精度、可复用的上下文文档，显著提升LLM智能体在复杂任务中的成功率与效率。

Details

Motivation: 现有LLM智能体缺乏对特定环境实例中可验证、可复用事实（如实例级上下文）的有效利用，导致在复杂任务中表现不佳。 Method: 提出实例级上下文学习（ILCL），采用TODO森林进行动作优先级排序，并通过轻量级的计划-执行-提取循环实现高效探索、验证和格式化实例级上下文。 Result: 在TextWorld、ALFWorld和Crafter上的实验表明，该方法显著提升性能：例如，TextWorld中ReAct的成功率从37%提升至95%，IGE从81%提升至95%。 Conclusion: 将一次性探索转化为持久可复用的知识，补全了环境级、任务级之外的实例级上下文，增强了LLM智能体的可靠性与效率。 Abstract: Large language model (LLM) agents typically receive two kinds of context: (i) environment-level manuals that define interaction interfaces and global rules, and (ii) task-level guidance or demonstrations tied to specific goals. In this work, we identify a crucial but overlooked third type of context, instance-level context, which consists of verifiable and reusable facts tied to a specific environment instance, such as object locations, crafting recipes, and local rules. We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks, as success often depends not only on reasoning over global rules or task prompts but also on making decisions based on precise and persistent facts. Acquiring such context requires more than memorization: the challenge lies in efficiently exploring, validating, and formatting these facts under tight interaction budgets. We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it. Our method performs a guided exploration, using a compact TODO forest to intelligently prioritize its next actions and a lightweight plan-act-extract loop to execute them. This process automatically produces a high-precision context document that is reusable across many downstream tasks and agents, thereby amortizing the initial exploration cost. Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent gains in both success and efficiency: for instance, ReAct's mean success rate in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By transforming one-off exploration into persistent, reusable knowledge, our method complements existing contexts to enable more reliable and efficient LLM agents.

[33] Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

Minsung Kim,Dong-Kyum Kim,Jea Kwon,Nakyeong Yang,Kyomin Jung,Meeyoung Cha

Main category: cs.CL

TL;DR: 本研究首次系统探讨了训练条件如何影响语言模型在上下文知识和参数化知识之间的仲裁策略，发现文档内事实的重复以及包含不一致信息或分布偏差的数据集有助于模型发展出更鲁棒的知识利用策略。

Details

Motivation: 大语言模型在推理时常面临上下文检索知识与预训练获得的参数化知识之间的冲突，现有模型要么盲目接受外部知识易受误导，要么固守参数化知识无法有效利用检索信息。尽管检索增强生成被广泛采用，但尚缺乏对训练过程中知识仲裁机制的系统理解，可能导致预训练资源浪费。 Method: 通过在合成传记语料库上训练基于Transformer的语言模型，并系统性地控制各种训练条件，进行受控实验以分析不同因素对模型知识仲裁行为的影响。 Result: 实验表明，文档内事实的重复能促进参数化和上下文知识能力的发展；包含不一致信息或分布偏差的语料库训练可促使模型形成更稳健的知识利用策略。这些看似非理想的数据特性实际上有助于学习鲁棒的仲裁机制。 Conclusion: 数据中的重复性和不一致性并非应消除的噪声，而是培养模型灵活、可靠地整合内外知识的关键因素，为高效预训练兼具参数化与上下文知识融合能力的模型提供了实证指导。 Abstract: Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models' use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.

[34] Pretraining with hierarchical memories: separating long-tail and common knowledge

Hadi Pouransari,David Grangier,C Thomas,Michael Kirchhof,Oncel Tuzel

Main category: cs.CL

TL;DR: 提出一种记忆增强的小型语言模型架构，通过从大型分层参数化记忆库中检索上下文相关记忆块来提升性能，在万亿级token实验中表现出与两倍以上参数常规模型相当的效果。

Details

Motivation: 现代大模型依赖参数扩展，但将所有世界知识压缩到参数中既不必要也不适合边缘设备；需要更高效的知识存储与调用方式。 Method: 引入小型语言模型结合大型分层参数化记忆库的架构，在预训练和推理时动态检索并注入小块上下文相关记忆，使模型能有效利用外部记忆中的长尾知识。 Result: 在万亿token规模实验中，1.6亿参数模型结合46亿参数记忆库中的1800万参数记忆块，性能媲美两倍以上参数的常规模型；分层前馈记忆结构在不同Transformer架构中均表现稳健。 Conclusion: 该记忆增强架构有效解耦了通用能力与特定知识存储，为轻量模型高效利用大规模知识提供了可行方案，兼容现有硬件范式且具良好扩展性。 Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.

[35] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Aakriti Agrawal,Rohith Aralikatti,Anirudh Satheesh,Souradip Chakraborty,Amrit Singh Bedi,Furong Huang

Main category: cs.CL

TL;DR: 提出一种基于校准对数似然分数的高效方法，用于从多个不同大语言模型（LLMs）中选择最佳响应，在GSM8K、MMLU和ARC数据集上优于现有方法。

Details

Motivation: 在资源受限环境下，如何从多个LLM的多样化响应中可靠地选出最优结果仍具挑战，现有方法依赖高成本验证或多次采样，效率较低。 Method: 利用校准后的对数似然分数作为评估指标，隐式挖掘多个LLM自身的知识与置信度，实现高效响应选择。 Result: 在GSM8K、MMLU（6个子集）和ARC数据集上，相比辩论与非辩论设置，性能分别提升约4%、3%和5%。 Conclusion: 该方法原则性强、新颖且计算高效，显著提升了多LLM系统在响应选择上的表现。 Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.

[36] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation

Haoyue Bai,Haoyu Wang,Shengyu Chen,Zhengzhang Chen,Lu-An Tang,Wei Cheng,Haifeng Chen,Yanjie Fu

Main category: cs.CL

TL;DR: 本文提出了一种基于规则的路由框架，用于在检索增强生成中动态选择关系数据库或文档作为知识源，以提高领域特定问答的准确性和效率。

Details

Motivation: 现有检索增强生成系统主要依赖非结构化文档，忽视了能提供精确、及时信息的关系数据库。本文旨在填补这一空白，探索如何有效结合两种知识源的优势。 Method: 通过系统性分析发现查询类型与检索路径之间存在规律，据此设计了一个规则驱动的路由框架，包括路由代理、规则优化专家代理和路径级元缓存机制，实现对知识源的智能选择与复用。 Result: 在三个问答基准上的实验表明，该框架优于静态策略和学习型路由基线方法，能够在保持适度计算成本的同时提升准确性。 Conclusion: 规则驱动的路由策略能有效平衡准确性与效率，充分利用数据库和文档的互补优势，为领域特定的问答系统提供了可行的解决方案。 Abstract: Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA), yet they often struggle in domain-specific scenarios where accurate and up-to-date information is required. Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge, but existing systems primarily rely on unstructured documents, while largely overlooking relational databases, which provide precise, timely, and efficiently queryable factual information, serving as indispensable infrastructure in domains such as finance, healthcare, and scientific research. Motivated by this gap, we conduct a systematic analysis that reveals three central observations: (i) databases and documents offer complementary strengths across queries, (ii) naively combining both sources introduces noise and cost without consistent accuracy gains, and (iii) selecting the most suitable source for each query is crucial to balance effectiveness and efficiency. We further observe that query types show consistent regularities in their alignment with retrieval paths, suggesting that routing decisions can be effectively guided by systematic rules that capture these patterns. Building on these insights, we propose a rule-driven routing framework. A routing agent scores candidate augmentation paths based on explicit rules and selects the most suitable one; a rule-making expert agent refines the rules over time using QA feedback to maintain adaptability; and a path-level meta-cache reuses past routing decisions for semantically similar queries to reduce latency and cost. Experiments on three QA benchmarks demonstrate that our framework consistently outperforms static strategies and learned routing baselines, achieving higher accuracy while maintaining moderate computational cost.

[37] KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning

Yinyi Luo,Zhexian Zhou,Hao Chen,Kai Qiu,Marios Savvides,Yixuan Li,Jindong Wang

Main category: cs.CL

TL;DR: 本文提出了KnowledgeSmith框架，统一研究大语言模型的知识编辑与遗忘机制，通过自动数据集生成器进行多层级和多尺度的可控实验，揭示了知识传播、可塑性、一致性和鲁棒性等方面的深层洞见。

Details

Motivation: 由于缺乏充分、系统和大规模的评估，大语言模型的知识更新机制尚不明确。例如，模型在知识修改上是否类似人类？随着训练数据增加，编辑与遗忘有何差异？ Method: 将知识编辑与机器遗忘统一为一个约束优化问题，并设计自动数据集生成器，在多个图层级和数据规模上实现结构化干预，以系统分析不同修改策略在模型知识中的传播方式。 Result: 实验揭示了知识传播的复杂性、可塑性随规模变化的趋势、一致性与容量之间的权衡等现象，表明大语言模型在不同知识层级上的更新行为与人类不同。 Conclusion: KnowledgeSmith为理解大语言模型的知识更新机制提供了系统性框架，其发现有助于设计更可靠、可扩展的知识更新策略。 Abstract: Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: https://github.com/AIFrontierLab/KnowledgeSmith.git

[38] Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

Manasi Patwardhan,Ayush Agarwal,Shabbirhussain Bhaisaheb,Aseem Arora,Lovekesh Vig,Sunita Sarawagi

Main category: cs.CL

TL;DR: 提出了一种在数据库层面关联结构化领域语句的系统框架，通过子字符串匹配检索相关语句，显著提升了自然语言到SQL转换的准确性。

Details

Motivation: 现有基准依赖不切实际的、针对特定查询的文本提示来表达领域知识，且大模型在不同数据库间的NL到SQL性能差异显著，需要更实用和准确的领域知识表达方式。 Method: 在数据库层面引入结构化领域语句，并使用子字符串级别的匹配方法检索与用户查询相关的领域语句。 Result: 在涵盖多个领域的十一个真实数据库模式上评估了五种开源和专有大语言模型，结果表明：(1) 数据库级结构化领域语句比现有的临时查询特定文本语句更实用和准确；(2) 基于子字符串匹配的检索方法比其他检索方法显著提高了准确性。 Conclusion: 数据库级别的结构化领域知识表示结合子字符串匹配检索是一种更优的提升NL-to-SQL转换性能的方法。 Abstract: The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.

[39] Words That Make Language Models Perceive

Sophie L. Wang,Phillip Isola,Brian Cheung

Main category: cs.CL

TL;DR: 通过感官提示（如“看”或“听”）可以激活纯文本训练的大语言模型中与模态对应的潜在表征，使其在表示上更接近专用的视觉和音频编码器。

Details

Motivation: 探索纯文本训练的大型语言模型是否能通过感官提示激活其隐含的多模态结构。 Method: 使用感官提示来引导模型生成下一个词的预测，模拟其基于未实际提供的视觉或听觉证据进行推理的过程，并分析其表示变化。 Result: 轻量级的提示工程能够可靠地激活文本训练的LLM中适合特定感官模态的表示，使其与专门的视觉和音频编码器在表示上更加一致。 Conclusion: 尽管LLM没有直接的感知经验，但通过适当的提示可以激发其内部隐含的多模态规律，表明语言中编码的跨模态信息可用于增强模型的感知对齐能力。 Abstract: Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

[40] CLARITY: Clinical Assistant for Routing, Inference, and Triage

Vladimir Shaposhnikov,Aleksandr Nesterov,Ilia Kopanichuk,Ivan Bakulin,Egor Zhelvakov,Ruslan Abramov,Ekaterina Tsapieva,Dmitry V. Dylov,Ivan Oseledets

Main category: cs.CL

TL;DR: CLARITY是一个基于AI的临床辅助平台，结合有限状态机和大语言模型，用于患者分诊、专科转介和病情评估，在真实大规模医疗IT系统中部署后表现出优于人类的路由准确率和更短的咨询时间。

Details

Motivation: 提升医疗系统中患者到专科医生的转诊效率，减少咨询时间，增强临床决策支持系统的自动化与准确性。 Method: 采用混合架构，结合有限状态机（FSM）进行结构化对话控制和基于大语言模型（LLM）的协作代理进行症状分析与优先级判断，构建在模块化的微服务框架上。 Result: 在两个月内部署完成超过55,000次用户对话，其中2,500条经专家标注验证；结果显示CLARITY在首次转诊准确率上超过人类水平，且咨询时长缩短至人类的三分之一。 Conclusion: CLARITY在实际医疗环境中展现出高效、安全和可扩展的性能，显著优于人工转诊，具备广泛集成于现有医疗IT系统的潜力。 Abstract: We present CLARITY (Clinical Assistant for Routing, Inference, and Triage), an AI-driven platform designed to facilitate patient-to-specialist routing, clinical consultations, and severity assessment of patients' conditions. Its hybrid architecture combines a Finite State Machine (FSM) for structured dialogue flows with collaborative agents that employ Large Language Model (LLM) to analyze symptoms and prioritize referrals to appropriate specialists. Built on a modular microservices framework, CLARITY ensures safe, efficient, and robust performance, flexible and readily scalable to meet the demands of existing workflows and IT solutions in healthcare. We report integration of our clinical assistant into a large-scale nation-wide inter-hospital IT platform, with over 55,000 content-rich user dialogues completed within the two months of deployment, 2,500 of which were expert-annotated for a consequent validation. The validation results show that CLARITY surpasses human-level performance in terms of the first-attempt routing precision, naturally requiring up to 3 times shorter duration of the consultation than with a human.

[41] Unraveling Syntax: How Language Models Learn Context-Free Grammars

Laura Ying Schulz,Daniel Mitropolsky,Tomaso Poggio

Main category: cs.CL

TL;DR: 本文提出了一种理解语言模型如何习得句法的新框架，通过在由概率上下文无关文法（PCFG）生成的合成语言上训练小模型，研究其学习动态，并揭示了变换器模型在处理递归结构和子文法时的行为特点。

Details

Motivation: 尽管大型语言模型取得了显著成果，但其学习动态尚不清楚；自然语言、编程语言和算术问题等大多可用PCFG建模，因此需要一个可控环境来研究语法学习过程。 Method: 使用PCFG生成具有可控复杂度、递归深度和子文法结构的合成语言，训练小型变换器模型，并推导出关于子文法结构上训练损失和KL散度的递归公式。 Result: 发现变换器模型并行减少所有子文法的损失，不同于儿童逐步学习的方式；子文法预训练可改善小模型的最终损失，且预训练模型的内部表征更符合文法结构；模型在深层递归结构上表现困难，揭示了神经网络表示层次语法的根本挑战。 Conclusion: 该工作建立了以PCFG为测试平台研究变换器学习动态的新方向，为探究语言模型的学习机制提供了可扩展的分析框架，并提出了多个待解问题。 Abstract: We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar's substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.

[42] Hierarchical Semantic Retrieval with Cobweb

Anant Gupta,Karthik Singaravadivelan,Zekun Wang

Main category: cs.CL

TL;DR: Cobweb是一种层次感知的文档检索框架，通过构建句子嵌入的原型树实现从粗到细的检索，提升可解释性与鲁棒性。

Details

Motivation: 传统神经检索方法将语料视为扁平向量集合，忽视了语料结构，导致解释性差且对低质量嵌入敏感。 Method: 提出Cobweb框架，将句子嵌入组织成原型树，采用从粗到细的遍历策略进行检索，并设计两种推理方法：广义最佳优先搜索和轻量级路径求和排序器。 Result: 在MS MARCO和QQP数据集上，使用BERT/T5等强编码器时性能与点积搜索相当，而在GPT-2等弱嵌入下显著优于点积搜索，表现出更强的鲁棒性。 Conclusion: Cobweb框架在保持检索效果的同时，提升了对嵌入质量的鲁棒性、可扩展性和检索结果的可解释性。 Abstract: Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb--a hierarchy-aware framework--to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.

[43] Knowledge-Graph Based RAG System Evaluation Framework

Sicheng Dong,Vahid Zolfaghari,Nenad Petrovic,Alois Knoll

Main category: cs.CL

TL;DR: 本文提出了一种基于知识图谱（KG）的检索增强生成（RAG）系统评估新方法，通过多跳推理和语义社区聚类提升评估的全面性，并验证了其与人类判断的相关性及对语义差异的敏感性。

Details

Motivation: 传统评估指标难以有效捕捉现代大语言模型生成内容的关键特征，尤其是在RAG系统中，需要更全面、细致的评估方法。 Method: 受RAGAS框架启发，扩展其为基于知识图谱的评估范式，引入多跳推理和语义社区聚类以构建更全面的评分指标。 Result: 实验表明，该方法在与RAGAS评分对比中表现更优，且在人工标注子集上显示出更高的相关性，对生成结果中的细微语义差异更敏感。 Conclusion: 基于KG的评估方法能更深入地理解RAG系统性能，提供了更细粒度的评估视角，为未来RAG评估研究指明了方向。 Abstract: Large language models (LLMs) has become a significant research focus and is utilized in various fields, such as text generation and dialog systems. One of the most essential applications of LLM is Retrieval Augmented Generation (RAG), which greatly enhances generated content's reliability and relevance. However, evaluating RAG systems remains a challenging task. Traditional evaluation metrics struggle to effectively capture the key features of modern LLM-generated content that often exhibits high fluency and naturalness. Inspired by the RAGAS tool, a well-known RAG evaluation framework, we extended this framework into a KG-based evaluation paradigm, enabling multi-hop reasoning and semantic community clustering to derive more comprehensive scoring metrics. By incorporating these comprehensive evaluation criteria, we gain a deeper understanding of RAG systems and a more nuanced perspective on their performance. To validate the effectiveness of our approach, we compare its performance with RAGAS scores and construct a human-annotated subset to assess the correlation between human judgments and automated metrics. In addition, we conduct targeted experiments to demonstrate that our KG-based evaluation method is more sensitive to subtle semantic differences in generated outputs. Finally, we discuss the key challenges in evaluating RAG systems and highlight potential directions for future research.

[44] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

Tolúl\d{o}pé Ògúnrèmí,Christopher D. Manning,Dan Jurafsky,Karen Livescu

Main category: cs.CL

TL;DR: 研究了三种口语语言模型（SLM）中模态适配器（MA）的表示策略，发现使用Whisper编码器的模型通过英语为基础的中间语言表示语义，而未使用Whisper的模型则用英语词汇表达输入语音的音素。

Details

Motivation: 了解模态适配器如何转换语音表示，以揭示其在多语言处理中的作用机制。 Method: 通过查找模态适配器输出表示最接近的语言模型token，分析三种SLM中的MA表示策略。 Result: 发现两种MA表示策略：一种是基于英语语义的中间语言表示（用于Whisper编码器），另一种是用英语词表示语音音素（如Phi-4-Multimodal-Instruct）。 Conclusion: MA的表示策略取决于语音编码器是否经过翻译训练，而不仅仅是语音识别训练。 Abstract: Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

[45] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Jingyi Sun,Pepa Atanasova,Sagnik Ray Choudhury,Sekh Mainul Islam,Isabelle Augenstein

Main category: cs.CL

TL;DR: 本文提出了首个用于评估高亮解释（HEs）在上下文归因中有效性的黄金标准框架，通过受控测试用例验证了四种HE方法在不同场景下的表现，发现现有方法在长上下文和位置偏差方面存在挑战，其中基于机械可解释性的MechLight表现最佳。

Details

Motivation: 语言模型如何利用上下文信息对用户而言仍不透明，现有工作缺乏对高亮解释（HEs）在揭示上下文使用方面的准确性评估，因此需要一个可靠的评估框架来衡量解释方法的有效性。 Method: 构建了一个具有已知真实上下文使用的受控测试用例的黄金标准评估框架，并评估了三种已有HE方法和一种新适配的机械可解释性方法MechLight，在四个上下文场景、四个数据集和五个语言模型上进行实验。 Result: MechLight在所有上下文场景中表现最优，但所有方法在处理较长上下文时性能下降，且表现出位置偏差，说明当前解释方法在准确性和可扩展性方面存在根本性挑战。 Conclusion: 现有的高亮解释方法尚不足以可靠地解释语言模型在复杂或长上下文中的信息利用方式，未来需要新的方法来克服位置偏差和长度敏感性问题，以实现可信赖的大规模上下文使用解释。 Abstract: Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

[46] Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

Fulei Zhang,Zhou Yu

Main category: cs.CL

TL;DR: 研究发现用户与LLM聊天机器人交互时的语言风格与人类代理不同，表现为语法流畅性、礼貌性和词汇多样性差异；训练数据中加入多样化语言风格可提升模型鲁棒性，而推理时消息重构效果有限。

Details

Motivation: 探讨用户在与LLM聊天机器人和人类代理交互时沟通方式的差异，并分析现有基于人类-人类交互数据训练的模型是否能适应这种变化。 Method: 通过实证分析比较用户在与LLM聊天机器人和人类代理交互时的语言特征差异，并实验两种应对策略：后训练阶段的数据增强和推理时的用户消息重构。 Result: 发现用户与聊天机器人交互时语言风格显著不同；采用风格多样化数据集训练的模型表现优于仅使用原始或风格单一数据集训练的模型，而推理时消息重构策略效果较弱。 Conclusion: 为提高LLM在实际部署中的鲁棒性，应在其训练数据中纳入与聊天机器人交互的真实用户语言风格，以更好适应上线后的通信风格变化。 Abstract: As Large Language Models (LLMs) are increasingly deployed in customer-facing applications, a critical yet underexplored question is how users communicate differently with LLM chatbots compared to human agent. In this study, we present empirical evidence that users adopt distinct communication styles when users interact with chatbots versus human agents. Our analysis reveals significant differences in grammatical fluency, politeness, and lexical diversity in user language between the two settings. These findings suggest that models trained exclusively on human-human interaction data may not adequately accommodate the communication style shift that occurs once an LLM chatbot is deployed. To enhance LLM robustness to post-launch communication style changes, we experimented with two strategies: (1) data augmentation during the post-training phase and (2) inference-time user message reformulation. Our results indicate that models trained on stylistically diverse datasets significantly outperform those trained exclusively on original or stylistically uniform datasets, while inference-time reformulation proved less effective. These insights help us to better adapt our models for improved LLM-user interaction experiences.

[47] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models

Rui Qi,Zhibo Man,Yufeng Chen,Fengran Mo,Jinan Xu,Kaiyu Huang

Main category: cs.CL

TL;DR: 提出了一种无需训练的多语言推理方法SoT，通过语言思维转换和结构化知识转换提升LLM在低资源语言下的推理能力。

Details

Motivation: 现有大模型的复杂推理能力难以迁移到非高资源语言，因语言资源限制导致多语言推理表现不佳。 Method: 提出Structured-of-Thought（SoT）方法，包含语言思维转换和结构化知识转换两个步骤，将语言特定语义转化为语言无关的结构化表示，引导模型保持一致的推理路径。 Result: 实验表明，SoT在多个多语言推理基准上优于多种强基线方法，且兼容不同LLM主干和其他无需训练策略，可进一步提升性能。 Conclusion: SoT有效提升了大模型在多语言场景下的推理能力，为低资源语言的复杂推理任务提供了高效、通用的解决方案。 Abstract: Recent developments have enabled Large Language Models (LLMs) to engage in complex reasoning tasks through deep thinking. However, the capacity of reasoning has not been successfully transferred to non-high-resource languages due to resource constraints, which struggles with multilingual reasoning tasks. To this end, we propose Structured-of-Thought (SoT), a training-free method that improves the performance on multilingual reasoning through a multi-step transformation: Language Thinking Transformation and Structured Knowledge Transformation. The SoT method converts language-specific semantic information into language-agnostic structured representations, enabling the models to understand the query in different languages more sophisticated. Besides, SoT effectively guides LLMs toward more concentrated reasoning to maintain consistent underlying reasoning pathways when handling cross-lingual variations in expression. Experimental results demonstrate that SoT outperforms several strong baselines on multiple multilingual reasoning benchmarks when adapting to various backbones of LLMs. It can also be integrated with other training-free strategies for further improvements. Our code is available at https://github.com/Cherry-qwq/SoT.

[48] Self-Improvement in Multimodal Large Language Models: A Survey

Shijian Deng,Kai Wang,Tianyu Yang,Harsh Singh,Yapeng Tian

Main category: cs.CL

TL;DR: 这是首篇关于多模态大语言模型（MLLM）自提升的综述，从数据收集、组织和模型优化三方面系统梳理现有方法，并讨论评估方式、应用及未来研究方向。

Details

Motivation: 尽管大语言模型的自提升已取得进展，但其在多模态领域的扩展仍具潜力，亟需系统性总结以推动发展。 Method: 从数据收集、数据组织和模型优化三个角度对现有MLLM自提升方法进行分类和综述，并总结常用评估手段与下游应用。 Result: 提供了MLLM自提升领域的全面概览，建立了结构化的文献框架，明确了当前的技术路径。 Conclusion: 该领域尚处早期，未来需解决数据质量、模态融合与评估标准等开放挑战。 Abstract: Recent advancements in self-improvement for Large Language Models (LLMs) have efficiently enhanced model capabilities without significantly increasing costs, particularly in terms of human effort. While this area is still relatively young, its extension to the multimodal domain holds immense potential for leveraging diverse data sources and developing more general self-improving models. This survey is the first to provide a comprehensive overview of self-improvement in Multimodal LLMs (MLLMs). We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data collection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also include commonly used evaluations and downstream applications. Finally, we conclude by outlining open challenges and future research directions.

[49] Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

Yavuz Bakman,Sungmin Kang,Zhiqi Huang,Duygu Nur Yaldiz,Catarina G. Belém,Chenyang Zhu,Anoop Kumar,Alfy Samuel,Salman Avestimehr,Daben Liu,Sai Praneeth Karimireddy

Main category: cs.CL

TL;DR: 本文提出了一种理论上有依据的方法来量化上下文问答任务中的认知不确定性，通过分解模型预测分布与真实分布之间的交叉熵，提取上下文依赖、理解和诚实性三个特征，构建鲁棒的不确定性评分，在多个基准上显著优于现有方法。

Details

Motivation: 尽管上下文问答在实际应用中很重要，但现有的不确定性量化研究主要集中在闭卷事实问答上，缺乏对上下文问答中不确定性量化的探索。 Method: 引入任务无关的词元级不确定性度量，通过分解交叉熵分离认知不确定性，并用理想化模型逼近真实分布，进而推导出认知不确定性的上界；针对上下文问答任务，提取上下文依赖、理解和诚实性三个特征，使用少量标注样本结合集成方法构建不确定性评分。 Result: 在多个问答基准（包括分布内和分布外场景）上的实验表明，该方法显著优于当前最先进的有监督和无监督不确定性量化方法，PRR指标最高提升13点，且推理开销极小。 Conclusion: 所提出的方法为上下文问答中的认知不确定性量化提供了有效且高效的解决方案，具有良好的理论基础和实际应用前景。 Abstract: Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify epistemic uncertainty. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance (using the provided context rather than parametric knowledge), context comprehension (extracting relevant information from context), and honesty (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.

[50] Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Yubo Li,Ramayya Krishnan,Rema Padman

Main category: cs.CL

TL;DR: 本研究首次采用生存分析方法评估大语言模型在多轮对话中的鲁棒性，发现渐进的语义漂移反而能显著降低对话失败风险，而突变的语义漂移则极具破坏性，挑战了语义一致性必须严格保持的传统假设。

Details

Motivation: 现有评估框架多关注静态或单轮对话表现，难以捕捉真实多轮对话中随时间演化的性能退化问题，因此需要新的方法来系统分析大语言模型在长期交互中的稳健性。 Method: 基于36,951轮对话数据，对9个前沿大语言模型采用Cox比例风险、加速失效时间（AFT）和随机生存森林等生存分析模型，将对话失败建模为时间事件过程，并分析不同语义漂移模式的影响。 Result: 发现突变式语义漂移（P2P）显著增加对话失败风险，而渐进式累积漂移具有保护作用，可大幅延长有效对话轮次；引入交互项的AFT模型表现出最优的区分度和校准性能。 Conclusion: 生存分析为评估大语言模型的对话鲁棒性提供了有力框架，揭示了语义漂移类型对对话稳定性的关键影响，为构建更健壮的对话系统提供了新设计思路，并挑战了对语义一致性的传统认知。 Abstract: Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present the first comprehensive survival analysis of conversational AI robustness, analyzing 36,951 conversation turns across 9 state-of-the-art LLMs to model failure as a time-to-event process. Our survival modeling framework-employing Cox proportional hazards, Accelerated Failure Time, and Random Survival Forest approaches-reveals extraordinary temporal dynamics. We find that abrupt, prompt-to-prompt(P2P) semantic drift is catastrophic, dramatically increasing the hazard of conversational failure. In stark contrast, gradual, cumulative drift is highly protective, vastly reducing the failure hazard and enabling significantly longer dialogues. AFT models with interactions demonstrate superior performance, achieving excellent discrimination and exceptional calibration. These findings establish survival analysis as a powerful paradigm for evaluating LLM robustness, offer concrete insights for designing resilient conversational agents, and challenge prevailing assumptions about the necessity of semantic consistency in conversational AI Systems.

[51] TravelBench : Exploring LLM Performance in Low-Resource Domains

Srinivas Billa,Xiaonan Jing

Main category: cs.CL

TL;DR: 本文研究了现有大语言模型（LLM）基准在低资源任务中评估模型能力的局限性，提出了一个涵盖14个真实场景旅行领域数据集的新基准，覆盖7种常见NLP任务。研究表明，通用基准无法准确反映低资源任务中的模型表现，即使用大量FLOPs训练的LLM在复杂、特定领域任务中仍存在性能瓶颈。此外，推理能力对小型LLM的提升更为显著，使其在特定任务上判断更准确。

Details

Motivation: 现有LLM基准在低资源任务中提供的信息有限，难以推动这些领域有效解决方案的发展，因此需要构建更具代表性的领域特定基准来深入理解模型的实际性能。 Method: 收集并整理了来自真实场景的14个旅行领域数据集，涵盖7种常见NLP任务，系统评估了多种LLM在准确性、扩展行为和推理能力方面的表现。 Result: 发现通用基准结果不足以反映低资源任务中的模型性能；即使经过大规模训练，现成LLM在复杂领域任务中仍遇到性能瓶颈；推理机制对小型LLM的性能提升更为明显。 Conclusion: 为准确评估LLM在低资源、领域特定任务中的表现，需超越通用基准，构建更具针对性的评估体系，且推理能力是提升小型模型表现的关键因素。 Abstract: Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.

[52] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking

KM Pooja,Cheng Long,Aixin Sun

Main category: cs.CL

TL;DR: 本文提出了一种基于策略梯度的生成对抗网络（PGMEL），用于多模态实体链接，通过生成高质量负样本提升表示学习效果，在多个数据集上优于现有方法。

Details

Motivation: 现有研究未充分探索负样本选择对多模态实体链接中表示学习的影响，本文旨在填补这一空白。 Method: 采用生成对抗框架，其中生成器负责生成高质量负样本，判别器进行度量学习；生成器通过策略梯度方法优化。 Result: 在Wiki-MEL、Richpedia-MEL和WikiDiverse数据集上的实验表明，PGMEL能学习到更有意义的表示，并优于当前最先进的方法。 Conclusion: PGMEL通过对抗性生成负样本显著提升了多模态实体链接性能，验证了负样本质量在该任务中的重要性。 Abstract: The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. To fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods.

[53] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context

Santhosh G S,Akshay Govind S,Gokul S Krishnan,Balaraman Ravindran,Sriraam Natarajan

Main category: cs.CL

TL;DR: 提出基于对比学习编码器的评估框架和新数据集IndiCASA，用于检测印度多维度社会偏见，发现现有大模型在残障相关偏见上尤为显著。

Details

Motivation: 现有偏见评估方法难以捕捉印度多元文化背景下的细微刻板印象，需更细粒度的评估框架。 Method: 采用对比学习训练编码器，通过嵌入相似性衡量细粒度偏见，并构建包含2575个标注句子的IndiCASA数据集，覆盖种姓、性别、宗教、残疾和经济地位五个维度。 Result: 对多个开源大模型的评估显示均存在一定程度的刻板偏见，其中残障相关偏见最持久，宗教偏见较低，可能得益于全球去偏努力。 Conclusion: 需要针对特定文化背景设计更精细的偏见评估方法，并推动更公平的大模型开发。 Abstract: Large Language Models (LLMs) have gained significant traction across critical domains owing to their impressive contextual understanding and generative capabilities. However, their increasing deployment in high stakes applications necessitates rigorous evaluation of embedded biases, particularly in culturally diverse contexts like India where existing embedding-based bias assessment methods often fall short in capturing nuanced stereotypes. We propose an evaluation framework based on a encoder trained using contrastive learning that captures fine-grained bias through embedding similarity. We also introduce a novel dataset - IndiCASA (IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes) comprising 2,575 human-validated sentences spanning five demographic axes: caste, gender, religion, disability, and socioeconomic status. Our evaluation of multiple open-weight LLMs reveals that all models exhibit some degree of stereotypical bias, with disability related biases being notably persistent, and religion bias generally lower likely due to global debiasing efforts demonstrating the need for fairer model development.

[54] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

Hangfan Zhang,Siyuan Xu,Zhimeng Guo,Huaisheng Zhu,Shicheng Liu,Xinrun Wang,Qiaosheng Zhang,Yang Chen,Peng Ye,Lei Bai,Shuyue Hu

Main category: cs.CL

TL;DR: 提出一种基于自我意识的强化学习方法，通过让大语言模型自主提出并尝试解决任务，在极少量额外数据下显著提升推理能力。

Details

Motivation: 减少强化学习训练中对大量标注数据的依赖，探索在最小数据条件下提升大语言模型推理能力的方法。 Method: 引入两种基于自我意识的机制：自我感知难度预测（评估任务难度并选择具挑战性但可解的任务）和自我极限突破（识别无法完成的任务并主动请求外部数据）。 Result: 在九个基准测试上实现了53.8%的相对性能提升，仅使用不到1.2%的额外数据。 Conclusion: 自我意识强化学习有效减少了数据依赖，展现了自演化智能体训练的潜力。 Abstract: Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.

[55] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

Tien Phat Nguyen,Vu Minh Ngo,Tung Nguyen,Linh Van Ngo,Duc Anh Nguyen,Sang Dinh,Trung Le

Main category: cs.CL

TL;DR: 提出了一种新的跨语言主题建模框架XTRA，通过统一词袋模型和多语言嵌入，并引入表示对齐和主题对齐机制，在主题连贯性、多样性和跨语言一致性方面显著优于基线方法。

Details

Motivation: 现有跨语言主题建模方法在保证高主题连贯性和跨语言一致对齐方面存在困难，需要改进。 Method: XTRA结合了词袋模型和多语言嵌入，采用对比学习实现文档-主题分布的表示对齐，并将主题-词分布投影到共享语义空间以实现主题对齐。 Result: 在多语言语料库上的实验表明，XTRA在主题连贯性、多样性和对齐质量上显著优于强基线方法。 Conclusion: XTRA能够学习出可解释且跨语言良好对齐的主题，有效提升了跨语言主题建模的性能。 Abstract: Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.

Matej Gjurković

Main category: cs.CL

TL;DR: 本论文提出并利用两个从Reddit收集的数据集（MBTI9k和PANDORA）解决NLP在人格评估中面临的数据稀缺与心理学-NLP脱节问题，开发了可解释性强的SIMPA框架，通过语义匹配实现高效、准确且具解释性的人格分析。

Details

Motivation: 由于缺乏大规模标注数据集以及人格心理学与自然语言处理（NLP）之间的脱节，当前自动人格评估模型的效度和可解释性受限。 Method: 构建了两个新的大规模人格标注数据集MBTI9k和PANDORA，并提出了SIMPA框架，该框架通过将用户生成文本与标准化问卷条目进行语义匹配，结合机器学习和语义相似度技术实现人格评估。 Result: 实验证明人口统计学变量影响模型有效性；SIMPA框架在保持高可解释性和效率的同时，实现了与人工评估相当的人格预测性能。 Conclusion: SIMPA为可解释的人格评估提供了有效解决方案，其模块化、模型无关和可扩展的设计使其适用于更广泛的复杂标签分类任务。 Abstract: Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets -- MBTI9k and PANDORA -- collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA's versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.

[57] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

Tengjun Ni,Xin Yuan,Shenghong Li,Kai Wu,Ren Ping Liu,Wei Ni,Wenjie Zhang

Main category: cs.CL

TL;DR: StepChain GraphRAG 是一种结合问题分解与广度优先搜索推理流程的检索增强生成框架，通过动态构建知识图谱和显式证据链，在多跳问答任务中实现了最先进的性能，并提升了可解释性。

Details

Motivation: 现有方法在将迭代推理步骤与外部知识检索有效结合方面仍存在挑战，影响多跳问答的准确性和可解释性。 Method: 首先构建语料库的全局索引；在推理时，仅按需将检索到的段落动态解析为知识图；将复杂问题分解为子问题，并对每个子问题采用基于BFS的遍历方式沿相关边扩展，形成明确的证据链。 Result: 在MuSiQue、2WikiMultiHopQA和HotpotQA数据集上，StepChain GraphRAG均达到SOTA水平，平均EM提升2.57%，F1提升2.13%，在HotpotQA上提升最大（+4.70% EM, +3.44% F1），并增强了推理过程的可解释性。 Conclusion: StepChain GraphRAG通过结合问题分解与BFS推理流有效提升了多跳问答的性能与可解释性，未来工作需关注降低计算开销和缓解大模型幻觉以进一步提升系统效率与可靠性。 Abstract: Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.

[58] Evaluating Large Language Models for IUCN Red List Species Information

Shinya Uryu

Main category: cs.CL

TL;DR: 大型语言模型（LLMs）在物种分类上表现良好，但在保护状况评估等推理任务中表现不佳，存在知识-推理鸿沟和对魅力动物的系统性偏见，需结合人类专家进行负责任的应用。

Details

Motivation: 评估大型语言模型在IUCN红色名录物种保护评估中的可靠性，揭示其在保护生物学应用中的局限性。 Method: 在21,955个物种上系统验证五种主流大模型在分类、保护状态、分布和威胁四个核心评估维度的表现。 Result: 模型在分类任务上准确率达94.9%，但在保护状态评估等推理任务中仅27.2%；普遍存在偏好 charismatic 脊椎动物的系统性偏见。 Conclusion: LLMs适用于信息检索，但因推理能力不足和偏见问题，必须由人类专家主导判断与决策，建议采用人机协同模式。 Abstract: Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.

[59] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation

Jahidul Arafat,Fariha Tasmin,Sanjaya Poudel,Kamrujjaman,Eftakhar Ahmed Arnob,Ahsan Habib Tareq

Main category: cs.CL

TL;DR: 本文提出了首个全面的Wordle约束满足问题（CSP）建模范式，引入了约束感知熵和概率性CSP框架，在求解效率、鲁棒性和跨语言泛化方面显著优于传统方法。

Details

Motivation: 现有Wordle求解器依赖信息论熵最大化或频率启发式，缺乏对约束的正式处理，导致在噪声环境和跨语言场景下性能受限。作者旨在建立一个形式化的CSP框架以提升求解的系统性与鲁棒性。 Method: 提出CSP-Aware Entropy，在约束传播后计算信息增益；设计Probabilistic CSP框架，融合贝叶斯词频先验与逻辑约束；在2,315个英文词和500个西班牙词上进行评估，并分析噪声鲁棒性与跨语言表现。 Result: CSP-Aware Entropy平均3.54次猜测（成功率99.9%），比前向检查快46%，且在10%噪声下仍保持5.3个百分点优势；Probabilistic CSP在0-20%噪声下均实现100%成功率；跨语言测试成功率达88%（无需调参）。 Conclusion: 形式化的CSP建模结合约束感知启发式和概率-逻辑集成能显著提升Wordle等结构化谜题的求解性能，为CSP研究提供了可复现的新基准。 Abstract: Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen's d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher's exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.

[60] Self-Reflective Generation at Test Time

Jian Mu,Qixin Zhang,Zhiyong Wang,Menglin Yang,Shuang Qiu,Chengwei Qin,Zhongxiang Dai,Yao Shu

Main category: cs.CL

TL;DR: 提出SRGen，一种轻量级测试时自反框架，在生成不确定token前进行自省，通过动态熵阈值识别高不确定性token，并利用已生成上下文训练修正向量来调整概率分布，显著提升大模型在数学推理等任务中的准确性和一致性。

Details

Motivation: 大语言模型的自回归生成过程对早期错误敏感，现有自反方法效率低或需昂贵训练，缺乏在生成过程中实时、主动纠错的轻量机制。 Method: 提出SRGen框架：1）动态熵阈值检测高不确定性token；2）基于已生成上下文训练特定修正向量；3）在生成前调整token概率分布，实现生成时自省。 Result: 在AIME2024等数学推理任务上，SRGen使DeepSeek-R1-Distill-Qwen-7B的Pass@1提升12.0%，Cons@5提升13.3%，且具有低开销和与其他技术（如RLHF、SLOT）良好兼容性。 Conclusion: SRGen是一种高效、即插即用的测试时自省方法，能有效增强大模型推理的可靠性与一致性，适用于多种模型和任务。 Abstract: Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

[61] Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Yohan Lee,Yongwoo Song,Sangyeop Kim

Main category: cs.CL

TL;DR: 提出了首个用于评估对话数据检索系统的基准CDR，包含1.6k查询和9.1k对话，揭示了当前模型在对话检索上的性能瓶颈。

Details

Motivation: 现有检索系统在处理对话数据以获取产品洞见方面缺乏统一、全面的评估标准，难以衡量其实际效果。 Method: 构建了一个包含五种分析任务的大规模对话检索基准CDR，涵盖1.6k查询和9.1k对话，并对16种主流嵌入模型进行了评估，采用NDCG@10等指标衡量性能。 Result: 最佳模型的NDCG@10仅约为0.51，显著低于文档检索性能，暴露出对话数据检索中的关键挑战，如隐式状态识别、回合动态和上下文指代。 Conclusion: 对话数据检索仍存在显著技术差距，CDR基准为后续研究提供了可靠的标准、实用的查询模板和详细的错误分析。 Abstract: We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://github.com/l-yohai/CDR-Benchmark.

[62] Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Jingqi Zhang,Ruibo Chen,Yingqing Yang,Peihua Mai,Heng Huang,Yan Pang

Main category: cs.CL

TL;DR: 本文提出了一种名为TRACE的实用框架，用于在完全黑盒条件下检测大语言模型（LLM）微调中是否使用了受版权保护的数据集。该方法通过私钥引导的无失真水印嵌入数据，并利用微调对水印数据的“放射性”效应，结合熵门控机制增强检测能力，在多种数据集和模型上实现了显著且稳健的检测效果。

Details

Motivation: 随着大语言模型在专有或受版权保护的小型领域数据集上进行微调的现象增多，亟需一种可靠的方法来防止未经授权的数据使用。现有成员推断攻击和数据集推断方法通常依赖内部信号或手工提示，限制了实际应用；而现有水印技术可能损害文本质量或任务性能。因此，需要一种兼顾文本质量、实用性和强检测能力的黑盒检测方案。 Method: 提出TRACE框架：1）使用私钥在数据集中嵌入无损水印；2）在检测阶段，利用微调后模型在水印数据上的‘放射性’效应，即模型更倾向于生成水印模式；3）引入熵门控机制，仅对高不确定性token进行打分，从而放大检测信号，提升检测功效。整个过程无需访问模型内部参数或梯度，实现完全黑盒检测。 Result: 在多个数据集和模型家族上，TRACE均能以显著水平（p<0.05）检测出水印数据的使用，常表现出极强的统计证据；支持多数据集归因；即使模型后续在大规模非水印语料上继续预训练，仍保持鲁棒性。 Conclusion: TRACE为大语言模型微调中版权数据使用的黑盒验证提供了一个实用且可靠的解决方案，在不牺牲文本质量和任务性能的前提下，实现了高效、稳健的检测能力。 Abstract: Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.

[63] Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines

Matthew Lewis,Samuel Thio,Richard JB Dobson,Spiros Denaxas

Main category: cs.CL

TL;DR: 本研究开发并评估了一种基于大语言模型的检索增强生成（RAG）系统，用于查询英国NICE临床指南，显著提升了信息检索的准确性和生成答案的可信度。

Details

Motivation: 由于NICE临床指南数量庞大、内容冗长，在时间受限的医疗环境中难以高效利用，因此需要一个能快速响应自然语言查询并提供精准信息的系统。 Method: 构建了一个基于混合嵌入机制的RAG系统，从300份指南中提取10,195个文本块作为数据库，并在7,901个查询上评估其检索性能；在70个手动标注的问答对上评估生成阶段的表现。 Result: 检索阶段MRR达0.814，首块召回率81%，前十块召回率达99.1%；生成阶段RAG增强模型的事实性支持率提升64.7个百分点至99.5%，远超Meditron3-8B的43%，且上下文精确度均为1。 Conclusion: RAG系统能有效、可靠且可扩展地应用于医疗领域的生成式AI，显著提高临床指南的可访问性和使用效率。 Abstract: This paper presents the development and evaluation of a Retrieval-Augmented Generation (RAG) system for querying the United Kingdom's National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs). The extensive length and volume of these guidelines can impede their utilisation within a time-constrained healthcare system, a challenge this project addresses through the creation of a system capable of providing users with precisely matched information in response to natural language queries. The system's retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a database of 10,195 text chunks derived from three hundred guidelines. It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries. The most significant impact of the RAG system was observed during the generation phase. When evaluated on a manually curated dataset of seventy question-answer pairs, RAG-enhanced models showed substantial gains in performance. Faithfulness, the measure of whether an answer is supported by the source text, was increased by 64.7 percentage points to 99.5% for the RAG-enhanced O4-Mini model and significantly outperformed the medical-focused Meditron3-8B LLM, which scored 43%. This, combined with a perfect Context Precision score of 1 for all RAG-enhanced models, confirms the system's ability to prevent information fabrication by grounding its answers in relevant source material. This study thus establishes RAG as an effective, reliable, and scalable approach for applying generative AI in healthcare, enabling cost-effective access to medical guidelines.

[64] Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles

Rongchen Guo,Vincent Francoeur,Isar Nejadgholi,Sylvain Gagnon,Miodrag Bolic

Main category: cs.CL

TL;DR: 该研究通过区分语音中的描述性语义和表达性语义，探索其在语音情感识别（SER）中的作用，发现前者与预期情绪一致，后者与诱发情绪相关。

Details

Motivation: 提升语音情感识别的准确性，解决当前因语音中情感细微复杂而导致的识别瓶颈。 Method: 记录参与者观看情绪化电影片段后的描述性语音，并结合意图情绪标签、自我报告情绪反应及效价/唤醒度评分，分析描述性与表达性语义的关系。 Result: 实验表明，描述性语义与意图情绪对齐，而表达性语义与实际诱发的情绪（如自我报告和生理指标）相关。 Conclusion: 区分描述性和表达性语义有助于提升SER系统的性能，为更情境感知的人机交互AI系统提供支持。 Abstract: Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker's emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants' self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.

[65] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Oriol Pareras,Gerard I. Gállego,Federico Costa,Cristina España-Bonet,Javier Hernando

Main category: cs.CL

TL;DR: 本文系统比较了在不同规模的S2TT数据下，链式思维（CoT）与直接提示在基于大语言模型的语音到文本翻译中的表现，发现随着数据量增加，直接提示方法提升更稳定，可能在未来更大规模的数据下更有效。

Details

Motivation: 由于CoT提示方法依赖于丰富的ASR和T2TT数据集，而在S2TT数据逐渐增多的情况下，尚不清楚CoT是否仍优于直接提示，因此需要系统比较两种方法在不同数据规模下的性能。 Method: 通过将ASR语料库的转录文本伪标注翻译为六种欧洲语言，构建不同规模的S2TT数据，训练基于LLM的S2TT系统，对比CoT与直接提示在不同数据量下的表现。 Result: 实验结果表明，随着S2TT数据量的增加，直接提示方法的性能提升比CoT更一致。 Conclusion: 在S2TT数据不断增长的背景下，直接提示可能成为比CoT更有效的策略。 Abstract: Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

[66] Semantic Similarity in Radiology Reports via LLMs and NER

Beth Pearson,Ahmed Adnan,Zahraa Abdallah

Main category: cs.CL

TL;DR: 本文提出了一种结合Llama 3.1和命名实体识别（NER）的语义相似性评分方法Llama-EntScore，用于比较放射学报告的初稿与终稿，能够提供可解释且准确的反馈，优于单独使用大语言模型或NER的方法。

Details

Motivation: 放射科医生培训中需要有效工具来识别初步报告与最终报告之间的语义差异，以提升诊断准确性和临床知识水平，但现有方法在准确性上存在局限。 Method: 比较多种大语言模型在放射报告对比中的表现，并评估基于NER的传统方法；提出Llama-EntScore，结合Llama 3.1和NER，通过可调节权重生成语义相似性评分及解释性反馈。 Result: 该方法在与放射科医生提供的真实评分对比中，达到67%的完全匹配准确率和93%的±1误差内准确率，优于单独使用LLM或NER的方法。 Conclusion: Llama-EntScore能有效量化放射学报告间的语义差异，并提供可解释的反馈，有助于放射科住院医师培训和报告质量改进。 Abstract: Radiology report evaluation is a crucial part of radiologists' training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: \href{https://github.com/otmive/llama_reports}{github.com/otmive/llama\_reports}

[67] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

Jacobo Romero-Díaz,Gerard I. Gállego,Oriol Pareras,Federico Costa,Javier Hernando,Cristina España-Bonet

Main category: cs.CL

TL;DR: 本文研究了基于链式思维（CoT）提示的语音到文本翻译（S2TT）系统，发现其主要依赖转录文本而几乎未利用语音信号，仍存在错误传播和无法利用韵律等问题。通过引入直接S2TT数据或注入噪声文本等简单训练干预可提升鲁棒性和语音信息使用。研究挑战了CoT的优势假设，强调需设计显式融合声学信息的翻译架构。

Details

Motivation: 传统级联式S2TT系统存在误差传播和无法利用语音韵律等问题，CoT提示被寄望于能联合利用语音和文本克服这些缺陷，但其实际工作机制尚不明确，需深入分析。 Method: 通过归因分析、使用损坏转录进行鲁棒性评估以及韵律感知测试，分析CoT在S2TT中的行为，并尝试加入直接S2TT数据和噪声文本注入等训练干预方法以改善模型表现。 Result: CoT提示的行为与级联系统相似，主要依赖文本转录，对语音信号利用极少；引入直接S2TT数据或噪声文本注入可提高模型鲁棒性并增强语音归因。 Conclusion: 当前CoT提示未能有效融合语音与文本信息，其优势被高估；未来应设计能显式整合声学特征的S2TT架构以真正突破级联系统的局限。 Abstract: Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

[68] SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Zhaojun Sun,Xuzhou Zhu,Xuanhe Zhou,Xin Tong,Shuo Wang,Jie Fu,Guoliang Li,Zhiyuan Liu,Fan Wu

Main category: cs.CL

TL;DR: 本文提出了一种细粒度、基于测验的评估框架SurveyBench，用于评估自动生成学术综述的质量，填补了现有方法缺乏严格读者对齐基准的空白。

Details

Motivation: 现有的自动综述生成方法（LLM4Survey）输出质量不足且缺乏严谨、与读者需求对齐的评估基准，因此需要一个更精细的评估框架来揭示其缺陷。 Method: 构建了包含来自11,343篇arXiv论文和4,947篇高质量综述的主题数据集，设计了多层次指标体系（涵盖大纲质量、内容质量和非文本丰富性），并采用基于内容和基于测验的双模式评估协议。 Result: 实验结果表明，SurveyBench能有效挑战现有LLM4Survey方法，在内容评估中平均比人类低21%。 Conclusion: SurveyBench提供了一个更全面、读者对齐的评估标准，有助于推动自动综述生成技术向更高水平发展。 Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

[69] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

Ej Zhou,Caiqi Zhang,Tiancheng Hu,Chengzu Li,Nigel Collier,Ivan Vulić,Anna Korhonen

Main category: cs.CL

TL;DR: 首次大规模研究多语言大模型的置信度校准，发现非英语语言校准效果更差，并提出无需训练的层间集成方法LACE以提升多语言校准性能。

Details

Motivation: 多语言场景下大模型的置信度校准问题尚未被充分研究，尤其是非英语语言的校准表现可能因英文中心化的训练而受损。 Method: 对六个模型家族、100多种语言进行系统性分析，通过层间表示研究诊断校准问题，并提出基于中间层的无需训练的校准方法LACE。 Result: 发现最终层受英文偏倚影响导致多语言校准不佳，而晚期中间层提供更可靠的校准信号；LACE方法在多种语言上显著提升校准效果。 Conclusion: 应超越最终层来改进多语言置信度校准，LACE为构建更公平、可信的全球大模型提供了新路径。 Abstract: Confidence calibration, the alignment of a model's predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model's internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.

[70] EditLens: Quantifying the Extent of AI Editing in Text

Katherine Thai,Bradley Emi,Elyas Masrour,Mohit Iyyer

Main category: cs.CL

TL;DR: 该论文提出了一种检测AI编辑文本的方法，通过轻量级相似性度量和训练回归模型EditLens，能够有效区分人类写作、AI生成和混合文本，并可量化AI编辑的程度。

Details

Motivation: 现有研究主要关注检测完全由AI生成的文本，而忽视了AI对人类文本进行编辑的情况。本文旨在填补这一空白，探索AI编辑文本的可检测性及其影响。 Method: 提出使用轻量级相似性度量来量化AI编辑程度，并以这些度量作为中间监督信号，训练一个名为EditLens的回归模型来预测文本中AI编辑的量。 Result: EditLens在二分类（F1=94.7%）和三分类（F1=90.4%）任务上均达到最先进的性能，成功识别出AI编辑文本及其编辑程度，并应用于Grammarly等工具的案例分析。 Conclusion: AI编辑的文本不仅可以被检测，其编辑程度也可被量化，这对作者归属、教育和政策制定具有重要意义。 Abstract: A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.

[71] Neural Correlates of Language Models Are Specific to Human Language

Iñigo Parra

Main category: cs.CL

TL;DR: 该研究验证了大语言模型隐藏状态与fMRI脑响应之间的相关性在多种潜在问题下的鲁棒性，确认并加强了先前的研究结果。

Details

Motivation: 检验先前发现的大语言模型与大脑表征之间相关性的稳健性，排除维度灾难、相似性度量方法等潜在干扰因素。 Method: 通过降维分析、新相似性度量方法、对比不同训练数据的模型以及考察位置编码的影响来验证相关性。 Result: 发现相关性在降维后依然存在；新度量方法确认了结果；仅在训练于人类语言的模型中观察到相关性；位置编码对结果有依赖性。 Conclusion: 先前关于大语言模型与大脑表征相似性的结果是稳健的，支持其生物学合理性和可解释性。 Abstract: Previous work has shown correlations between the hidden states of large language models and fMRI brain responses, on language tasks. These correlations have been taken as evidence of the representational similarity of these models and brain states. This study tests whether these previous results are robust to several possible concerns. Specifically this study shows: (i) that the previous results are still found after dimensionality reduction, and thus are not attributable to the curse of dimensionality; (ii) that previous results are confirmed when using new measures of similarity; (iii) that correlations between brain representations and those from models are specific to models trained on human language; and (iv) that the results are dependent on the presence of positional encoding in the models. These results confirm and strengthen the results of previous research and contribute to the debate on the biological plausibility and interpretability of state-of-the-art large language models.

[72] Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?

Xuan Xu,Haolun Li,Zhongliang Yang,Beilin Chu,Jia Song,Moxuan Xu,Linna Zhou

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的新型主题建模范式，将主题建模视为长文本生成任务，并通过零样本提示与传统神经主题模型（NTM）进行系统比较，探讨其在主题质量上的优劣。

Details

Motivation: 随着大语言模型的发展，传统神经主题模型可能已过时，本文旨在探索利用LLM进行主题建模的新范式，以提升主题发现的质量和实用性。 Method: 将主题建模重构为长文本生成任务，采用零样本提示方法，通过采样数据子集、生成主题及代表性文本、关键词匹配进行文本分配，实现即插即用的LLM主题建模。 Result: 实验对LLM与NTM在主题质量方面进行了系统比较，初步验证了LLM在零样本设置下具备与NTM相当甚至更优的潜力，支持‘多数NTM已过时’的观点。 Conclusion: 基于大语言模型的长文本生成范式为主题建模提供了新方向，在无需训练的情况下表现出强大潜力，可能取代传统的神经主题模型。 Abstract: Traditional topic models such as neural topic models rely on inference and generation networks to learn latent topic distributions. This paper explores a new paradigm for topic modeling in the era of large language models, framing TM as a long-form generation task whose definition is updated in this paradigm. We propose a simple but practical approach to implement LLM-based topic model tasks out of the box (sample a data subset, generate topics and representative text with our prompt, text assignment with keyword match). We then investigate whether the long-form generation paradigm can beat NTMs via zero-shot prompting. We conduct a systematic comparison between NTMs and LLMs in terms of topic quality and empirically examine the claim that "a majority of NTMs are outdated."

[73] Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer

Abteen Ebrahimi,Adam Wiemerslage,Katharina von der Wense

Main category: cs.CL

TL;DR: 提出NN-Rank算法，利用多语言模型的隐藏表示和无标签目标语言数据进行源语言排序，在POS和NER任务上优于现有方法。

Details

Motivation: 现有跨语言迁移中的源语言排序方法依赖语言级特征，在无目标语言数据时表现受限，需更有效利用无标签数据的方法。 Method: 基于多语言预训练模型的隐藏层表示，结合无标签目标语言数据，通过最近邻相似性计算源语言排序得分。 Result: 在POS和NER任务上，使用领域内数据时比现有方法提升高达35.56和18.14 NDCG；仅用圣经文本（跨领域）时仍具竞争力；即使仅25个样本，性能达全数据的92.8% NDCG。 Conclusion: NN-Rank能有效利用少量或跨领域无标签数据生成高质量源语言排序，显著优于依赖语言特征的基线方法，具备实际应用潜力。 Abstract: We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines that leverage lexical and linguistic features, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. As prior approaches can fall back to language-level features if target language data is not available, we show that NN-Rank remains competitive using only the Bible, an out-of-domain corpus available for a large number of languages. Ablations on the amount of unlabeled target data show that, for subsets consisting of as few as 25 examples, NN-Rank produces high-quality rankings which achieve 92.8% of the NDCG achieved using all available target data for ranking.

[74] FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua,Sahar Omidi Shayegan,Megh Thakkar,Xing Han Lù,Léo Boisvert,Massimo Caccia,Jérémy Espinas,Alexandre Aussem,Véronique Eglin,Alexandre Lacoste

Main category: cs.CL

TL;DR: FocusAgent是一种利用轻量级LLM检索器从可访问性树中提取与任务最相关行的方法，能在减少50%以上观测内容的同时保持性能，并提升对提示注入攻击的鲁棒性。

Details

Motivation: 现有网页观察剪枝策略在处理超长网页时容易丢失关键信息或保留无关内容，且全页处理带来上下文溢出、计算成本高和安全风险（如提示注入）等问题。 Method: 提出FocusAgent，使用轻量级LLM检索器，基于任务目标从可访问性树（AxTree）中提取最相关的文本行，过滤噪声和无关内容，实现高效推理和安全防护。 Result: 在WorkArena和WebArena基准上，FocusAgent在减少超过50%观测规模的同时，性能与强基线相当；其变体显著降低了提示注入攻击的成功率，同时在无攻击环境下保持任务成功率。 Conclusion: 基于目标的LLM检索是一种实用且稳健的策略，有助于构建高效、有效且安全的Web智能体。 Abstract: Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

[75] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Tianyu Fu,Zihan Min,Hanling Zhang,Jichao Yan,Guohao Dai,Wanli Ouyang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为Cache-to-Cache（C2C）的新范式，使大语言模型之间通过直接传递KV-Cache中的深层语义信息进行通信，避免了传统文本通信中的语义损失和生成延迟。

Details

Motivation: 现有LLM系统通过文本进行通信，导致语义信息丢失和生成延迟，限制了性能与效率的进一步提升。 Method: C2C利用神经网络将源模型的KV-Cache投影并融合到目标模型中，并引入可学习的门控机制选择受益的层，实现模型间直接的语义传递。 Result: 实验表明，C2C比单个模型平均准确率提高8.5-10.5%，相比文本通信范式提升3.0-5.0%，并实现平均2.0倍的延迟加速。 Conclusion: C2C是一种有效的多LLM协同新范式，通过超越文本的语义通信提升了性能与效率。 Abstract: Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

[76] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment

Hongxiang Zhang,Yuan Tian,Tianyi Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Self-Anchor的新方法，通过利用推理过程的内在结构来引导大语言模型的注意力，从而提升复杂推理任务的表现。

Details

Motivation: 现有的提示方法在推理链变长时，关键中间步骤和原始提示容易被上下文淹没，导致模型注意力不足而产生错误。 Method: Self-Anchor将推理路径分解为结构化计划，并自动对齐模型注意力到最关键的推理步骤，使其在生成过程中保持聚焦。 Result: 实验表明，Self-Anchor在六个基准上优于当前最先进的提示方法，并显著缩小了非推理模型与专用推理模型之间的性能差距。 Conclusion: Self-Anchor有望使大多数大语言模型无需重新训练即可处理复杂推理任务。 Abstract: To solve complex reasoning tasks for Large Language Models (LLMs), prompting-based methods offer a lightweight alternative to fine-tuning and reinforcement learning. However, as reasoning chains extend, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this paper, we propose Self-Anchor, a novel pipeline that leverages the inherent structure of reasoning to steer LLM attention. Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model's attention to the most relevant inference steps, allowing the model to maintain focus throughout generation. Our experiment shows that Self-Anchor outperforms SOTA prompting methods across six benchmarks. Notably, Self-Anchor significantly reduces the performance gap between ``non-reasoning'' models and specialized reasoning models, with the potential to enable most LLMs to tackle complex reasoning tasks without retraining.

[77] Reward Models are Metrics in a Trench Coat

Sebastian Gehrmann

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型后训练中强化学习的兴起引发的奖励模型与评估指标之间的关系，指出这两个领域虽然任务相似但研究分离，导致术语冗余和重复问题。文章通过比较发现某些任务上评估指标优于奖励模型，并提出加强两个领域合作的必要性，以共同应对虚假相关性、奖励博弈、数据质量及元评估等挑战。

Details

Motivation: 奖励模型和评估指标在AI模型输出质量评估方面承担类似任务，但目前两个领域的研究相对独立，存在术语不统一和重复犯错的问题，因此需要促进跨领域协作以提升整体研究效率和模型性能。 Method: 本文采用对比分析和文献综述的方法，系统梳理奖励模型与评估指标的研究现状，识别共性挑战，并通过具体任务上的表现比较两者优劣，进而提出促进融合的研究方向。 Result: 研究表明，在某些任务中评估指标的表现优于传统奖励模型；同时，论文总结出多个可通过领域协同改进的研究主题，包括偏好获取、避免虚假相关性和奖励博弈、以及校准感知的元评估方法。 Conclusion: 加强奖励模型与评估指标两个领域的交流与合作，有助于克服当前面临的共同挑战，推动更可靠、可解释和鲁棒的AI评估体系的发展。 Abstract: The emergence of reinforcement learning in post-training of large language models has sparked significant interest in reward models. Reward models assess the quality of sampled model outputs to generate training signals. This task is also performed by evaluation metrics that monitor the performance of an AI model. We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls. Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation. Our position paper argues that a closer collaboration between the fields can help overcome these issues. To that end, we show how metrics outperform reward models on specific tasks and provide an extensive survey of the two areas. Grounded in this survey, we point to multiple research topics in which closer alignment can improve reward models and metrics in areas such as preference elicitation methods, avoidance of spurious correlations and reward hacking, and calibration-aware meta-evaluation.

cs.CV [Back]

[78] Exploring OCR-augmented Generation for Bilingual VQA

JoonHo Lee,Sunho Park

Main category: cs.CV

TL;DR: 本文研究了基于视觉语言模型（VLM）的OCR增强生成，提出了一个强大的双语OCR基线模型KLOCR，并构建了韩语VQA数据集KOCRBench，验证了OCR文本对多语言VQA性能的显著提升。

Details

Motivation: 为了推动多语言环境下视觉语言模型与OCR结合的研究，特别是在韩语和英语场景中，需要更强大的OCR增强能力和专门的基准测试数据集。 Method: 训练并发布了KLOCR模型，该模型在1亿个实例上训练以增强VLM的OCR能力；同时构建了KOCRBench作为韩语VQA的新基准，并分析了不同提示方法的影响。 Result: 大量实验表明，利用OCR提取的文本能显著提升开源和商业VLM在双语VQA任务上的性能。 Conclusion: OCR增强生成对双语VQA具有重要意义，本工作为多语言OCR-VLM系统提供了有效的工具、数据和见解。 Abstract: We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.

Derek Shi,Ruben Glatt,Christine Klymko,Shubham Mohole,Hongjun Choi,Shashank Kushwaha,Sam Sakla,Felipe Leno da Silva

Main category: cs.CV

TL;DR: 提出Oracle-RLAIF框架，用通用的Oracle排序器替代训练奖励模型，并引入基于排序的GRPO_rank损失函数，在视频语言模型微调中实现更高效、灵活的AI反馈强化学习。

Details

Motivation: 随着视频语言模型规模增大，依赖人类反馈的微调成本高昂，现有AI反馈方法依赖专用奖励模型，训练成本高且限制多，因此需要更低成本、更灵活的微调框架。 Method: 提出Oracle-RLAIF框架，使用通用Oracle排序器对模型响应进行排序而非打分，并设计GRPO_rank损失函数，基于排序信息优化策略。 Result: 在多个视频理解基准上，Oracle-RLAIF优于现有的先进视频语言模型微调方法。 Conclusion: Oracle-RLAIF通过使用排序机制替代传统打分机制，为大规模多模态视频模型的对齐提供了更灵活、数据更高效的强化学习框架。 Abstract: Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Leading pipelines typically pair supervised fine-tuning (SFT) with reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks explore reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards -- an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce $GRPO_{rank}$, a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.

[80] PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction

Qiao Feng,Yiming Huang,Yufu Wang,Jiatao Gu,Lingjie Liu

Main category: cs.CV

TL;DR: 本文提出PhysHMR，一种统一的从单目视频中重建物理合理人体运动的框架，通过直接学习视觉到动作的策略，在物理模拟器中实现既符合视觉又物理合理的运动重建。

Details

Motivation: 现有方法主要基于运动学估计，缺乏物理约束，常导致不真实的结果；而两阶段方法易引入误差累积，限制重建质量。 Method: 提出PhysHMR框架，采用像素即射线（pixel-as-ray）策略将2D关键点提升为3D空间射线并转换至全局空间，结合预训练编码器的局部视觉特征，输入策略网络；并通过从动捕专家模型蒸馏知识，再结合物理驱动的强化学习奖励进行微调。 Result: 实验表明，PhysHMR在多种场景下生成高保真、物理合理的运动，在视觉准确性和物理真实性上均优于先前方法。 Conclusion: PhysHMR通过统一的视觉-动作策略学习，有效融合全局姿态引导与局部视觉信息，并利用知识蒸馏提升样本效率，实现了高质量的物理合理运动重建。 Abstract: Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

[81] Unlocking the power of partnership: How humans and machines can work together to improve face recognition

P. Jonathon Phillips,Geraldine Jeckeln,Carina A. Hahn,Amy N. Yates,Peter C. Fontana,Alice J. O'Toole

Main category: cs.CV

TL;DR: 本研究探讨了人类与机器在人脸识别决策中的协作机制，提出了“邻近准确率规则”（PAR），并确定了智能人机融合的临界区域。结果显示，通过选择合适的人类参与者进行融合，可显著提升系统整体准确性，优于纯机器或全人类团队的表现。

Details

Motivation: 由于人类与机器在人脸识别中的个体差异，协作可能提高或降低准确性。因此需要明确在何种条件下人机协作能提升识别精度。 Method: 基于专家与非专家的人脸识别数据，分析人-人与人-机协作的效果，提出并验证邻近准确率规则（PAR），并通过图论寻找最优人类组合，实现智能人机融合。 Result: 发现当合作者基础准确率差异较小时，协作增益更大；存在一个较大的‘临界融合区’，在此区间内人机融合优于单独机器或简单融合所有判断；智能融合比单独机器更准确，且能有效抑制低表现人类对系统的影响。 Conclusion: 人类与机器在人脸识别中各有作用，通过基于PAR的智能融合策略，可实现更高且更稳定的系统准确性，为AI在人脸识别中的合理应用提供了实证依据。 Abstract: Human review of consequential decisions by face recognition algorithms creates a "collaborative" human-machine system. Individual differences between people and machines, however, affect whether collaboration improves or degrades accuracy in any given case. We establish the circumstances under which combining human and machine face identification decisions improves accuracy. Using data from expert and non-expert face identifiers, we examined the benefits of human-human and human-machine collaborations. The benefits of collaboration increased as the difference in baseline accuracy between collaborators decreased-following the Proximal Accuracy Rule (PAR). This rule predicted collaborative (fusion) benefit across a wide range of baseline abilities, from people with no training to those with extensive training. Using the PAR, we established a critical fusion zone, where humans are less accurate than the machine, but fusing the two improves system accuracy. This zone was surprisingly large. We implemented "intelligent human-machine fusion" by selecting people with the potential to increase the accuracy of a high-performing machine. Intelligent fusion was more accurate than the machine operating alone and more accurate than combining all human and machine judgments. The highest system-wide accuracy achievable with human-only partnerships was found by graph theory. This fully human system approximated the average performance achieved by intelligent human-machine collaboration. However, intelligent human-machine collaboration more effectively minimized the impact of low-performing humans on system-wide accuracy. The results demonstrate a meaningful role for both humans and machines in assuring accurate face identification. This study offers an evidence-based road map for the intelligent use of AI in face identification.

[82] How Confident are Video Models? Empowering Video Models to Express their Uncertainty

Zhiting Mei,Ola Shorinwa,Anirudha Majumdar

Main category: cs.CV

TL;DR: 本文首次提出了一种用于生成式视频模型的不确定性量化（UQ）框架，包括一种基于稳健秩相关估计的校准评估指标、一种名为S-QUBED的黑箱UQ方法（可分解为偶然性和认知性不确定性），以及一个用于基准测试的UQ数据集。实验表明，S-QUBED能有效估计与任务准确率负相关的校准不确定性。

Details

Motivation: 生成式视频模型虽然广泛应用，但存在事实性幻觉问题。尽管大语言模型的不确定性量化已有研究，但视频模型尚无相关方法，带来安全隐患。因此，亟需针对视频生成模型的不确定性量化方案。 Method: 提出S-QUBED框架：1）基于稳健秩相关估计的校准评估指标；2）在潜在空间中进行条件生成，通过潜在建模将预测不确定性分解为偶然性和认知性成分；3）构建UQ数据集以支持基准测试。 Result: 在多个基准视频数据集上的实验表明，S-QUBED能够生成校准良好的总不确定性估计，且该估计与任务准确率呈负相关，同时能有效分离并计算偶然性和认知性不确定性成分。 Conclusion: 本工作是首个针对生成式视频模型的不确定性量化研究，提出的S-QUBED框架和UQ数据集为提升视频生成模型的可靠性与安全性提供了有效工具和评估基础。 Abstract: Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

[83] PEO: Training-Free Aesthetic Quality Enhancement in Pre-Trained Text-to-Image Diffusion Models with Prompt Embedding Optimization

Hovhannes Margaryan,Bo Wan,Tinne Tuytelaars

Main category: cs.CV

TL;DR: 本文提出了一种名为提示嵌入优化（PEO）的新方法，用于在给定简单提示的情况下提升预训练文本到图像扩散模型的美学质量。该方法无需训练且不依赖特定模型结构，通过优化提示的文本嵌入来提高生成图像的视觉质量。

Details

Motivation: 现有的文本到图像生成模型在使用简单、未经修饰的提示时，生成图像的美学质量往往不足。因此，需要一种能够自动提升生成图像美感而无需重新训练模型的方法。 Method: 提出Prompt Embedding Optimization (PEO) 方法，利用预训练扩散模型，通过一个三部分目标函数优化输入提示的文本嵌入：提升图像美学质量、保持与文本嵌入的一致性，并通过提示保留项确保不偏离原始提示。该方法无需训练且适用于多种模型结构。 Result: 实验表明，PEO在定量和定性评估中均优于或媲美当前最先进的文本到图像生成和提示适应方法，在图像美学质量和文本对齐方面表现优异。 Conclusion: PEO是一种有效、通用且无需训练的提示优化方法，能够在不修改预训练扩散模型的前提下显著提升由简单提示生成图像的视觉质量。 Abstract: This paper introduces a novel approach to aesthetic quality improvement in pre-trained text-to-image diffusion models when given a simple prompt. Our method, dubbed Prompt Embedding Optimization (PEO), leverages a pre-trained text-to-image diffusion model as a backbone and optimizes the text embedding of a given simple and uncurated prompt to enhance the visual quality of the generated image. We achieve this by a tripartite objective function that improves the aesthetic fidelity of the generated image, ensures adherence to the optimized text embedding, and minimal divergence from the initial prompt. The latter is accomplished through a prompt preservation term. Additionally, PEO is training-free and backbone-independent. Quantitative and qualitative evaluations confirm the effectiveness of the proposed method, exceeding or equating the performance of state-of-the-art text-to-image and prompt adaptation methods.

[84] Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig

Patrick Rim,Kun He,Kevin Harris,Braden Copple,Shangchen Han,Sizhe An,Ivan Shugurov,Tomas Hodan,He Wen,Xu Xie

Main category: cs.CV

TL;DR: 提出了一种新的无标记多相机系统，用于在真实野外条件下精确捕捉3D手部和物体交互，结合背戴式八相机装置与Meta Quest 3头显，实现了高精度的3D手部姿态标注。

Details

Motivation: 现有数据集大多在受控实验室环境中采集，缺乏环境多样性，限制了模型在真实场景中的泛化能力，因此需要一种能在真实无约束环境下进行高精度3D手部追踪的解决方案。 Method: 设计了一个轻量级、背戴式多相机系统（含八个外部相机）与用户佩戴的Meta Quest 3头显（提供两个自我中心视角）相结合的 ego-exo 捕捉系统，并开发了相应的ego-exo追踪流程以生成精确的3D手部姿态真值。 Result: 成功构建了一个包含同步多视角图像和精确3D手部姿态标注的数据集，验证了该系统在保持高3D标注精度的同时显著提升了环境真实性。 Conclusion: 该方法有效减少了环境真实感与3D标注精度之间的权衡，为面向真实场景的3D手部追踪研究提供了可靠的技术路径和数据支持。 Abstract: Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.

[85] Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation

Beijia Lu,Ziyi Chen,Jing Xiao,Jun-Yan Zhu

Main category: cs.CV

TL;DR: 本文提出一种基于输入人体姿态的视频蒸馏方法，通过引入输入感知的稀疏注意力和蒸馏损失，将多步扩散模型压缩为少步学生模型，在保持高质量的同时实现实时语音驱动视频生成。

Details

Motivation: 现有的扩散模型在生成对口型视频时因多步去噪和高计算成本的注意力机制而速度缓慢，难以实现实时应用，因此需要一种高效且高质量的蒸馏方法。 Method: 提出一种新的视频蒸馏方法，利用输入的人体姿态关键点指导注意力机制（输入感知稀疏注意力），并在损失函数中结合姿态信息（输入感知蒸馏损失），以提升唇部同步和手势真实感，同时减少计算冗余。 Result: 该方法在多个音频驱动和输入驱动的基准上实现了实时性能，并在视觉质量和动作连贯性方面优于现有方法。实验验证了所提注意力和损失设计的有效性。 Conclusion: 通过结合输入人体姿态信息优化注意力与蒸馏过程，所提方法成功实现了高质量、低延迟的扩散蒸馏视频生成，推动了扩散模型在实时虚拟代理和视频创作中的应用。 Abstract: Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker's face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

[86] Deep Generative Continual Learning using Functional LoRA: FunLoRA

Victor Enescu,Hichem Sahbi

Main category: cs.CV

TL;DR: 本文提出了一种基于低秩适应（LoRA）的新型条件机制FunLoRA，用于持续学习中的生成模型，仅使用当前任务数据训练即可有效避免灾难性遗忘，并在性能、内存和采样效率上优于现有的扩散模型方法。

Details

Motivation: 深度生成模型在文本和视觉应用中广泛应用，但持续学习中的灾难性遗忘问题严重限制了其发展。现有方法依赖合成数据重放，导致训练时间增长和性能退化。 Method: 提出FunLoRA，采用函数增强的低秩适应机制，仅使用秩为1的矩阵并通过重参数化提升表达能力，实现动态条件控制，使模型仅需在当前任务数据上训练。 Result: 在从零开始训练的流匹配模型上实验表明，FunLoRA在分类准确率上超过基于扩散模型的最先进方法，同时显著降低内存开销和采样时间。 Conclusion: FunLoRA是一种高效、可扩展的参数微调方法，有效缓解了生成模型持续学习中的灾难性遗忘问题，且具备更低的资源消耗和更高的性能。 Abstract: Continual adaptation of deep generative models holds tremendous potential and critical importance, given their rapid and expanding usage in text and vision based applications. Incremental training, however, remains highly challenging due to catastrophic forgetting phenomenon, which makes it difficult for neural networks to effectively incorporate new knowledge. A common strategy consists in retraining the generative model on its own synthetic data in order to mitigate forgetting. Yet, such an approach faces two major limitations: (i) the continually increasing training time eventually becomes intractable, and (ii) reliance on synthetic data inevitably leads to long-term performance degradation, since synthetic samples lack the richness of real training data. In this paper, we attenuate these issues by designing a novel and more expressive conditioning mechanism for generative models based on low rank adaptation (LoRA), that exclusively employs rank 1 matrices, whose reparametrized matrix rank is functionally increased using carefully selected functions -- and dubbed functional LoRA: FunLoRA. Using this dynamic conditioning, the generative model is guaranteed to avoid catastrophic forgetting and needs only to be trained on data from the current task. Extensive experiments using flow-matching based models trained from scratch, showcase that our proposed parameter-efficient fine-tuning (PEFT) method surpasses prior state-of-the-art results based on diffusion models, reaching higher classification accuracy scores, while only requiring a fraction of the memory cost and sampling time.

[87] Sequence-Preserving Dual-FoV Defense for Traffic Sign and Light Recognition in Autonomous Vehicles

Abhishek Joshi,Jahnavi Krishna Koda,Abhishek Phadke

Main category: cs.CV

TL;DR: 提出了一种基于多源数据集的双视场、序列保持的鲁棒性框架，用于美国交通信号灯和标志识别，结合三层次防御堆栈（特征压缩、防御性蒸馏和基于熵的异常检测）以及时序投票机制，在真实场景中显著提升了准确性和安全性。

Details

Motivation: 现有研究缺乏对时间连续性、多视角感知以及对数字和自然退化的鲁棒性的综合考虑，而交通信号识别中的感知错误直接影响自动驾驶的安全与导航。 Method: 构建了基于aiMotive、Udacity、Waymo及自录德州区域视频的多源数据集，对高速、夜间、雨天和城市四种运行设计域下的中长期RGB图像序列进行时间对齐，并提出包含特征压缩、防御性蒸馏、熵基异常检测和时序投票的三层次防御框架。 Result: 该统一防御堆栈实现了79.8 mAP，攻击成功率（ASR）降至18.2%，高风险误分类减少至32%，在准确性、ASR、风险加权误分类严重程度和置信稳定性方面优于YOLOv8、YOLOv9和BEVFormer。 Conclusion: 所提出的框架在应对数字和自然扰动方面表现出更强的鲁棒性，通过多视角与时序建模有效提升了自动驾驶中交通信号识别的安全性与可靠性。 Abstract: Traffic light and sign recognition are key for Autonomous Vehicles (AVs) because perception mistakes directly influence navigation and safety. In addition to digital adversarial attacks, models are vulnerable to existing perturbations (glare, rain, dirt, or graffiti), which could lead to dangerous misclassifications. The current work lacks consideration of temporal continuity, multistatic field-of-view (FoV) sensing, and robustness to both digital and natural degradation. This study proposes a dual FoV, sequence-preserving robustness framework for traffic lights and signs in the USA based on a multi-source dataset built on aiMotive, Udacity, Waymo, and self-recorded videos from the region of Texas. Mid and long-term sequences of RGB images are temporally aligned for four operational design domains (ODDs): highway, night, rainy, and urban. Over a series of experiments on a real-life application of anomaly detection, this study outlines a unified three-layer defense stack framework that incorporates feature squeezing, defensive distillation, and entropy-based anomaly detection, as well as sequence-wise temporal voting for further enhancement. The evaluation measures included accuracy, attack success rate (ASR), risk-weighted misclassification severity, and confidence stability. Physical transferability was confirmed using probes for recapture. The results showed that the Unified Defense Stack achieved 79.8mAP and reduced the ASR to 18.2%, which is superior to YOLOv8, YOLOv9, and BEVFormer, while reducing the high-risk misclassification to 32%.

[88] Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models

Benjamin Yu,Jackie Liu,Justin Cui

Main category: cs.CV

TL;DR: 提出了Smart-GRPO，首个针对流匹配模型中强化学习优化噪声扰动的方法，通过迭代搜索策略提升奖励优化与图像质量。

Details

Motivation: 流匹配模型的确定性特性使其难以适用于强化学习，而现有引入随机性的方法效率低且不稳定。 Method: 采用迭代搜索策略，对潜在空间中的候选扰动进行解码，利用奖励函数评估并优化噪声分布至高奖励区域。 Result: 实验表明，Smart-GRPO在奖励优化和视觉质量上均优于基线方法。 Conclusion: Smart-GRPO为流匹配框架中的强化学习提供了可行路径，弥合了高效训练与人类对齐生成之间的差距。 Abstract: Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.

[89] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 3min

Yibin Zhao,Yihan Pan,Jun Nan,Jianjun Yi

Main category: cs.CV

TL;DR: 本文提出了一种名为FSFSplatter的新方法，用于从自由稀疏图像中快速进行高斯点阵表面重建，通过端到端的密集初始化、相机参数估计和几何增强优化，显著提升了重建质量。

Details

Motivation: 现有高斯点阵重建方法通常需要密集且标定的视角，难以处理自由稀疏图像，导致表面重建效果差，主要由于视图重叠少和过拟合问题。 Method: FSFSplatter采用大型Transformer编码多视角图像，并通过自分裂高斯头生成密集且几何一致的高斯场景初始化；结合基于贡献度的剪枝、深度与多视角特征监督，并在优化过程中使用可微相机参数来缓解过拟合。 Result: 在DTU和Replica数据集上，FSFSplatter优于当前最先进的方法，实现了更高质量的表面重建和更快的优化速度。 Conclusion: FSFSplatter有效解决了从自由稀疏图像进行高斯点阵重建中的浮点物和过拟合问题，实现了高效、鲁棒且精细的表面重建，具有广泛的应用前景。 Abstract: Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU and Replica.

[90] MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context

Junyu Shi,Yong Sun,Zhiyuan Zhang,Lijiang Liu,Zhengjie Zhang,Yuxin He,Qiang Nie

Main category: cs.CV

TL;DR: 本文提出了MoGIC，一种融合意图建模和视觉先验的统一多模态运动合成框架，通过联合优化动作生成与意图预测，显著提升生成质量与可控性，并引入新的注意力机制和大规模数据集Mo440H验证其有效性。

Details

Motivation: 现有文本驱动运动生成方法难以捕捉行为背后的因果逻辑与人类意图，且缺乏视觉 grounding 导致时空细节表达不足，限制了生成的精确性与个性化。 Method: 提出MoGIC框架，联合优化多模态条件下的运动生成与意图预测；引入带有自适应范围的混合注意力机制，实现条件令牌与运动子序列间的有效局部对齐；构建包含440小时数据的Mo440H基准数据集。 Result: 在HumanML3D和Mo440H上微调后，FID分别降低38.6%和34.6%；在运动描述任务中超越基于大语言模型的方法；支持意图预测和视觉条件生成。 Conclusion: MoGIC通过整合意图建模与视觉先验，提升了运动生成的可控性与语义一致性，推动了对人类意图的理解与多模态运动合成的发展。 Abstract: Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6\% on HumanML3D and 34.6\% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC

[91] From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting

Jianing Chen,Zehao Li,Yujun Cai,Hao Jiang,Shuqin Gao,Honglong Zhao,Tianlu Mao,Yucheng Zhang

Main category: cs.CV

TL;DR: 提出一种运动自适应的动态3D重建框架，通过语义与运动先验优化控制点分布，提升单目视频重建质量与效率。

Details

Motivation: 现有稀疏控制方法仅基于几何分配控制点，导致静态区域冗余、动态区域不足，难以有效建模复杂运动场景。 Method: 利用视觉基础模型提取语义与运动先验，建立patch-token-node对应关系，采用运动自适应压缩策略，在动态区域集中控制点；引入基于样条的轨迹参数化方法，由2D tracklets初始化，替代MLP形变场。 Result: 在多个实验中显著优于当前最先进方法，实现了更高质量和更高效率的动态3D重建，控制点分布更符合运动复杂度。 Conclusion: 所提出的运动自适应框架有效解决了控制点分配与运动复杂性之间的不匹配问题，提升了单目视频动态3D重建的精度与稳定性。 Abstract: Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

[92] Net2Net: When Un-trained Meets Pre-trained Networks for Robust Real-World Denoising

Weimin Yuan,Cai Meng

Main category: cs.CV

TL;DR: 本文提出了一种名为Net2Net的新方法，结合了未训练网络和预训练网络的优势，通过DIP与DRUNet的正则化去噪框架实现真实噪声去除。

Details

Motivation: 传统去噪方法依赖手工先验，在真实复杂噪声中表现不佳；深度学习方法依赖大量标注数据且泛化能力有限。因此需要一种兼具自适应性与强泛化能力的去噪方法。 Method: 将无监督的DIP（深度图像先验）与有监督的预训练模型DRUNet结合，采用正则化去噪（RED）框架，利用未训练网络适配输入图像的噪声特征，同时利用预训练网络提供强去噪能力。 Result: 在多个基准数据集上实验表明，该方法在真实噪声去除任务中优于现有方法，尤其在训练数据有限的情况下表现出更强的泛化能力和去噪性能。 Conclusion: Net2Net通过融合未训练与预训练网络的优势，有效提升了真实世界图像去噪的性能与鲁棒性，为少标签或无标签场景下的去噪提供了新思路。 Abstract: Traditional denoising methods for noise removal have largely relied on handcrafted priors, often perform well in controlled environments but struggle to address the complexity and variability of real noise. In contrast, deep learning-based approaches have gained prominence for learning noise characteristics from large datasets, but these methods frequently require extensive labeled data and may not generalize effectively across diverse noise types and imaging conditions. In this paper, we present an innovative method, termed as Net2Net, that combines the strengths of untrained and pre-trained networks to tackle the challenges of real-world noise removal. The innovation of Net2Net lies in its combination of unsupervised DIP and supervised pre-trained model DRUNet by regularization by denoising (RED). The untrained network adapts to the unique noise characteristics of each input image without requiring labeled data, while the pre-trained network leverages learned representations from large-scale datasets to deliver robust denoising performance. This hybrid framework enhances generalization across varying noise patterns and improves performance, particularly in scenarios with limited training data. Extensive experiments on benchmark datasets demonstrate the superiority of our method for real-world noise removal.

[93] Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval

Lanyun Zhu,Deyi Ji,Tianrun Chen,Haiyang Wu,Shiqi Wang

Main category: cs.CV

TL;DR: Retrv-R1是首个采用R1风格的多模态大语言模型，专为多模态通用检索设计，通过引入信息压缩模块和新的训练范式，在提升推理效率和检索性能的同时实现SOTA结果。

Details

Motivation: 受DeepSeek-R1启发，探索强化学习在多模态检索任务中提升LLM推理能力的潜力，但发现直接应用其方法存在计算成本高和训练不稳定的问题。 Method: 提出Retrv-R1，包含信息压缩模块与细节检查机制以降低token消耗；设计分阶段训练范式：先使用定制的合成CoT数据激活模型，再结合课程式奖励进行强化学习优化。 Result: Retrv-R1在多个基准和任务上实现了SOTA性能，具备高效率和强泛化能力，有效解决了计算开销大和训练不稳定的挑战。 Conclusion: Retrv-R1通过创新的信息压缩和分阶段训练策略，成功将R1-style推理框架应用于多模态检索，显著提升了准确率、效率与稳定性。 Abstract: The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs' reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Furthermore, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by experiments across multiple benchmarks and tasks.

[94] Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

Lihua Zhou,Mao Ye,Shuaifeng Li,Nianxin Li,Jinlin Wu,Xiatian Zhu,Lei Deng,Hongbin Liu,Jiebo Luo,Zhen Lei

Main category: cs.CV

TL;DR: 提出BCA+，一种无需训练的测试时自适应框架，通过动态缓存和贝叶斯推理统一提升视觉-语言模型在物体识别与检测中的鲁棒性。

Details

Motivation: 现有测试时自适应方法依赖反向传播计算成本高，或仅关注似然适应而忽略先验信息，在真实分布偏移下性能下降。 Method: 将适应过程建模为贝叶斯推断，引入动态缓存机制存储并更新类别嵌入、空间尺度和基于历史预测的自适应先验，结合似然与先验进行不确定性引导的预测融合。 Result: 在多个识别与检测基准上达到最先进的性能，且无需训练、高效适用于实时部署。 Conclusion: BCA+通过联合优化语义理解和上下文置信度，实现了高效、统一的视觉-语言模型测试时自适应。 Abstract: Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model's semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.

[95] Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology

Matthias Perkonigg,Patrick Rockenschaub,Georg Göbel,Adelheid Wöhrer

Main category: cs.CV

TL;DR: 提出了一种用于脑肿瘤分类的层次化广义类别发现方法（HGCD-BT），结合对比学习与半监督层次聚类损失，在识别未知肿瘤类型上显著优于现有方法。

Details

Motivation: 现有分类方法受限于预定义类别，无法识别训练中未见的肿瘤类型；需结合有标签数据的先验知识与无监督学习的泛化能力。 Method: 提出HGCD-BT，融合层次聚类与对比学习，引入新的半监督层次聚类损失，以反映脑肿瘤分类的层级结构。 Result: 在OpenSRH数据集上，相比当前GCD方法准确率提升28%；并在Digital Brain Tumor Atlas的全片图像上验证了跨模态通用性。 Conclusion: HGCD-BT能有效识别已知和未知脑肿瘤类型，具有良好的跨数据集和跨成像模态泛化能力，适用于术中决策支持。 Abstract: Accurate brain tumor classification is critical for intra-operative decision making in neuro-oncological surgery. However, existing approaches are restricted to a fixed set of predefined classes and are therefore unable to capture patterns of tumor types not available during training. Unsupervised learning can extract general-purpose features, but it lacks the ability to incorporate prior knowledge from labelled data, and semi-supervised methods often assume that all potential classes are represented in the labelled data. Generalized Category Discovery (GCD) aims to bridge this gap by categorizing both known and unknown classes within unlabelled data. To reflect the hierarchical structure of brain tumor taxonomies, in this work, we introduce Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT), a novel approach that integrates hierarchical clustering with contrastive learning. Our method extends contrastive learning based GCD by incorporating a novel semi-supervised hierarchical clustering loss. We evaluate HGCD-BT on OpenSRH, a dataset of stimulated Raman histology brain tumor images, achieving a +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification, particularly in identifying previously unseen tumor categories. Furthermore, we demonstrate the generalizability of HGCD-BT on slide-level classification of hematoxylin and eosin stained whole-slide images from the Digital Brain Tumor Atlas, confirming its utility across imaging modalities.

[96] AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

Xian Zhang,Zexi Wu,Zinuo Li,Hongming Xu,Luqi Gong,Farid Boussaid,Naoufel Werghi,Mohammed Bennamoun

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的查询驱动视频关键帧采样模块AdaRD-Key，通过结合相关性与多样性的统一目标函数，在长视频理解任务中实现了高效、准确且非冗余的关键帧选择。

Details

Motivation: 现有方法在处理长视频时，由于均匀采样或固定时间间隔排除策略，容易忽略关键瞬间或短时细粒度线索，且难以平衡查询相关性与视觉多样性。 Method: 提出AdaRD-Key，最大化相关性-多样性最大体积（RD-MV）目标函数，结合查询条件下的相关性评分与基于行列式的多样性度量；引入轻量级相关性感知门控机制，在相关性弱时自动切换至纯多样性模式。 Result: 在LongVideoBench和Video-MME等基准上取得最先进的性能，尤其在长视频任务中表现突出，且具备实时性、单GPU运行能力和与现有VLM的即插即用兼容性。 Conclusion: AdaRD-Key是一种高效、灵活、无需训练的关键帧采样方法，有效提升了查询驱动的长视频理解性能。 Abstract: Understanding long-form videos remains a significant challenge for vision--language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.

[97] Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models

Prahitha Movva

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型（VLMs）在复杂横向思维挑战（如字谜谜题）中的认知过程，提出一个包含221个字谜的标注数据集和评估框架，分析不同提示策略对推理质量的影响，揭示模型在视觉组合上的优势及在缺失解释和文化象征理解上的局限。

Details

Motivation: 尽管VLMs在多模态任务中表现出色，但在字谜等需要横向思维的任务上表现不佳，且其推理过程和失败模式尚不明确，因此需要深入探究其认知机制。 Method: 构建了一个包含221个字谜的数据集，涵盖六个认知类别，并设计了三种提示策略来激发不同的解释过程，通过分离推理质量与答案正确性进行评估。 Result: 发现推理质量在不同类别间差异显著，模型在视觉组成方面较强，但在解释缺失信息和文化符号方面存在根本性缺陷；提示策略显著影响模型的推理方式和解题效果。 Conclusion: 将可解释性作为模型性能的核心组成部分，而非事后考量，有助于更全面地理解VLMs在复杂认知任务中的行为。 Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet their cognitive processes remain opaque on complex lateral thinking challenges like rebus puzzles. While recent work has demonstrated these models struggle significantly with rebus puzzle solving, the underlying reasoning processes and failure patterns remain largely unexplored. We address this gap through a comprehensive explainability analysis that moves beyond performance metrics to understand how VLMs approach these complex lateral thinking challenges. Our study contributes a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework that separates reasoning quality from answer correctness. We investigate three prompting strategies designed to elicit different types of explanatory processes and reveal critical insights into VLM cognitive processes. Our findings demonstrate that reasoning quality varies dramatically across puzzle categories, with models showing systematic strengths in visual composition while exhibiting fundamental limitations in absence interpretation and cultural symbolism. We also discover that prompting strategy substantially influences both cognitive approach and problem-solving effectiveness, establishing explainability as an integral component of model performance rather than a post-hoc consideration.

[98] OTR: Synthesizing Overlay Text Dataset for Text Removal

Jan Zdenek,Wataru Shimoda,Kota Yamaguchi

Main category: cs.CV

TL;DR: 提出了一种合成文本移除基准数据集的方法，解决了现有数据集中存在的真值缺陷和评估不足问题。

Details

Motivation: 现有文本移除数据集存在真值伪影、背景过于简单等问题，限制了跨域泛化和准确评估。 Method: 通过对象感知的文本放置和视觉语言模型生成的内容，在复杂背景下渲染文本，构建高质量的合成数据集。 Result: 构建了一个适用于非场景文本领域的文本移除基准数据集，并公开可用。 Conclusion: 该数据集具有干净的真值和更具挑战性的文本移除场景，有助于提升文本移除模型的泛化能力和评估准确性。 Abstract: Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .

[99] Align Your Query: Representation Alignment for Multimodality Medical Object Detection

Ara Seo,Bryan Sangwoo Kim,Hyungjin Chung,Jong Chul Ye

Main category: cs.CV

TL;DR: 提出一种用于多模态医学目标检测的简单、检测器无关的框架，通过引入模态令牌和QueryREPA预训练策略，实现对象查询与模态上下文的对齐，提升检测性能。

Details

Motivation: 单一检测器在混合多种医学成像模态（如CXR、CT、MRI）时因统计特性和表示空间差异而表现不佳，需要解决跨模态表示不一致的问题。 Method: 提出模态令牌（modality tokens）和多模态上下文注意力（MoCA），将文本衍生的轻量级嵌入注入DETR式检测器的对象查询中，并通过QueryREPA进行对比学习预训练，使查询表示与模态对齐。 Result: 在多种医学图像模态上联合训练时，该方法显著提升了AP指标，且仅带来极小计算开销，无需修改模型架构。 Conclusion: MoCA与QueryREPA有效实现了多模态医学图像检测中的表示对齐，生成具备模态感知和类别保真性的查询，为鲁棒的多模态医学目标检测提供了实用解决方案。 Abstract: Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.

[100] MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

Jingyuan Deng,Yujiu Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为MaskCD的图像头掩码对比解码方法，用于缓解大视觉语言模型中的幻觉问题，在多个基准上验证了其有效性并保持了模型的通用能力。

Details

Motivation: 大视觉语言模型在多模态任务中表现出色，但存在生成内容与输入不符的幻觉问题，现有方法在构建对比样本或稳定性方面存在不足。 Method: 利用模型中的“图像头”，通过掩码操作构造对比样本，实现对比解码。 Result: 在LLaVA-1.5-7b和Qwen-VL-7b上测试表明，MaskCD能有效减轻幻觉现象，同时保留模型的通用性能。 Conclusion: MaskCD是一种有效且稳定的缓解LVLM幻觉问题的方法，具有实际应用潜力。 Abstract: Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .

[101] VERNIER: an open-source software pushing marker pose estimation down to the micrometer and nanometer scales

Patrick Sandoz,Antoine N. André,Guillaume J. Laurent

Main category: cs.CV

TL;DR: 本文介绍了VERNIER，一种开源的基于相位处理的软件，用于通过伪周期性图案实现快速且可靠的姿态测量，适用于需要纳米级和微弧度分辨率的微尺度姿态估计。

Details

Motivation: 在小尺度下进行具有纳米级和微弧度分辨率的姿态估计仍具挑战性，现有方法有限，因此需要一种可靠、鲁棒且适用于多种显微镜应用的解决方案。 Method: 提出并实现了基于相位处理的算法，结合伪周期性图案设计，采用相位局部阈值算法提升对噪声、离焦和遮挡的鲁棒性，并通过合成与实验图像验证流程。 Result: VERNIER软件能够实现厘米级测量范围和纳米级分辨率，对噪声、离焦和遮挡具有强鲁棒性，支持多种图案类型以满足不同应用需求。 Conclusion: VERNIER为微尺度下的高精度姿态估计提供了一个开放、灵活且可靠的工具，结合适当的图案设计和显微镜配置，可满足广泛的应用性能需求。 Abstract: Pose estimation is still a challenge at the small scales. Few solutions exist to capture the 6 degrees of freedom of an object with nanometric and microradians resolutions over relatively large ranges. Over the years, we have proposed several fiducial marker and pattern designs to achieve reliable performance for various microscopy applications. Centimeter ranges are possible using pattern encoding methods, while nanometer resolutions can be achieved using phase processing of the periodic frames. This paper presents VERNIER, an open source phase processing software designed to provide fast and reliable pose measurement based on pseudo-periodic patterns. Thanks to a phase-based local thresholding algorithm, the software has proven to be particularly robust to noise, defocus and occlusion. The successive steps of the phase processing are presented, as well as the different types of patterns that address different application needs. The implementation procedure is illustrated with synthetic and experimental images. Finally, guidelines are given for selecting the appropriate pattern design and microscope magnification lenses as a function of the desired performance.

[102] Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis

Feng Yuan,Yifan Gao,Yuehua Ye,Haoyue Li,Xin Gao

Main category: cs.CV

TL;DR: 本文提出Med-K2N，通过自适应加权、质量过滤和因果模态身份约束，实现跨模态医学图像的K到N生成，显著优于现有方法。

Details

Motivation: 解决临床中缺失模态重建需求下的三个关键问题：不同模态对任务贡献的异质性建模、融合质量控制以避免噪声干扰、多输出生成中的模态身份一致性保持。 Method: 受SAM2和临床工作流启发，将多模态数据视为带质量选择机制的序列帧，设计PreWeightNet、ThresholdNet和EffiWeightNet三个协作模块进行渐进式增强，并提出因果模态身份模块（CMIM）利用视觉-语言建模保持生成图像与目标模态描述间的因果约束。 Result: 在多个基准上实验表明，Med-K2N显著优于当前最先进的方法。 Conclusion: Med-K2N有效解决了K到N医学图像生成中的模态贡献差异、融合质量控制和模态一致性问题，具有良好的临床适用性和性能优势。 Abstract: Cross-modal medical image synthesis research focuses on reconstructing missing imaging modalities from available ones to support clinical diagnosis. Driven by clinical necessities for flexible modality reconstruction, we explore K to N medical generation, where three critical challenges emerge: How can we model the heterogeneous contributions of different modalities to various target tasks? How can we ensure fusion quality control to prevent degradation from noisy information? How can we maintain modality identity consistency in multi-output generation? Driven by these clinical necessities, and drawing inspiration from SAM2's sequential frame paradigm and clinicians' progressive workflow of incrementally adding and selectively integrating multi-modal information, we treat multi-modal medical data as sequential frames with quality-driven selection mechanisms. Our key idea is to "learn" adaptive weights for each modality-task pair and "memorize" beneficial fusion patterns through progressive enhancement. To achieve this, we design three collaborative modules: PreWeightNet for global contribution assessment, ThresholdNet for adaptive filtering, and EffiWeightNet for effective weight computation. Meanwhile, to maintain modality identity consistency, we propose the Causal Modality Identity Module (CMIM) that establishes causal constraints between generated images and target modality descriptions using vision-language modeling. Extensive experimental results demonstrate that our proposed Med-K2N outperforms state-of-the-art methods by significant margins on multiple benchmarks. Source code is available.

[103] ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment

Md Zahim Hassan,Md. Osama,Muhammad Ashad Kabir,Md. Saiful Islam,Zannatul Naim

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态特征融合的集成学习框架ELMF4EggQ，利用鸡蛋的外部图像、形状和重量信息实现对内部品质（等级和新鲜度）的无损评估，并发布了首个公开标注数据集。

Details

Motivation: 为了在不破坏鸡蛋的前提下准确评估其质量，满足食品安全、产品标准和生产效率的需求，现有方法依赖内部检测，缺乏仅使用外部非侵入性特征进行机器学习预测的研究。 Method: 构建包含186个褐壳蛋的公开数据集，通过实验室专家评估确定质量和新鲜度；采用ResNet152、DenseNet169和ResNet152V2等预训练CNN提取图像特征，结合形状和重量等结构特征，使用PCA降维、SMOTE增强，并通过多种机器学习模型分类，最后采用集成投票机制提升准确性。 Result: 多模态集成方法在等级分类上达到86.57%的准确率，在新鲜度预测上达到70.83%，显著优于仅用图像或仅用结构特征的基线模型。 Conclusion: ELMF4EggQ是首个仅利用外部非侵入性特征进行鸡蛋内部质量评估的机器学习框架，并发布了配套数据集，推动了该领域的透明性、可重复性和进一步研究。 Abstract: Accurate, non-destructive assessment of egg quality is critical for ensuring food safety, maintaining product standards, and operational efficiency in commercial poultry production. This paper introduces ELMF4EggQ, an ensemble learning framework that employs multimodal feature fusion to classify egg grade and freshness using only external attributes - image, shape, and weight. A novel, publicly available dataset of 186 brown-shelled eggs was constructed, with egg grade and freshness levels determined through laboratory-based expert assessments involving internal quality measurements, such as yolk index and Haugh unit. To the best of our knowledge, this is the first study to apply machine learning methods for internal egg quality assessment using only external, non-invasive features, and the first to release a corresponding labeled dataset. The proposed framework integrates deep features extracted from external egg images with structural characteristics such as egg shape and weight, enabling a comprehensive representation of each egg. Image feature extraction is performed using top-performing pre-trained CNN models (ResNet152, DenseNet169, and ResNet152V2), followed by PCA-based dimensionality reduction, SMOTE augmentation, and classification using multiple machine learning algorithms. An ensemble voting mechanism combines predictions from the best-performing classifiers to enhance overall accuracy. Experimental results demonstrate that the multimodal approach significantly outperforms image-only and tabular (shape and weight) only baselines, with the multimodal ensemble approach achieving 86.57% accuracy in grade classification and 70.83% in freshness prediction. All code and data are publicly available at https://github.com/Kenshin-Keeps/Egg_Quality_Prediction_ELMF4EggQ, promoting transparency, reproducibility, and further research in this domain.

[104] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Lorenzo Bianchi,Giacomo Pacini,Fabio Carrara,Nicola Messina,Giuseppe Amato,Fabrizio Falchi

Main category: cs.CV

TL;DR: 本文提出了一种名为\frameworkName{}的零样本图像描述框架，通过将图像分割为语义补丁并以补丁为中心生成任意区域的描述，无需区域级监督，在多个密集和区域描述任务中表现优异。

Details

Motivation: 现有的零样本描述模型局限于全局图像表示和整图描述，缺乏对任意子区域进行描述的能力，且依赖成对的图像-文本数据。因此需要一种更灵活、可扩展的方法来实现无需监督的细粒度描述生成。 Method: \frameworkName{}将单个图像补丁作为基本描述单元，利用如DINO等产生密集视觉特征的骨干网络提取有意义的局部表示，并通过聚合补丁特征来描述从单个补丁到非连续区域乃至整张图像的任意区域，实现了从图像中心范式到补丁中心范式的转变。 Result: 实验表明，使用DINO等具备强密集特征提取能力的骨干网络时，\frameworkName{}在零样本密集描述、区域集描述以及新提出的轨迹描述任务上均优于现有基线和其他最先进方法，验证了补丁级语义表示的有效性。 Conclusion: \frameworkName{}通过引入补丁中心的统一框架，成功实现了无需区域级监督的灵活、可扩展的零样本图像描述，展示了密集视觉表征在细粒度生成任务中的潜力。 Abstract: Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present \frameworkName{}, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation. Project page at https://paciosoft.com/Patch-ioner/ .

[105] Training-Free Out-Of-Distribution Segmentation With Foundation Models

Laith Nayal,Hadi Salloum,Ahmad Taha,Yaroslav Kholodov,Alexander Gasnikov

Main category: cs.CV

TL;DR: 本文研究了在语义分割中，基于大规模视觉基础模型（如InternImage）在无监督情况下检测未知物体的能力，提出一种无需训练、结合K-Means聚类与置信度阈值的方法，在RoadAnomaly和ADE-OoD基准上取得了优于现有方法的结果。

Details

Motivation: 现有的语义分割方法主要关注闭集任务，对开放世界中未知物体（分布外OoD）的检测能力不足，尤其在自动驾驶等安全关键场景中亟需提升。 Method: 利用InternImage骨干网络提取特征，采用K-Means对特征进行聚类，并结合解码头原始logits的置信度阈值来识别OoD区域，整个过程无需额外训练。 Result: 在RoadAnomaly数据集上达到50.02的平均精度，在ADE-OoD上达到48.77，优于多个有监督和无监督基线方法。 Conclusion: 研究表明，经过分割数据集微调的基础模型本身具备一定的OoD检测潜力，所提方法为低假设、无需额外数据的通用OoD分割提供了可行方向。 Abstract: Detecting unknown objects in semantic segmentation is crucial for safety-critical applications such as autonomous driving. Large vision foundation models, including DINOv2, InternImage, and CLIP, have advanced visual representation learning by providing rich features that generalize well across diverse tasks. While their strength in closed-set semantic tasks is established, their capability to detect out-of-distribution (OoD) regions in semantic segmentation remains underexplored. In this work, we investigate whether foundation models fine-tuned on segmentation datasets can inherently distinguish in-distribution (ID) from OoD regions without any outlier supervision. We propose a simple, training-free approach that utilizes features from the InternImage backbone and applies K-Means clustering alongside confidence thresholding on raw decoder logits to identify OoD clusters. Our method achieves 50.02 Average Precision on the RoadAnomaly benchmark and 48.77 on the benchmark of ADE-OoD with InternImage-L, surpassing several supervised and unsupervised baselines. These results suggest a promising direction for generic OoD segmentation methods that require minimal assumptions or additional data.

[106] Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou,Di Lu,Yizhou Wang,Yibo Yan,Yuanhuiyi Lyu,Xu Zheng,Linfeng Zhang,Xuming Hu

Main category: cs.CV

TL;DR: 提出HoloV，一种简单有效的即插即用型视觉token剪枝框架，通过自适应分配不同空间区域的剪枝预算，保留全局视觉上下文，显著提升多模态大模型推理效率与准确率的平衡。

Details

Motivation: 现有基于注意力的视觉token剪枝方法在高剪枝比下倾向于保留语义相似的冗余token，导致性能显著下降。 Method: 提出HoloV框架，从整体视角重新思考token保留策略，通过自适应地在不同空间区域分配剪枝预算，确保保留的token能捕捉全局视觉上下文而非孤立的显著特征。 Result: 实验表明，HoloV在多种任务、MLLM架构和剪枝比例下均优于现有最先进方法；例如，在剪除88.9%视觉token后，LLaVA1.5仍保持原始性能的95.8%。 Conclusion: HoloV有效缓解了传统注意力优先剪枝方法的表征坍塌问题，在高剪枝比下仍能维持任务相关的信息完整性，实现了更优的效率-精度权衡。 Abstract: Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\texttt{CLS}] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose {HoloV}, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1.5 equipped with HoloV preserves 95.8\% of the original performance after pruning 88.9\% of visual tokens, achieving superior efficiency-accuracy trade-offs.

[107] Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting

Nikoo Naghavian,Mostafa Tavassolipour

Main category: cs.CV

TL;DR: 提出了一种名为Confidence-Aware Weighting (CAW) 的方法，通过置信度感知损失和特征对齐正则化来提升视觉-语言模型在零样本设置下的对抗鲁棒性，在多个数据集上优于现有方法且内存占用更低。

Details

Motivation: 现有的视觉-语言模型（如CLIP）虽然具有良好的零样本泛化能力，但容易受到对抗攻击，亟需提升其鲁棒性。 Method: CAW包含两个部分：一是置信度感知损失，通过缩放干净样本与对抗样本预测之间的KL散度来关注不确定的对抗样本；二是特征对齐正则化，最小化冻结与微调图像编码器在对抗输入上的特征距离以保持语义一致性。 Result: 在TinyImageNet和另外14个数据集上的实验表明，CAW在AutoAttack等强攻击下优于PMG-AFT和TGA-ZSR等最新方法，同时内存使用更少，并且兼顾了干净准确率和鲁棒性。 Conclusion: CAW能有效提升视觉-语言模型的零样本对抗鲁棒性，同时保持良好泛化性能，是一种高效实用的防御方法。 Abstract: Vision-language models like CLIP demonstrate impressive zero-shot generalization but remain highly vulnerable to adversarial attacks. In this work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models. CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency by minimizing the distance between frozen and fine-tuned image encoder features on adversarial inputs. These components work jointly to improve both clean and robust accuracy without sacrificing generalization. Extensive experiments on TinyImageNet and 14 additional datasets show that CAW outperforms recent methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while using less memory.

[108] Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights

Daphne Tsolissou,Theofanis Ganitidis,Konstantinos Mitsis,Stergios CHristodoulidis,Maria Vakalopoulou,Konstantina Nikita

Main category: cs.CV

TL;DR: 本研究探讨了大型视觉-语言模型（LVLMs）在颈动脉斑块多模态评估中的应用，通过结合超声图像与临床、人口统计、实验室及蛋白质生物标志物数据进行卒中风险分层。研究提出了一种模拟真实诊断场景的问答式框架，并比较了多种开源LVLM。实验表明，尽管这些模型强大，但在识别成像模式和解剖结构以及准确风险分类方面表现不佳。为此，采用低秩适应（LoRA）技术将LLaVa-NeXT-Vicuna适配至超声领域，显著提升了风险分层性能；进一步整合文本形式的多模态表格数据，提高了特异性和平衡准确率，性能可媲美基于相同数据集训练的传统卷积神经网络（CNN）基线模型。研究强调了多模态融合、模型校准和领域适配对临床转化的重要性。

Details

Motivation: 颈动脉粥样硬化疾病的风险评估在临床上仍具挑战性，需透明且可解释地整合多样化的临床与影像信息。现有方法难以满足这一需求，因此探索先进的视觉-语言模型（LVLMs）在多模态风险评估中的潜力具有重要意义。 Method: 提出一个模拟真实诊断场景的问答式框架，结合超声图像与结构化临床、人口统计、实验室及蛋白质生物标志物等多模态数据。比较多种开源LVLM（包括通用与医学调优模型），并在零样本设置下评估其性能。针对表现不足的问题，采用低秩适应（LoRA）对LLaVa-NeXT-Vicuna模型进行超声领域的微调，并引入文本格式的表格数据以增强模型输入。 Result: 零样本实验显示，多数LVLM难以准确识别成像模态与解剖结构，且在风险分类任务上普遍表现较差。经LoRA适配后的LLaVa-NeXT-Vicuna在卒中风险分层上表现显著提升，结合多模态表格数据后进一步提高了特异性与平衡准确率，性能接近甚至优于传统CNN基线模型。 Conclusion: 尽管当前LVLM在直接应用于颈动脉超声风险预测时存在局限，但通过领域适配（如LoRA）和多模态数据融合（尤其是结构化临床数据的文本化整合），其性能可大幅提升，具备临床转化潜力。研究强调了模型定制化、多模态集成与可解释性在医学AI中的关键作用。 Abstract: Reliable risk assessment for carotid atheromatous disease remains a major clinical challenge, as it requires integrating diverse clinical and imaging information in a manner that is transparent and interpretable to clinicians. This study investigates the potential of state-of-the-art and recent large vision-language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging (USI) with structured clinical, demographic, laboratory, and protein biomarker data. A framework that simulates realistic diagnostic scenarios through interview-style question sequences is proposed, comparing a range of open-source LVLMs, including both general-purpose and medically tuned models. Zero-shot experiments reveal that even if they are very powerful, not all LVLMs can accurately identify imaging modality and anatomy, while all of them perform poorly in accurate risk classification. To address this limitation, LLaVa-NeXT-Vicuna is adapted to the ultrasound domain using low-rank adaptation (LoRA), resulting in substantial improvements in stroke risk stratification. The integration of multimodal tabular data in the form of text further enhances specificity and balanced accuracy, yielding competitive performance compared to prior convolutional neural network (CNN) baselines trained on the same dataset. Our findings highlight both the promise and limitations of LVLMs in ultrasound-based cardiovascular risk prediction, underscoring the importance of multimodal integration, model calibration, and domain adaptation for clinical translation.

[109] Flip Distribution Alignment VAE for Multi-Phase MRI Synthesis

Xiaoyan Kui,Qianmu Xiao,Qqinsong Li,Zexin Ji,JIelin Zhang,Beiji Zou

Main category: cs.CV

TL;DR: 提出一种轻量级的特征解耦VAE模型FDA-VAE，用于多相位增强MRI图像合成，通过Flip分布对齐和Y型双向训练策略，有效分离共享与独立特征，在减少参数和推理时间的同时提升合成质量。

Details

Motivation: 现有方法使用深度自编码器生成器，参数效率低且缺乏可解释的训练策略，难以有效分离多相位CE-MRI合成中的共享与独立特征。 Method: 提出Flip分布对齐变分自编码器（FDA-VAE），将输入和目标图像编码为关于标准正态分布对称的两个潜在分布，并采用Y型双向训练策略增强特征分离的可解释性。 Result: 实验结果表明，相比现有的端到端深度自编码器方法，FDA-VAE显著减少了模型参数量和推理时间，同时有效提升了合成图像的质量。 Conclusion: FDA-VAE是一种高效、轻量且可解释的多相位CE-MRI合成方法，能够在保证高质量图像生成的同时大幅提升模型效率。 Abstract: Separating shared and independent features is crucial for multi-phase contrast-enhanced (CE) MRI synthesis. However, existing methods use deep autoencoder generators with low parameter efficiency and lack interpretable training strategies. In this paper, we propose Flip Distribution Alignment Variational Autoencoder (FDA-VAE), a lightweight feature-decoupled VAE model for multi-phase CE MRI synthesis. Our method encodes input and target images into two latent distributions that are symmetric concerning a standard normal distribution, effectively separating shared and independent features. The Y-shaped bidirectional training strategy further enhances the interpretability of feature separation. Experimental results show that compared to existing deep autoencoder-based end-to-end synthesis methods, FDA-VAE significantly reduces model parameters and inference time while effectively improving synthesis quality. The source code is publicly available at https://github.com/QianMuXiao/FDA-VAE.

[110] TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency

Juntong Wang,Huiyu Duan,Jiarui Wang,Ziheng Jia,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 本文提出了LPG-Bench，一个用于评估长文本到图像生成的基准，以及一种新的零样本评估指标TIT，基于文本-图像-文本一致性，在人类偏好对齐上显著优于现有方法。

Details

Motivation: 现有的文本到图像模型在处理长而详细的提示时表现不佳，且当前的自动评估指标与人类偏好一致性差，因此需要更有效的评估基准和指标。 Method: 构建包含200个长提示的LPG-Bench基准，生成2600张图像并进行人工标注；提出TIT框架，包括TIT-Score和TIT-Score-LLM，通过大模型描述生成图像后与原始提示的一致性来评估生成质量。 Result: 实验表明TIT指标与人类判断高度一致，TIT-Score-LLM相较于最强基线绝对提升了7.31%的成对准确率，显著优于CLIP-score和LMM-score等现有指标。 Conclusion: LPG-Bench和TIT为长提示文本到图像生成提供了有效的评估手段，有助于推动该领域的发展。 Abstract: With the rapid advancement of large multimodal models (LMMs), recent text-to-image (T2I) models can generate high-quality images and demonstrate great alignment to short prompts. However, they still struggle to effectively understand and follow long and detailed prompts, displaying inconsistent generation. To address this challenge, we introduce LPG-Bench, a comprehensive benchmark for evaluating long-prompt-based text-to-image generation. LPG-Bench features 200 meticulously crafted prompts with an average length of over 250 words, approaching the input capacity of several leading commercial models. Using these prompts, we generate 2,600 images from 13 state-of-the-art models and further perform comprehensive human-ranked annotations. Based on LPG-Bench, we observe that state-of-the-art T2I alignment evaluation metrics exhibit poor consistency with human preferences on long-prompt-based image generation. To address the gap, we introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images. The core concept of TIT is to quantify T2I alignment by directly comparing the consistency between the raw prompt and the LMM-produced description on the generated image, which includes an efficient score-based instantiation TIT-Score and a large-language-model (LLM) based instantiation TIT-Score-LLM. Extensive experiments demonstrate that our framework achieves superior alignment with human judgment compared to CLIP-score, LMM-score, etc., with TIT-Score-LLM attaining a 7.31% absolute improvement in pairwise accuracy over the strongest baseline. LPG-Bench and TIT methods together offer a deeper perspective to benchmark and foster the development of T2I models. All resources will be made publicly available.

[111] Towards Scalable and Consistent 3D Editing

Ruihao Xia,Yang Tang,Pan Zhou

Main category: cs.CV

TL;DR: 本文提出了3DEditVerse数据集和3DEditFormer模型，用于解决3D编辑中的跨视角一致性、结构保真度和精细控制难题，实现了无需精确3D掩码的高效、精确3D编辑。

Details

Motivation: 现有的3D编辑方法存在速度慢、易产生几何失真或依赖人工精确3D掩码等问题，难以满足实际应用需求。 Method: 在数据方面构建了包含大量高质量配对样本的3DEditVerse基准；在模型方面提出3DEditFormer，采用双引导注意力和时间自适应门控机制，在保持3D结构的同时实现局部编辑。 Result: 实验表明该方法在定量和定性指标上均优于现有最先进方法，实现了更精确、一致且无需辅助3D掩码的3D编辑。 Conclusion: 3DEditFormer结合3DEditVerse为实用且可扩展的3D编辑设立了新标准。 Abstract: 3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/

[112] Not every day is a sunny day: Synthetic cloud injection for deep land cover segmentation robustness evaluation across data sources

Sara Mobsite,Renaud Hostache,Laure Berti Equille,Emmanuel Roux,Joris Guerin

Main category: cs.CV

TL;DR: 提出了一种云注入算法和轻量级归一化差值指数（NDI）注入方法，结合Sentinel-1雷达数据与光学数据，提升云覆盖条件下土地覆盖语义分割性能。

Details

Motivation: 现有Sentinel-2土地覆盖数据集多为无云图像，限制了热带多云地区的应用，且深度网络下采样过程易丢失空间和光谱细节。 Method: 设计云注入算法模拟真实云覆盖，并在解码末层注入归一化差值指数（NDI）以保留空间特征；融合Sentinel-1雷达数据弥补光学影像被遮挡的缺陷。 Result: 在DFC2020数据集上，NDI注入使U-Net和DeepLabV3在无云影像上分别提升1.99%和2.78%；在有云情况下，融合Sentinel-1数据显著提升所有模型性能。 Conclusion: 雷达与光学数据融合能有效应对云遮挡问题，NDI注入可增强模型对关键空间特征的保留，提升土地覆盖分割鲁棒性。 Abstract: Supervised deep learning for land cover semantic segmentation (LCS) relies on labeled satellite data. However, most existing Sentinel-2 datasets are cloud-free, which limits their usefulness in tropical regions where clouds are common. To properly evaluate the extent of this problem, we developed a cloud injection algorithm that simulates realistic cloud cover, allowing us to test how Sentinel-1 radar data can fill in the gaps caused by cloud-obstructed optical imagery. We also tackle the issue of losing spatial and/or spectral details during encoder downsampling in deep networks. To mitigate this loss, we propose a lightweight method that injects Normalized Difference Indices (NDIs) into the final decoding layers, enabling the model to retain key spatial features with minimal additional computation. Injecting NDIs enhanced land cover segmentation performance on the DFC2020 dataset, yielding improvements of 1.99% for U-Net and 2.78% for DeepLabV3 on cloud-free imagery. Under cloud-covered conditions, incorporating Sentinel-1 data led to significant performance gains across all models compared to using optical data alone, highlighting the effectiveness of radar-optical fusion in challenging atmospheric scenarios.

[113] PocketSR: The Super-Resolution Expert in Your Pocket Mobiles

Haoze Sun,Linfeng Jiang,Fan Li,Renjing Pei,Zhixin Wang,Yong Guo,Jiaqi Xu,Haoyu Chen,Jin Han,Fenglong Song,Yujiu Yang,Wenbo Li

Main category: cs.CV

TL;DR: 本文提出了PocketSR，一种超轻量级的单步图像超分辨率模型，通过设计LiteED和在线退火剪枝方法，在保持高质量的同时显著提升了效率，适用于边缘设备。

Details

Motivation: 现有的基于大型生成模型的现实世界图像超分辨率方法计算成本高、延迟大，难以在边缘设备上部署。 Method: 提出LiteED作为VAE的高效替代方案，并结合在线退火剪枝和多层特征蒸馏损失优化U-Net结构。 Result: PocketSR仅用1.46亿参数即可在0.8秒内处理4K图像，速度显著快于先前方法，性能媲美最先进的单步甚至多步RealSR模型。 Conclusion: PocketSR在保持高保真度的同时极大提升了效率，是面向边缘设备应用的实用化RealSR解决方案。 Abstract: Real-world image super-resolution (RealSR) aims to enhance the visual quality of in-the-wild images, such as those captured by mobile phones. While existing methods leveraging large generative models demonstrate impressive results, the high computational cost and latency make them impractical for edge deployment. In this paper, we introduce PocketSR, an ultra-lightweight, single-step model that brings generative modeling capabilities to RealSR while maintaining high fidelity. To achieve this, we design LiteED, a highly efficient alternative to the original computationally intensive VAE in SD, reducing parameters by 97.5% while preserving high-quality encoding and decoding. Additionally, we propose online annealing pruning for the U-Net, which progressively shifts generative priors from heavy modules to lightweight counterparts, ensuring effective knowledge transfer and further optimizing efficiency. To mitigate the loss of prior knowledge during pruning, we incorporate a multi-layer feature distillation loss. Through an in-depth analysis of each design component, we provide valuable insights for future research. PocketSR, with a model size of 146M parameters, processes 4K images in just 0.8 seconds, achieving a remarkable speedup over previous methods. Notably, it delivers performance on par with state-of-the-art single-step and even multi-step RealSR models, making it a highly practical solution for edge-device applications.

[114] When and Where do Events Switch in Multi-Event Video Generation?

Ruotong Liao,Guowen Huang,Qing Cheng,Thomas Seidl,Daniel Cremers,Volker Tresp

Main category: cs.CV

TL;DR: 本文提出了MEve，一个用于评估多事件文本到视频生成的自建提示集，并系统研究了OpenSora和CogVideoX两类代表性模型，揭示了去噪早期干预和分层控制对事件转换的关键作用。

Details

Motivation: 现有方法在扩展多事件生成时忽略了事件切换中的内在因素，缺乏对多事件提示何时何地控制视频生成中事件转换的深入探究。 Method: 提出MEve提示套件，对OpenSora和CogVideoX两类模型进行系统性实验，分析不同去噪步骤和模型层在事件转换中的控制效果。 Result: 实验证明在去噪过程的早期阶段以及特定的模块化网络层中进行干预，对实现多事件的时间连贯性和可控性至关重要。 Conclusion: 多事件T2V生成的关键在于早期去噪步骤和分层控制策略，为未来模型的多事件条件控制提供了可行方向。 Abstract: Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.

[115] InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition

Ahsan Farabi,Israt Khandaker,Ibrahim Khalil Shanto,Md Abdul Ahad Minhaz,Tanisha Zaman

Main category: cs.CV

TL;DR: 提出了一种基于EfficientNetV2-S的可复现面部情绪识别框架InsideOut，通过迁移学习、强数据增强和不平衡感知优化，在FER2013数据集上实现了62.8%的准确率和0.590的宏平均F1分数。

Details

Motivation: 面部情绪识别（FER）因遮挡、光照和姿态变化、类内差异小以及数据集不平衡等问题仍具挑战性，尤其是少数情绪类别识别困难。 Method: 采用EfficientNetV2-S作为主干网络，结合迁移学习、强数据增强、分层分割和类别加权损失函数，对轻量级分类头进行微调以应对数据分布不均问题。 Result: 在FER2013数据集上达到62.8%的准确率和0.590的宏平均F1分数，表现优于传统CNN基线模型。 Conclusion: 研究表明，高效的网络架构结合针对性的不平衡处理策略，能够提供实用、透明且可复现的FER解决方案。 Abstract: Facial Emotion Recognition (FER) is a key task in affective computing, enabling applications in human-computer interaction, e-learning, healthcare, and safety systems. Despite advances in deep learning, FER remains challenging due to occlusions, illumination and pose variations, subtle intra-class differences, and dataset imbalance that hinders recognition of minority emotions. We present InsideOut, a reproducible FER framework built on EfficientNetV2-S with transfer learning, strong data augmentation, and imbalance-aware optimization. The approach standardizes FER2013 images, applies stratified splitting and augmentation, and fine-tunes a lightweight classification head with class-weighted loss to address skewed distributions. InsideOut achieves 62.8% accuracy with a macro averaged F1 of 0.590 on FER2013, showing competitive results compared to conventional CNN baselines. The novelty lies in demonstrating that efficient architectures, combined with tailored imbalance handling, can provide practical, transparent, and reproducible FER solutions.

[116] What Drives Compositional Generalization in Visual Generative Models?

Karim Farid,Rajat Sahay,Yumna Ali Alnaggar,Simon Schrodi,Volker Fischer,Cordelia Schmid,Thomas Brox

Main category: cs.CV

TL;DR: 本文系统研究了不同设计选择对图像和视频生成中组合泛化能力的影响，发现训练目标的离散或连续性以及训练过程中条件信息的提供程度是两个关键因素，并提出通过结合连续JEPA目标来提升离散模型（如MaskGIT）的组合性能。

Details

Motivation: 组合泛化能力对于视觉生成模型至关重要，但目前尚不完全清楚哪些机制能够促进或抑制这种能力。 Method: 通过受控实验，系统分析了训练目标的分布类型（离散或连续）以及条件信息在训练中对概念组成的影响。 Result: 确定了影响组合泛化的两个关键因素，并证明在MaskGIT等离散模型中引入辅助的连续JEPA目标可以提升其组合性能。 Conclusion: 结合离散与连续训练目标的设计有助于增强视觉生成模型的组合泛化能力。 Abstract: Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.

[117] Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations

Naresh Kumar Devulapally,Shruti Agarwal,Tejas Gokhale,Vishnu Suresh Lokhande

Main category: cs.CV

TL;DR: 提出一种基于潜在空间的扰动策略，通过在扩散模型的隐空间中进行轨迹偏移采样，生成对下游模型难以学习但视觉上与原图高度一致的“不可学习”图像，有效提升隐私保护的不可感知性和鲁棒性。

Details

Motivation: 现有图像中毒方法在像素空间操作导致明显的噪声和伪影，难以在保持视觉质量的同时实现有效的模型防窃取，因此需要一种更隐蔽且高效的防御机制。 Method: 在扩散模型的隐空间中设计一种交替去噪与逆过程的模型驱动扰动策略，通过调整去噪轨迹的起始点实现轨迹偏移采样，使生成图像对下游生成模型难以个性化学习，同时保持高视觉保真度。 Result: 在四个基准数据集上验证了方法对最先进逆向攻击的鲁棒性，不可感知性在PSNR、SSIM和FID等指标上提升约8%-10%，鲁棒性在五种对抗设置下平均提升约10%。 Conclusion: 该方法成功将不可学习性集成到潜在扩散模型框架中，提供了一种实用且难以察觉的防御手段，有效防止敏感数据被未经授权的模型复制和使用。 Abstract: Text-to-image diffusion models have demonstrated remarkable effectiveness in rapid and high-fidelity personalization, even when provided with only a few user images. However, the effectiveness of personalization techniques has lead to concerns regarding data privacy, intellectual property protection, and unauthorized usage. To mitigate such unauthorized usage and model replication, the idea of generating ``unlearnable'' training samples utilizing image poisoning techniques has emerged. Existing methods for this have limited imperceptibility as they operate in the pixel space which results in images with noise and artifacts. In this work, we propose a novel model-based perturbation strategy that operates within the latent space of diffusion models. Our method alternates between denoising and inversion while modifying the starting point of the denoising trajectory: of diffusion models. This trajectory-shifted sampling ensures that the perturbed images maintain high visual fidelity to the original inputs while being resistant to inversion and personalization by downstream generative models. This approach integrates unlearnability into the framework of Latent Diffusion Models (LDMs), enabling a practical and imperceptible defense against unauthorized model adaptation. We validate our approach on four benchmark datasets to demonstrate robustness against state-of-the-art inversion attacks. Results demonstrate that our method achieves significant improvements in imperceptibility ($\sim 8 \% -10\%$ on perceptual metrics including PSNR, SSIM, and FID) and robustness ( $\sim 10\%$ on average across five adversarial settings), highlighting its effectiveness in safeguarding sensitive data.

[118] Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

Zhiting Mei,Ola Shorinwa,Anirudha Majumdar

Main category: cs.CV

TL;DR: 本文研究了几何锚定语义特征在辐射场蒸馏中的作用，提出了新框架SPINE用于无初始猜测的辐射场反演，发现视觉-only特征在下游任务中更具通用性，而几何锚定特征虽包含更多几何细节，但姿态估计精度反而下降。

Details

Motivation: 探讨几何锚定语义特征是否能在辐射场蒸馏中带来优势，尤其是在空间任务如姿态估计和语义定位中的表现。 Method: 通过比较视觉-only与几何锚定的预训练语义特征，分析其在几何感知、对象定位和辐射场反演中的性能，并提出SPINE框架实现无需初始猜测的辐射场反演。 Result: 几何锚定特征具有更精细的结构细节，但在语义定位上无显著提升，且在姿态估计中精度下降；SPINE框架能有效进行辐射场反演。 Conclusion: 视觉-only特征在多种下游任务中更具通用性，几何锚定特征需进一步研究以提升其有效性和适用性。 Abstract: Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

[119] GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion

Beibei Lin,Tingting Chen,Robby T. Tan

Main category: cs.CV

TL;DR: 提出GeoComplete，一种结合3D结构引导的双分支扩散框架，用于参考驱动的图像补全，显著提升几何准确性和视觉质量。

Details

Motivation: 现有方法缺乏几何线索，导致在目标视图与参考图像差异大时生成内容错位或不合理。 Method: 通过投影点云为扩散过程提供几何信息，并引入目标感知掩码机制，采用双分支扩散架构，结合跨分支自注意力实现一致且准确的补全。 Result: 实验显示相比现有最先进方法PSNR提升17.1，在几何准确性和视觉质量上均表现优异。 Conclusion: GeoComplete通过融合显式3D结构引导和目标感知掩码策略，为几何条件下的图像补全提供了统一且鲁棒的解决方案。 Abstract: Reference-driven image completion, which restores missing regions in a target view using additional images, is particularly challenging when the target view differs significantly from the references. Existing generative methods rely solely on diffusion priors and, without geometric cues such as camera pose or depth, often produce misaligned or implausible content. We propose GeoComplete, a novel framework that incorporates explicit 3D structural guidance to enforce geometric consistency in the completed regions, setting it apart from prior image-only approaches. GeoComplete introduces two key ideas: conditioning the diffusion process on projected point clouds to infuse geometric information, and applying target-aware masking to guide the model toward relevant reference cues. The framework features a dual-branch diffusion architecture. One branch synthesizes the missing regions from the masked target, while the other extracts geometric features from the projected point cloud. Joint self-attention across branches ensures coherent and accurate completion. To address regions visible in references but absent in the target, we project the target view into each reference to detect occluded areas, which are then masked during training. This target-aware masking directs the model to focus on useful cues, enhancing performance in difficult scenarios. By integrating a geometry-aware dual-branch diffusion architecture with a target-aware masking strategy, GeoComplete offers a unified and robust solution for geometry-conditioned image completion. Experiments show that GeoComplete achieves a 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.

[120] Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan,Xihua Wang,Zhengfeng Lai,Xin Cheng,Peng Zhang,XiaoJiang Liu,Ruihua Song,Meng Cao

Main category: cs.CV

TL;DR: 本文提出了一种新的文本生成有声视频（T2SV）方法，通过分层视觉接地描述（HVGC）生成解耦的音视频文本描述，并设计双塔扩散Transformer模型BridgeDiT，利用双交叉注意力（DCA）机制实现语义与时间上的跨模态同步，在多个数据集上达到最优性能。

Details

Motivation: 现有T2SV方法因使用共享文本描述导致音视频模态干扰，且缺乏有效的跨模态交互机制，限制了生成质量。 Method: 提出HVGC框架生成独立的视频和音频描述文本，以消除条件干扰；构建BridgeDiT模型，采用双交叉注意力机制实现双向、对称的信息融合。 Result: 在三个基准数据集上实验表明，该方法在自动指标和人工评估中均优于现有方法，实现了更优的音视频同步与文本对齐。 Conclusion: HVGC与BridgeDiT有效解决了模态干扰和跨模态交互问题，显著提升了T2SV生成质量，为未来研究提供了新方向。 Abstract: This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

[121] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang,Dong Liang,Hairong Zheng,Yihang Zhou

Main category: cs.CV

TL;DR: 提出HAVIR模型，通过分层提取视觉皮层的结构和语义特征，结合扩散模型实现高质量视觉重建。

Details

Motivation: 现有方法在重建复杂自然场景时存在困难，主要由于低层特征异质性和高层语义纠缠。 Method: 将视觉皮层分为两个层级区域，分别用结构生成器提取结构信息并转换为扩散先验，语义提取器生成CLIP嵌入，再通过通用扩散模型融合生成图像。 Result: 实验表明，HAVIR在结构和语义重建质量上均优于现有模型，尤其在复杂场景下表现更优。 Conclusion: HAVIR通过分层解耦策略有效提升了脑电信号到视觉图像的重建精度，推动了神经科学与计算机视觉的融合。 Abstract: The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

[122] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

Gen Li,Bo Zhao,Jianfei Yang,Laura Sevilla-Lara

Main category: cs.CV

TL;DR: 本文提出Mask2IV，一种无需密集掩码输入的交互中心视频生成框架，通过解耦的两阶段流程实现高质量、可控的交互视频生成。

Details

Motivation: 现有方法难以建模复杂的交互动态，且依赖密集精确的掩码标注，限制了在真实场景中的应用。 Method: 采用解耦的两阶段框架：第一阶段预测主体与物体的运动轨迹，第二阶段基于轨迹条件生成视频；支持通过动作描述或空间线索进行控制。 Result: 在两个新构建的人机交互与机器人操作基准上验证，Mask2IV在视觉真实感和可控性方面优于现有基线方法。 Conclusion: Mask2IV有效解决了密集掩码标注的难题，实现了灵活、直观且高质量的交互中心视频生成。 Abstract: Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

[123] ReeMark: Reeb Graphs for Simulating Patterns of Life in Spatiotemporal Trajectories

Anantajit Subrahmanya,Chandrakanth Gudavalli,Connor Levenson,Umang Garg,B. S. Manjunath

Main category: cs.CV

TL;DR: 本文提出了Markovian Reeb图，一种用于模拟保留“生活模式”（PoLs）的时空轨迹的新框架，能够在保持数据和计算效率的同时生成逼真的未来轨迹。

Details

Motivation: 准确建模人类移动性对城市规划、流行病学和交通管理至关重要，现有方法难以同时捕捉个体与群体层面的移动规律并保持高效性。 Method: 结合个体与群体层面的移动结构，提出基于概率拓扑模型的Markovian Reeb图框架，用于生成具有真实性的时空轨迹。 Result: 在Atlanta和Berlin子集的Urban Anomalies数据集上评估显示，该方法在人群级和个体级指标上的Jensen-Shannon散度表现优异，具有高保真度且计算与数据效率高。 Conclusion: Markovian Reeb图是一种可扩展的轨迹模拟框架，在多种城市环境中具有广泛应用前景。 Abstract: Accurately modeling human mobility is critical for urban planning, epidemiology, and traffic management. In this work, we introduce Markovian Reeb Graphs, a novel framework for simulating spatiotemporal trajectories that preserve Patterns of Life (PoLs) learned from baseline data. By combining individual- and population-level mobility structures within a probabilistic topological model, our approach generates realistic future trajectories that capture both consistency and variability in daily life. Evaluations on the Urban Anomalies dataset (Atlanta and Berlin subsets) using the Jensen-Shannon Divergence (JSD) across population- and agent-level metrics demonstrate that the proposed method achieves strong fidelity while remaining data- and compute-efficient. These results position Markovian Reeb Graphs as a scalable framework for trajectory simulation with broad applicability across diverse urban environments.

[124] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Ming Zhao,Wenhui Dong,Yang Zhang,Xiang Zheng,Zhonghao Zhang,Zian Zhou,Yunzhi Guan,Liukun Xu,Wei Peng,Zhaoyang Gong,Zhicheng Zhang,Dachuan Li,Xiaosheng Ma,Yuli Ma,Jianing Ni,Changjiang Jiang,Lixia Tian,Qixin Chen,Kaishun Xia,Pingping Liu,Tongshun Zhang,Zhiqiang Liu,Zhongan Bi,Chenyang Si,Tiansheng Sun,Caifeng Shan

Main category: cs.CV

TL;DR: 本文提出了SpineMed，一个专为脊柱疾病AI辅助诊断设计的生态系统，包括大规模多模态数据集SpineMed-450k和临床导向的评估框架SpineBench，显著提升了模型在椎体层级推理任务上的表现。

Details

Motivation: 脊柱疾病影响广泛，但现有AI诊断受限于缺乏具备椎体层级标注的多模态数据集和临床可追溯的指令数据，亟需构建符合临床决策需求的数据与评估体系。 Method: 与执业脊柱外科医生共同设计SpineMed生态系统，构建包含45万条指令的多模态数据集SpineMed-450k，采用两阶段LLM生成方法（草稿与修订）并结合临床医生参与的数据流水线确保数据质量；同时提出SpineBench作为评估框架，从椎体定位、病理分析和手术规划等维度评测模型性能。 Result: 在SpineBench上的实验表明，当前先进的视觉-语言大模型在细粒度椎体层级推理上存在系统性缺陷，而基于SpineMed-450k微调的模型在所有任务中均表现出显著提升，且经临床医生评估确认其输出具有良好的诊断清晰度和实用性。 Conclusion: SpineMed填补了脊柱疾病AI诊断中高水平、多模态、可追溯指令数据的空白，为推动临床可用的AI辅助诊断系统提供了重要基础。 Abstract: Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

[125] UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

Qing Huang,Zhipei Xu,Xuanyu Zhang,Jian Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于多智能体的统一图像伪造检测与定位系统UniShield，能够跨多个领域（如图像篡改、文档篡改、DeepFake和AI生成图像）实现高效、可解释的伪造检测，实验表明其性能优于现有方法。

Details

Motivation: 现有的伪造图像检测方法通常局限于特定领域，跨域泛化能力差，且缺乏自适应的统一框架，限制了实际应用。 Method: 提出UniShield，包含感知智能体和检测智能体：感知智能体动态分析图像特征并选择合适的检测模型；检测智能体集成多种专家检测器，并生成可解释报告。 Result: 在多个伪造图像检测任务上，UniShield实现了最先进的性能，超越了现有的统一方法和领域专用检测器。 Conclusion: UniShield具有出色的实用性、适应性和可扩展性，是应对多样化图像伪造威胁的有效统一解决方案。 Abstract: With the rapid advancements in image generation, synthetic images have become increasingly realistic, posing significant societal risks, such as misinformation and fraud. Forgery Image Detection and Localization (FIDL) thus emerges as essential for maintaining information integrity and societal security. Despite impressive performances by existing domain-specific detection methods, their practical applicability remains limited, primarily due to their narrow specialization, poor cross-domain generalization, and the absence of an integrated adaptive framework. To address these issues, we propose UniShield, the novel multi-agent-based unified system capable of detecting and localizing image forgeries across diverse domains, including image manipulation, document manipulation, DeepFake, and AI-generated images. UniShield innovatively integrates a perception agent with a detection agent. The perception agent intelligently analyzes image features to dynamically select suitable detection models, while the detection agent consolidates various expert detectors into a unified framework and generates interpretable reports. Extensive experiments show that UniShield achieves state-of-the-art results, surpassing both existing unified approaches and domain-specific detectors, highlighting its superior practicality, adaptiveness, and scalability.

[126] ROGR: Relightable 3D Objects using Generative Relighting

Jiapeng Tang,Matthew Lavine,Dor Verbin,Stephan J. Garbin,Matthias Nießner,Ricardo Martin Brualla,Pratul P. Srinivasan,Philipp Henzler

Main category: cs.CV

TL;DR: 本文提出了一种名为ROGR的新方法，通过生成性重光照模型重建多视角捕获物体的可重光照3D模型，使用双分支神经辐射场（NeRF）架构实现对任意环境光照下的高效前馈重光照。

Details

Motivation: 为了实现对真实物体在不同光照环境下外观的准确重建和高效重光照，避免传统方法中复杂的逐光源优化或光传输模拟过程。 Method: 利用生成性重光照模型从多个光照环境中采样物体外观，构建数据集并训练一种新型双分支结构的光照条件化NeRF，分别编码整体光照效果和镜面反射特性。 Result: 在TensoIR和Stanford-ORB数据集上优于现有最先进方法的大多数指标，并成功应用于真实世界物体的重光照展示。 Conclusion: ROGR能够高效生成高质量的可重光照3D模型，支持任意环境光照下的逼真渲染，具有较强的实用性和扩展性。 Abstract: We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views, driven by a generative relighting model that simulates the effects of placing the object under novel environment illuminations. Our method samples the appearance of the object under multiple lighting environments, creating a dataset that is used to train a lighting-conditioned Neural Radiance Field (NeRF) that outputs the object's appearance under any input environmental lighting. The lighting-conditioned NeRF uses a novel dual-branch architecture to encode the general lighting effects and specularities separately. The optimized lighting-conditioned NeRF enables efficient feed-forward relighting under arbitrary environment maps without requiring per-illumination optimization or light transport simulation. We evaluate our approach on the established TensoIR and Stanford-ORB datasets, where it improves upon the state-of-the-art on most metrics, and showcase our approach on real-world object captures.

[127] Dynamic Prompt Generation for Interactive 3D Medical Image Segmentation Training

Tidiane Camaret Ndir,Alexander Pfefferle,Robin Tibor Schirrmeister

Main category: cs.CV

TL;DR: 提出了一种结合动态体素提示生成和内容感知自适应裁剪的训练策略，以优化3D生物医学图像分割中图像编码器的使用，在单个GPU上实现高效训练，并在比赛中表现出色。

Details

Motivation: 现有的基础模型缺乏体积感知能力或交互功能有限，难以满足交互式3D生物医学图像分割的需求。 Method: 采用动态体素提示生成和内容感知自适应裁剪的训练策略，模拟真实用户交互模式，并利用nnInteractive分割模型的公开权重初始化网络。 Result: 在单个GPU上成功解决了序列 refine 反馈的学习计算挑战，比赛评估结果显示平均最终Dice分数为0.6385，归一化表面距离为0.6614，Dice曲线下面积为2.4799，NSD为2.5671。 Conclusion: 所提方法有效提升了交互式3D生物医学图像分割的性能，兼具高效训练和良好分割效果，适用于资源受限环境。 Abstract: Interactive 3D biomedical image segmentation requires efficient models that can iteratively refine predictions based on user prompts. Current foundation models either lack volumetric awareness or suffer from limited interactive capabilities. We propose a training strategy that combines dynamic volumetric prompt generation with content-aware adaptive cropping to optimize the use of the image encoder. Our method simulates realistic user interaction patterns during training while addressing the computational challenges of learning from sequential refinement feedback on a single GPU. For efficient training, we initialize our network using the publicly available weights from the nnInteractive segmentation model. Evaluation on the \textbf{Foundation Models for Interactive 3D Biomedical Image Segmentation} competition demonstrates strong performance with an average final Dice score of 0.6385, normalized surface distance of 0.6614, and area-under-the-curve metrics of 2.4799 (Dice) and 2.5671 (NSD).

[128] Product-Quantised Image Representation for High-Quality Image Synthesis

Denis Zavadski,Nikita Philip Tatsch,Carsten Rother

Main category: cs.CV

TL;DR: 本文提出了PQGAN，一种将乘积量化（PQ）引入VQGAN框架的量化图像自编码器，在重建性能上显著优于现有方法，并可无缝集成到预训练扩散模型中，提升生成效率或输出分辨率。

Details

Motivation: 尽管乘积量化（PQ）在可扩展向量编码中表现优异，但在高保真图像生成的潜在表示中应用有限，因此本文旨在探索PQ在该领域的潜力并改进现有量化方法的性能瓶颈。 Method: 将PQ整合进VQGAN的向量量化（VQ）框架中，系统分析码本大小、嵌入维度和子空间分解之间的关系，并研究其对量化性能的影响。 Result: PQGAN在PSNR上达到37dB（此前为27dB），FID、LPIPS和CMMD指标最多降低96%，且在扩展嵌入维度时表现出与传统VQ相反的性能趋势，为超参数选择提供了指导。 Conclusion: PQGAN在图像重建和生成方面显著优于现有方法，并能高效集成到扩散模型中，提升了生成速度或分辨率，表明PQ是图像合成中离散潜在表示的一个强有力扩展。 Abstract: Product quantisation (PQ) is a classical method for scalable vector encoding, yet it has seen limited usage for latent representations in high-fidelity image generation. In this work, we introduce PQGAN, a quantised image autoencoder that integrates PQ into the well-known vector quantisation (VQ) framework of VQGAN. PQGAN achieves a noticeable improvement over state-of-the-art methods in terms of reconstruction performance, including both quantisation methods and their continuous counterparts. We achieve a PSNR score of 37dB, where prior work achieves 27dB, and are able to reduce the FID, LPIPS, and CMMD score by up to 96%. Our key to success is a thorough analysis of the interaction between codebook size, embedding dimensionality, and subspace factorisation, with vector and scalar quantisation as special cases. We obtain novel findings, such that the performance of VQ and PQ behaves in opposite ways when scaling the embedding dimension. Furthermore, our analysis shows performance trends for PQ that help guide optimal hyperparameter selection. Finally, we demonstrate that PQGAN can be seamlessly integrated into pre-trained diffusion models. This enables either a significantly faster and more compute-efficient generation, or a doubling of the output resolution at no additional cost, positioning PQ as a strong extension for discrete latent representation in image synthesis.

[129] Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

Junchao Huang,Xinting Hu,Boyao Han,Shaoshuai Shi,Zhuotao Tian,Tianyu He,Li Jiang

Main category: cs.CV

TL;DR: 提出Memory Forcing框架，通过几何索引空间记忆和混合训练策略，在保持生成质量的同时实现长期空间一致性。

Details

Motivation: 现有自回归视频扩散模型在探索新场景和重访已探索区域时难以兼顾生成自然性和空间一致性，且受限于计算资源下的历史信息压缩与利用。 Method: 引入几何索引的空间记忆机制，结合混合训练（Hybrid Training）和链式前向训练（Chained Forward Training），并通过点到帧检索与增量3D重建高效利用历史信息。 Result: 在多种环境中实现了更优的长期空间一致性和生成质量，同时保持对长序列的计算效率。 Conclusion: Memory Forcing有效平衡了探索与重访场景下的记忆使用，在有限计算预算下显著提升了视频扩散模型的世界建模能力。 Abstract: Autoregressive video diffusion models have proved effective for world modeling and interactive scene generation, with Minecraft gameplay as a representative application. To faithfully simulate play, a model must generate natural content while exploring new scenes and preserve spatial consistency when revisiting explored areas. Under limited computation budgets, it must compress and exploit historical cues within a finite context window, which exposes a trade-off: Temporal-only memory lacks long-term spatial consistency, whereas adding spatial memory strengthens consistency but may degrade new scene generation quality when the model over-relies on insufficient spatial context. We present Memory Forcing, a learning framework that pairs training protocols with a geometry-indexed spatial memory. Hybrid Training exposes distinct gameplay regimes, guiding the model to rely on temporal memory during exploration and incorporate spatial memory for revisits. Chained Forward Training extends autoregressive training with model rollouts, where chained predictions create larger pose variations and encourage reliance on spatial memory for maintaining consistency. Point-to-Frame Retrieval efficiently retrieves history by mapping currently visible points to their source frames, while Incremental 3D Reconstruction maintains and updates an explicit 3D cache. Extensive experiments demonstrate that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments, while maintaining computational efficiency for extended sequences.

[130] MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

Luca Collorone,Matteo Gioia,Massimiliano Pappa,Paolo Leoni,Giovanni Ficarra,Or Litany,Indro Spinelli,Fabio Galasso

Main category: cs.CV

TL;DR: 本文提出了MonSTeR，首个用于动作-场景-文本检索的模型，通过统一的潜在空间建模三者之间的复杂关系，实现了跨模态的灵活且鲁棒的检索，并在多个任务中验证了其有效性与人类偏好的一致性。

Details

Motivation: 现有研究缺乏评估骨骼运动、意图和环境上下文之间对齐的工具，而人类行为受意图驱动且依赖环境支持，因此需要一个能联合建模三者的框架。 Method: MonSTeR利用单模态和跨模态表征，构建统一的潜在空间，通过建模高阶关系来捕捉运动、场景和文本之间的复杂依赖，实现三模态检索。 Result: MonSTeR在性能上优于仅依赖单模态表征的三模态模型；用户研究表明其检索分数与人类偏好一致；并在零样本场景内物体放置和动作描述生成任务中展示了潜在空间的通用性。 Conclusion: MonSTeR有效实现了运动、场景与文本之间的对齐建模，其统一的潜在空间支持多种下游任务，为多模态人机交互和行为理解提供了新工具。 Abstract: Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.

[131] Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles

Dong Lao,Yuxiang Zhang,Haniyeh Ehsani Oskouie,Yangchao Wu,Alex Wong,Stefano Soatto

Main category: cs.CV

TL;DR: 提出一种基于随机共振的测试时防御机制，通过引入微小平移扰动和特征对齐来增强模型对对抗攻击的鲁棒性，无需训练、不依赖网络结构或攻击类型，具有广泛适用性。

Details

Motivation: 现有防御方法常依赖特征过滤或平滑，易导致信息损失；需要一种能在不损失信息的前提下有效抵御对抗攻击的通用防御机制。 Method: 在输入图像上引入微小的随机平移扰动，对变换后的特征嵌入进行对齐并聚合，再映射回原始参考图像，通过闭式公式实现，无需额外模块或微调。 Result: 该方法在图像分类、立体匹配和光流等任务上展现出领先的鲁棒性，在多种对抗攻击下分别恢复最多68.1%、71.9%和29.2%的性能损失。 Conclusion: 所提方法是一种完全无需训练、架构无关且攻击无关的通用测试时防御方案，首次实现了对密集预测任务的有效防御，具备高实用性和可扩展性。 Abstract: We propose a test-time defense mechanism against adversarial attacks: imperceptible image perturbations that significantly alter the predictions of a model. Unlike existing methods that rely on feature filtering or smoothing, which can lead to information loss, we propose to "combat noise with noise" by leveraging stochastic resonance to enhance robustness while minimizing information loss. Our approach introduces small translational perturbations to the input image, aligns the transformed feature embeddings, and aggregates them before mapping back to the original reference image. This can be expressed in a closed-form formula, which can be deployed on diverse existing network architectures without introducing additional network modules or fine-tuning for specific attack types. The resulting method is entirely training-free, architecture-agnostic, and attack-agnostic. Empirical results show state-of-the-art robustness on image classification and, for the first time, establish a generic test-time defense for dense prediction tasks, including stereo matching and optical flow, highlighting the method's versatility and practicality. Specifically, relative to clean (unperturbed) performance, our method recovers up to 68.1% of the accuracy loss on image classification, 71.9% on stereo matching, and 29.2% on optical flow under various types of adversarial attacks.

[132] MIXER: Mixed Hyperspherical Random Embedding Neural Network for Texture Recognition

Ricardo T. Fares,Lucas C. Ribas

Main category: cs.CV

TL;DR: 本文提出了一种用于纹理表征学习的新型随机神经网络Mixer，结合超球面随机嵌入和双分支学习模块，有效捕捉通道内和通道间关系，并通过新设计的优化问题提升表征能力，在多个纯纹理基准上表现出色。

Details

Motivation: 现有随机神经网络在纹理识别中主要关注跨信息预测，缺乏对整体架构的显著改进，因此需要一种更有效的网络结构来提升纹理表征能力。 Method: 提出Mixer模型，采用超球面随机嵌入和双分支学习模块，分别捕获通道内和通道间关系，并引入新的优化问题以构建丰富的纹理表征。 Result: 在多个具有不同特性和挑战的纯纹理基准上实验结果表明，所提方法取得了优异且有趣的结果。 Conclusion: Mixer通过改进随机网络架构，在纹理表示学习中展现出强大性能，为该领域提供了新的有效解决方案。 Abstract: Randomized neural networks for representation learning have consistently achieved prominent results in texture recognition tasks, effectively combining the advantages of both traditional techniques and learning-based approaches. However, existing approaches have so far focused mainly on improving cross-information prediction, without introducing significant advancements to the overall randomized network architecture. In this paper, we propose Mixer, a novel randomized neural network for texture representation learning. At its core, the method leverages hyperspherical random embeddings coupled with a dual-branch learning module to capture both intra- and inter-channel relationships, further enhanced by a newly formulated optimization problem for building rich texture representations. Experimental results have shown the interesting results of the proposed approach across several pure texture benchmarks, each with distinct characteristics and challenges. The source code will be available upon publication.

[133] Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Suyuchen Wang,Tianyu Zhang,Ahmed Masry,Christopher Pal,Spandana Gella,Bang Liu,Perouz Taslakian

Main category: cs.CV

TL;DR: 本文提出了一种改进GUI grounding的方法，通过引入RULER tokens和Interleaved MRoPE（I-MRoPE）来提升自然语言指令到像素坐标的映射准确性，尤其在高分辨率屏幕上表现更优。

Details

Motivation: 现有的视觉语言模型（VLMs）在将自然语言指令映射到像素坐标时面临挑战，尤其是在训练中未见过的高分辨率屏幕上，由于缺乏可靠的patch-to-pixel映射，导致准确率下降。 Method: 提出了两种创新方法：一是使用RULER tokens作为显式的坐标标记，使模型能够像地图上的网格线一样参考位置，并调整而非从头生成坐标；二是采用Interleaved MRoPE（I-MRoPE）改善空间编码，确保宽度和高度维度的均衡表示，解决标准位置编码方案的不对称问题。 Result: 在ScreenSpot、ScreenSpot-V2和ScreenSpot-Pro数据集上的实验表明，该方法在不同分辨率下均显著提高了定位精度，尤其在高分辨率界面上提升最为明显。 Conclusion: 通过提供显式空间引导而非依赖隐式学习，该方法实现了跨多种分辨率和平台的更可靠GUI自动化。 Abstract: GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

[134] LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Ci-Siang Lin,Min-Hung Chen,Yu-Yang Sheng,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 提出了一种标签高效适应框架LEAML，利用少量标注和大量未标注数据提升多模态大模型在医学等专业领域的表现。

Details

Motivation: 多模态大语言模型在分布外的专业领域（如医学影像）任务中表现不佳，且标注数据稀缺昂贵。 Method: 通过问答生成器结合字幕蒸馏生成伪问答对，并选择性更新与问答最相关的神经元，实现对未标注数据的有效利用。 Result: 在胃肠内镜和体育视觉问答任务上，LEAML在极低监督下优于标准微调方法。 Conclusion: LEAML框架能有效提升多模态大模型在标注数据有限的专业领域的适应能力。 Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

Table of Contents

cs.CL [Back]

[1] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

[2] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

[3] KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

[4] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering

[5] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

[6] EntropyLong: Effective Long-Context Training via Predictive Uncertainty

[7] Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER)

[8] A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

[9] Human Mobility Datasets Enriched With Contextual and Social Dimensions

[10] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

[11] FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

[12] KurdSTS: The Kurdish Semantic Textual Similarity

[13] CRACQ: A Multi-Dimensional Approach To Automated Document Assessment

[14] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

[15] Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

[16] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

[17] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

[18] $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training

[19] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

[20] Small Language Models for Curriculum-based Guidance

[21] mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

[22] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL

[23] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs

[24] Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

[25] An Senegalese Legal Texts Structuration Using LLM-augmented Knowledge Graph

[26] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness

[27] DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

[28] Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

[29] Spiral of Silence in Large Language Model Agents

[30] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

[31] A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History

[32] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

[33] Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

[34] Pretraining with hierarchical memories: separating long-tail and common knowledge

[35] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

[36] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation

[37] KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning

[38] Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

[39] Words That Make Language Models Perceive

[40] CLARITY: Clinical Assistant for Routing, Inference, and Triage

[41] Unraveling Syntax: How Language Models Learn Context-Free Grammars

[42] Hierarchical Semantic Retrieval with Cobweb

[43] Knowledge-Graph Based RAG System Evaluation Framework

[44] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

[45] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

[46] Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

[47] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models

[48] Self-Improvement in Multimodal Large Language Models: A Survey

[49] Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

[50] Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

[51] TravelBench : Exploring LLM Performance in Low-Resource Domains

[52] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking

[53] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context

[54] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

[55] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

[56] A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media

[57] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

[58] Evaluating Large Language Models for IUCN Red List Species Information

[59] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation

[60] Self-Reflective Generation at Test Time

[61] Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

[62] Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

[63] Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines

[64] Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles

[65] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

[66] Semantic Similarity in Radiology Reports via LLMs and NER

[67] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

[68] SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

[69] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

[70] EditLens: Quantifying the Extent of AI Editing in Text

[71] Neural Correlates of Language Models Are Specific to Human Language

[72] Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?

[73] Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer

[74] FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

[75] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

[76] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment

[77] Reward Models are Metrics in a Trench Coat

cs.CV [Back]